首页|优化的协作多智能体强化学习架构

优化的协作多智能体强化学习架构

扫码查看
在现实环境中,许多任务需要多个智能体的协作来完成,然而智能体之间通常存在着通信受限和观察不完整的问题.深度多智能体强化学习(Deep-MARL)算法在解决这类具有挑战性的场景中表现出卓越的性能.其中QTRAN和QTRAN++是能够学习一类广泛的联合动作-价值函数的代表性方法,且同时具备强大的理论保证.然而,由于依赖于单一联合动作-价值估计量以及忽视了对智能体观察的预处理,使得QTRAN和QTRAN++的性能受到了影响.本文提出了一种称为OPTQTRAN的新算法,其在QTRAN和QTRAN++的性能基础上取得了显著的提升.首先,本文引入了一种双联合动作-价值估计量的结构,利用一个分解网络模块计算额外的联合动作-价值.为了确保准确计算联合动作-价值,本文设计了一个自适应网络模块,有效促进了值函数学习.此外,本文引入了一个多元网络结构,将智能体的观察分组到不同的单元中,以有效估计各智能体的效用函数.在广泛使用的StarCraft基准测试中进行的多场景实验表明,与最先进的多智能体强化学习方法相比,本文的方法表现出更卓越的性能.
Optimized Architecture for Cooperative Multi-agent Reinforcement Learning
Numerous real-world tasks require the collaboration of multiple agents,often with limited communication and incomplete observations.Deep multi-agent reinforcement learning(Deep-MARL)algorithms show remarkable effectiveness in tackling such challenging scenarios.Among these algorithms,QTRAN and QTRAN++are representative approaches capable of learning a broad class of joint-action value functions with strong theoretical guarantees.However,the performance of QTRAN and QTRAN++is hindered by their reliance on a single joint action-value estimator and their neglect of preprocessing agent observations.This study introduces a novel algorithm called OPTQTRAN,which significantly improves upon the performance of QTRAN and QTRAN++.Firstly,the study proposes a dual joint action-value estimator structure that leverages a decomposition network module to compute additional joint action-values.To ensure accurate computation of joint action-value estimators,it designs an adaptive network that facilitates efficient value function learning.Additionally,it introduces a multi-unit network that groups agent observations into different units for effective estimation of utility functions.Extensive experiments conducted on the widely-used StarCraft benchmark across diverse scenarios demonstrate that the proposed approach outperforms state-of-the-art MARL methods.

reinforcement learning(RL)intelligent gamemulti-agent reinforcement learning(MARL)agent collaboration

刘玮、程旭、李浩源

展开 >

南京信息工程大学 计算机学院、网络空间安全学院,南京 210044

强化学习 智能博弈 多智能体强化学习 智能体协作

2024

计算机系统应用
中国科学院软件研究所

计算机系统应用

CSTPCD
影响因子:0.449
ISSN:1003-3254
年,卷(期):2024.33(11)