首页|改进式MATD3算法及其对抗应用

改进式MATD3算法及其对抗应用

扫码查看
提升多智能体训练效果一直是强化学习领域中的重点.以多智能体双延迟深度确定性策略梯度(MATD3)算法为基础,引入参数共享机制,进而提升训练效率.同时为缓解真实奖励与辅助奖励不一致的问题,借鉴课程学习思想,提出辅助奖励衰减因子,以保证训练初期的策略探索积极性与训练末期的奖励一致性.将所提出的改进式MATD3 算法应用于战车博弈对抗,从而实现战车的智能决策,应用结果表明,智能战车的奖励曲线收敛稳定,且效果良好.同时就改进式算法与原始MATD3 算法进行对比仿真,仿真结果验证了改进式算法能够有效提升收敛速度以及奖励收敛值.
Improved MATD3 algorithm and its adversarial application
Improving the training effect of multi-agent has always been the focus in the field of reinforcement learning.Based on the multi-Agent twin-delay deep deterministic policy gradient(MATD3)algorithm,a parameter sharing mechanism is in-troduced to improve training efficiency.At the same time,in order to alleviate the inconsistency between real rewards and auxiliary rewards,drawing on the ideas of course learning,a decay factor for auxiliary rewards is proposed to ensure the mo-tivation of policy exploration in the early training period and the reward consistency in the late training period.And the pro-posed improved MATD3 algorithm is applied to combat vehicle games to achieve intelligent decision-making of the vehicle.The application results show that the reward curve of the vehicle converges stably and the effect is good.Besides,the im-proved algorithm is compared with the original MATD3 algorithm,and the simulation results verify that the improved algo-rithm can effectively improve the effect of convergence and the convergence value of reward.

reinforcement learningparameter sharingreward consistencyintelligent decision-making

王琨、赵英策、王光耀、李建勋

展开 >

上海交通大学自动化系,上海 200240

沈阳飞机设计研究所,沈阳 110035

强化学习 参数共享 奖励一致性 智能决策

2024

指挥控制与仿真
中国船舶重工集团公司 第七一六研究所

指挥控制与仿真

CSTPCD
影响因子:0.309
ISSN:1673-3819
年,卷(期):2024.46(5)