针对未知、不确定环境下多无人车协同围捕问题,本文提出了一种基于双向长短时记忆与混合奖励函数(BM-MADDPG)的多智能体协同围捕决策算法解决无人车围捕策略生成与协同控制问题.首先,通过双向长短时记忆(bidirectional long short-term memory,Bi-LSTM)网络捕捉状态和动作序列间时序信息特征,评估当前状态采取不同动作的长期效果,解决协作围捕信息数据利用率低的问题.针对无人车围捕任务场景中奖励稀疏、反馈延时而导致的学习效率低的问题,提出了一种稀疏奖励与密集奖励相结合的混合奖励函数(mixed reward function),引导围捕者探索,提高多无人车间的协作能力.仿真和实验表明,在多无人车协作围捕场景中,所提出的BM-MADDPG算法相较于MADDPG的围捕成功率提高了4.5%,有效提高多无人车协作围捕能力与学习训练效率.
Multi unmanned vehicle cooperative encirclement control based on bidirectional long short-term memory and mixed reward functions
In the context of multi-unmanned vehicle cooperative encirclement within unknown and uncertain environments,this paper introduces a multi-agent cooperative encirclement decision-making algorithm,BM-MADDPG,based on Bidirectional Long Short-Term Memory(Bi-LSTM)and a Mixed Reward Function.The algorithm is designed to address the challenges of generating encirclement strategies and coordinating control for unmanned vehicles.To begin,the Bi-LSTM network is employed to capture the temporal information features between state and action sequences,enabling an assessment of the long-term effects of different actions in the current state.This addresses the issue of limited information utilization in collaborative encirclement.Furthermore,to overcome the problems associated with sparse rewards,feedback delays,and slow learning convergence in unmanned vehicle encirclement tasks,a Mixed Reward Function that combines sparse and dense rewards is proposed.This mixed reward function guides the encirclement agents to explore,accelerates training convergence,and enhances the collaborative capabilities among multiple unmanned vehicles.Simulations and experimental results reveal that in the context of multi-unmanned vehicle cooperative encirclement,the BM-MADDPG algorithm outperforms MADDPG,achieving a 4.5%increase in encirclement success rate.This effectively enhances the cooperative encirclement capabilities and learning efficiency of multiple unmanned vehicles.
multiple unmanned vehiclesBM-MADDPGencirclement strategyBi-LSTMmixed reward function