基于LSTM-PPO算法的多机空战智能决策及目标分配

Intelligent decision making and target assignment of multi-aircraft air combat based on the LSTM-PPO algorithm

丁云龙 ¹匡敏驰 ¹朱纪洪 ²祝靖宇 ²乔直²

扫码查看

作者信息

1. 新疆大学计算机科学与技术学院,乌鲁木齐 830000
2. 清华大学精密仪器系,北京 100084
折叠

摘要

针对传统多机空战中智能决效率低、难以满足复杂空战环境的需求以及目标分配不合理等问题.本文提出一种基于强化学习的多机空战的智能决策及目标分配方法.使用长短期记忆网络(Long short-term memory,LSTM)对状态进行特征提取和态势感知,将归一化和特征融合后的状态信息训练残差网络和价值网络,智能体通过近端优化策略(Proximal policy optimization,PPO)针对当前态势选择最优动作.以威胁评估指标作为分配依据,计算综合威胁度,优先将威胁值最大的战机作为攻击目标.为了验证算法的有效性,在课题组搭建的数字孪生仿真环境中进行4v4多机空战实验.并在相同的实验环境下与其他强化学习主流算法进行比较.实验结果表明,使用LSTM-PPO算法在多机空战中的胜率明显优于其他主流强化学习算法,验证了算法的有效性.

Abstract

With the rapid development of intelligent and informationized air battlefields,intelligent air combat has increasingly become key to affecting the outcome of a battlefield. In conventional multi-aircraft air combat,there are issues of low efficiency in intelligent decision-making,difficulty in meeting the needs of complex air combat environments,and unreasonable target allocation. In response to the problems in conventional multi-aircraft air combat,we introduce a long short-term memory-proximal policy optimization algorithm (LSTM-PPO). Using the long short-term memory network to extract features and perceive the situation of the state,an intelligent agent trains the normalized and feature-fused state information residual network and value network,chooses the optimal action through the proximal policy optimization strategy based on the current situation,and embeds a reward function containing expert knowledge during the training process to solve the problem of sparse rewards. Meanwhile,a target allocation algorithm based on threat value calculation is presented. Using angle,speed,and height threat values as the basis for target allocation,the ID of the target aircraft with the highest threat value on the battlefield is calculated in real-time. When the strategy network outputs an action of attack,it conducts target allocation. To confirm the effectiveness of the algorithm,we carried out 4v4 multi-aircraft air combat experiments in a digital twin simulation environment built by our research group. The red team consists of reinforcement learning agents based on LSTM-PPO algorithm,whereas the blue team comprises a finite state machine composed of expert knowledge bases. After more than 1200 rounds of aerial confrontation,the algorithm has been converged,and the win rate of the red team has reached 82％. Furthermore,we assessed the performance of four other mainstream reinforcement learning algorithms in 4v4 air combat experiments under the same experimental conditions. It is shown that the deep Q-network (DQN) and soft actor-critic (SAC) algorithms have difficulties in dealing with high-dimensional continuous action spaces and multiagent collaboration. The multi-agent deep deterministic policy gradient algorithm (MADDPG) employs a multi-agent strategy and cooperative training,so it exhibits a significantly higher win rate than the DQN and SAC algorithms. The multi-agent proximal policy optimization (MAPPO) algorithm has a relatively high failure rate and is not stable enough to deal with enemy aircraft's strategies in some cases. The LSTM-PPO algorithm shows a significantly higher win rate than other mainstream reinforcement learning algorithms in multi-aircraft collaborative air combat,which confirms the effectiveness of the LSTM-PPO algorithm in dealing with high-dimensional continuous action spaces and multi-aircraft collaborative operations.

关键词

多机空战/智能决策/近端优化策略/威胁评估/目标分配

Key words

multi-aircraft air combat/intelligent decision/proximal policy optimization/threat assessment/dynamic target assignment

引用本文复制引用

出版年

2024

工程科学学报

北京科技大学

工程科学学报

CSTPCDCSCD北大核心

影响因子：0.801

ISSN：2095-9389

参考文献量27

段落导航