首页|基于注意力的循环PPO算法及其应用

基于注意力的循环PPO算法及其应用

扫码查看
针对深度强化学习算法在部分可观测环境中面临信息掌握不足、存在随机因素等问题,提出了一种融合注意力机制与循环神经网络的近端策略优化算法(ARPPO算法).该算法首先通过卷积网络层提取特征;其次采用注意力机制突出状态中重要的关键信息;再次通过LSTM网络提取数据的时域特性;最后基于Actor-Critic结构的PPO算法进行策略学习与训练提升.基于Gym-Minigrid环境设计了两项探索任务的消融与对比实验,实验结果表明ARPPO算法较已有的A2C算法、PPO算法、RPPO算法具有更快的收敛速度,且ARPPO算法在收敛之后具有很强的稳定性,并对存在随机因素的未知环境具备更强的适应力.
Attention-based Recurrent PPO Algorithm and Its Application
A proximal policy optimization model based on attention mechanism and recurrent neural network(ARPPO)is proposed to address the problems faced by deep reinforcement learning algorithms in partially observable environments,such as insufficient information about the environment and randomness factors.The algorithm first processes the encoded information of environmental images through convolutional network layers;then highlights important key information in states using attention mechanism;then extracts temporal characteristics of data through LSTM network;finally improves policy learning and training based on PPO with Actor-Critic structure.Ablation and comparative experiments of two exploration tasks were designed based on the Gym-Minigrid environment.The experimental results show that ARPPO has faster training speed and stronger stability compared with A2C,PPO and RPPO,and has stronger adaptability to unknown environments with random factors.

deep reinforcement learningpartially observableattention mechanismLSTM networkproximal policy optimization algo-rithm

吕相霖、臧兆祥、李思博、王俊英

展开 >

三峡大学 水电工程智能视觉监测湖北省重点实验室,湖北 宜昌 443002

三峡大学 计算机与信息学院,湖北 宜昌 443002

深度强化学习 部分可观测 注意力机制 LSTM网络 近端策略优化算法

国家自然科学基金湖北省自然科学基金

615022742015CFB336

2024

计算机技术与发展
陕西省计算机学会

计算机技术与发展

CSTPCD
影响因子:0.621
ISSN:1673-629X
年,卷(期):2024.34(1)
  • 12