基于注意力的循环PPO算法及其应用

扫码查看

原文链接

国家科技期刊平台
NETL
NSTL
万方数据
维普

中文摘要：针对深度强化学习算法在部分可观测环境中面临信息掌握不足、存在随机因素等问题,提出了一种融合注意力机制与循环神经网络的近端策略优化算法(ARPPO算法).该算法首先通过卷积网络层提取特征;其次采用注意力机制突出状态中重要的关键信息;再次通过LSTM网络提取数据的时域特性;最后基于Actor-Critic结构的PPO算法进行策略学习与训练提升.基于Gym-Minigrid环境设计了两项探索任务的消融与对比实验,实验结果表明ARPPO算法较已有的A2C算法、PPO算法、RPPO算法具有更快的收敛速度,且ARPPO算法在收敛之后具有很强的稳定性,并对存在随机因素的未知环境具备更强的适应力.

外文标题：Attention-based Recurrent PPO Algorithm and Its Application

外文摘要：A proximal policy optimization model based on attention mechanism and recurrent neural network(ARPPO)is proposed to address the problems faced by deep reinforcement learning algorithms in partially observable environments,such as insufficient information about the environment and randomness factors.The algorithm first processes the encoded information of environmental images through convolutional network layers;then highlights important key information in states using attention mechanism;then extracts temporal characteristics of data through LSTM network;finally improves policy learning and training based on PPO with Actor-Critic structure.Ablation and comparative experiments of two exploration tasks were designed based on the Gym-Minigrid environment.The experimental results show that ARPPO has faster training speed and stronger stability compared with A2C,PPO and RPPO,and has stronger adaptability to unknown environments with random factors.

外文关键词：

deep reinforcement learningpartially observableattention mechanismLSTM networkproximal policy optimization algo-rithm

作者：

吕相霖、臧兆祥、李思博、王俊英

展开 >

作者单位：

三峡大学水电工程智能视觉监测湖北省重点实验室,湖北宜昌 443002

三峡大学计算机与信息学院,湖北宜昌 443002

关键词：

深度强化学习部分可观测注意力机制 LSTM网络近端策略优化算法

基金：

国家自然科学基金湖北省自然科学基金

项目编号：

615022742015CFB336

出版年：

2024

DOI：

10.3969/j.issn.1673-629X.2024.01.020

计算机技术与发展

陕西省计算机学会

计算机技术与发展

CSTPCD

影响因子：0.621

ISSN：1673-629X

年,卷(期)：2024.34(1)

参考文献量12