基于乐观探索的双延迟深度确定性策略梯度

扫码查看

原文链接

万方数据
维普

中文摘要：双延迟深度确定性策略梯度是深度强化学习的一个主流算法,是一种无模型强化学习,已成功应用于具有挑战性的连续控制任务中.然而,当环境中奖励稀疏或者状态空间较大时,双延迟深度确定性策略梯度的样本效率较差,环境探索能力较弱.针对通过双Q值函数的下界确定目标函数带来的低效探索问题,提出一种基于乐观探索的双延迟深度确定性策略梯度(TD3-OE).首先,从双Q值函数出发,分析取下界会使得探索具有一定的悲观性;然后,利用高斯函数和分段函数分别对双Q值函数进行拟合;最后,利用拟合Q值函数和目标策略构造出探索策略,指导智能体在环境中进行探索.探索策略能够避免智能体学习到次优策略,从而有效解决低效探索的问题.该文在基于MuJoCo物理引擎的控制平台上将所提算法与基准算法进行试验对比,验证了所提算法的有效性.试验结果表明:所提算法在奖励、稳定性和学习速度等指标上均达到或超过其他基础强化学习算法.

外文标题：Twin delayed deep deterministic policy gradient based on optimistic exploration

外文摘要：Twin delayed deep deterministic policy gradient is a mainstream algorithm for deep reinforcement learning,and a model-free reinforcement learning that has been successfully applied to challenging continuous control tasks.However,when the rewards are sparse or the state space is large in the environment,the sample efficiency of the twin delayed deep deterministic policy gradient is poor and the environment exploration ability is weak.Aiming at the problem of inefficient exploration caused by determining the objective function through the lower bound of the double Q-value function,a twin delayed deep deterministic policy gradient based on optimistic exploration(TD3-OE)is proposed.First,starting from the double Q-value function,it is analyzed that taking the lower bound will make the exploration somewhat pessimistic.Then,the Gaussian function and the piecewise function are used to fit the double Q-value function respectively.Finally,the exploration policy is constructed by fitting the Q-value function and the target policy to guide the agent to explore in the environment.The exploration policy can prevent the agent from learning sub-optimal policies,thus effectively solving the problem of inefficient exploration.This paper compares the proposed algorithm with the benchmark algorithm on the control platform based on the MuJoCo physics engine to verify the effectiveness of the proposed algorithm.The experimental results show that the proposed algorithm achieves or exceeds other basic reinforcement learning algorithms in terms of indicators such as reward,stability and learning speed.

外文关键词：

deep reinforcement learningtwin delayed deep deterministic policy gradientexploration policyoptimistic exploration

作者：

王浩宇、张衡波、程玉虎、王雪松

展开 >

作者单位：

中国矿业大学信息与控制工程学院,江苏徐州 221116

关键词：

深度强化学习双延迟深度确定性策略梯度探索策略乐观探索

基金：

国家自然科学基金国家自然科学基金江苏省自然科学基金江苏省卓越博士后计划

项目编号：

6197621562176259BK202211162022ZB530

出版年：

2024

DOI：

10.14177/j.cnki.32-1397n.2024.48.03.007

南京理工大学学报(自然科学版)

南京理工大学

南京理工大学学报(自然科学版)

CSTPCD北大核心

影响因子：0.526

ISSN：1005-9830

年,卷(期)：2024.48(3)

参考文献量23