首页|结合优势结构和最小目标Q值的深度强化学习导航算法

结合优势结构和最小目标Q值的深度强化学习导航算法

扫码查看
针对现有基于策略梯度的深度强化学习方法应用于办公室、走廊等室内复杂场景下的机器人导航时,存在训练时间长、学习效率低的问题,本文提出了一种结合优势结构和最小化目标Q值的深度强化学习导航算法。该算法将优势结构引入到基于策略梯度的深度强化学习算法中,以区分同一状态价值下的动作差异,提升学习效率,并且在多目标导航场景中,对状态价值进行单独估计,利用地图信息提供更准确的价值判断。同时,针对离散控制中缓解目标Q值过估计方法在强化学习主流的Actor-Critic框架下难以奏效,设计了基于高斯平滑的最小目标Q值方法,以减小过估计对训练的影响。实验结果表明本文算法能够有效加快学习速率,在单目标、多目标连续导航训练过程中,收敛速度上都优于柔性演员评论家算法(SAC),双延迟深度策略性梯度算法(TD3),深度确定性策略梯度算法(DDPG),并使移动机器人有效远离障碍物,训练得到的导航模型具备较好的泛化能力。
Deep reinforcement learning navigation algorithm combining advantage structure and minimum target Q-value
The existing deep reinforcement learning methods based on the policy gradients have the problems of long training time and low learning efficiency when they are applied to robot navigation in complex indoor scenes such as offices and corridors.This paper proposes a deep reinforcement learning navigation algorithm which combines the advantage structure and minimizing the target Q value.The algorithm introduces the advantage structure into the deep reinforcement learning method based on the policy gradient to distinguish the action difference under the same state value and improve the learning efficiency.In the multi-target navigation scenario,the method estimates the state value separately to provide more accurate value judgment by using map information.The mitigation over estimation method for discrete control is difficult to work in the mainstream Actor-Critic framework,a minimum target Q-value method based on the Gaussian smoothing is designed to reduce the influence of over estimation on training,The experimental results show that the algorithm in this paper can effectively speed up the learning rate.In the process of single-target and multi-target continuous navigation training,the convergence speed of our method is better than that of SAC,TD3,and DDPG.The trained agent makes the robot effectively away from obstacles and has a good generalization ability.

reinforcement learningmobile robotnavigationadvantage structureminimize target Q-Value

朱威、洪力栋、施海东、何德峰

展开 >

浙江工业大学信息工程学院,浙江杭州 312000

强化学习 移动机器人 导航 优势结构 最小化目标Q值

国家自然科学基金浙江省自然科学基金

62173303LY21F010009

2024

控制理论与应用
华南理工大学 中国科学院数学与系统科学研究院

控制理论与应用

CSTPCD北大核心
影响因子:1.076
ISSN:1000-8152
年,卷(期):2024.41(4)
  • 28