首页|基于改进近端策略优化算法的移动机械臂抓取实验设计

基于改进近端策略优化算法的移动机械臂抓取实验设计

扫码查看
针对在训练移动机械臂时,近端策略优化算法的学习困难和易陷入局部最优问题,引入了6种可行的改进方法,包括优势值标准化、状态标准化、奖励缩放、策略熵、梯度裁剪和标准差限制,并且使用这些方法在数据采集和训练的各个阶段对近端策略优化算法的步骤进行了调整,完成了对算法稳定性和学习效率的优化,并针对每个改进点设计了相关的实验。实验结果表明,在训练移动机械臂夹取物体的任务上,6 个改进方法对近端策略优化算法均有不同程度的提升。改进后的PPO算法使移动机械臂的奖励曲线获得很大改善,能够迅速收敛到理想的结果。
Grasping experiment design of a mobile manipulator based on proximal policy optimization
[Objective]As a common reinforcement learning algorithm,proximal policy optimization(PPO)performs well in terms of stability,sample utilization,and wide applicability.However,its training effect in the mobile robotic arm environment is not ideal.To solve the problem of learning difficulty and falling into local optimal in the near-end strategy optimization algorithm when training a mobile manipulator,six feasible improvement methods are introduced in this paper,and the original algorithm is adjusted.These methods include advantage normalization,state normalization,reward scaling,policy entropy,gradient clipping,and standard deviation limitation.[Methods]They adjust the steps of the PPO algorithm in each stage of data acquisition and training and optimize the stability and learning efficiency of the algorithm.These improved methods have universal applicability,and several schemes can be appropriately selected to adjust the original algorithm in different environments.Before the training,the training environment of the mobile robot arm was modeled,including state space design,action space design,and reward function design.The main flow of the PPO algorithm applied in this environment was analyzed,and the points that can be improved in the training steps were clarified.To verify the effect of each improved PPO algorithm on training the mobile robot arm to pick up objects,relevant experiments were designed for each improvement point.Six improvements were added to the PPO algorithm to train the mobile manipulator,and its reward curve was compared and analyzed with the original algorithm.[Results]The experimental results showed that(1)the native PPO algorithm performs poorly in the mobile manipulator environment,and its reward function often decays to a small value in the late training period.(2)PPO rises faster and reaches a stable value in fewer rounds after adding the advantage normalization.(3)Although state normalization has the greatest improvement on the reward range,it brings unstable factors,making the reward curve exhibit a significant downward trend in the late training period.(4)Reward scaling will bring unstable factors to the training of the mobile robot arm in the early stage;however,the PPO algorithm is robust enough to exert its effect of reward scaling adjustment in the later stage of the training.(5)A policy entropy of 0.2 is more suitable for PPO algorithm training on picking up objects by a mobile robot arm.(6)Gradient clipping prevents excessive gradient updates from causing training instability or dramatic performance fluctuations and has a significant positive effect on the rise of the reward curve.[Conclusions]Through analysis of the reward curve of the PPO algorithm improved by each method,the six improvement methods have different degrees of improvement on the PPO algorithm in the task of training the mobile robot arm to pick up objects.The improved PPO6+algorithm makes the reward curve of the mobile robot arm better than the original PPO algorithm in terms of stability,reward rising speed,training completion time,etc.It can quickly converge to the ideal result and complete the task goal.

proximal policy optimizationmobile robotic armdeep reinforcement learning

王永华、钟欣见、李明

展开 >

广东工业大学 自动化学院,广东 广州 510006

广东工业大学 大思政课建设协同创新中心,广东 广州 510006

近端策略优化 移动机械臂 深度强化学习

教育部高等学校控制理论课程群虚拟教研室专项广东省研究生教育创新计划(2022)大思政课建设协同中心研究课题(2022)

2203052022JGXM0522022DSZK06

2024

实验技术与管理
清华大学

实验技术与管理

CSTPCD北大核心
影响因子:1.651
ISSN:1002-4956
年,卷(期):2024.41(4)
  • 25