Grasping experiment design of a mobile manipulator based on proximal policy optimization
[Objective]As a common reinforcement learning algorithm,proximal policy optimization(PPO)performs well in terms of stability,sample utilization,and wide applicability.However,its training effect in the mobile robotic arm environment is not ideal.To solve the problem of learning difficulty and falling into local optimal in the near-end strategy optimization algorithm when training a mobile manipulator,six feasible improvement methods are introduced in this paper,and the original algorithm is adjusted.These methods include advantage normalization,state normalization,reward scaling,policy entropy,gradient clipping,and standard deviation limitation.[Methods]They adjust the steps of the PPO algorithm in each stage of data acquisition and training and optimize the stability and learning efficiency of the algorithm.These improved methods have universal applicability,and several schemes can be appropriately selected to adjust the original algorithm in different environments.Before the training,the training environment of the mobile robot arm was modeled,including state space design,action space design,and reward function design.The main flow of the PPO algorithm applied in this environment was analyzed,and the points that can be improved in the training steps were clarified.To verify the effect of each improved PPO algorithm on training the mobile robot arm to pick up objects,relevant experiments were designed for each improvement point.Six improvements were added to the PPO algorithm to train the mobile manipulator,and its reward curve was compared and analyzed with the original algorithm.[Results]The experimental results showed that(1)the native PPO algorithm performs poorly in the mobile manipulator environment,and its reward function often decays to a small value in the late training period.(2)PPO rises faster and reaches a stable value in fewer rounds after adding the advantage normalization.(3)Although state normalization has the greatest improvement on the reward range,it brings unstable factors,making the reward curve exhibit a significant downward trend in the late training period.(4)Reward scaling will bring unstable factors to the training of the mobile robot arm in the early stage;however,the PPO algorithm is robust enough to exert its effect of reward scaling adjustment in the later stage of the training.(5)A policy entropy of 0.2 is more suitable for PPO algorithm training on picking up objects by a mobile robot arm.(6)Gradient clipping prevents excessive gradient updates from causing training instability or dramatic performance fluctuations and has a significant positive effect on the rise of the reward curve.[Conclusions]Through analysis of the reward curve of the PPO algorithm improved by each method,the six improvement methods have different degrees of improvement on the PPO algorithm in the task of training the mobile robot arm to pick up objects.The improved PPO6+algorithm makes the reward curve of the mobile robot arm better than the original PPO algorithm in terms of stability,reward rising speed,training completion time,etc.It can quickly converge to the ideal result and complete the task goal.