首页|基于相似度加权的无模型元强化学习方法

基于相似度加权的无模型元强化学习方法

扫码查看
强化学习在游戏对弈、机器人控制等领域内已取得良好成效.为进一步提高训练效率,将元学习拓展至强化学习中,由此所产生的元强化学习已成为当前强化学习领域中的研究热点.元知识质量是决定元强化学习效果的关键因素,基于梯度的元强化学习以模型初始参数为元知识指导后续学习.为提高元知识质量,提出了一种通用元强化学习方法,通过加权机制显式表现训练过程中子任务对训练效果的贡献.该方法利用不同子任务所得的梯度更新向量与任务集内所有梯度更新向量的相似性作为更新权重,完善梯度更新过程,提高以模型初始参数为元知识的质量,使训练好的模型在一个良好的起点上解决新任务.该方法可通用在基于梯度的强化学习中,达到使用少量样本快速解决新任务的目标.在二维导航任务和仿真机器人运动控制任务的对比实验中,该方法优于其他基准算法,证明了加权机制的合理性.
Model-agnostic Meta Reinforcement Learning Based on Similarity Weighting
Reinforcement learning has achieved excellent performance in the fields of game games and robotics control.In order to further improve the training efficiency,meta-learning is extended to reinforcement learning,the resulting meta-reinforcement learning has become a research hotspot in the field of reinforcement learning.The quality of meta-knowledge is the key factor determining the effect of meta-reinforcement learning,and gradient-based meta-reinforcement learning takes the initial parameters of the model as meta-knowledge to guide the subsequent learning.To improve the quality of meta-knowledge,we propose a general meta-reinforcement learning method,which explicitly shows the contribution of subtasks to the training effect in the training process by weighting.The proposed method uses the similarity between the gradient update vectors obtained by different subtasks and the gradient update vectors obtained by the overall task set as update weights,improves the gradient update process,improves the quality of the meta-knowledge based on the initial parameters of the model,and makes the trained model solve the new task at a good starting point.The proposed method can be used in gradient-based reinforcement learning to quickly solve new tasks with a small number of samples.In the experiments of 2D navigation tasks and locomotion tasks,the proposed method outperforms other benchmark algorithms,which proves the rationality of weighted mechanism.

meta-learningreinforcement learningmeta-reinforcement learninggradient descentmodel agnostic

赵春宇、赖俊、陈希亮、张人文

展开 >

陆军工程大学 指挥控制工程学院,江苏 南京 210007

元学习 强化学习 元强化学习 梯度下降 无模型

国家自然科学基金项目

61806221

2024

计算机技术与发展
陕西省计算机学会

计算机技术与发展

CSTPCD
影响因子:0.621
ISSN:1673-629X
年,卷(期):2024.34(5)
  • 32