基于不确定性权重的保守Q学习离线强化学习算法

扫码查看

原文链接

万方数据
维普

中文摘要：离线强化学习(Offline RL)中,智能体不与环境交互而是从一个固定的数据集中获得数据进行学习,这是强化学习领域研究的一个热点.目前多数离线强化学习算法对策略训练过程进行保守正则化处理,训练策略倾向于选择存在于数据集中的动作,从而解决离线强化学习中对数据集分布外(OOD)的状态-动作价值估值错误的问题.保守Q学习算法(CQL)通过值函数正则赋予分布外状态-动作较低的价值来避免该问题.然而,由于该算法正则化过于保守,数据集内的分布内状态-动作也被赋予了较低的价值,难以达到训练策略选择数据集中动作的目的,因此很难学习到最优策略.针对该问题,提出了一种基于不确定性权重的保守Q学习算法(UWCQL).该方法引入不确定性计算,在保守Q学习算法的基础上添加不确定性权重,对不确定性高的动作给予更高的保守权重,使得策略能更合理地选择数据集分布内的状态-动作.将UWCQL算法应用于D4RL的MuJoCo数据集中进行了实验,实验结果表明,UWCQL算法具有更好的性能表现,从而验证了算法的有效性.

外文标题：Offline Reinforcement Learning Algorithm for Conservative Q-learning Based on Uncertainty Weight

外文摘要：Offline reinforcement learning,in which the agent learns from a fixed dataset without interacting with the environ-ment,is a current hot spot in the field of reinforcement learning.Many offline reinforcement learning algorithms try to regularize value function to force the agent choose actions in the given dataset.The conservative Q-learning(CQL)algorithm avoids this problem by assigning a lower value to the OOD(out of distribution)state-action pairs through the value function regularization.However,the algorithm is too conservative to recognize the state-action pairs outside the distribution precisely,and therefore it is difficult to learn the optimal policy.To address this problem,the uncertainty-weighted conservative Q-learning algorithm(UWC-QL)is proposed by introducing an uncertainty mechanism during training.The UWCQL adds uncertainty weight to the CQL reg-ularization term,assigns higher conservative weight to actions with high uncertainty to ensure that the algorithm can more effec-tively train the agent to choose proper state-action pairs in the dataset.The effectiveness of UWCQL is verified by applying it to the D4RL MuJoCo dataset,along with the best offline reinforcement learning algorithms,and the experimental results show that the UWCQL algorithm has better performance.

外文关键词：

Offline reinforcement learningDeep reinforcement learningReinforcement learningConservative Q-learningUncer-tainty

作者：

王天久、刘全、乌兰

展开 >

作者单位：

苏州大学计算机科学与技术学院江苏苏州 215006

苏州大学江苏省计算机信息处理技术重点实验室江苏苏州 215006

关键词：

离线强化学习深度强化学习强化学习保守Q学习不确定性

基金：

国家自然科学基金国家自然科学基金国家自然科学基金国家自然科学基金新疆维吾尔自治区自然科学基金江苏高校优势学科建设工程资助项目

项目编号：

617723556170205561876217621761752022D01A238

出版年：

2024

DOI：

10.11896/jsjkx.230700151

计算机科学

重庆西南信息有限公司（原科技部西南信息中心）

计算机科学

CSTPCD北大核心

影响因子：0.944

ISSN：1002-137X

年,卷(期)：2024.51(9)