基于优质样本筛选的离线强化学习算法

扫码查看

原文链接

万方数据
维普

中文摘要：针对离线强化学习算法过度依赖数据集样本质量的问题,提出基于优质样本筛选的离线强化学习算法.首先,在策略评估阶段,赋予优势值的样本更高的更新权重,并添加策略熵项,快速识别高质量且在数据分布内概率较高的动作样本,从而筛选更有价值的动作样本.在策略优化阶段,最大化归一化优势函数的同时,保持对数据集上动作的策略约束,使算法在数据集样本质量较低时也可高效利用优质样本,提升策略的学习效率和性能.实验表明,文中算法在MuJoCo-Gym环境的D4RL离线数据集上表现出色,并且可成功筛选更有价值的样本,由此验证其有效性.

外文标题：Offline Reinforcement Learning Algorithm Based on Selection of High-Quality Samples

外文摘要：To address the issue of over-reliance on the quality of dataset samples of offline reinforcement learning algorithms,an offline reinforcement learning algorithm based on selection of high-quality samples(SHS)is proposed.In the policy evaluation stage,higher update weights are assigned to the samples with advantage values,and a policy entropy term is added to quickly identify high-quality action samples with high probability within the data distribution,thereby screening out more valuable action samples.In the policy optimization stage,SHS aims to maximize the normalized advantage function while maintaining the policy constraints on the actions within the dataset.Consequently,high-quality samples can be efficiently utilized when the sample quality of the dataset is low,thereby improving the learning efficiency and performance of the strategy.Experiments show that SHS performs well on D4RL offline dataset in the MuJoCo-Gym environment and successfully screens out more valuable samples,thus its effectiveness is verified.

外文关键词：

Reinforcement LearningOffline Reinforcement LearningDistribution ShiftPolicy Con-straintValue FunctionSample Selection

作者：

侯永宏、丁旺、任懿、董洪伟、杨松领

展开 >

作者单位：

天津大学电气自动化与信息工程学院天津 300072

中国科学院软件研究所空间综合信息系统国家重点实验室北京 100190

关键词：

强化学习离线强化学习分布偏移策略约束值函数样本筛选

出版年：

2024

DOI：

10.16451/j.cnki.issn1003-6059.202411007

模式识别与人工智能

中国自动化学会,国家智能计算机研究开发中心,中国科学院合肥智能机械研究所

模式识别与人工智能

CSTPCD北大核心

影响因子：0.954

ISSN：1003-6059

年,卷(期)：2024.37(11)