首页|基于优质样本筛选的离线强化学习算法

基于优质样本筛选的离线强化学习算法

扫码查看
针对离线强化学习算法过度依赖数据集样本质量的问题,提出基于优质样本筛选的离线强化学习算法.首先,在策略评估阶段,赋予优势值的样本更高的更新权重,并添加策略熵项,快速识别高质量且在数据分布内概率较高的动作样本,从而筛选更有价值的动作样本.在策略优化阶段,最大化归一化优势函数的同时,保持对数据集上动作的策略约束,使算法在数据集样本质量较低时也可高效利用优质样本,提升策略的学习效率和性能.实验表明,文中算法在MuJoCo-Gym环境的D4RL离线数据集上表现出色,并且可成功筛选更有价值的样本,由此验证其有效性.
Offline Reinforcement Learning Algorithm Based on Selection of High-Quality Samples
To address the issue of over-reliance on the quality of dataset samples of offline reinforcement learning algorithms,an offline reinforcement learning algorithm based on selection of high-quality samples(SHS)is proposed.In the policy evaluation stage,higher update weights are assigned to the samples with advantage values,and a policy entropy term is added to quickly identify high-quality action samples with high probability within the data distribution,thereby screening out more valuable action samples.In the policy optimization stage,SHS aims to maximize the normalized advantage function while maintaining the policy constraints on the actions within the dataset.Consequently,high-quality samples can be efficiently utilized when the sample quality of the dataset is low,thereby improving the learning efficiency and performance of the strategy.Experiments show that SHS performs well on D4RL offline dataset in the MuJoCo-Gym environment and successfully screens out more valuable samples,thus its effectiveness is verified.

Reinforcement LearningOffline Reinforcement LearningDistribution ShiftPolicy Con-straintValue FunctionSample Selection

侯永宏、丁旺、任懿、董洪伟、杨松领

展开 >

天津大学电气自动化与信息工程学院 天津 300072

中国科学院软件研究所空间综合信息系统国家重点实验室 北京 100190

强化学习 离线强化学习 分布偏移 策略约束 值函数 样本筛选

2024

模式识别与人工智能
中国自动化学会,国家智能计算机研究开发中心,中国科学院合肥智能机械研究所

模式识别与人工智能

CSTPCD北大核心
影响因子:0.954
ISSN:1003-6059
年,卷(期):2024.37(11)