首页|基于策略蒸馏主仆框架的优势加权双行动者-评论家算法

基于策略蒸馏主仆框架的优势加权双行动者-评论家算法

扫码查看
离线强化学习(Offline RL)定义了从固定批次的数据集中学习的任务,能够规避与环境交互的风险,提高学习的效率与稳定性.其中优势加权行动者-评论家算法提出了一种将样本高效动态规划与最大似然策略更新相结合的方法,在利用大量离线数据的同时,快速执行在线精细化策略的调整.但是该算法使用随机经验回放机制,同时行动者-评论家模型只采用一套行动者,数据采样与回放不平衡.针对以上问题,提出一种基于策略蒸馏并进行数据经验优选回放的优势加权双行动者-评论家算法(Advantage Weighted Double Actors-Critics Based on Policy Distillation with Data Experience Optimization and Replay,DOR-PDAWAC),该算法采用偏好新经验并重复回放新旧经验的机制,利用双行动者增加探索,并运用基于策略蒸馏的主从框架,将行动者分为主行为者和从行为者,提升协作效率.将所提算法应用到通用D4RL数据集中的MuJoCo任务上进行消融实验与对比实验,结果表明,其学习效率等均获得了更优的表现.
Advantage Weighted Double Actors-Critics Algorithm Based on Key-Minor Architecture for Policy Distillation
Offline reinforcement learning(Offline RL)defines the task of learning from a fixed batch of dataset,which can avoid the risk of interacting with environment and improve the efficiency and stability of learning.Advantage weighted actor-critic algo-rithm,which combines sample efficient dynamic programming with maximum likelihood strategy updating,makes use of a large number of offline data and quickly performs online fine-grained strategy adjustment.However,the algorithm uses a random expe-rience replay mechanism,while the actor-critic model only uses one set of actors,and data sampling and playback are unbalanced.In view of the above problems,an advantage weighted double actors-critics algorithm based on policy distillation with data expe-rience optimization and replay is proposed(DOR-PDAWAC),which adopts the mechanism of preferring new data and replaying old and new data repeatedly,uses double actors to increase exploration,and uses key-minor architecture for policy distillation to divide actors into key actor and minor actor to improve performance and efficiency.Applying algorithm to the MuJoCo task in the general D4RL dataset,and experimental results show that the proposed algorithm achieves better performance in terms of lear-ning efficiencv and other aspect.

Offline reinforcement learningDeep reinforcement learningPolicy distillationDouble actors-critics frameworkEx-perience replay mechanism

杨皓麟、刘全

展开 >

苏州大学计算机科学与技术学院 江苏苏州 215006

苏州大学江苏省计算机信息处理技术重点实验室 江苏苏州 215006

离线强化学习 深度强化学习 策略蒸馏 双行动者-评论家框架 经验回放机制

国家自然科学基金国家自然科学基金国家自然科学基金国家自然科学基金国家自然科学基金新疆维吾尔自治区自然科学基金江苏高校优势学科建设工程资助项目

62376179617723556170205561876217621761752022D01A238

2024

计算机科学
重庆西南信息有限公司(原科技部西南信息中心)

计算机科学

CSTPCD北大核心
影响因子:0.944
ISSN:1002-137X
年,卷(期):2024.51(11)