时序抽象作为分层强化学习的重要研究内容,允许分层强化学习智能体在不同的时间尺度上学习策略,可以有效解决深度强化学习难以处理的稀疏奖励问题.如何端到端地学习到优秀的时序抽象策略一直是分层强化学习研究面临的挑战.Option-Critic(OC)框架在Option框架的基础上,通过策略梯度理论,可以有效解决此问题.然而,在策略学习过程中,OC框架会出现Option内部策略动作分布变得十分相似的退化问题.该退化问题影响了 OC框架的实验性能,导致Option的可解释性变差.为了解决上述问题,引入互信息知识作为内部奖励,并提出基于互信息优化的Option-Critic算法(Option-Critic Algo-rithm with Mutual Information Optimization,MIOOC).MIOOC 算法结合了 近端策略 Option-Critic(Proximal Policy Option-Critic,PPOC)算法,可以保证下层策略的多样性.为了验证算法的有效性,把MIOOC算法和几种常见的强化学习方法在连续实验环境中进行对比实验.实验结果表明,MIOOC算法可以加快模型学习速度,实验性能更优,Option内部策略更有区分度.
Abstract
As an important research content of hierarchical reinforcement learning,temporal abstraction allows hierarchical rein-forcement learning agents to learn policies at different time scales,which can effectively solve the sparse reward problem that is difficult to deal with in deep reinforcement learning.How to learn excellent temporal abstraction policy end-to-end is always a re-search challenge in hierarchical reinforcement learning.Based on the Option framework,Option-Critic can effectively solve the above problems through policy gradient theory.However,in the process of policy learning,the OC framework will have the degra-dation problem that the action distribution of the internal option policies becomes very similar.This degradation problem affects the experimental performance of the OC framework and leads to poor interpretability of the Option.In order to solve the above problems,mutual information knowledge is introduced as the internal reward,and an Option-Critic algorithm with mutual infor-mation optimization is proposed.The MIOOC algorithm combines the proximal policy Option-Critic algorithm to ensure the diver-sity of the lower level policies.In order to verify the effectiveness of the algorithm,the MIOOC algorithm is compared with seve ral common reinforcement learning methods in continuous experimental environments.Experimental results show that the MIOOC algorithm can speed up the learning speed of the model,improve its experimental performance,and its Option internal strategy is more discriminative.
关键词
深度强化学习/时序抽象/分层强化学习/互信息/内部奖励/Option多样性
Key words
Deep reinforcement learning/Temporal abstract/Hierarchical reinforcement learning/Mutual information/Internal re-wards/Diversity in option policies