Actor-Critic是一种强化学习方法,通过与环境在线试错交互收集样本来学习策略,是求解序贯感知决策问题的有效手段.但是,这种在线交互的主动学习范式在一些复杂真实环境中收集样本时会带来成本和安全问题.离线强化学习作为一种基于数据驱动的强化学习范式,强调从静态样本数据集中学习策略,与环境无探索交互,为机器人、自动驾驶、健康护理等真实世界部署应用提供了可行的解决方案,是近年来的研究热点.目前,离线强化学习方法存在学习策略和行为策略之间的分布偏移挑战.针对这个挑战,通常采用策略约束或值函数正则化来限制访问数据集分布之外(Out-Of-Distribution,OOD)的动作,从而导致学习性能过于保守,阻碍了值函数网络的泛化和学习策略的性能提升.为此,本文利用不确定性估计和OOD采样来平衡值函数学习的泛化性和保守性,提出一种基于不确定性估计的离线确定型 Actor-Critic 方法(Offline Deterministic Actor-Critic based on Uncertainty Esti-mation,ODACUE).首先,针对确定型策略,给出一种Q值函数的不确定性估计算子定义,理论证明了该算子学到的Q值函数是最优Q值函数的一种悲观估计.然后,将不确定性估计算子应用于确定型Actor-Critic框架中,通过对不确定性估计算子进行凸组合构造Critic学习的目标函数.最后,D4RL基准数据集任务上的实验结果表明:相较于对比算法,ODACUE在11个不同质量等级数据集任务中的总体性能提升最低达9.56%,最高达64.92%.此外,参数分析和消融实验进一步验证了 ODACUE的稳定性和泛化能力.
Offline Deterministic Actor-Critic Based on Uncertainty Estimation
Actor-critic is a reinforcement learning method that learns a policy by collecting sam-ples through online trial-and-error interaction with the environment,which is an effective tool for solving sequential perceptual decision problems.However,the active learning paradigm of online in-teraction raises cost and security issues when collecting samples in some complex real-world environ-ments.Offline reinforcement learning,as a data-driven reinforcement learning paradigm,emphasi-zes learning policy from a static sample dataset without exploratory interaction with the environment,which has been a research hotspot in recent years and provides a feasible solution for real-world de-ployment applications such as robotics,autonomous driving,healthcare,and so on.At present,off-line reinforcement learning methods face the challenge of distribution shift between the learned and behavior policies,which generates extrapolation errors in the value function estimation for the out-of-distribution(OOD)actions of the static sample dataset.The extrapolation errors are accumulated with the Bellman bootstrapping operation,which leads to the performance degradation or even non-convergence of offline reinforcement learning.In order to deal with the distribution shift problem,the policy constraint or value function regularization is usually used to restrict the agent access to OOD actions,which may result in overly conservative learning performance and hinder the generali-zation of value function network and performance improvement of policy.To this end,an offline de-terministic actor-critic method based on uncertainty estimation(ODACUE)is proposed to balance the generalization and conservation of value function learning by utilizing the uncertainty estimation and OOD sampling.Firstly,for the deterministic policy,the definition of uncertainty estimation op-erator is given according to the different estimation methods of Q value function for the in-dataset and OOD actions.The in-dataset action value function is estimated according to the Bellman bootstrap-ping operation and ensemble uncertainty estimation.On the other hand,the OOD action value func-tion is estimated based on a pseudo-target constructed by the ensemble uncertainty estimation and OOD sampling method.The pessimism of the uncertainty estimation operator is theoretically ana-lyzed by ξ-uncertainty estimation theory.By choosing appropriate parameters,the Q value function learned according to the uncertainty estimation operator is a pessimistic estimation of the optimal Q value function.Then,by applying the uncertainty estimation operator to the deterministic actor-critic framework,the objective function of critic learning is constructed via a convex combination of the in-dataset and OOD action value functions,thus the conservative constraints and generalization of value function learning are balanced by using the convex combination coefficient.Moreover,the uncertain-ty estimation operator of value function is implemented by the critic target network during the in-dataset action value function learning process.During the OOD action value function learning process,the OOD sampling is implemented by the actor main network,and the uncertainty estima-tion operator of value function is implemented by the critic main network.Finally,ODACUE and some state-of-the-art baseline algorithms are evaluated on D4RL benchmark.Experimental results show that,in contrast to the comparative algorithms,the overall performance improvement of ODACUE on the 11 datasets with different quality levels is at least 9.56%and at most 64.92%.In addition,parameter analysis and ablation experiments further validate the stability and generalization ability of ODACUE.