首页|基于softmax的加权Double Q-Learning算法

基于softmax的加权Double Q-Learning算法

扫码查看
强化学习作为机器学习的一个分支,用于描述和解决智能体在与环境的交互过程中,通过学习策略以达成回报最大化的问题.Q-Learning作为无模型强化学习的经典方法,存在过估计引起的最大化偏差问题,并且在环境中奖励存在噪声时表现不佳.Double Q-Learning(DQL)的出现解决了过估计问题,但同时造成了低估问题.为解决以上算法的高低估问题,提出了基于softmax的加权Q-Learning算法,并将其与DQL相结合,提出了一种新的基于softmax的加权Double Q-Learning算法(WDQL-Softmax).该算法基于加权双估计器的构造,对样本期望值进行softmax操作得到权重,使用权重估计动作价值,有效平衡对动作价值的高估和低估问题,使估计值更加接近理论值.实验结果表明,在离散动作空间中,相比于Q-Learning算法、DQL算法和 WDQL算法,WDQL-Softmax算法的收敛速度更快且估计值与理论值的误差更小.
Weighted Double Q-Learning Algorithm Based on Softmax
As a branch of machine learning,einforcement learning is used to describe and solve the problem that agents maximize returns through learning strategies in the process of interaction with the environment.Q-Learning,as a classical model free rein-forcement learning method,has the problem of maximizing the bias caused by overestimation,and performs poorly when there is noise in the environment.The emergence of double Q-Learning(DQL)solves the problem of overestimation,but at the same time causes the problem of underestimation.To solve the problem of high and low estimation in the above algorithms,weighted Q-Learning algorithm based on softmax is proposed.And combined with DQL,a new weighted double Q-Learning algorithm based on softmax(WDQL-Softmax)is proposed.This algorithm is based on the construction of weighted dual estimators,which per-form softmax operations on the expected values of the samples to obtain weights.The weights are used to estimate the action value,effectively balancing the problem of overestimation and underestimation of the action value,making the estimated value closer to the theoretical value.Experimental results show that in the discrete action space,compared with Q-Learning algorithm,double Q-Learning algorithm and weighted double Q-learning algorithm,weighted double q-learning algorithm based on softmax has faster convergence rate and smaller error between the estimated value and the theoretical value.

Reinforcement learningQ-LearningDouble Q-LearningSoftmax

钟雨昂、袁伟伟、关东海

展开 >

南京航空航天大学计算机科学与技术学院 南京 211106

强化学习 Q-Learning Double Q-Learning Softmax

2024

计算机科学
重庆西南信息有限公司(原科技部西南信息中心)

计算机科学

CSTPCD北大核心
影响因子:0.944
ISSN:1002-137X
年,卷(期):2024.51(z1)
  • 15