中国航空学报(英文版)2024,Vol.37Issue(7) :406-417.DOI:10.1016/j.cja.2024.03.008

Controlling underestimation bias in reinforcement learning via minmax operation

Fanghui HUANG Yixin HE Yu ZHANG Xinyang DENG Wen JIANG
中国航空学报(英文版)2024,Vol.37Issue(7) :406-417.DOI:10.1016/j.cja.2024.03.008

Controlling underestimation bias in reinforcement learning via minmax operation

Fanghui HUANG 1Yixin HE 2Yu ZHANG 3Xinyang DENG 3Wen JIANG3
扫码查看

作者信息

  • 1. School of Electronics and Information,Northwestern Polytechnical University,Xi'an 710129,China;College of Information Science and Engineering,Jiaxing University,Jiaxing 314001,China
  • 2. College of Information Science and Engineering,Jiaxing University,Jiaxing 314001,China
  • 3. School of Electronics and Information,Northwestern Polytechnical University,Xi'an 710129,China
  • 折叠

Abstract

Obtaining the accurate value estimation and reducing the estimation bias are the key issues in reinforcement learning.However,current methods that address the overestimation prob-lem tend to introduce underestimation,which face a challenge of precise decision-making in many fields.To address this issue,we conduct a theoretical analysis of the underestimation bias and pro-pose the minmax operation,which allow for flexible control of the estimation bias.Specifically,we select the maximum value of each action from multiple parallel state-action networks to create a new state-action value sequence.Then,a minimum value is selected to obtain more accurate value estimations.Moreover,based on the minmax operation,we propose two novel algorithms by com-bining Deep Q-Network(DQN)and Double DQN(DDQN),named minmax-DQN and minmax-DDQN.Meanwhile,we conduct theoretical analyses of the estimation bias and variance caused by our proposed minmax operation,which show that this operation significantly improves both under-estimation and overestimation biases and leads to the unbiased estimation.Furthermore,the vari-ance is also reduced,which is helpful to improve the network training stability.Finally,we conduct numerous comparative experiments in various environments,which empirically demonstrate the superiority of our method.

Key words

Reinforcement learning/Minmax operation/Estimation bias/Underestimation bias/Variance

引用本文复制引用

基金项目

National Natural Science Foundation of China(62173272)

出版年

2024
中国航空学报(英文版)
中国航空学会

中国航空学报(英文版)

CSTPCDEI
影响因子:0.847
ISSN:1000-9361
参考文献量3
段落导航相关论文