首页|一种新的基于Sigmoid函数的分布式深度Q网络概率分布更新策略

一种新的基于Sigmoid函数的分布式深度Q网络概率分布更新策略

扫码查看
分布式深度Q网络(Distributed-Deep Q Network,Dist-DQN)是在传统期望值深度Q网络的基础上将离散的动作奖励在一个区间上连续化,通过不断更新支集区间的概率分布来解决复杂环境的随机奖励问题.奖励概率的分布更新策略作为Dist-DQN实现的重要函数,会显著影响智能体在环境中的学习效率.针对上述问题,提出了一种新的Sig-Dist-DQN概率分布更新策略.该策略综合考虑奖励概率支集之间的相关性强弱关系,提高与观察奖励强相关支集的概率质量更新速率,同时降低弱相关支集概率质量的更新速率.在OpenAI gym提供的环境下进行实验,结果表明,指数更新和调和序列更新策略在每次训练的差异性较大,而Sig-Dist-DQN策略的训练图像非常稳定.相较于指数更新和调和序列更新策略,应用Sig-Dist-DQN的智能体在学习过程中损失函数的收敛速度和收敛过程的稳定性都有显著提高.
Novel Probability Distribution Update Strategy for Distributed Deep Q-Networks Based on Sigmoid Function
Based on expected value DQN,distributed deep Q network(Dist-DQN)can solve the stochastic reward problem in complex environments by continuing discrete action reward into an interval and continuously updating the probability distribution of support intervals.The distribution update strategy of reward probability,as an important function for Dist-DQN implementa-tion,significantly affect the learning efficiency of agents in the environment.A new Sig-Dist-DQN probability distribution update strategy is proposed to address the above issues.This strategy comprehensively considers the strength of the correlation between reward probability subsets,improving the probability quality update rate of strongly correlated subsets while reducing the proba-bility quality update rate of weakly correlated subsets.In the environment provided by OpenAI Gym,experiments are conducted,and the exponential update and harmonic series update strategies show significant differences in each training session,while the training images of the Sig-Dist-DQN strategy are very stable.Compared with the exponential update and harmonic sequence up-date strategies,the intelligent agent applying Sig-Dist-DQN has significantly improved the convergence speed and stability of the loss function during the learning process.

Distributed deep Q networkContinuation of reward intervalsUpdating the probability distributionLearning rateTraining stability

高卓凡、郭文利

展开 >

中国航空工业集团公司洛阳电光设备研究所 河南洛阳 471000

分布式深度Q网络 奖励区间连续化 概率分布更新 学习效率 训练稳定性

2024

计算机科学
重庆西南信息有限公司(原科技部西南信息中心)

计算机科学

CSTPCD北大核心
影响因子:0.944
ISSN:1002-137X
年,卷(期):2024.51(12)