两方零和马尔科夫博弈策略梯度算法及收敛性分析

Policy gradient algorithm and its convergence analysis for two-player zero-sum Markov games

扫码查看

原文链接

NETL
NSTL
维普
万方数据

中文摘要：为了解决基于策略的强化学习方法在两方零和马尔科夫博弈中学习效率低下的问题,提出同时更新双方玩家策略的近似纳什均衡策略优化算法.将两方零和马尔科夫博弈问题描述为最大最小优化问题,针对参数化策略,给出马尔科夫博弈的策略梯度定理,并通过近似随机策略梯度的推导,为算法实施提供可行性基础.通过比较分析不同的最大最小问题梯度更新方法,发现额外梯度相较于其他方法具有更好的收敛性能.基于这一发现,提出基于额外梯度的近似纳什均衡策略优化算法,并给出算法的收敛性证明.在Oshi-Zumo游戏上,使用表格式soft-max参数化策略以及神经网络作为参数化策略,验证不同游戏规模场景下算法的有效性.通过对比实验,验证算法相对于其他方法的收敛性和优越性.

外文摘要：An approximate Nash equilibrium policy optimization algorithm that simultaneously updated the policy of both players was proposed,in order to resolve the problem of low learning efficiency of the policy-based reinforcement learning method in the two-player zero-sum Markov game.The two-player zero-sum Markov game problem was described as a maximum-minimum optimization problem.The policy gradient theorem of the Markov game was given for the parameterized policy,and it provided a feasibility basis for algorithm implementation through the derivation of the approximate stochastic policy gradient.Different gradient update methods for the maximum-minimum problem were compared and analyzed,and it was found that the extragradient had better convergence performance than other methods.An approximate Nash equilibrium policy optimization algorithm based on the extragradient was proposed based on this finding,and the convergence proof of the algorithm was given.The tabular softmax parameterized policy and the neural network were used as parameterized policy on the Oshi-Zumo game,to verify the effectiveness of the algorithm in different game scale scenarios.The convergence and superiority of the algorithm compared to other methods were verified through comparative experiments.

外文关键词：

two-player zero-sum Markov gamereinforcement learningpolicy optimizationextragradientNash equilibriumneural network

作者：

王卓、李永强、冯宇、冯远静

展开 >

作者单位：

浙江工业大学信息工程学院,浙江杭州 310000

关键词：

两方零和马尔科夫博弈强化学习策略优化额外梯度纳什均衡神经网络

基金：

国家自然科学基金资助项目浙江省自然科学基金资助项目

项目编号：

62073294LZ21F030003

出版年：

2024

DOI：

10.3785/j.issn.1008-973X.2024.03.005

浙江大学学报(工学版)

浙江大学

浙江大学学报(工学版)

CSTPCD北大核心

影响因子：0.625

ISSN：1008-973X

年,卷(期)：2024.58(3)

参考文献量26