Policy gradient algorithm and its convergence analysis for two-player zero-sum Markov games
An approximate Nash equilibrium policy optimization algorithm that simultaneously updated the policy of both players was proposed,in order to resolve the problem of low learning efficiency of the policy-based reinforcement learning method in the two-player zero-sum Markov game.The two-player zero-sum Markov game problem was described as a maximum-minimum optimization problem.The policy gradient theorem of the Markov game was given for the parameterized policy,and it provided a feasibility basis for algorithm implementation through the derivation of the approximate stochastic policy gradient.Different gradient update methods for the maximum-minimum problem were compared and analyzed,and it was found that the extragradient had better convergence performance than other methods.An approximate Nash equilibrium policy optimization algorithm based on the extragradient was proposed based on this finding,and the convergence proof of the algorithm was given.The tabular softmax parameterized policy and the neural network were used as parameterized policy on the Oshi-Zumo game,to verify the effectiveness of the algorithm in different game scale scenarios.The convergence and superiority of the algorithm compared to other methods were verified through comparative experiments.