Twin delayed deep deterministic policy gradient based on optimistic exploration
Twin delayed deep deterministic policy gradient is a mainstream algorithm for deep reinforcement learning,and a model-free reinforcement learning that has been successfully applied to challenging continuous control tasks.However,when the rewards are sparse or the state space is large in the environment,the sample efficiency of the twin delayed deep deterministic policy gradient is poor and the environment exploration ability is weak.Aiming at the problem of inefficient exploration caused by determining the objective function through the lower bound of the double Q-value function,a twin delayed deep deterministic policy gradient based on optimistic exploration(TD3-OE)is proposed.First,starting from the double Q-value function,it is analyzed that taking the lower bound will make the exploration somewhat pessimistic.Then,the Gaussian function and the piecewise function are used to fit the double Q-value function respectively.Finally,the exploration policy is constructed by fitting the Q-value function and the target policy to guide the agent to explore in the environment.The exploration policy can prevent the agent from learning sub-optimal policies,thus effectively solving the problem of inefficient exploration.This paper compares the proposed algorithm with the benchmark algorithm on the control platform based on the MuJoCo physics engine to verify the effectiveness of the proposed algorithm.The experimental results show that the proposed algorithm achieves or exceeds other basic reinforcement learning algorithms in terms of indicators such as reward,stability and learning speed.
deep reinforcement learningtwin delayed deep deterministic policy gradientexploration policyoptimistic exploration