A Survey of Interpretability Research Methods for Reinforcement Learning
Reinforcement learning,as a machine learning paradigm,is garnering increasing attention due to its robust trial-and-error learning capabilities.With the integration of deep learning,reinforcement learning methods have achieved remarkable success in complex control tasks of real-world.However,the lack of interpretability in deep reinforcement learning networks,stemming from their black-box nature,presents challenges such as insecurity,lack of control,and difficulty in comprehension.This limitation hampers the progress of reinforcement learning in critical domains like autonomous driving and intelligent healthcare.To tackle this issue,researchers have undertaken extensive studies in the field of explainable reinforcement learning.Nevertheless,these studies are relatively recent and lack a systematic summary of explainable methods tailored to multi-agent reinforcement learning.Moreover,the definition of interpretability carries subjective elements,making it more challenging to comprehensively categorize explainable research targeting the reinforcement learning process.This article provides a comprehensive review and synthesis of the current state of interpretability research in reinforcement learning.To commence,the article establishes a definition for the interpretability of reinforcement learning and outlines relevant evaluation methods.Subsequently,rooted in Markov decision processes,the article categorizes interpretability into four classes:action-level explanation,feature-level explanation,reward-level explanation,and policy-level explanation.Regarding the inclusion of action factors in action-level explanations within reinforcement learning,methods for action-level explanations can be categorized based on the extent to which they explain the decision-making action of intelligent agents.These categories include self-explanatory model construction,formalized explanation methods,and generative explanation methods in reinforcement learning.In terms of incorporating state factors into feature-level explanations,methods in the feature-level explanations category are classified based on the importance of reinforcement learning state features and the form of explanation.This includes interaction trajectory explanation methods and key feature visualization methods.When it comes to incorporating reward factors into reward-level explanations,methods in the reward-level explanations category are categorized based on the impact of reward feedback on policy effectiveness and different task scenarios.This includes reward decomposition methods and reward shaping methods.Regarding the inclusion of learning strategies into policy-level explanations,methods in the policy-level explanations category are further classified into strategy decomposition methods and strategy aggregation methods,considering the hierarchical relationship of strategy explanations.These four categories of methods encompass the entire Markov decision process of an agent,spanning action execution,state transitions,reward computations,and policy learning.Furthermore,the description of each method category takes into account the progression of interpretability methods from single-agent reinforcement learning to multi-agent reinforcement learning,addressing the interpretability of more complex multi-agent environments.This includes analyzing and discussing issues such as credit assignment,collaborative cooperation,and adversarial games in the context of explainable methods,effectively bridging the gap in interpretability research methods for multi-agent reinforcement learning.Additionally,considering the subjective factors of human interpretability research,this paper provides a comprehensive summary and organization of human-machine interactive interpretability research,emphasizing the involvement of humans in the interpretability.Finally,the article concludes by summarizing the current challenges in interpretability research for reinforcement learning and offering prospects for future research directions.
reinforcement learninginterpretabilitymachine learningartificial intelligenceMarkov decision process