A Survey of Task-Oriented Dialogue Policies Based on Reinforcement Learning
The dialogue system holds a crucial position within the realm of natural language processing(NLP),serving as a significant and valuable component in facilitating human-machine interaction.At present,the dialogue system has attracted more and more attention in both academic and industrial communities because it is conversational for real-world applications as well as valuable in academic prospects.The pipeline-based human-computer dialogue systems consist of four distinct modules,with dialogue policy learning serving as a central component.In the pipeline framework,dialogue policy learning is responsible for selecting suitable dialogue actions based on the dialogue states obtained from the modules of natural language understanding and dialogue state tracking.These selected actions subsequently drive the natural language generation process to produce a coherent and complete response.Dialogue policy learning is commonly formulated as either a Markov decision process(MDP)or a semi-Markov decision process(SMDP).These processes are subsequently addressed by the means of reinforcement learning methods as a sequential decision problem.In recent years,there has been a rapid expansion of research methods focused on studying task-oriented dialogue policy learning using reinforcement learning methods.However,to the best of our knowledge,the existing reviews on dialogue policy learning based on reinforcement learning fall notably short in terms of comprehensiveness and depth.Therefore,the primary focus of this paper revolves around task-oriented dialogue policy learning utilizing reinforcement learning methods.We undertake an all-sided analysis,categorization,and comprehensive synthesis of task-oriented dialogue policy learning based on reinforcement learning techniques.First,we classify the reinforcement learning algorithms that are commonly used in dialogue policy learning.Then,based on the classification of reinforcement learning,we analyze the concept of dialogue policy learning in general,and summarize the problems or limitations in the existing dialogue policy learning methods.Furthermore,we present a comprehensive examination of current research directions and obstacles in the field of dialogue policy learning,which encompass various prominent areas of investigation such as multi-domain,multi-modal,multi-agent,and empathetic dialogue policies.Next,we proceed to introduce additional pertinent studies pertaining to dialogue policy learning.These encompass investigations on user simulators,methodologies for evaluating dialogue policy learning,dialogue policy platforms,datasets tailored for dialogue systems,as well as the interplay between large language models and the learning of dialogue policies.In order to rectify the deficiencies found in current research on dialogue policy learning,this paper under-takes an analysis of the prospective research directions for dialogue policy learning from five distinct vantage points.These perspectives encompass the realms of reinforcement learning technology and various applications.In conclusion,we wrap up this article and turn our gaze toward the future of dialogue policy learning.This paper not only provides a classification and comprehensive overview of task-oriented dialogue policy learning based on reinforcement learning algorithms but also categorizes it from different application perspectives.It offers a multi-dimensional,comprehensive,and systematic synthesis of task-oriented dialogue policy learning.We believe that this paper can provide valuable insights and inspiration for future research in task-oriented dialogue policy learning,and promoting the development of human-machine dialogue systems.