Visual intelligence evaluation techniques for single object tracking:a survey
Single object tracking(SOT)task,which aims to model the human dynamic vision system and accomplish human-like object tracking ability in complex environments,has been widely used in various real-world applications like self-driving,video surveillance,and robot vision.Over the past decade,the development in deep learning has encouraged many research groups to work on designing different tracking frameworks like correlation filter(CF)and Siamese neural networks(SNNs),which facilitate the progress of SOT research.However,many factors(e.g.,target deformation,fast motion,and illumination changes)in natural application scenes still challenge the SOT trackers.Thus,algorithms with novel architectures have been proposed for robust tracking and to achieve better performance in representative experimental environments.However,several poor cases in natural application environments reveal a large gap between the performance of state-of-the-art trackers and human expectations,which motivates us to pay close attention to the evaluation aspects.Therefore,instead of the traditional reviews that mainly concentrate on algorithm design,this study systematically reviews the visual intelligence evaluation techniques for SOT,including four key aspects:the task definition,evaluation environ-ments,task executors,and evaluation mechanisms.First,we present the development direction of task definition,which includes the original short-term tracking,long-term tracking,and the recently proposed global instance tracking.With the evolution of the SOT definition,research has shown a progress from perceptual to cognitive intelligence.We also summa-rize challenging factors in the SOT task to help readers understand the research bottlenecks in actual applications.Second,we compare the representative experimental environments in SOT evaluation.Unlike existing reviews that mainly introduce datasets based on chronological order,this study divides the environments into three categories(i.e.,general datasets,dedicated datasets,and competition datasets)and introduces them separately.Third,we introduce the executors of SOT tasks,which not only include tracking algorithms represented by traditional trackers,CF-based trackers,SNN-based track-ers,and Transformer-based trackers but also contain human visual tracking experiments conducted in interdisciplinary fields.To our knowledge,none of the existing SOT reviews have included related works on human dynamic visual ability.Therefore,introducing interdisciplinary works can also support the visual intelligence evaluation by comparing machines with humans and better reveal the intelligence degree of existing algorithm modeling methods.Fourth,we review the evalu-ation mechanism and metrics,which encompass traditional machine-machine and novel human-machine comparisons,and analyze the target tracking capability of various task executors.We also provide an overview of the human-machine com-parison named visual Turing test,including its application in many vision tasks(e.g.,image comprehension,game navi-gation,image classification,and image recognition).Especially,we hope that this study can help researchers focus on this novel evaluation technique,better understand the capability bottlenecks,further explore the gaps between existing methods and humans,and finally achieve the goal of algorithmic intelligence.Finally,we indicate the evolution trend of visual intelligence evaluation techniques:1)designing more human-like task definitions,2)constructing more comprehen-sive and realistic evaluation environments,3)including human subjects as task executors,and 4)using human abilities as a baseline to evaluate machine intelligence.In conclusion,this study summarizes the evolution trend of visual intelligence evaluation techniques for SOT task,further analyzes the existing challenge factors,and discusses the possible future research directions.