The task of video text tracking mainly involves detection and tracking.However,the related models fail to fully-capture the semantic connections between continuous video frames,and neglect the real-time requirements of video text track-ing.To address these issues,this paper presents a real-time end-to-end video text tracking model(RTTVTS),which achieves end-to-end video text tracking by predicting across multiple continuous frames,addressing the challenges of dynamic detection and ongoing tracking in video text information.Firstly,a computationally efficient feature enhancement network composed of stacked feature pyramid enhancement modules is employed.Secondly,a lightweight detection head,working in conjunction with Pixel Aggregation,is used to capture and learn detection information between continuous video frames.Lastly,during the infer-ence phase,Kalman filtering is employed to associate each detection box.Experimental results show that the proposed RTTVTS model improves the effectiveness and real-time performance of video text tracking.