Adaptive keyframe selection for continuous sign language recognition
Vision-based continuous sign language recognition(CSLR),which aims to recognize unsegmented signs from image sequences,provides a convenient communication tool for sign language users.Recent CSLR approaches often extract visual and contextual features frame by frame from image sequences,leading to redundant computations due to the presence of similar visual information in adjacent frames.This paper analyzes the impact of framerate on continuous sign language recognition algorithms and finds that reducing the framerate significantly improves computational efficiency but may also result in performance degradation.To preserve more key sign language information while reducing computational cost,this paper proposes an adaptive dynamic temporal pooling(ADTP)layer that dynamically downsamples sequences based on their self-similarity in sequence features.Furthermore,a two-stage training scheme is introduced to better utilize the spatiotemporal information in original sequences.Specifically,in the first stage,the CSLR model is trained based on original sequences,and in the second stage,the model with the ADTP module is trained with knowledge distillation guided by the teacher network from the first stage.Experimental results demonstrate that the proposed method significantly reduces the computational requirements for recognition while only sacrificing a small amount of performance.Additionally,the proposed ADTP can also be applied to sign language video structure analysis,generating concise and intuitive summaries of sign language videos.
continuous sign language recognitiontime series analysisvisual languagesknowledge distillationcomputational efficiency