To address the challenge of aligning multimodal data and improving the slow translation speed in sign language translation,a Transformer Sign Language Translation Non-Autoregression(Trans-SLT-NA)is proposed in this paper,which utilizes a self-attention mechanism.Additionally,it incorporates a contrastive learning loss function to align the multimodal data.By capturing the contextual and interaction information between the input sequence(sign language videos)and the target sequence(text),the proposed model is able to perform sign language translation to natural language in s single step.The effectiveness of the proposed model is evaluated on publicly available datasets,including PHOENIX-2014-T(German),CSL(Chinese)and How2Sign(English).Results demonstrate that the proposed method achieves a significant improvement in translation speed,with a speed boost ranging from 11.6 to 17.6 times compared to autoregressive models,while maintaining comparable performance in terms of BiLingual Evaluation Understudy(BLEU-4)and Recall-Oriented Understudy for Gisting Evaluation(ROUGE)metrics.
关键词
手语翻译/自注意力机制/非自回归翻译/深度学习/多模态数据对齐
Key words
Sign language translation/Self-attention mechanism/Non-autoregressive translation/Deep learning/Alignment of multimodal data