LingAlign:A Multilingual Sentence Aligner Using Cross-Lingual Sentence Embeddings
[Objective]This paper develops a multilingual sentence aligner for parallel corpora-based research in digital humanities and machine translation.[Methods]The system first encodes the bitext to be aligned in a shared vector space,and then calculates the semantic relationship between sentences based on modified cosine similarity.Finally,a two-stage dynamic programming algorithm is used to automatically extract parallel sentence pairs.[Results]We use both intrinsic and extrinsic evaluation to calculate the performance of the system.The intrinsic evaluation shows that the average accuracy,recall and F,values reached 0.950,0.960 and 0.955.Furthermore,the chrF,chrF++and COMET scores achieved in the extrinsic evaluation are 55.65,55.85 and 87.31 respectively.[Limitations]A data capture platform that integrates document alignment and sentence alignment is yet to be developed.[Conclusions]The proposed approach outperforms existing methods in both intrinsic and extrinsic evaluation tasks,which may help to promote the construction of large and high quality multilingual parallel corpora.