中英文跨语言信息检索中平行语料库的构建及性能评价研究

Construction and Performance Evaluation of Parallel Corpora in Chinese and English Cross-Language Information Retrieval

张宇辉 ¹张雪萍²

扫码查看

作者信息

1. 吉林外国语大学英语学院,吉林长春 130117
2. 吉林农业大学外国语学院,吉林长春 130118
折叠

摘要

[目的/意义]语料库是一种十分重要跨语言信息检索领域实现翻译的数据来源.在CLIR中对语料库进行性能评测、翻译抽取双语词典和语义消歧等工作,能够满足人们获取知识和信息需求.[方法/过程]本文通过从华尔街日报、金融时报和香港政府等新闻网站搜集中英文网页,使用开源软件HTML Parser 过滤掉非文本内容,经过格式转换,最终生成XML文件,自行建立平行语料库,利用CL-LSI和TDS模型,并对其性能进行评价.[结果/结论]在建立CLIR评测语料库上进行的验证,TDS模型在双语配对检索过程中,能够充分客观的提取语义关联的语义双语主题特征,通过双语配对搜索,CLIR的性能上将超过CL-LSI模型检索效率.[创新/局限]本文针对语料库深入研究,提出一种基于平行语料库中对偶空间的跨语言信息检索模型(TDS),并对给定的主题分别进行中英文语料采集,对获得的关键词应用于TDS模型上,通过双语词项的共现语义信息分析,最终实现在平行语料库的构建和性能评价的目标.不足之处在于双语主题数较少时,翻译的准确率较低,而主题数量逐渐增大时,翻译的准确率更高.

Abstract

[Purpose/significance]Corpora is a very important data source for translation in the field of cross-language information re-trieval.In CLIR,the performance evaluation of corpus,translation and extraction of bilingual dictionaries and semantic disambiguation can meet the needs of people to acquire knowledge and information.[Method/process]This paper collects Chinese and English web pages from news websites such as the Wall Street Journal,the Financial Times and the Hong Kong Government,uses the open-source software HTML Parser to filter out non-text content,converts the format and finally generates XML files,builds the parallel corpus by itself,uses CL-LSI and TDS models,and evaluates its performance.[Result/conclusion]In the establishment of CLIR evaluation cor-pus,it is verified that the TDS model can fully and objectively extract semantic bilingual subject features of semantic association in the process of bilingual paired search,and the performance of CLIR will exceed the retrieval efficiency of CL-LSI model through bilingual paired search.[Innovation/limitation]Aiming at in-depth research on corpora,this paper proposes a cross-language information re-trieval model(TDS)based on dual space in parallel corpora,and collects Chinese and English corpus for a given topic respectively.The obtained keywords are applied to the TDS model,and the co-occurrence semantic information of bilingual terms is analyzed.Fi-nally,the goal of parallel corpus construction and performance evaluation is realized.The disadvantage is that when the number of bi-lingual topics is small,the accuracy of translation is low,and when the number of topics is gradually increasing,the accuracy of trans-lation is higher.

关键词

跨语言信息检索/平行语料库/对偶空间/TDS模型/CL-LSI模型

Key words

cross-language information retrieval/parallel corpora/dual space/TDS model/CL-LSI model

引用本文复制引用

基金项目

吉林省教育厅教育厅社会科学研究项目(2023)(JJKH20231385SK)

吉林省教育科学规划课题(十三五)(GH20391)

出版年

2024

情报科学

中国科学技术情报学会吉林大学

情报科学

CSTPCDCSSCICHSSCD北大核心

影响因子：2.275

ISSN：1007-7634

参考文献量8

段落导航