首页|我国民族语言文献文本数字化识别问题——基于OCR及其工具

我国民族语言文献文本数字化识别问题——基于OCR及其工具

扫码查看
我国少数民族语言文献数量庞大,文字种类繁多,内容涵盖政治、经济、法律、历史、文学、艺术、宗教、天文、地理、医药等领域,是中华民族文化知识的重要组成部分.构建各民族文献文本数据,使之应用于自然语言处理和人工智能,能有效促进中华优秀传统知识创新性传承,促进知识社会化,是对各民族语言古文献和现代书报刊进行文字识别和文本转换数据构建的基础.国内早期OCR技术虽然解决了几种主要少数民族文字识别的问题,但因字符为非Unicode基本集编码而弃用.当前OCR技术已能较好识别蒙、藏、维、哈、朝等文种文献,但在处理我国汉文与少数民族文字混排图像文本时仍然效果不佳.因此应推进少数民族语言文献OCR识别技术创新.我国少数民族语言文献现行活态文字有十多种,其中非拉丁字系的文字有11种,OCR技术应重点解决这类少数民族语言字系的抄本、刻版和铅字印刷文本,以及汉文与民族文字混排文本的识别问题,研发开放的多功能工具和平台.在此基础上,进一步开展少数民族语言文献文本大规模数据构建,以促进我国语言科学研究和自然语言处理的创新发展.
Digital Recognition of Minority Language Documents in China——Based on OCR and its ways
China has over 130 minority languages and more than 10 minority scripts.These have preserved a wealth of ethnic language documents,including a large number of ancient manuscripts and modern printed documents.These records capture the long-standing civilization of the Chinese nation and the knowledge and practices of various ethnic groups in their production and daily life.The content covers a wide range of areas,including politics,economics,law,history,literature,art,religion,medicine,astronomy,and geography,reflecting the exchange,integration,and innovation of various ethnic cultures.Fully utilizing contemporary data science and artificial intelligence(AI)to innovate text recognition technology for minority language documents and achieving the digitization of massive amounts of literature is of great historical and cultural significance,and practical political significance.This effort is crucial for the scientific protection of Chinese minority language document resources and the inheritance of excellent Chinese traditional knowledge and cultural spirit.Collecting and organizing minority language books,newspapers,and manuscripts for large-scale text recognition and digitization is a crucial source for building natural language processing(NLP)and AI datasets.The digitization of minority language documents involves two fundamental tasks:(1)compiling and cataloging various documents to create indexed data;and(2)performing optical character recognition(OCR)on the content of these documents to convert them into computer-processable text files.The recognition of minority language text is the prerequisite for the digitization of document content,while OCR text recognition is key to constructing large-scale corpora and knowledge text data in China's native languages.Efficient document OCR recognition technology has broad applications.It enables publishers to transition from passively receiving manuscripts to actively creating knowledge content,maximizing content production potential.Additionally,it facilitates the extraction of text data such as characters,words,sentences,and paragraphs from vast social language landscapes,addressing the issue of entity name corpora in NLP tasks.Through OCR technology,a large number of language examples and corpora in linguistic works can be automatically extracted and annotated.This enables the large-scale integration and utilization of discrete corpora in mixed-language documents of various minority languages and Chinese,driving a data-oriented shift in linguistic research in China.The text recognition of minority language documents involves two key technologies:the accuracy of single-language recognition,and the differentiation and recognition of texts in documents with mixed scripts.Currently,the OCR technology in China performs well in recognizing documents in Mongolian,Tibetan,Uighur,Kazakh,Korean,and other languages,but it performs poorly at the application level in recognizing mixed-language documents.Therefore,the R&D of text recognition technology for minority language documents in China currently focuses on four main tasks:(1)Solving the text recognition of ancient documents in minority languages;(2)Addressing the recognition and extraction of mixed scripts involving multiple minority languages,the coexistence of Chinese characters and minority scripts,and various linguistic works in minority languages;(3)Advancing the OCR recognition and simultaneous digitization of single-language documents in minority languages;(4)Rapidly developing various tools and integrated platforms for the text recognition of minority language documents in China.To accomplish these tasks,interdisciplinary research that combines linguistics and contemporary AI science is necessary.

ethno-languagesethnic documentstext recognitionOCRdata constructionDigital humanities

范俊军、刘贤娴

展开 >

暨南大学文学院

少数民族语言 民族文献 文本识别 OCR 数据构建 数字人文

2024

暨南学报(哲学社会科学版)
暨南大学

暨南学报(哲学社会科学版)

CSSCICHSSCD北大核心
影响因子:0.69
ISSN:1000-5072
年,卷(期):2024.46(6)
  • 7