Digital Recognition of Minority Language Documents in China——Based on OCR and its ways
China has over 130 minority languages and more than 10 minority scripts.These have preserved a wealth of ethnic language documents,including a large number of ancient manuscripts and modern printed documents.These records capture the long-standing civilization of the Chinese nation and the knowledge and practices of various ethnic groups in their production and daily life.The content covers a wide range of areas,including politics,economics,law,history,literature,art,religion,medicine,astronomy,and geography,reflecting the exchange,integration,and innovation of various ethnic cultures.Fully utilizing contemporary data science and artificial intelligence(AI)to innovate text recognition technology for minority language documents and achieving the digitization of massive amounts of literature is of great historical and cultural significance,and practical political significance.This effort is crucial for the scientific protection of Chinese minority language document resources and the inheritance of excellent Chinese traditional knowledge and cultural spirit.Collecting and organizing minority language books,newspapers,and manuscripts for large-scale text recognition and digitization is a crucial source for building natural language processing(NLP)and AI datasets.The digitization of minority language documents involves two fundamental tasks:(1)compiling and cataloging various documents to create indexed data;and(2)performing optical character recognition(OCR)on the content of these documents to convert them into computer-processable text files.The recognition of minority language text is the prerequisite for the digitization of document content,while OCR text recognition is key to constructing large-scale corpora and knowledge text data in China's native languages.Efficient document OCR recognition technology has broad applications.It enables publishers to transition from passively receiving manuscripts to actively creating knowledge content,maximizing content production potential.Additionally,it facilitates the extraction of text data such as characters,words,sentences,and paragraphs from vast social language landscapes,addressing the issue of entity name corpora in NLP tasks.Through OCR technology,a large number of language examples and corpora in linguistic works can be automatically extracted and annotated.This enables the large-scale integration and utilization of discrete corpora in mixed-language documents of various minority languages and Chinese,driving a data-oriented shift in linguistic research in China.The text recognition of minority language documents involves two key technologies:the accuracy of single-language recognition,and the differentiation and recognition of texts in documents with mixed scripts.Currently,the OCR technology in China performs well in recognizing documents in Mongolian,Tibetan,Uighur,Kazakh,Korean,and other languages,but it performs poorly at the application level in recognizing mixed-language documents.Therefore,the R&D of text recognition technology for minority language documents in China currently focuses on four main tasks:(1)Solving the text recognition of ancient documents in minority languages;(2)Addressing the recognition and extraction of mixed scripts involving multiple minority languages,the coexistence of Chinese characters and minority scripts,and various linguistic works in minority languages;(3)Advancing the OCR recognition and simultaneous digitization of single-language documents in minority languages;(4)Rapidly developing various tools and integrated platforms for the text recognition of minority language documents in China.To accomplish these tasks,interdisciplinary research that combines linguistics and contemporary AI science is necessary.