我国低资源语言大规模数据建构及语言田野实践的数据转向

Large-scale Data Construction of Low-resource Languages in China and the Data-oriented Turn in Linguistic Fieldwork

范俊军 沐华

我国低资源语言大规模数据建构及语言田野实践的数据转向

Large-scale Data Construction of Low-resource Languages in China and the Data-oriented Turn in Linguistic Fieldwork

范俊军 1沐华2
扫码查看

作者信息

  • 1. 暨南大学 文学院,广东广州 510632
  • 2. 暨南大学 文学院,广东广州 510632;楚雄师范学院语言文化学院,云南楚雄 675099
  • 折叠

摘要

低资源语言是指缺乏可用于自然语言处理任务和语言学计量分析所需足够基础数据的语言.低资源语言数据稀缺,是当前语言科学和自然语言处理共同面临的问题.语言数据资源最基础的部分是单语或双语词汇、语句的语音和文本数据.我国普通话、粤方言、藏语、维吾尔语、蒙古语、壮语总体属于高资源语言,其他语言都属于低资源语言,其中县乡语言和方言属零资源语言.建构我国低资源语言的大规模数据,有助于强化我们掌握自己国家语言资源的控制权,发挥我国自然语言处理领域在语言模型技术创新中的独特作用,推动语言田野工作的数据转向,创新田野语言学理论和实践,促进基于数据计量的语言学广域研究.建构我国低资源语言数据,主要有四项任务:一是建构大规模词语数据集,二是建构知识语义词网,三是建构大规模句子数据集,四是现有语言资料的数据化.

Abstract

Low-resource languages are those that lack sufficient basic data for natural language pro-cessing tasks and quantitative linguistic analyses.The scarcity of low-resource language data is a com-mon problem faced by current language science and natural language processing.The fundamental part of language data resources is composed of monolingual or bilingual vocabulary,the sentence speech sounds and textual data.In China,Mandarin,Cantonese dialect,Tibetan,Uyghur,Mongoli-an,and Zhuang languages are generally high-resource languages,and other languages are low-re-source languages,of which the county and township languages and dialects are zero-resource langua-ges.Building large-scale data of low-resource languages of our country will help strengthen our con-trol over the language resources,play a unique role in our country's NLP technological innovation of language models,promote the data shift of our linguistic fieldwork,the innovation on the field lin-guistic theory and practice,and wide-area linguistic research based on data measurement.There are four main tasks in building low-resource language data of China:the first is to build a large word data set,the second is to construct a knowledge-based semantic word network,the third is to build a large sentence data set,and the fourth is to digitize the existing low-resource language data.

关键词

低资源语言/少数民族语言/自然语言处理(NLP)/田野语言学

Key words

low-resource languages/minority languages/NLP/field linguistics

引用本文复制引用

基金项目

国家社会科学基金重大项目(2014ZDB106)

出版年

2023
云南师范大学学报(哲学社会科学版)
云南师范大学

云南师范大学学报(哲学社会科学版)

CSSCICHSSCD北大核心
影响因子:1.025
ISSN:1000-5110
被引量2
参考文献量7
段落导航相关论文