New Machine Learning Findings from National Textile University Published (Machine learning based framework for fine-grained word segmentation and enhanced text normalization for low resourced language)

扫码查看

Abstract

2024 FEB 20 (NewsRx) – By a News Reporter-Staff News Editor at Robotics & Machine Learning Daily News Daily News – New research on artificial intelligence is the subject of a new report. According to news reporting from Faisalabad, Pakistan, by NewsRx journalists, research stated, “In text applications, pre-processing is deemed as a significant parameter to enhance the outcomes of natural language processing (NLP) chores. Text normalization and tokenization are two pivotal procedures of text pre-processing that cannot be overstated.” Our news editors obtained a quote from the research from National Textile University: “Text normal- ization refers to transforming raw text into scriptural standardized text, while word tokenization splits the text into tokens or words. Well defined normalization and tokenization approaches exist for most spoken languages in world. However, the world’s 10th most widely spoken language has been overlooked by the research community. This research presents improved text normalization and tokenization techniques for the Urdu language. For Urdu text normalization, multiple regular expressions and rules are proposed, including removing diuretics, normalizing single characters, separating digits, etc. While for word tokeniza- tion, core features are defined and extracted against each character of text. Machine learning model is considered with specified handcrafted rules to predict the space and to tokenize the text. This experiment is performed, while creating the largest human-annotated dataset composed in Urdu script covering five different domains.”

Key words

National Textile University/Faisalabad/Pakistan/Asia/Cyborgs/Emerging Technologies/Machine Learning

引用本文复制引用

出版年

2024

Robotics & Machine Learning Daily News

ISSN：

参考文献量38

段落导航