Abstract
2024 FEB 20 (NewsRx) – By a News Reporter-Staff News Editor at Robotics & Machine Learning Daily News Daily News – New research on artificial intelligence is the subject of a new report. According to news reporting from Faisalabad, Pakistan, by NewsRx journalists, research stated, “In text applications, pre-processing is deemed as a significant parameter to enhance the outcomes of natural language processing (NLP) chores. Text normalization and tokenization are two pivotal procedures of text pre-processing that cannot be overstated.” Our news editors obtained a quote from the research from National Textile University: “Text normal- ization refers to transforming raw text into scriptural standardized text, while word tokenization splits the text into tokens or words. Well defined normalization and tokenization approaches exist for most spoken languages in world. However, the world’s 10th most widely spoken language has been overlooked by the research community. This research presents improved text normalization and tokenization techniques for the Urdu language. For Urdu text normalization, multiple regular expressions and rules are proposed, including removing diuretics, normalizing single characters, separating digits, etc. While for word tokeniza- tion, core features are defined and extracted against each character of text. Machine learning model is considered with specified handcrafted rules to predict the space and to tokenize the text. This experiment is performed, while creating the largest human-annotated dataset composed in Urdu script covering five different domains.”