首页期刊导航|Language resources and evaluation
期刊信息/Journal information
Language resources and evaluation
Springer
Language resources and evaluation

Springer

季刊

1574-020X

Language resources and evaluation/Journal Language resources and evaluationAHCIISTPSCI
正式出版
收录年代

    Improving Arabic sentiment analysis across context-aware attention deep model based on natural language processing

    Abubakr H. OmbabiWael OuardaAdel M. Alimi
    639-663页
    查看更多>>摘要:Abstract With the enormous growth of social data in recent years, sentiment analysis has gained increasing research attention and has been widely explored in various languages. Arabic language nature imposes several challenges, such as the complicated morphological structure and the limited resources, Thereby, the current state-of-the-art methods for sentiment analysis remain to be enhanced. This inspired us to explore the application of the emerging deep-learning architecture to Arabic text classification. In this paper, we present an ensemble model which integrates a convolutional neural network, bidirectional long short-term memory (Bi-LSTM), and attention mechanism, to predict the sentiment orientation of Arabic sentences. The convolutional layer is used for feature extraction from the higher-level sentence representations layer, the BiLSTM is integrated to further capture the contextual information from the produced set of features. Two attention mechanism units are incorporated to highlight the critical information from the contextual feature vectors produced by the Bi-LSTM hidden layers. The context-related vectors generated by the attention mechanism layers are then concatenated and passed into a classifier to predict the final label. To disentangle the influence of these components, the proposed model is validated as three variant architectures on a multi-domains corpus, as well as four benchmarks. Experimental results show that incorporating Bi-LSTM and attention mechanism improves the model’s performance while yielding 96.08% in accuracy. Consequently, this architecture consistently outperforms the other State-of-The-Art approaches with up to + 14.47%, + 20.38%, and + 18.45% improvements in accuracy, precision, and recall respectively. These results demonstrated the strengths of this model in addressing the challenges of text classification tasks.

    Data augmentation and transfer learning for cross-lingual Named Entity Recognition in the biomedical domain

    Brayan Stiven LancherosGloria Corpas PastorRuslan Mitkov
    665-684页
    查看更多>>摘要:Abstract Given the increase in production of data for the biomedical field and the unstoppable growth of the internet, the need for Information Extraction (IE) techniques has skyrocketed. Named Entity Recognition (NER) is one of such IE tasks useful for professionals in different areas. There are several settings where biomedical NER is needed, for instance, extraction and analysis of biomedical literature, relation extraction, organisation of biomedical documents, and knowledge-base completion. However, the computational treatment of entities in the biomedical domain has faced a number of challenges including its high cost of annotation, ambiguity, and lack of biomedical NER datasets in languages other than English. These difficulties have hampered data development, affecting both the domain itself and its multilingual coverage. The purpose of this study is to overcome the scarcity of biomedical data for NER in Spanish, for which only two datasets exist, by developing a robust bilingual NER model. Inspired by back-translation, this paper leverages the progress in Neural Machine Translation (NMT) to create a synthetic version of the Colorado Richly Annotated Full-Text (CRAFT) dataset in Spanish. Additionally, a new CRAFT dataset is constructed by replacing 20% of the entities in the original dataset generating a new augmented dataset. We evaluate two training methods: concatenation of datasets and continuous training to assess the transfer learning capabilities of transformers using the newly obtained datasets. The best performing NER system in the development set achieved an F-1 score of 86.39%. The novel methodology proposed in this paper presents the first bilingual NER system and it has the potential to improve applications across under-resourced languages.

    Analyzing learner language: the case of the Hebrew Learner Essay Corpus

    Chen GafniLivnat Herzig SheinfuxHadar KlunoverAnat Bar Siman Tov...
    685-726页
    查看更多>>摘要:Abstract We present the Hebrew Learner Essay Corpus (HELEECS): an annotated corpus of Hebrew language argumentative essays authored by prospective higher-education students. The corpus includes essays by two main populations: (1) essays by native speakers of Hebrew, written as part of the psychometric exam that is used to assess their future success in academic studies; (2) essays by non-native speakers of Hebrew, with three different native languages (Arabic, French, and Russian), that were written as part of a language aptitude test. The corpus is uniformly encoded and stored. The non-native essays were annotated with target hypotheses (i.e., hypothesized intended formulations in standard written Hebrew). The corpus is available for research purposes upon request. We describe the corpus and the error correction and annotation schemes used in its analysis. In addition to introducing this new resource, we discuss the challenges of identifying and analyzing non-native language use. Among these challenges are determining whether the language used in a particular utterance is native-like, and determining the target hypothesis when language use is non-native-like. We propose various ways for dealing with these challenges.

    Cross-linguistically consistent semantic and syntactic annotation of child-directed speech

    Ida SzubertOmri AbendNathan SchneiderSamuel Gibbon...
    727-776页
    查看更多>>摘要:Abstract Corpora of child speech and child-directed speech (CDS) have enabled major contributions to the study of child language acquisition, yet semantic annotation for such corpora is still scarce and lacks a uniform standard. Semantic annotation of CDS is particularly important for understanding the nature of the input children receive and developing computational models of child language acquisition. For example, under the assumption that children are able to infer meaning representations for (at least some of) the utterances they hear, the acquisition task is to learn a grammar that can map novel adult utterances onto their corresponding meaning representations, in the face of noise and distraction by other contextually possible meanings. To study this problem and to develop computational models of it, we need corpora that provide both adult utterances and their meaning representations, ideally using annotation that is consistent across a range of languages in order to facilitate cross-linguistic comparative studies. This paper proposes a methodology for constructing such corpora of CDS paired with sentential logical forms, and uses this method to create two such corpora, in English and Hebrew. The approach enforces a cross-linguistically consistent representation, building on recent advances in dependency representation and semantic parsing. Specifically, the approach involves two steps. First, we annotate the corpora using the Universal Dependencies (UD) scheme for syntactic annotation, which has been developed to apply consistently to a wide variety of domains and typologically diverse languages. Next, we further annotate these data by applying an automatic method for transducing sentential logical forms (LFs) from UD structures. The UD and LF representations have complementary strengths: UD structures are language-neutral and support consistent and reliable annotation by multiple annotators, whereas LFs are neutral as to their syntactic derivation and transparently encode semantic relations. Using this approach, we provide syntactic and semantic annotation for two corpora from CHILDES: Brown’s Adam corpus (English; we annotate ≈\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\approx$$\end{document} 80% of its child-directed utterances), all child-directed utterances from Berman’s Hagar corpus (Hebrew). We verify the quality of the UD annotation using an inter-annotator agreement study, and manually evaluate the transduced meaning representations. We then demonstrate the utility of the compiled corpora through (1) a longitudinal corpus study of the prevalence of different syntactic and semantic phenomena in the CDS, and (2) applying an existing computational model of language acquisition to the two corpora and briefly comparing the results across languages.

    Correction: Cross-linguistically consistent semantic and syntactic annotation of child-directed speech

    Ida SzubertOmri AbendNathan SchneiderSamuel Gibbon...
    777-778页

    Preservation of sentiment in machine translation of low-resource languages: a case study on Slovak movie subtitles

    Jaroslav ReichelĽubomír Benko
    779-805页
    查看更多>>摘要:Abstract This research explores the effectiveness of machine translation from Slovak to English for sentiment analysis, specifically focusing on the translation of movie subtitles. The study employs a parallel corpus of segmented movie subtitles in both languages and utilizes IBM Watson™ Natural Language Understanding service and Google Translate. The research aims to assess the correlation between human-generated text and machine-translated text in sentiment analysis. A comparative analysis was also conducted using OpenAI to evaluate the sentiment of the Slovak text directly, without translation into English. The findings reveal a strong correlation between human text and machine translation, with a Pearson correlation coefficient of 0.86, and a correlation with OpenAI’s GPT model evaluation at 0.72. Despite the relatively high accuracy of the end-to-end solution using OpenAI, the methodology comprising machine translation followed by sentiment analysis in English was found to be significantly more precise. The research further investigates the challenges in translating specific language nuances, such as humor and vulgarism, and their impact on sentiment analysis. The study concludes that machine translation can be effectively used for sentiment analysis in Slovak, a flective language, and highlights the potential of advanced language models in low-resource languages. Future research directions include expanding the study to other text types and comparable languages beyond Slovak.

    Dataset on sentiment-based cryptocurrency-related news and tweets in English and Malay language

    Nur Azmina Mohamad ZamaniNorhaslinda KamaruddinAhmad Muhyiddin B. Yusof
    807-842页
    查看更多>>摘要:Abstract Cryptocurrency trading is becoming popular due to its profitable investment and has led to worldwide involvement in buying and selling cryptocurrency assets. Sentiments expressed by cryptocurrency enthusiasts toward some news via social media or other online platforms may affect the cryptocurrency market activities. Thus, it has become a challenge to determine the level of positivity or negativity (regression) inhibiting the texts than simply classifying the sentiment into categorical classes. Regression offers more detailed information than a simple classification which can be robust to noisy data as they consider the entire range of possible target values. On the contrary, classification can lead to biased models due to imbalanced dataset and tend to cause overfitting. Hence, this work emphasises in creating sentiment-based cryptocurrency-related corpora in English and Malay focusing on Bitcoin and Ethereum. The data was collected from January to December 2021 from the publicly available news online and tweets from Twitter in English and Malay. The dataset contains a total of 29,694 instances comprised of 5694 news data and 24,000 tweets data. During the annotation process, the annotators are trained until Krippendorf’s alpha agreement of above 60% is achieved since it is considered an applicable benckmark due to the annotation complexity. The corpora is available on Github for cryptocurrency-related experiments using various machine learning or deep learning models to study English and Malay sentiments effect on the global market, particularly the Malaysian market and can be extended for further analysis for Bitcoin and Ethereum market volatile nature.

    Automatic construction of direction-aware sentiment lexicon using direction-dependent words

    Jihye ParkHye Jin LeeSungzoon Cho
    843-869页
    查看更多>>摘要:Abstract Explainability, which is the degree to which an interested stakeholder can understand the key factors that led to a data-driven model’s decision, has been considered an essential consideration in the financial domain. Accordingly, lexicons that can achieve reasonable performance and provide clear explanations to users have been among the most popular resources in sentiment-based financial forecasting. Since deep learning-based techniques have limitations in that the basis for interpreting the results is unclear, lexicons have consistently attracted the community’s attention as a crucial tool in studies that demand explanations for the sentiment estimation process. One of the challenges in the construction of a financial sentiment lexicon is the domain-specific feature that the sentiment orientation of a word can change depending on the application of directional expressions. For instance, the word “cost” typically conveys a negative sentiment; however, when the word is juxtaposed with “decrease” to form the phrase “cost decrease,” the associated sentiment is positive. Several studies have manually built lexicons containing directional expressions. However, they have been hindered because manual inspection inevitably requires intensive human labor and time. In this study, we propose to automatically construct the “sentiment lexicon composed of direction-dependent words,” which expresses each term as a pair consisting of a directional word and a direction-dependent word. Experimental results show that the proposed sentiment lexicon yields enhanced classification performance, proving the effectiveness of our method for the automated construction of a direction-aware sentiment lexicon.

    A sentiment corpus for the cryptocurrency financial domain: the CryptoLin corpus

    Manoel Fernando Alonso GadiMiguel Ángel Sicilia
    871-889页
    查看更多>>摘要:Abstract The objective of this paper is to describe Cryptocurrency Linguo (CryptoLin), a novel corpus containing 2683 cryptocurrency-related news articles covering more than a three-year period. CryptoLin was human-annotated with discrete values representing negative, neutral, and positive news respectively. Eighty-three people participated in the annotation process; each news title was randomly assigned and blindly annotated by three human annotators, one in each different cohort, followed by a consensus mechanism using simple voting. The selection of the annotators was intentionally made using three cohorts with students from a very diverse set of nationalities and educational backgrounds to minimize bias as much as possible. In case one of the annotators was in total disagreement with the other two (e.g., one negative vs two positive or one positive vs two negative), we considered this minority report and defaulted the labeling to neutral. Fleiss’s Kappa, Krippendorff’s Alpha, and Gwet’s AC1 inter-rater reliability coefficients demonstrate CryptoLin’s acceptable quality of inter-annotator agreement. The dataset also includes a text span with the three manual label annotations for further auditing of the annotation mechanism. To further assess the quality of the labeling and the usefulness of CryptoLin dataset, it incorporates four pretrained Sentiment Analysis models: Vader, Textblob, Flair, and FinBERT. Vader and FinBERT demonstrate reasonable performance in the CryptoLin dataset, indicating that the data was not annotated randomly and is therefore useful for further research1. FinBERT (negative) presents the best performance, indicating an advantage of being trained with financial news. Both the CryptoLin dataset and the Jupyter Notebook with the analysis, for reproducibility, are available at the project’s Github. Overall, CryptoLin aims to complement the current knowledge by providing a novel and publicly available Gadi and Ángel Sicilia (Cryptolin dataset and python jupyter notebooks reproducibility codes, 2022) cryptocurrency sentiment corpus and fostering research on the topic of cryptocurrency sentiment analysis and potential applications in behavioral science. This can be useful for businesses and policymakers who want to understand how cryptocurrencies are being used and how they might be regulated. Finally, the rules for selecting and assigning annotators make CryptoLin unique and interesting for new research in annotator selection, assignment, and biases.

    Slovenian parliamentary corpus siParl

    Katja MedenTomaž ErjavecAndrej Pančur
    891-911页
    查看更多>>摘要:Abstract Parliamentary debates represent an essential part of democratic discourse and provide insights into various socio-demographic and linguistic phenomena - parliamentary corpora, which contain transcripts of parliamentary debates and extensive metadata, are an important resource for parliamentary discourse analysis and other research areas. This paper presents the Slovenian parliamentary corpus siParl, the latest version of which contains transcripts of plenary sessions and other legislative bodies of the Assembly of the Republic of Slovenia from 1990 to 2022, comprising more than 1 million speeches and 210 million words. We outline the development history of the corpus and also mention other initiatives that have been influenced by siParl (such as the Parla-CLARIN encoding and the ParlaMint corpora of European parliaments), present the corpus creation process, ranging from the initial data collection to the structural development and encoding of the corpus, and given the growing influence of the ParlaMint corpora, compare siParl with the Slovenian ParlaMint-SI corpus. Finally, we discuss updates for the next version as well as the long-term development and enrichment of the siParl corpus.