Complex Word Identification for Italian Language: A Dictionary–based Approach
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F25%3AST9IB3JF" target="_blank" >RIV/00216208:11320/25:ST9IB3JF - isvavai.cz</a>
Result on the web
<a href="https://aclanthology.org/2024.clib-1.12" target="_blank" >https://aclanthology.org/2024.clib-1.12</a>
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
Complex Word Identification for Italian Language: A Dictionary–based Approach
Original language description
Assessing word complexity in Italian poses significant challenges, particularly due to the absence of a standardized dataset. This study introduces the first automatic model designed to identify word complexity for native Italian speakers. A dictionary of simple and complex words was constructed, and various configurations of linguistic features were explored to find the best statistical classifier based on Random Forest algorithm. Considering the probabilities of a word to belong to a class, a comparison between the models' predictions and human assessments derived from a dataset annotated for complexity perception was made. Finally, the degree of accord between the model predictions and the human inter-annotator agreement was analyzed using Spearman correlation. Our findings indicate that a model incorporating both linguistic features and word embeddings performed better than other simpler models, also showing a value of correlation with the human judgements similar to the inter-annotator agreement. This study demonstrates the feasibility of an automatic system for detecting complexity in the Italian language with good performances and comparable effectiveness to humans in this subjective task.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
—
Others
Publication year
2024
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Proceedings of the Sixth International Conference on Computational Linguistics in Bulgaria (CLIB 2024)
ISBN
—
ISSN
2367-5578
e-ISSN
—
Number of pages
11
Pages from-to
119-129
Publisher name
Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences
Place of publication
—
Event location
Sofia, Bulgaria
Event date
Jan 1, 2025
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—