A machine learning approach to Czech readability
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A90244%2F24%3A10495705" target="_blank" >RIV/00216208:90244/24:10495705 - isvavai.cz</a>
Result on the web
<a href="https://doi.org/10.4995/EuroCALL2023.2023.16991" target="_blank" >https://doi.org/10.4995/EuroCALL2023.2023.16991</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.4995/EuroCALL2023.2023.16991" target="_blank" >10.4995/EuroCALL2023.2023.16991</a>
Alternative languages
Result language
angličtina
Original language name
A machine learning approach to Czech readability
Original language description
We present a new corpus of Czech texts labeled for second-language readability, and show results of experiments to train machine-learning classifiers to automatically label new texts according to reading level. We report results comparing the performance of traditional machine-learning models (including Random Forest, XGBoost, Linear Discriminant Analysis, and XGBoost Random Forest) and a neural network (XLM-RoBERTa). The results of our research can be implemented in tools to support learning Czech, a less commonly taught language. We extract 46 linguistic features in various categories for use with traditional machine-learning algorithms. We train models on these features and evaluate their performance with recursive feature elimination to determine how informative each feature is for each model. We then compare those results to those of a transformer trained for the same task on the same corpus. XGBoost achieves the highest accuracy at 0.81, suggesting that these traditional models can still perform as well as, or better, than newer models on this task. Notably, the transformer has the lowest mean F1 at 0.74.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
60203 - Linguistics
Result continuities
Project
—
Continuities
—
Others
Publication year
2024
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
EuroCALL 2023. CALL for all Languages
ISBN
978-84-13-96131-6
ISSN
—
e-ISSN
—
Number of pages
6
Pages from-to
159-164
Publisher name
Universitat Politècnica de València
Place of publication
Valencie
Event location
University of Iceland
Event date
Aug 15, 2023
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—