A machine learning approach to Czech readability
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A90244%2F24%3A10495705" target="_blank" >RIV/00216208:90244/24:10495705 - isvavai.cz</a>
Výsledek na webu
<a href="https://doi.org/10.4995/EuroCALL2023.2023.16991" target="_blank" >https://doi.org/10.4995/EuroCALL2023.2023.16991</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.4995/EuroCALL2023.2023.16991" target="_blank" >10.4995/EuroCALL2023.2023.16991</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
A machine learning approach to Czech readability
Popis výsledku v původním jazyce
We present a new corpus of Czech texts labeled for second-language readability, and show results of experiments to train machine-learning classifiers to automatically label new texts according to reading level. We report results comparing the performance of traditional machine-learning models (including Random Forest, XGBoost, Linear Discriminant Analysis, and XGBoost Random Forest) and a neural network (XLM-RoBERTa). The results of our research can be implemented in tools to support learning Czech, a less commonly taught language. We extract 46 linguistic features in various categories for use with traditional machine-learning algorithms. We train models on these features and evaluate their performance with recursive feature elimination to determine how informative each feature is for each model. We then compare those results to those of a transformer trained for the same task on the same corpus. XGBoost achieves the highest accuracy at 0.81, suggesting that these traditional models can still perform as well as, or better, than newer models on this task. Notably, the transformer has the lowest mean F1 at 0.74.
Název v anglickém jazyce
A machine learning approach to Czech readability
Popis výsledku anglicky
We present a new corpus of Czech texts labeled for second-language readability, and show results of experiments to train machine-learning classifiers to automatically label new texts according to reading level. We report results comparing the performance of traditional machine-learning models (including Random Forest, XGBoost, Linear Discriminant Analysis, and XGBoost Random Forest) and a neural network (XLM-RoBERTa). The results of our research can be implemented in tools to support learning Czech, a less commonly taught language. We extract 46 linguistic features in various categories for use with traditional machine-learning algorithms. We train models on these features and evaluate their performance with recursive feature elimination to determine how informative each feature is for each model. We then compare those results to those of a transformer trained for the same task on the same corpus. XGBoost achieves the highest accuracy at 0.81, suggesting that these traditional models can still perform as well as, or better, than newer models on this task. Notably, the transformer has the lowest mean F1 at 0.74.
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
60203 - Linguistics
Návaznosti výsledku
Projekt
—
Návaznosti
—
Ostatní
Rok uplatnění
2024
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
EuroCALL 2023. CALL for all Languages
ISBN
978-84-13-96131-6
ISSN
—
e-ISSN
—
Počet stran výsledku
6
Strana od-do
159-164
Název nakladatele
Universitat Politècnica de València
Místo vydání
Valencie
Místo konání akce
University of Iceland
Datum konání akce
15. 8. 2023
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—