New Human-Annotated Dataset of Czech Health Records for Training Medical Concept Recognition Models
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F24%3A00136991" target="_blank" >RIV/00216224:14330/24:00136991 - isvavai.cz</a>
Result on the web
<a href="http://dx.doi.org/10.1007/978-3-031-70563-2_9" target="_blank" >http://dx.doi.org/10.1007/978-3-031-70563-2_9</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-031-70563-2_9" target="_blank" >10.1007/978-3-031-70563-2_9</a>
Alternative languages
Result language
angličtina
Original language name
New Human-Annotated Dataset of Czech Health Records for Training Medical Concept Recognition Models
Original language description
Following the widespread successes of leveraging recent large language models (LLMs) in various NLP tasks, this paper focuses on medical text content understanding. Adapting a foundational LLM to the medical domain requires a special kind of datasets where core medical concepts are accurately annotated. This paper addresses the need of better medical concept recognition in free-text electronic health records in low-resourced Slavic languages and introduces CSEHR, a new human-annotated dataset of Czech oncology health records. It describes the dataset inception, management, considerations, processing, and finally presents baseline concept recognition model results. XLM-RoBERTa models trained on the dataset using 5-fold cross-validation achieved an average weighted F1 score of 0.672 in exact and 0.777 in partial medical concept recognition ranging from 0.335 to 0.857 per different concept classes. This paper then describes future plans of bootstrapping larger annotated corpora from the CSEHR dataset and of making the dataset publicly available. This endeavor is unique in the realm of Slavic languages and already at this stage it represents a major step in the field of Slavic medical concept recognition.",
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10200 - Computer and information sciences
Result continuities
Project
<a href="/en/project/LM2023062" target="_blank" >LM2023062: Digital Research Infrastructure for Language Technologies, Arts and Humanities</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)<br>S - Specificky vyzkum na vysokych skolach
Others
Publication year
2024
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Text, Speech, and Dialogue
ISBN
9783031705625
ISSN
0302-9743
e-ISSN
—
Number of pages
11
Pages from-to
110-120
Publisher name
Springer Nature Switzerland
Place of publication
Cham
Event location
Brno
Event date
Jan 1, 2024
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
001307840300009