Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14740%2F24%3A00138864" target="_blank" >RIV/00216224:14740/24:00138864 - isvavai.cz</a>
Výsledek na webu
<a href="https://www.nature.com/articles/s41597-024-03841-9" target="_blank" >https://www.nature.com/articles/s41597-024-03841-9</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1038/s41597-024-03841-9" target="_blank" >10.1038/s41597-024-03841-9</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature
Popis výsledku v původním jazyce
We present a novel system that leverages curators in the loop to develop a dataset and model for detecting structure features and functional annotations at residue-level from standard publication text. Our approach involves the integration of data from multiple resources, including PDBe, EuropePMC, PubMedCentral, and PubMed, combined with annotation guidelines from UniProt, and LitSuggest and HuggingFace models as tools in the annotation process. A team of seven annotators manually curated ten articles for named entities, which we utilized to train a starting PubmedBert model from HuggingFace. Using a human-in-the-loop annotation system, we iteratively developed the best model with commendable performance metrics of 0.90 for precision, 0.92 for recall, and 0.91 for F1-measure. Our proposed system showcases a successful synergy of machine learning techniques and human expertise in curating a dataset for residue-level functional annotations and protein structure features. The results demonstrate the potential for broader applications in protein research, bridging the gap between advanced machine learning models and the indispensable insights of domain experts.
Název v anglickém jazyce
Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature
Popis výsledku anglicky
We present a novel system that leverages curators in the loop to develop a dataset and model for detecting structure features and functional annotations at residue-level from standard publication text. Our approach involves the integration of data from multiple resources, including PDBe, EuropePMC, PubMedCentral, and PubMed, combined with annotation guidelines from UniProt, and LitSuggest and HuggingFace models as tools in the annotation process. A team of seven annotators manually curated ten articles for named entities, which we utilized to train a starting PubmedBert model from HuggingFace. Using a human-in-the-loop annotation system, we iteratively developed the best model with commendable performance metrics of 0.90 for precision, 0.92 for recall, and 0.91 for F1-measure. Our proposed system showcases a successful synergy of machine learning techniques and human expertise in curating a dataset for residue-level functional annotations and protein structure features. The results demonstrate the potential for broader applications in protein research, bridging the gap between advanced machine learning models and the indispensable insights of domain experts.
Klasifikace
Druh
J<sub>imp</sub> - Článek v periodiku v databázi Web of Science
CEP obor
—
OECD FORD obor
10700 - Other natural sciences
Návaznosti výsledku
Projekt
—
Návaznosti
—
Ostatní
Rok uplatnění
2024
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název periodika
Scientific Data
ISSN
2052-4463
e-ISSN
2052-4463
Svazek periodika
11
Číslo periodika v rámci svazku
1
Stát vydavatele periodika
DE - Spolková republika Německo
Počet stran výsledku
18
Strana od-do
1-18
Kód UT WoS článku
001325129100022
EID výsledku v databázi Scopus
2-s2.0-85205275590