Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14740%2F24%3A00138864" target="_blank" >RIV/00216224:14740/24:00138864 - isvavai.cz</a>
Výsledek na webu
<a href="https://www.nature.com/articles/s41597-024-03841-9" target="_blank" >https://www.nature.com/articles/s41597-024-03841-9</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1038/s41597-024-03841-9" target="_blank" >10.1038/s41597-024-03841-9</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature
Popis výsledku v původním jazyce
We present a novel system that leverages curators in the loop to develop a dataset and model for detecting structure features and functional annotations at residue-level from standard publication text. Our approach involves the integration of data from multiple resources, including PDBe, EuropePMC, PubMedCentral, and PubMed, combined with annotation guidelines from UniProt, and LitSuggest and HuggingFace models as tools in the annotation process. A team of seven annotators manually curated ten articles for named entities, which we utilized to train a starting PubmedBert model from HuggingFace. Using a human-in-the-loop annotation system, we iteratively developed the best model with commendable performance metrics of 0.90 for precision, 0.92 for recall, and 0.91 for F1-measure. Our proposed system showcases a successful synergy of machine learning techniques and human expertise in curating a dataset for residue-level functional annotations and protein structure features. The results demonstrate the potential for broader applications in protein research, bridging the gap between advanced machine learning models and the indispensable insights of domain experts.
Název v anglickém jazyce
Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature
Popis výsledku anglicky
We present a novel system that leverages curators in the loop to develop a dataset and model for detecting structure features and functional annotations at residue-level from standard publication text. Our approach involves the integration of data from multiple resources, including PDBe, EuropePMC, PubMedCentral, and PubMed, combined with annotation guidelines from UniProt, and LitSuggest and HuggingFace models as tools in the annotation process. A team of seven annotators manually curated ten articles for named entities, which we utilized to train a starting PubmedBert model from HuggingFace. Using a human-in-the-loop annotation system, we iteratively developed the best model with commendable performance metrics of 0.90 for precision, 0.92 for recall, and 0.91 for F1-measure. Our proposed system showcases a successful synergy of machine learning techniques and human expertise in curating a dataset for residue-level functional annotations and protein structure features. The results demonstrate the potential for broader applications in protein research, bridging the gap between advanced machine learning models and the indispensable insights of domain experts.

Klasifikace

Druh
J<sub>imp</sub> - Článek v periodiku v databázi Web of Science
CEP obor
—
OECD FORD obor
10700 - Other natural sciences

Návaznosti výsledku

Projekt
—
Návaznosti
—

Ostatní

Rok uplatnění
2024
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název periodika
Scientific Data
ISSN
2052-4463
e-ISSN
2052-4463
Svazek periodika
11
Číslo periodika v rámci svazku
1
Stát vydavatele periodika
DE - Spolková republika Německo
Počet stran výsledku
18
Strana od-do
1-18
Kód UT WoS článku
001325129100022
EID výsledku v databázi Scopus
2-s2.0-85205275590

Podobné výsledky(10)

Genomic benchmarks: a collection of datasets for genomic sequence classification Reproducible MS/MS library cleaning pipeline in matchms SoluProtMutDB: A manually curated database of protein solubility changes upon mutations

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)