Revealing data leakage in protein interaction benchmarks
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F68407700%3A21730%2F24%3A00380615" target="_blank" >RIV/68407700:21730/24:00380615 - isvavai.cz</a>
Výsledek na webu
<a href="https://openreview.net/attachment?id=ORMXYUK5IY&name=pdf" target="_blank" >https://openreview.net/attachment?id=ORMXYUK5IY&name=pdf</a>
DOI - Digital Object Identifier
—
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Revealing data leakage in protein interaction benchmarks
Popis výsledku v původním jazyce
In recent years, there has been remarkable progress in machine learning for protein–protein interactions. However, prior work has predominantly focused on improving learning algorithms, with less attention paid to evaluation strategies and data preparation. Here, we demonstrate that further development of machine learning methods may be hindered by the quality of existing train-test splits. Specifically, we find that commonly used splitting strategies for protein complexes, based on protein sequence or metadata similarity, introduce major data leakage. This may result in overoptimistic evaluation of generalization, as well as unfair benchmarking of the models, biased towards assessing their overfitting capacity rather than practical utility. To overcome the data leakage, we recommend constructing data splits based on 3D structural similarity of protein–protein interfaces and suggest corresponding algorithms. We believe that addressing the data leakage problem is critical for further progress in this research area.
Název v anglickém jazyce
Revealing data leakage in protein interaction benchmarks
Popis výsledku anglicky
In recent years, there has been remarkable progress in machine learning for protein–protein interactions. However, prior work has predominantly focused on improving learning algorithms, with less attention paid to evaluation strategies and data preparation. Here, we demonstrate that further development of machine learning methods may be hindered by the quality of existing train-test splits. Specifically, we find that commonly used splitting strategies for protein complexes, based on protein sequence or metadata similarity, introduce major data leakage. This may result in overoptimistic evaluation of generalization, as well as unfair benchmarking of the models, biased towards assessing their overfitting capacity rather than practical utility. To overcome the data leakage, we recommend constructing data splits based on 3D structural similarity of protein–protein interfaces and suggest corresponding algorithms. We believe that addressing the data leakage problem is critical for further progress in this research area.
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
—
Návaznosti
—
Ostatní
Rok uplatnění
2024
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
Proceeding The Twelfth International Conference on Learning Representations (ICLR 2024)
ISBN
9781713898658
ISSN
—
e-ISSN
—
Počet stran výsledku
13
Strana od-do
—
Název nakladatele
International Conference on Learning Representations
Místo vydání
—
Místo konání akce
Vídeň
Datum konání akce
7. 5. 2024
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—