All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

Genomic benchmarks: a collection of datasets for genomic sequence classification

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14740%2F23%3A00131330" target="_blank" >RIV/00216224:14740/23:00131330 - isvavai.cz</a>

  • Result on the web

    <a href="https://link.springer.com/article/10.1186/s12863-023-01123-8" target="_blank" >https://link.springer.com/article/10.1186/s12863-023-01123-8</a>

  • DOI - Digital Object Identifier

    <a href="http://dx.doi.org/10.1186/s12863-023-01123-8" target="_blank" >10.1186/s12863-023-01123-8</a>

Alternative languages

  • Result language

    angličtina

  • Original language name

    Genomic benchmarks: a collection of datasets for genomic sequence classification

  • Original language description

    Background Recently, deep neural networks have been successfully applied in many biological fields. In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods. However, this solution to the most prominent bioinformatic challenge of the past 50 years has been possible only thanks to a carefully curated benchmark of experimentally predicted protein structures. In Genomics, we have similar challenges (annotation of genomes and identification of functional elements) but currently, we lack benchmarks similar to protein folding competition. Results Here we present a collection of curated and easily accessible sequence classification datasets in the field of genomics. The proposed collection is based on a combination of novel datasets constructed from the mining of publicly available databases and existing datasets obtained from published articles. The collection currently contains nine datasets that focus on regulatory elements (promoters, enhancers, open chromatin region) from three model organisms: human, mouse, and roundworm. A simple convolution neural network is also included in a repository and can be used as a baseline model. Benchmarks and the baseline model are distributed as the Python package ‘genomic-benchmarks’, and the code is available at https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks. Conclusions Deep learning techniques revolutionized many biological fields but mainly thanks to the carefully curated benchmarks. For the field of Genomics, we propose a collection of benchmark datasets for the classification of genomic sequences with an interface for the most commonly used deep learning libraries, implementation of the simple neural network and a training framework that can be used as a starting point for future research. The main aim of this effort is to create a repository for shared datasets that will make machine learning for genomics more comparable and reproducible while reducing the overhead of researchers who want to enter the field, leading to healthy competition and new discoveries.

  • Czech name

  • Czech description

Classification

  • Type

    J<sub>imp</sub> - Article in a specialist periodical, which is included in the Web of Science database

  • CEP classification

  • OECD FORD branch

    10610 - Biophysics

Result continuities

  • Project

    Result was created during the realization of more than one project. More information in the Projects tab.

  • Continuities

    P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Others

  • Publication year

    2023

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

  • Name of the periodical

    BMC Genomic Data

  • ISSN

    2730-6844

  • e-ISSN

    2730-6844

  • Volume of the periodical

    24

  • Issue of the periodical within the volume

    1

  • Country of publishing house

    GB - UNITED KINGDOM

  • Number of pages

    9

  • Pages from-to

    1-9

  • UT code for WoS article

    000981254200001

  • EID of the result in the Scopus database

    2-s2.0-85157964171