All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

Fast Hybrid Data Structure for a Large Alphabet K-Mers Indexing for Whole Genome Alignment

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F61989100%3A27240%2F21%3A10248617" target="_blank" >RIV/61989100:27240/21:10248617 - isvavai.cz</a>

  • Result on the web

    <a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9583298" target="_blank" >https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9583298</a>

  • DOI - Digital Object Identifier

    <a href="http://dx.doi.org/10.1109/ACCESS.2021.3121749" target="_blank" >10.1109/ACCESS.2021.3121749</a>

Alternative languages

  • Result language

    angličtina

  • Original language name

    Fast Hybrid Data Structure for a Large Alphabet K-Mers Indexing for Whole Genome Alignment

  • Original language description

    The most common index data structures used by whole genome aligners (WGA) are based on suffix trees (ST), suffix arrays, and FM-indexes. These data structures show good performance results as WGA works with sequences of letters over small alphabets; for example, four letters a, c, t, g for DNA alignment. A novel whole genome aligner, which we are developing, will work with distances between the label sites on DNA samples, which are represented as a sequence of positive integers. The size of alphabet sigma is theoretically unlimited. This has prompted us to investigate if there is a better structure that would improve search performance on large alphabets compared to the commonly used suffix-based structures. This paper presents the implementation of a highly optimized hybrid index data structure based on a ternary search tree (TST) and hash tables, which perform much better when working with strings on large alphabets compared to the ST. Single core parallelism was achieved using advanced vector extensions (AVX) single instruction multiple data (SIMD) instruction set. When searching for short k-mers over an alphabet of 25, 695 letters, our index search performance was up to 29 times better than the search performance of the reference ST. When the alphabet was compressed to approximately 1; 300 letters, our index search performance was still up to 2.6 times better than the ST. The source code is available free on http://olgen.cz/Resources/Upload/Home/public/software/hds.zip under the MIT license.

  • Czech name

  • Czech description

Classification

  • Type

    J<sub>imp</sub> - Article in a specialist periodical, which is included in the Web of Science database

  • CEP classification

  • OECD FORD branch

    10200 - Computer and information sciences

Result continuities

  • Project

    <a href="/en/project/NU20-06-00269" target="_blank" >NU20-06-00269: Utility of cellular profiles and proteomics of synovial fluid and periprosthetic tissues for clinical decision making in knee osteoarthritis</a><br>

  • Continuities

    P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Others

  • Publication year

    2021

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

  • Name of the periodical

    IEEE Access

  • ISSN

    2169-3536

  • e-ISSN

  • Volume of the periodical

    9

  • Issue of the periodical within the volume

    2021

  • Country of publishing house

    US - UNITED STATES

  • Number of pages

    8

  • Pages from-to

    161890-161897

  • UT code for WoS article

    000730465200001

  • EID of the result in the Scopus database

    2-s2.0-85121784198