Fast Hybrid Data Structure for a Large Alphabet K-Mers Indexing for Whole Genome Alignment

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F61989100%3A27240%2F21%3A10248617" target="_blank" >RIV/61989100:27240/21:10248617 - isvavai.cz</a>
Výsledek na webu
<a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9583298" target="_blank" >https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9583298</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1109/ACCESS.2021.3121749" target="_blank" >10.1109/ACCESS.2021.3121749</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Fast Hybrid Data Structure for a Large Alphabet K-Mers Indexing for Whole Genome Alignment
Popis výsledku v původním jazyce
The most common index data structures used by whole genome aligners (WGA) are based on suffix trees (ST), suffix arrays, and FM-indexes. These data structures show good performance results as WGA works with sequences of letters over small alphabets; for example, four letters a, c, t, g for DNA alignment. A novel whole genome aligner, which we are developing, will work with distances between the label sites on DNA samples, which are represented as a sequence of positive integers. The size of alphabet sigma is theoretically unlimited. This has prompted us to investigate if there is a better structure that would improve search performance on large alphabets compared to the commonly used suffix-based structures. This paper presents the implementation of a highly optimized hybrid index data structure based on a ternary search tree (TST) and hash tables, which perform much better when working with strings on large alphabets compared to the ST. Single core parallelism was achieved using advanced vector extensions (AVX) single instruction multiple data (SIMD) instruction set. When searching for short k-mers over an alphabet of 25, 695 letters, our index search performance was up to 29 times better than the search performance of the reference ST. When the alphabet was compressed to approximately 1; 300 letters, our index search performance was still up to 2.6 times better than the ST. The source code is available free on http://olgen.cz/Resources/Upload/Home/public/software/hds.zip under the MIT license.
Název v anglickém jazyce
Fast Hybrid Data Structure for a Large Alphabet K-Mers Indexing for Whole Genome Alignment
Popis výsledku anglicky
The most common index data structures used by whole genome aligners (WGA) are based on suffix trees (ST), suffix arrays, and FM-indexes. These data structures show good performance results as WGA works with sequences of letters over small alphabets; for example, four letters a, c, t, g for DNA alignment. A novel whole genome aligner, which we are developing, will work with distances between the label sites on DNA samples, which are represented as a sequence of positive integers. The size of alphabet sigma is theoretically unlimited. This has prompted us to investigate if there is a better structure that would improve search performance on large alphabets compared to the commonly used suffix-based structures. This paper presents the implementation of a highly optimized hybrid index data structure based on a ternary search tree (TST) and hash tables, which perform much better when working with strings on large alphabets compared to the ST. Single core parallelism was achieved using advanced vector extensions (AVX) single instruction multiple data (SIMD) instruction set. When searching for short k-mers over an alphabet of 25, 695 letters, our index search performance was up to 29 times better than the search performance of the reference ST. When the alphabet was compressed to approximately 1; 300 letters, our index search performance was still up to 2.6 times better than the ST. The source code is available free on http://olgen.cz/Resources/Upload/Home/public/software/hds.zip under the MIT license.

Klasifikace

Druh
J<sub>imp</sub> - Článek v periodiku v databázi Web of Science
CEP obor
—
OECD FORD obor
10200 - Computer and information sciences

Návaznosti výsledku

Projekt
<a href="/cs/project/NU20-06-00269" target="_blank" >NU20-06-00269: Využití buněčných profilů a proteomiky synoviální tekutiny, případně tkání pro podporu klinického rozhodování u osteoartrózy kolena</a><br>
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Ostatní

Rok uplatnění
2021
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název periodika
IEEE Access
ISSN
2169-3536
e-ISSN
—
Svazek periodika
9
Číslo periodika v rámci svazku
2021
Stát vydavatele periodika
US - Spojené státy americké
Počet stran výsledku
8
Strana od-do
161890-161897
Kód UT WoS článku
000730465200001
EID výsledku v databázi Scopus
2-s2.0-85121784198

Podobné výsledky(10)

Towards faster matching algorithm using ternary tree in the area of genome mapping Testing the impact of Whole Genome Amplification on genome comparison using the polyploid flagellated Giardia duodenalis as a model On-line Searching in IUPAC nucleotide sequences

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Fast Hybrid Data Structure for a Large Alphabet K-Mers Indexing for Whole Genome Alignment

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)