Index-based N-gram extraction from large document collections

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F61989100%3A27240%2F11%3A86081508" target="_blank" >RIV/61989100:27240/11:86081508 - isvavai.cz</a>
Výsledek na webu
<a href="http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6093324" target="_blank" >http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6093324</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1109/ICDIM.2011.6093324" target="_blank" >10.1109/ICDIM.2011.6093324</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Index-based N-gram extraction from large document collections
Popis výsledku v původním jazyce
N-grams are applied in some applications searching in text documents, especially in cases when one must work with phrases, e.g. in plagiarism detection. N-gram is a sequence of n terms (or generally tokens) from a document. We get a set of n-grams by moving a floating window from the begin to the end of the document. During the extraction we must remove duplicate n-grams and we must store additional values to each n-gram type, e.g. n-gram type frequency for each document and so on, it depends on a querymodel used. Previous works utilize a sorting algorithm to compute the n-gram frequency. These approaches must handle a high number of the same n-grams resulting in high time and space overhead. Moreover, these techniques are often main-memory only, it means they must be executed for small or middle size collections. In this paper, we show an index-based method to the n-gram extraction for large collections. This method utilizes common data structures like B-tree and Hash table. We show
Název v anglickém jazyce
Index-based N-gram extraction from large document collections
Popis výsledku anglicky
N-grams are applied in some applications searching in text documents, especially in cases when one must work with phrases, e.g. in plagiarism detection. N-gram is a sequence of n terms (or generally tokens) from a document. We get a set of n-grams by moving a floating window from the begin to the end of the document. During the extraction we must remove duplicate n-grams and we must store additional values to each n-gram type, e.g. n-gram type frequency for each document and so on, it depends on a querymodel used. Previous works utilize a sorting algorithm to compute the n-gram frequency. These approaches must handle a high number of the same n-grams resulting in high time and space overhead. Moreover, these techniques are often main-memory only, it means they must be executed for small or middle size collections. In this paper, we show an index-based method to the n-gram extraction for large collections. This method utilizes common data structures like B-tree and Hash table. We show

Klasifikace

Druh
D - Stať ve sborníku
CEP obor
IN - Informatika
OECD FORD obor
—

Návaznosti výsledku

Projekt
<a href="/cs/project/GAP202%2F10%2F0573" target="_blank" >GAP202/10/0573: Zpracování XML dat v heterogenních a dynamických prostředích</a><br>
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)<br>S - Specificky vyzkum na vysokych skolach

Ostatní

Rok uplatnění
2011
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název statě ve sborníku
2011 6th International Conference on Digital Information Management, ICDIM 2011
ISBN
978-1-4577-1538-9
ISSN
—
e-ISSN
—
Počet stran výsledku
6
Strana od-do
73-78
Název nakladatele
IEEE
Místo vydání
NEW YORK
Místo konání akce
Melbourne
Datum konání akce
12. 9. 2011
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—

Podobné výsledky(10)

Využití N-Gramů při klasifikaci textu Modelling crosslinguistic n-gram correspondence in typologically different languages N-Gram-Based Text Compression

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Index-based N-gram extraction from large document collections

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)