All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

Index-based N-gram extraction from large document collections

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F61989100%3A27240%2F11%3A86081508" target="_blank" >RIV/61989100:27240/11:86081508 - isvavai.cz</a>

  • Result on the web

    <a href="http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6093324" target="_blank" >http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6093324</a>

  • DOI - Digital Object Identifier

    <a href="http://dx.doi.org/10.1109/ICDIM.2011.6093324" target="_blank" >10.1109/ICDIM.2011.6093324</a>

Alternative languages

  • Result language

    angličtina

  • Original language name

    Index-based N-gram extraction from large document collections

  • Original language description

    N-grams are applied in some applications searching in text documents, especially in cases when one must work with phrases, e.g. in plagiarism detection. N-gram is a sequence of n terms (or generally tokens) from a document. We get a set of n-grams by moving a floating window from the begin to the end of the document. During the extraction we must remove duplicate n-grams and we must store additional values to each n-gram type, e.g. n-gram type frequency for each document and so on, it depends on a querymodel used. Previous works utilize a sorting algorithm to compute the n-gram frequency. These approaches must handle a high number of the same n-grams resulting in high time and space overhead. Moreover, these techniques are often main-memory only, it means they must be executed for small or middle size collections. In this paper, we show an index-based method to the n-gram extraction for large collections. This method utilizes common data structures like B-tree and Hash table. We show

  • Czech name

  • Czech description

Classification

  • Type

    D - Article in proceedings

  • CEP classification

    IN - Informatics

  • OECD FORD branch

Result continuities

  • Project

    <a href="/en/project/GAP202%2F10%2F0573" target="_blank" >GAP202/10/0573: Handling XML Data in Heterogeneous and Dynamic Environments</a><br>

  • Continuities

    P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)<br>S - Specificky vyzkum na vysokych skolach

Others

  • Publication year

    2011

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

  • Article name in the collection

    2011 6th International Conference on Digital Information Management, ICDIM 2011

  • ISBN

    978-1-4577-1538-9

  • ISSN

  • e-ISSN

  • Number of pages

    6

  • Pages from-to

    73-78

  • Publisher name

    IEEE

  • Place of publication

    NEW YORK

  • Event location

    Melbourne

  • Event date

    Sep 12, 2011

  • Type of event by nationality

    WRD - Celosvětová akce

  • UT code for WoS article