A modified algorithm of the latent semantic analysis for text processing in the Russian language

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F21%3A10441753" target="_blank" >RIV/00216208:11320/21:10441753 - isvavai.cz</a>
Výsledek na webu
<a href="https://doi.org/10.1088/1742-6596/1715/1/012009" target="_blank" >https://doi.org/10.1088/1742-6596/1715/1/012009</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1088/1742-6596/1715/1/012009" target="_blank" >10.1088/1742-6596/1715/1/012009</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
A modified algorithm of the latent semantic analysis for text processing in the Russian language
Popis výsledku v původním jazyce
The paper presents a methodology for analyzing texts in the Russian language. The methodology is based on the Latent Semantic Analysis (LSA) algorithm. A number of disadvantages of the classical method are considered, and modification methods of extracting N-grams from the text are proposed. The modified method allows one to reduce a number of extracted N-grams and an increasing the meaningfulness of the retrieved collection in comparison with a standard method. The reduction of the collection size leads to a reduced dimension of the TF-IDF matrix and accelerated the execution of the SVD method. The advantages of the developed machine learning algorithm are demonstrated on simple sentences. Owing to discussed ideas it becomes possible to effectively parallelize the text processing at the lemmatization step.
Název v anglickém jazyce
A modified algorithm of the latent semantic analysis for text processing in the Russian language
Popis výsledku anglicky
The paper presents a methodology for analyzing texts in the Russian language. The methodology is based on the Latent Semantic Analysis (LSA) algorithm. A number of disadvantages of the classical method are considered, and modification methods of extracting N-grams from the text are proposed. The modified method allows one to reduce a number of extracted N-grams and an increasing the meaningfulness of the retrieved collection in comparison with a standard method. The reduction of the collection size leads to a reduced dimension of the TF-IDF matrix and accelerated the execution of the SVD method. The advantages of the developed machine learning algorithm are demonstrated on simple sentences. Owing to discussed ideas it becomes possible to effectively parallelize the text processing at the lemmatization step.

Klasifikace

Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

Projekt
—
Návaznosti
—

Ostatní

Rok uplatnění
2021
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název statě ve sborníku
Journal of Physics: Conference Series
ISBN
—
ISSN
1742-6588
e-ISSN
—
Počet stran výsledku
7
Strana od-do
—
Název nakladatele
IOP Publishing Ltd
Místo vydání
Bristol
Místo konání akce
Akademgorodok
Datum konání akce
19. 10. 2020
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—

Podobné výsledky(10)

Index-based N-gram extraction from large document collections Využití N-Gramů při klasifikaci textu N-Gram-Based Text Compression

Co hledáte?

Rychlé hledání

Chytré vyhledávání

A modified algorithm of the latent semantic analysis for text processing in the Russian language

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)