Feature extraction from unstructured texts as a combination of the morphological and the syntactic analysis and its usage in fake news classification tasks
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216275%3A25410%2F23%3A39920745" target="_blank" >RIV/00216275:25410/23:39920745 - isvavai.cz</a>
Výsledek na webu
<a href="https://link.springer.com/article/10.1007/s00521-023-08967-2" target="_blank" >https://link.springer.com/article/10.1007/s00521-023-08967-2</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/s00521-023-08967-2" target="_blank" >10.1007/s00521-023-08967-2</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Feature extraction from unstructured texts as a combination of the morphological and the syntactic analysis and its usage in fake news classification tasks
Popis výsledku v původním jazyce
In this paper, a new technique of feature extraction is proposed, which is considered an essential part of natural language processing. Feature extraction is the process of transformation of the unstructured text to a format which is recognizable by computers. This means a transformation to a vector of numbers. The study evaluates and compares the performance of three methods: M1, which is the baseline method TfIdf; M2, which combines TfIdf with POS tags; and M3, a novel technique called MDgwPosF that incorporates weighted TfIdf values based on word depths and the relative frequency of POS tags. The primary focus of the study is to assess and compare the performance of these methods, with particular emphasis on evaluating how M3 performs in comparison with M1 and M2. Two different datasets and feed-forward, LSTM and GRU neural networks were used in this study. The results showed that the feed-forward model with the proposed method MDgwPosF in moderate topology achieved the best performance across various measures. The dataset created automatically performed better than the manual dataset. The differences between methods and topologies were not statistically significant. Statistically significant differences between the classification models were proven. The MDgwPosF method achieved higher accuracy compared to the baseline TfIdf, indicating that incorporating additional information into the vector can enhance the performance of TfIdf.
Název v anglickém jazyce
Feature extraction from unstructured texts as a combination of the morphological and the syntactic analysis and its usage in fake news classification tasks
Popis výsledku anglicky
In this paper, a new technique of feature extraction is proposed, which is considered an essential part of natural language processing. Feature extraction is the process of transformation of the unstructured text to a format which is recognizable by computers. This means a transformation to a vector of numbers. The study evaluates and compares the performance of three methods: M1, which is the baseline method TfIdf; M2, which combines TfIdf with POS tags; and M3, a novel technique called MDgwPosF that incorporates weighted TfIdf values based on word depths and the relative frequency of POS tags. The primary focus of the study is to assess and compare the performance of these methods, with particular emphasis on evaluating how M3 performs in comparison with M1 and M2. Two different datasets and feed-forward, LSTM and GRU neural networks were used in this study. The results showed that the feed-forward model with the proposed method MDgwPosF in moderate topology achieved the best performance across various measures. The dataset created automatically performed better than the manual dataset. The differences between methods and topologies were not statistically significant. Statistically significant differences between the classification models were proven. The MDgwPosF method achieved higher accuracy compared to the baseline TfIdf, indicating that incorporating additional information into the vector can enhance the performance of TfIdf.
Klasifikace
Druh
J<sub>imp</sub> - Článek v periodiku v databázi Web of Science
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
—
Návaznosti
I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace
Ostatní
Rok uplatnění
2023
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název periodika
Neural Computing and Applications
ISSN
0941-0643
e-ISSN
1433-3058
Svazek periodika
35
Číslo periodika v rámci svazku
29
Stát vydavatele periodika
US - Spojené státy americké
Počet stran výsledku
13
Strana od-do
22055-22067
Kód UT WoS článku
001066965500041
EID výsledku v databázi Scopus
2-s2.0-85170046366