Feature extraction from unstructured texts as a combination of the morphological and the syntactic analysis and its usage in fake news classification tasks
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216275%3A25410%2F23%3A39920745" target="_blank" >RIV/00216275:25410/23:39920745 - isvavai.cz</a>
Result on the web
<a href="https://link.springer.com/article/10.1007/s00521-023-08967-2" target="_blank" >https://link.springer.com/article/10.1007/s00521-023-08967-2</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/s00521-023-08967-2" target="_blank" >10.1007/s00521-023-08967-2</a>
Alternative languages
Result language
angličtina
Original language name
Feature extraction from unstructured texts as a combination of the morphological and the syntactic analysis and its usage in fake news classification tasks
Original language description
In this paper, a new technique of feature extraction is proposed, which is considered an essential part of natural language processing. Feature extraction is the process of transformation of the unstructured text to a format which is recognizable by computers. This means a transformation to a vector of numbers. The study evaluates and compares the performance of three methods: M1, which is the baseline method TfIdf; M2, which combines TfIdf with POS tags; and M3, a novel technique called MDgwPosF that incorporates weighted TfIdf values based on word depths and the relative frequency of POS tags. The primary focus of the study is to assess and compare the performance of these methods, with particular emphasis on evaluating how M3 performs in comparison with M1 and M2. Two different datasets and feed-forward, LSTM and GRU neural networks were used in this study. The results showed that the feed-forward model with the proposed method MDgwPosF in moderate topology achieved the best performance across various measures. The dataset created automatically performed better than the manual dataset. The differences between methods and topologies were not statistically significant. Statistically significant differences between the classification models were proven. The MDgwPosF method achieved higher accuracy compared to the baseline TfIdf, indicating that incorporating additional information into the vector can enhance the performance of TfIdf.
Czech name
—
Czech description
—
Classification
Type
J<sub>imp</sub> - Article in a specialist periodical, which is included in the Web of Science database
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace
Others
Publication year
2023
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Name of the periodical
Neural Computing and Applications
ISSN
0941-0643
e-ISSN
1433-3058
Volume of the periodical
35
Issue of the periodical within the volume
29
Country of publishing house
US - UNITED STATES
Number of pages
13
Pages from-to
22055-22067
UT code for WoS article
001066965500041
EID of the result in the Scopus database
2-s2.0-85170046366