Pipeline and dataset generation for automated fact-checking in almost any language
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11230%2F24%3A10491323" target="_blank" >RIV/00216208:11230/24:10491323 - isvavai.cz</a>
Alternative codes found
RIV/68407700:21230/24:00376915
Result on the web
<a href="https://verso.is.cuni.cz/pub/verso.fpl?fname=obd_publikace_handle&handle=LlIO7PMEmJ" target="_blank" >https://verso.is.cuni.cz/pub/verso.fpl?fname=obd_publikace_handle&handle=LlIO7PMEmJ</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/s00521-024-10113-5" target="_blank" >10.1007/s00521-024-10113-5</a>
Alternative languages
Result language
angličtina
Original language name
Pipeline and dataset generation for automated fact-checking in almost any language
Original language description
This article presents a pipeline for automated fact-checking leveraging publicly available language models and data. The objective is to assess the accuracy of textual claims using evidence from a ground-truth evidence corpus. The pipeline consists of two main modules-the evidence retrieval and the claim veracity evaluation. Our primary focus is on the ease of deployment in various languages that remain unexplored in the field of automated fact-checking. Unlike most similar pipelines, which work with evidence sentences, our pipeline processes data on a paragraph level, simplifying the overall architecture and data requirements. Given the high cost of annotating language-specific fact-checking training data, our solution builds on the question answering for claim generation method, which we adapt and use to generate the data for all models of the pipeline. Our strategy enables the introduction of new languages through machine translation of only two fixed datasets of moderate size. Subsequently, any number of training samples can be generated based on an evidence corpus in the target language. We provide open access to all data and fine-tuned models for Czech, English, Polish, and Slovak pipelines, as well as to our codebase that may be used to reproduce the results. We comprehensively evaluate the pipelines for all four languages, including human annotations and per-sample difficulty assessment using Pointwise-information. The presented experiments are based on full Wikipedia snapshots to promote reproducibility. To facilitate implementation and user interaction, we develop the FactSearch application featuring the proposed pipeline and the preliminary feedback on its performance.
Czech name
—
Czech description
—
Classification
Type
J<sub>SC</sub> - Article in a specialist periodical, which is included in the SCOPUS database
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
Result was created during the realization of more than one project. More information in the Projects tab.
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Others
Publication year
2024
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Name of the periodical
Neural Computing and Applications
ISSN
0941-0643
e-ISSN
1433-3058
Volume of the periodical
36
Issue of the periodical within the volume
30
Country of publishing house
US - UNITED STATES
Number of pages
32
Pages from-to
19023-19054
UT code for WoS article
—
EID of the result in the Scopus database
2-s2.0-85200201059