Vše

Co hledáte?

Vše
Projekty
Výsledky výzkumu
Subjekty

Rychlé hledání

  • Projekty podpořené TA ČR
  • Významné projekty
  • Projekty s nejvyšší státní podporou
  • Aktuálně běžící projekty

Chytré vyhledávání

  • Takto najdu konkrétní +slovo
  • Takto z výsledků -slovo zcela vynechám
  • “Takto můžu najít celou frázi”

Pipeline and dataset generation for automated fact-checking in almost any language

Identifikátory výsledku

  • Kód výsledku v IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11230%2F24%3A10491323" target="_blank" >RIV/00216208:11230/24:10491323 - isvavai.cz</a>

  • Nalezeny alternativní kódy

    RIV/68407700:21230/24:00376915

  • Výsledek na webu

    <a href="https://verso.is.cuni.cz/pub/verso.fpl?fname=obd_publikace_handle&handle=LlIO7PMEmJ" target="_blank" >https://verso.is.cuni.cz/pub/verso.fpl?fname=obd_publikace_handle&handle=LlIO7PMEmJ</a>

  • DOI - Digital Object Identifier

    <a href="http://dx.doi.org/10.1007/s00521-024-10113-5" target="_blank" >10.1007/s00521-024-10113-5</a>

Alternativní jazyky

  • Jazyk výsledku

    angličtina

  • Název v původním jazyce

    Pipeline and dataset generation for automated fact-checking in almost any language

  • Popis výsledku v původním jazyce

    This article presents a pipeline for automated fact-checking leveraging publicly available language models and data. The objective is to assess the accuracy of textual claims using evidence from a ground-truth evidence corpus. The pipeline consists of two main modules-the evidence retrieval and the claim veracity evaluation. Our primary focus is on the ease of deployment in various languages that remain unexplored in the field of automated fact-checking. Unlike most similar pipelines, which work with evidence sentences, our pipeline processes data on a paragraph level, simplifying the overall architecture and data requirements. Given the high cost of annotating language-specific fact-checking training data, our solution builds on the question answering for claim generation method, which we adapt and use to generate the data for all models of the pipeline. Our strategy enables the introduction of new languages through machine translation of only two fixed datasets of moderate size. Subsequently, any number of training samples can be generated based on an evidence corpus in the target language. We provide open access to all data and fine-tuned models for Czech, English, Polish, and Slovak pipelines, as well as to our codebase that may be used to reproduce the results. We comprehensively evaluate the pipelines for all four languages, including human annotations and per-sample difficulty assessment using Pointwise-information. The presented experiments are based on full Wikipedia snapshots to promote reproducibility. To facilitate implementation and user interaction, we develop the FactSearch application featuring the proposed pipeline and the preliminary feedback on its performance.

  • Název v anglickém jazyce

    Pipeline and dataset generation for automated fact-checking in almost any language

  • Popis výsledku anglicky

    This article presents a pipeline for automated fact-checking leveraging publicly available language models and data. The objective is to assess the accuracy of textual claims using evidence from a ground-truth evidence corpus. The pipeline consists of two main modules-the evidence retrieval and the claim veracity evaluation. Our primary focus is on the ease of deployment in various languages that remain unexplored in the field of automated fact-checking. Unlike most similar pipelines, which work with evidence sentences, our pipeline processes data on a paragraph level, simplifying the overall architecture and data requirements. Given the high cost of annotating language-specific fact-checking training data, our solution builds on the question answering for claim generation method, which we adapt and use to generate the data for all models of the pipeline. Our strategy enables the introduction of new languages through machine translation of only two fixed datasets of moderate size. Subsequently, any number of training samples can be generated based on an evidence corpus in the target language. We provide open access to all data and fine-tuned models for Czech, English, Polish, and Slovak pipelines, as well as to our codebase that may be used to reproduce the results. We comprehensively evaluate the pipelines for all four languages, including human annotations and per-sample difficulty assessment using Pointwise-information. The presented experiments are based on full Wikipedia snapshots to promote reproducibility. To facilitate implementation and user interaction, we develop the FactSearch application featuring the proposed pipeline and the preliminary feedback on its performance.

Klasifikace

  • Druh

    J<sub>SC</sub> - Článek v periodiku v databázi SCOPUS

  • CEP obor

  • OECD FORD obor

    10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

  • Projekt

    Výsledek vznikl pri realizaci vícero projektů. Více informací v záložce Projekty.

  • Návaznosti

    P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Ostatní

  • Rok uplatnění

    2024

  • Kód důvěrnosti údajů

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

  • Název periodika

    Neural Computing and Applications

  • ISSN

    0941-0643

  • e-ISSN

    1433-3058

  • Svazek periodika

    36

  • Číslo periodika v rámci svazku

    30

  • Stát vydavatele periodika

    US - Spojené státy americké

  • Počet stran výsledku

    32

  • Strana od-do

    19023-19054

  • Kód UT WoS článku

  • EID výsledku v databázi Scopus

    2-s2.0-85200201059