Vše

Co hledáte?

Vše
Projekty
Výsledky výzkumu
Subjekty

Rychlé hledání

  • Projekty podpořené TA ČR
  • Významné projekty
  • Projekty s nejvyšší státní podporou
  • Aktuálně běžící projekty

Chytré vyhledávání

  • Takto najdu konkrétní +slovo
  • Takto z výsledků -slovo zcela vynechám
  • “Takto můžu najít celou frázi”

CsFEVER and CTKFacts: acquiring Czech data for fact verification

Identifikátory výsledku

  • Kód výsledku v IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11230%2F23%3A10470715" target="_blank" >RIV/00216208:11230/23:10470715 - isvavai.cz</a>

  • Nalezeny alternativní kódy

    RIV/68407700:21230/23:00372837

  • Výsledek na webu

    <a href="https://verso.is.cuni.cz/pub/verso.fpl?fname=obd_publikace_handle&handle=90debLHSb3" target="_blank" >https://verso.is.cuni.cz/pub/verso.fpl?fname=obd_publikace_handle&handle=90debLHSb3</a>

  • DOI - Digital Object Identifier

    <a href="http://dx.doi.org/10.1007/s10579-023-09654-3" target="_blank" >10.1007/s10579-023-09654-3</a>

Alternativní jazyky

  • Jazyk výsledku

    angličtina

  • Název v původním jazyce

    CsFEVER and CTKFacts: acquiring Czech data for fact verification

  • Popis výsledku v původním jazyce

    In this paper, we examine several methods of acquiring Czech data for automated fact-checking, which is a task commonly modeled as a classification of textual claim veracity w.r.t. a corpus of trusted ground truths. We attempt to collect sets of data in form of a factual claim, evidence within the ground truth corpus, and its veracity label (supported, refuted or not enough info). As a first attempt, we generate a Czech version of the large-scale FEVER dataset built on top of WIKIPEDIA corpus. We take a hybrid approach of machine translation and document alignment; the approach and the tools we provide can be easily applied to other languages. We discuss its weaknesses, propose a future strategy for their mitigation and publish the 127k resulting translations, as well as a version of such dataset reliably applicable for the Natural Language Inference task-the CSFEVER-NLI. Furthermore, we collect a novel dataset of 3,097 claims, which is annotated using the corpus of 2.2 M articles of Czech News Agency. We present an extended dataset annotation methodology based on the FEVER approach, and, as the underlying corpus is proprietary, we also publish a standalone version of the dataset for the task of Natural Language Inference we call CTKFACTSNLI. We analyze both acquired datasets for spurious cues-annotation patterns leading to model overfitting. CTKFACTS is further examined for inter-annotator agreement, thoroughly cleaned, and a typology of common annotator errors is extracted. Finally, we provide baseline models for all stages of the fact-checking pipeline and publish the NLI datasets, as well as our annotation platform and other experimental data.

  • Název v anglickém jazyce

    CsFEVER and CTKFacts: acquiring Czech data for fact verification

  • Popis výsledku anglicky

    In this paper, we examine several methods of acquiring Czech data for automated fact-checking, which is a task commonly modeled as a classification of textual claim veracity w.r.t. a corpus of trusted ground truths. We attempt to collect sets of data in form of a factual claim, evidence within the ground truth corpus, and its veracity label (supported, refuted or not enough info). As a first attempt, we generate a Czech version of the large-scale FEVER dataset built on top of WIKIPEDIA corpus. We take a hybrid approach of machine translation and document alignment; the approach and the tools we provide can be easily applied to other languages. We discuss its weaknesses, propose a future strategy for their mitigation and publish the 127k resulting translations, as well as a version of such dataset reliably applicable for the Natural Language Inference task-the CSFEVER-NLI. Furthermore, we collect a novel dataset of 3,097 claims, which is annotated using the corpus of 2.2 M articles of Czech News Agency. We present an extended dataset annotation methodology based on the FEVER approach, and, as the underlying corpus is proprietary, we also publish a standalone version of the dataset for the task of Natural Language Inference we call CTKFACTSNLI. We analyze both acquired datasets for spurious cues-annotation patterns leading to model overfitting. CTKFACTS is further examined for inter-annotator agreement, thoroughly cleaned, and a typology of common annotator errors is extracted. Finally, we provide baseline models for all stages of the fact-checking pipeline and publish the NLI datasets, as well as our annotation platform and other experimental data.

Klasifikace

  • Druh

    J<sub>imp</sub> - Článek v periodiku v databázi Web of Science

  • CEP obor

  • OECD FORD obor

    10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

  • Projekt

    <a href="/cs/project/TL02000288" target="_blank" >TL02000288: Proměna etických aspektů s nástupem žurnalistiky umělé inteligence</a><br>

  • Návaznosti

    P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Ostatní

  • Rok uplatnění

    2023

  • Kód důvěrnosti údajů

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

  • Název periodika

    Language Resources and Evaluation

  • ISSN

    1574-020X

  • e-ISSN

    1574-0218

  • Svazek periodika

    57

  • Číslo periodika v rámci svazku

    4

  • Stát vydavatele periodika

    NL - Nizozemsko

  • Počet stran výsledku

    35

  • Strana od-do

    1571-1605

  • Kód UT WoS článku

    000980799100007

  • EID výsledku v databázi Scopus

    2-s2.0-85158137690