CaLMQA: Exploring culturally specific long-form question answering across 23 languages

The result's identifiers

Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F25%3AU6ZJ6W7I" target="_blank" >RIV/00216208:11320/25:U6ZJ6W7I - isvavai.cz</a>
Result on the web
<a href="http://arxiv.org/abs/2406.17761" target="_blank" >http://arxiv.org/abs/2406.17761</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.48550/arXiv.2406.17761" target="_blank" >10.48550/arXiv.2406.17761</a>

Alternative languages

Result language
angličtina
Original language name
CaLMQA: Exploring culturally specific long-form question answering across 23 languages
Original language description
Large language models (LLMs) are used for long-form question answering (LFQA), which requires them to generate paragraph-length answers to complex questions. While LFQA has been well-studied in English, this research has not been extended to other languages. To bridge this gap, we introduce CaLMQA, a collection of 1.5K complex culturally specific questions spanning 23 languages and 51 culturally agnostic questions translated from English into 22 other languages. We define culturally specific questions as those uniquely or more likely to be asked by people from cultures associated with the question's language. We collect naturally-occurring questions from community web forums and hire native speakers to write questions to cover under-resourced, rarely-studied languages such as Fijian and Kirundi. Our dataset contains diverse, complex questions that reflect cultural topics (e.g. traditions, laws, news) and the language usage of native speakers. We automatically evaluate a suite of open- and closed-source models on CaLMQA by detecting incorrect language and token repetitions in answers, and observe that the quality of LLM-generated answers degrades significantly for some low-resource languages. Lastly, we perform human evaluation on a subset of models and languages. Manual evaluation reveals that model performance is significantly worse for culturally specific questions than for culturally agnostic questions. Our findings highlight the need for further research in non-English LFQA and provide an evaluation framework.
Czech name
—
Czech description
—

Classification

Type
O - Miscellaneous
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Result continuities

Project
—
Continuities
—

Others

Publication year
2024
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Similar results(10)

Task-Agnostic Low-Rank Adapters for Unseen English Dialects MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks This Word Mean What: Constructing a Singlish Dictionary with ChatGPT

What are you looking for?

Quick search

Smart search

CaLMQA: Exploring culturally specific long-form question answering across 23 languages

The result's identifiers

Alternative languages

Classification

Result continuities

Others

Similar results(10)

What are you looking for?

Quick search

Smart search

Result description

The result's identifiers

The result's identifiers

Alternative languages

Alternative languages

Classification

Classification

Result continuities

Result continuities

Others

Others

Similar results(10)