Reducing Gender Bias in Machine Translation through Counterfactual Data Generation

The result's identifiers

Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F23%3A3ADSLJUB" target="_blank" >RIV/00216208:11320/23:3ADSLJUB - isvavai.cz</a>
Result on the web
<a href="http://arxiv.org/abs/2311.16362" target="_blank" >http://arxiv.org/abs/2311.16362</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.48550/arXiv.2311.16362" target="_blank" >10.48550/arXiv.2311.16362</a>

Alternative languages

Result language
angličtina
Original language name
Reducing Gender Bias in Machine Translation through Counterfactual Data Generation
Original language description
"Recent advances in neural methods have led to substantial improvement in the quality of Neural Machine Translation (NMT) systems. However, these systems frequently produce translations with inaccurate gender (Stanovsky et al., 2019), which can be traced to bias in training data. Saunders and Byrne (2020) tackle this problem with a handcrafted dataset containing balanced gendered profession words. By using this data to fine-tune an existing NMT model, they show that gender bias can be significantly mitigated, albeit at the expense of translation quality due to catastrophic forgetting. They recover some of the lost quality with modified training objectives or additional models at inference. We find, however, that simply supplementing the handcrafted dataset with a random sample from the base model training corpus is enough to significantly reduce the catastrophic forgetting. We also propose a novel domain-adaptation technique that leverages in-domain data created with the counterfactual data generation techniques proposed by Zmigrod et al. (2019) to further improve accuracy on the WinoMT challenge test set without significant loss in translation quality. We show its effectiveness in NMT systems from English into three morphologically rich languages French, Spanish, and Italian. The relevant dataset and code will be available at Github."
Czech name
—
Czech description
—

Classification

Type
O - Miscellaneous
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Result continuities

Project
—
Continuities
—

Others

Publication year
2023
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Similar results(10)

CUNI Transformer Neural MT System for WMT18 Deep Multi-Lingual Cross Sentence Alignment English-Indonesian Neural Machine Translation for Spoken Language Domains

What are you looking for?

Quick search

Smart search

Reducing Gender Bias in Machine Translation through Counterfactual Data Generation

The result's identifiers

Alternative languages

Classification

Result continuities

Others

Similar results(10)

What are you looking for?

Quick search

Smart search

Result description

The result's identifiers

The result's identifiers

Alternative languages

Alternative languages

Classification

Classification

Result continuities

Result continuities

Others

Others

Similar results(10)