All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

Exploring the Impact of Training Data Distribution and Subword Tokenization on Gender Bias in Machine Translation

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F23%3A10475930" target="_blank" >RIV/00216208:11320/23:10475930 - isvavai.cz</a>

  • Result on the web

    <a href="http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-long/cdrom/pdf/2023.ijcnlp-long.57.pdf" target="_blank" >http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-long/cdrom/pdf/2023.ijcnlp-long.57.pdf</a>

  • DOI - Digital Object Identifier

Alternative languages

  • Result language

    angličtina

  • Original language name

    Exploring the Impact of Training Data Distribution and Subword Tokenization on Gender Bias in Machine Translation

  • Original language description

    We study the effect of tokenization on gender bias in machine translation, an aspect that has been largely overlooked in previous works. Specifically, we focus on the interactions between the frequency of gendered profession names in training data, their representation in the subword tokenizer&apos;s vocabulary, and gender bias. We observe that female and non-stereotypical gender inflections of profession names (e.g., Spanish &quot;doctora&quot; for &quot;female doctor&quot;) tend to be split into multiple subword tokens. Our results indicate that the imbalance of gender forms in the model&apos;s training corpus is a major factor contributing to gender bias and has a greater impact than subword splitting. We show that analyzing subword splits provides good estimates of gender-form imbalance in the training data and can be used even when the corpus is not publicly available. We also demonstrate that fine-tuning just the token embedding layer can decrease the gap in gender prediction accuracy between female and male forms without impairing the translation quality.

  • Czech name

  • Czech description

Classification

  • Type

    D - Article in proceedings

  • CEP classification

  • OECD FORD branch

    10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Result continuities

  • Project

    <a href="/en/project/GA23-06912S" target="_blank" >GA23-06912S: Identification and Prevention of Unwanted Gender Bias in Neural Language Models</a><br>

  • Continuities

    P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Others

  • Publication year

    2023

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

  • Article name in the collection

    Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

  • ISBN

    979-8-89176-014-1

  • ISSN

  • e-ISSN

  • Number of pages

    12

  • Pages from-to

    885-896

  • Publisher name

    Association for Computational Linguistics

  • Place of publication

    Stroudsburg, PA, USA

  • Event location

    Nusa Dua, Bali, Indonesia

  • Event date

    Nov 1, 2023

  • Type of event by nationality

    WRD - Celosvětová akce

  • UT code for WoS article