All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

Modeling Diachronic Change in English Scientific Writing over 300+ Years with Transformer-based Language Model Surprisal

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F25%3A6RCGJHPY" target="_blank" >RIV/00216208:11320/25:6RCGJHPY - isvavai.cz</a>

  • Result on the web

    <a href="https://www.scopus.com/inward/record.uri?eid=2-s2.0-85198668184&partnerID=40&md5=f39b9e4e7762bbbc6b4fea9cd5212861" target="_blank" >https://www.scopus.com/inward/record.uri?eid=2-s2.0-85198668184&partnerID=40&md5=f39b9e4e7762bbbc6b4fea9cd5212861</a>

  • DOI - Digital Object Identifier

Alternative languages

  • Result language

    angličtina

  • Original language name

    Modeling Diachronic Change in English Scientific Writing over 300+ Years with Transformer-based Language Model Surprisal

  • Original language description

    This study presents an analysis of diachronic linguistic changes in English scientific writing, utilizing surprisal from transformer-based language models. Unlike traditional n-gram models, transformer-based models are potentially better at capturing nuanced linguistic changes such as long-range dependencies by considering variable context sizes. However, to create diachronically comparable language models there are several challenges with historical data, notably an exponential increase in no. of texts, tokens per text and vocabulary size over time. We address these by using a shared vocabulary and employing a robust training strategy that includes initial uniform sampling from the corpus and continuing pre-training on specific temporal segments. Our empirical analysis highlights the predictive power of surprisal from transformer-based models, particularly in analyzing complex linguistic structures like relative clauses. The models’ broader contextual awareness and the inclusion of dependency length annotations contribute to a more intricate understanding of communicative efficiency. While our focus is on scientific English, our approach can be applied to other low-resource scenarios. © 2024 ELRA Language Resource Association: CC BY-NC 4.0.

  • Czech name

  • Czech description

Classification

  • Type

    D - Article in proceedings

  • CEP classification

  • OECD FORD branch

    10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Result continuities

  • Project

  • Continuities

Others

  • Publication year

    2024

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

  • Article name in the collection

    Workshop Build. Using Comp. Corpora, BUCC LREC-COLING - Proc.

  • ISBN

    978-249381431-9

  • ISSN

  • e-ISSN

  • Number of pages

    12

  • Pages from-to

    12-23

  • Publisher name

    European Language Resources Association (ELRA)

  • Place of publication

  • Event location

    Torino, Italia

  • Event date

    Jan 1, 2025

  • Type of event by nationality

    WRD - Celosvětová akce

  • UT code for WoS article