All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

CorpusArièja: Building an Annotated Corpus with Variation in Occitan

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F25%3A4JW38D3L" target="_blank" >RIV/00216208:11320/25:4JW38D3L - isvavai.cz</a>

  • Result on the web

    <a href="https://www.scopus.com/inward/record.uri?eid=2-s2.0-85195237095&partnerID=40&md5=d233ffeeff7bb6d7fc12c19187f794bf" target="_blank" >https://www.scopus.com/inward/record.uri?eid=2-s2.0-85195237095&partnerID=40&md5=d233ffeeff7bb6d7fc12c19187f794bf</a>

  • DOI - Digital Object Identifier

Alternative languages

  • Result language

    angličtina

  • Original language name

    CorpusArièja: Building an Annotated Corpus with Variation in Occitan

  • Original language description

    The Occitan language is a less resourced language and is classified as’in danger’ by the UNESCO. Thereby, it is important to build resources and tools that can help to safeguard and develop the digitisation of the language. CorpusArièja is a collection of 72 texts (just over 41,000 tokens) in the Occitan language of the French department of Ariège. The majority of the texts needed to be digitised and pass within an Optical Character Recognition. This corpus contains dialectal and spelling variation, but is limited to prose, without diachronic variation or genre variation. It is an annotated corpus with two levels of lemmatisation, POS tags and verbal inflection. One of the main aims of the corpus is to enable the conception of tools that can automatically annotate all Occitan texts, regardless of the dialect or spelling used. The Ariège territory is interesting because it includes the two variations that we focus on, dialectal and spelling. It has plenty of authors that write in their native language, their variety of Occitan. © 2024 ELRA Language Resource Association.

  • Czech name

  • Czech description

Classification

  • Type

    D - Article in proceedings

  • CEP classification

  • OECD FORD branch

    10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Result continuities

  • Project

  • Continuities

Others

  • Publication year

    2024

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

  • Article name in the collection

    Annu. Meet. ELRA-ISCA Spec. Interest Group Under-Resour. Lang., SIGUL LREC-COLING - Workshop Proc.

  • ISBN

    978-249381429-6

  • ISSN

  • e-ISSN

  • Number of pages

    6

  • Pages from-to

    66-71

  • Publisher name

    European Language Resources Association (ELRA)

  • Place of publication

  • Event location

    Torino, Italia

  • Event date

    Jan 1, 2025

  • Type of event by nationality

    WRD - Celosvětová akce

  • UT code for WoS article