Aalamaram: A Large-Scale Linguistically Annotated Treebank for the Tamil Language
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F25%3AZPC3NT23" target="_blank" >RIV/00216208:11320/25:ZPC3NT23 - isvavai.cz</a>
Result on the web
<a href="https://www.scopus.com/inward/record.uri?eid=2-s2.0-85195195757&partnerID=40&md5=87a815daf7cbcfbce6daa970a198049f" target="_blank" >https://www.scopus.com/inward/record.uri?eid=2-s2.0-85195195757&partnerID=40&md5=87a815daf7cbcfbce6daa970a198049f</a>
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
Aalamaram: A Large-Scale Linguistically Annotated Treebank for the Tamil Language
Original language description
Tamil is a relatively low-resource language in the field of Natural Language Processing (NLP). Recent years have seen a growth in Tamil NLP datasets in Natural Language Understanding (NLU) or Natural Language Generation (NLG) tasks, but high-quality linguistic resources remain scarce. In order to alleviate this gap in resources, this paper introduces Aalamaram, a treebank with rich linguistic annotations for the Tamil language. It is hitherto the largest publicly available Tamil treebank with almost 10,000 sentences from diverse sources and is annotated for the tasks of Part-of-speech (POS) tagging, Named Entity Recognition (NER), Morphological Parsing and Dependency Parsing. Close attention has also been paid to multi-word segmentation, especially in the context of Tamil clitics. Although the treebank is based largely on the Universal Dependencies (UD) specifications, significant effort has been made to adjust the annotation rules according to the idiosyncrasies and complexities of the Tamil language, thereby providing a valuable resource for linguistic research and NLP developments. © 2024 ELRA Language Resource Association: CC BY-NC 4.0.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
—
Others
Publication year
2024
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Workshop Indian Lang. Data Resour. Eval., WILDRE LREC-COLING - Workshop Proc.
ISBN
978-249381437-1
ISSN
—
e-ISSN
—
Number of pages
11
Pages from-to
73-83
Publisher name
European Language Resources Association (ELRA)
Place of publication
—
Event location
Torino, Italia
Event date
Jan 1, 2025
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—