WTF-LOD - A New Resource for Large-Scale NER Evaluation
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216305%3A26230%2F16%3APU121611" target="_blank" >RIV/00216305:26230/16:PU121611 - isvavai.cz</a>
Result on the web
—
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
WTF-LOD - A New Resource for Large-Scale NER Evaluation
Original language description
This paper introduces the Web TextFull linkage to Linked Open Data (WTF-LOD) dataset intended for large-scale evaluation of named entity recognition (NER) systems. First, we present the process of collecting data from the largest publically-available textual corpora, including Wikipedia dumps, monthly runs of the CommonCrawl, and ClueWeb09/12. We discuss similarities and differences of related initiatives such as WikiLinks and WikiReverse. Our work primarily focuses on links from "textfull" documents (links surrounded by a text that provides a useful context for entity linking), de-duplication of the data and advanced cleaning procedures. Presented statistics demonstrate that the collected data forms one of the largest available resource of its kind. They also prove suitability of the result for complex NER evaluation campaigns, including an analysis of the most ambiguous name mentions appearing in the data.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
<a href="/en/project/LQ1602" target="_blank" >LQ1602: IT4Innovations excellence in science</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Others
Publication year
2016
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Proceedings of the Tenth conference on International Language Resources and Evaluation (LREC'16)
ISBN
978-2-9517408-9-1
ISSN
—
e-ISSN
—
Number of pages
4
Pages from-to
3299-3302
Publisher name
European Language Resources Association
Place of publication
Portorož
Event location
Portorož
Event date
May 23, 2016
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
000526952503084