Practical Web Crawling for Text Corpora
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F11%3A00050166" target="_blank" >RIV/00216224:14330/11:00050166 - isvavai.cz</a>
Result on the web
—
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
Practical Web Crawling for Text Corpora
Original language description
SpiderLing--a web spider for linguistics--is new software for creating text corpora from the web, which we present in this article. Many documents on the web only contain material which is not useful for text corpora, such as lists of links, lists of products, and other kind of text not comprised of full sentences. In fact such pages represent the vast majority of the web. Therefore, by doing unrestricted web crawls, we typically download a lot of data which gets filtered out during post-processing. This makes the process of web corpus collection inefficient. The aim of our work is to focus the crawling on the text rich parts of the web and maximize the number of words in the final corpus per downloaded megabyte. We present our preliminary results fromcreating Web corpora of texts in Czech and Tajik.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
IN - Informatics
OECD FORD branch
—
Result continuities
Project
Result was created during the realization of more than one project. More information in the Projects tab.
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)<br>S - Specificky vyzkum na vysokych skolach
Others
Publication year
2011
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2011
ISBN
978-80-263-0077-9
ISSN
—
e-ISSN
—
Number of pages
11
Pages from-to
97-108
Publisher name
Tribun EU
Place of publication
Brno
Event location
Karlova Studánka, Czech Republic
Event date
Dec 2, 2011
Type of event by nationality
EUR - Evropská akce
UT code for WoS article
—