All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

Website Properties in Relation to the Quality of Text Extracted for Web Corpora

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F21%3A00123254" target="_blank" >RIV/00216224:14330/21:00123254 - isvavai.cz</a>

  • Result on the web

    <a href="https://nlp.fi.muni.cz/raslan/2021/paper19.pdf" target="_blank" >https://nlp.fi.muni.cz/raslan/2021/paper19.pdf</a>

  • DOI - Digital Object Identifier

Alternative languages

  • Result language

    angličtina

  • Original language name

    Website Properties in Relation to the Quality of Text Extracted for Web Corpora

  • Original language description

    In this paper we present our research concerning the relation between two properties of websites and the quality of the text extracted from a website in the context of crawling the web and building large web corpora. A manual classification of text quality of 18 thousand websites from 21 European languages was used to verify our assumption that certain web domain properties can be used to identify potential sources of bad quality content. The first property is the distance of a web domain from the seed domains in a web crawl. The second property studied in this work is the length of the website name. Although these properties were recommended to help identify good quality websites in our previous work, in this paper we show there is only a small difference between the quality of text-rich web domains with various seed distances or name lengths. This conclusion holds for the post-crawling text processing when starting the web crawl with a large amount of seed domains.

  • Czech name

  • Czech description

Classification

  • Type

    D - Article in proceedings

  • CEP classification

  • OECD FORD branch

    10200 - Computer and information sciences

Result continuities

  • Project

    <a href="/en/project/LM2018101" target="_blank" >LM2018101: Digital Research Infrastructure for the Language Technologies, Arts and Humanities</a><br>

  • Continuities

    P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Others

  • Publication year

    2021

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

  • Article name in the collection

    Recent Advances in Slavonic Natural Language Processing (RASLAN 2021)

  • ISBN

    9788026316701

  • ISSN

    2336-4289

  • e-ISSN

  • Number of pages

    9

  • Pages from-to

    167-175

  • Publisher name

    Tribun EU

  • Place of publication

    Brno

  • Event location

    Brno

  • Event date

    Jan 1, 2021

  • Type of event by nationality

    EUR - Evropská akce

  • UT code for WoS article