All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

Py_ape: Text Data Acquiring, Extracting, Cleaning and Schema Matching in Python

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F61989100%3A27240%2F20%3A10246988" target="_blank" >RIV/61989100:27240/20:10246988 - isvavai.cz</a>

  • Result on the web

    <a href="https://link.springer.com/chapter/10.1007%2F978-981-33-4370-2_6" target="_blank" >https://link.springer.com/chapter/10.1007%2F978-981-33-4370-2_6</a>

  • DOI - Digital Object Identifier

    <a href="http://dx.doi.org/10.1007/978-981-33-4370-2_6" target="_blank" >10.1007/978-981-33-4370-2_6</a>

Alternative languages

  • Result language

    angličtina

  • Original language name

    Py_ape: Text Data Acquiring, Extracting, Cleaning and Schema Matching in Python

  • Original language description

    Py_ape is a package in Python that integrates a number of string and text processing algorithms for collecting, extracting, and cleaning text data from websites, creating frames for text corpora, and matching entities, matching two schemas, mapping and merging two schemas. The functions of Py_ape help the user step-by-step perform data integration and data preparation, based on some popular Python libraries. Especially in the entity matching function of the schema matching and merging phase, we used the Hamming distance algorithm to identify similar string pairs, and the longest common substring similarity algorithm to map data between the columns of schemas. These algorithms help to increase the accuracy of the schema matching process. In addition, in the article, we present experimental results using Py_ape to scrape, clean, match, and merge two sets of data related to aviation crashes, taken from different sources of Kaggle and Wikipedia. The result of the experiment will be evaluated in detail in the rest of the paper. (C) 2020, Springer Nature Singapore Pte Ltd.

  • Czech name

  • Czech description

Classification

  • Type

    D - Article in proceedings

  • CEP classification

  • OECD FORD branch

    10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Result continuities

  • Project

  • Continuities

    S - Specificky vyzkum na vysokych skolach

Others

  • Publication year

    2020

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

  • Article name in the collection

    Communications in Computer and Information Science. Volume 1306

  • ISBN

    978-981-334-369-6

  • ISSN

    1865-0929

  • e-ISSN

    1865-0937

  • Number of pages

    12

  • Pages from-to

    78-89

  • Publisher name

    Springer

  • Place of publication

    Singapur

  • Event location

    Quy Nhon

  • Event date

    Nov 25, 2020

  • Type of event by nationality

    WRD - Celosvětová akce

  • UT code for WoS article