Py_ape: Text Data Acquiring, Extracting, Cleaning and Schema Matching in Python
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F61989100%3A27240%2F20%3A10246988" target="_blank" >RIV/61989100:27240/20:10246988 - isvavai.cz</a>
Result on the web
<a href="https://link.springer.com/chapter/10.1007%2F978-981-33-4370-2_6" target="_blank" >https://link.springer.com/chapter/10.1007%2F978-981-33-4370-2_6</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-981-33-4370-2_6" target="_blank" >10.1007/978-981-33-4370-2_6</a>
Alternative languages
Result language
angličtina
Original language name
Py_ape: Text Data Acquiring, Extracting, Cleaning and Schema Matching in Python
Original language description
Py_ape is a package in Python that integrates a number of string and text processing algorithms for collecting, extracting, and cleaning text data from websites, creating frames for text corpora, and matching entities, matching two schemas, mapping and merging two schemas. The functions of Py_ape help the user step-by-step perform data integration and data preparation, based on some popular Python libraries. Especially in the entity matching function of the schema matching and merging phase, we used the Hamming distance algorithm to identify similar string pairs, and the longest common substring similarity algorithm to map data between the columns of schemas. These algorithms help to increase the accuracy of the schema matching process. In addition, in the article, we present experimental results using Py_ape to scrape, clean, match, and merge two sets of data related to aviation crashes, taken from different sources of Kaggle and Wikipedia. The result of the experiment will be evaluated in detail in the rest of the paper. (C) 2020, Springer Nature Singapore Pte Ltd.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
S - Specificky vyzkum na vysokych skolach
Others
Publication year
2020
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Communications in Computer and Information Science. Volume 1306
ISBN
978-981-334-369-6
ISSN
1865-0929
e-ISSN
1865-0937
Number of pages
12
Pages from-to
78-89
Publisher name
Springer
Place of publication
Singapur
Event location
Quy Nhon
Event date
Nov 25, 2020
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—