All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

Topic modeling and classification of scientific disciplines

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F67985955%3A_____%2F22%3A00566673" target="_blank" >RIV/67985955:_____/22:00566673 - isvavai.cz</a>

  • Result on the web

    <a href="https://doi.org/10.5281/zenodo.6957149" target="_blank" >https://doi.org/10.5281/zenodo.6957149</a>

  • DOI - Digital Object Identifier

Alternative languages

  • Result language

    angličtina

  • Original language name

    Topic modeling and classification of scientific disciplines

  • Original language description

    This paper evaluates the possibility of classifying Ph.D. theses into disciplines by using a bottom-up empirical approach based on topic modeling. It examines a dataset of 334810 Ph.D. theses submitted at French universities between 2006 and 2020. In this comprehensive dataset, the variable “discipline” does not rely on any controlled vocabulary or disciplinary ontology. Consequently, there are 23057 unique labels for the variable of which 14538 appear only once. Such situation renders impossible any full-scale analysis of the data from the perspective of scientific disciplines. Our topic model is built atop of abstracts of 285311 of theses in French that include a title, keywords, and abstract. After applying the TopSBM algorithm, we obtained a topic model with 7 levels of hierarchy. The outcomes of our experiments with classification of theses into disciplines suggest that topics derived from purely textual data implicitly capture information about disciplines. This quality of topic modelling can be of great benefit when dealing with datasets where disciplinary information is unavailable or unreliable and where citation records are absent (as it remains the case especially in the Humanities).

  • Czech name

  • Czech description

Classification

  • Type

    O - Miscellaneous

  • CEP classification

  • OECD FORD branch

    50803 - Information science (social aspects)

Result continuities

  • Project

    <a href="/en/project/GJ20-01752Y" target="_blank" >GJ20-01752Y: Funded and Unfunded Research in the Czech Republic: Scientometric Analysis and Topic Modeling</a><br>

  • Continuities

    I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace

Others

  • Publication year

    2022

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů