MQDD: Pre-training of Multimodal Question Duplicity Detection for Software Engineering Domain
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F49777513%3A23520%2F23%3A43970053" target="_blank" >RIV/49777513:23520/23:43970053 - isvavai.cz</a>
Result on the web
<a href="https://aclanthology.org/2023.ranlp-1.89/" target="_blank" >https://aclanthology.org/2023.ranlp-1.89/</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.26615/978-954-452-092-2_089" target="_blank" >10.26615/978-954-452-092-2_089</a>
Alternative languages
Result language
angličtina
Original language name
MQDD: Pre-training of Multimodal Question Duplicity Detection for Software Engineering Domain
Original language description
This work proposes a new pipeline for leveraging data collected on the Stack Overflow website for pre-training a multimodal model for searching duplicates on question answering websites. Our multimodal model is trained on question descriptions and source codes in multiple programming languages. We design two new learning objectives to improve duplicate detection capabilities. The result of this work is a mature, fine-tuned Multimodal Question Duplicity Detection (MQDD) model, ready to be integrated into a Stack Overflow search system, where it can help users find answers for already answered questions. Alongside the MQDD model, we release two datasets related to the software engineering domain. The first Stack Overflow Dataset (SOD) represents a massive corpus of paired questions and answers. The second Stack Overflow Duplicity Dataset (SODD) contains data for training duplicate detection models.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
S - Specificky vyzkum na vysokych skolach
Others
Publication year
2023
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Deep Learning for Natural Language Processing Methods and Applications
ISBN
978-954-452-092-2
ISSN
—
e-ISSN
2603-2813
Number of pages
12
Pages from-to
824-835
Publisher name
INCOMA Ltd.
Place of publication
Shoumen
Event location
Varna
Event date
Sep 4, 2023
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—