An integrated model based on deep learning classifiers and pre-trained transformer for phishing URL detection
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F62690094%3A18450%2F24%3A50021595" target="_blank" >RIV/62690094:18450/24:50021595 - isvavai.cz</a>
Alternative codes found
RIV/29142890:_____/24:00048145
Result on the web
<a href="https://www.sciencedirect.com/science/article/pii/S0167739X24003315?via%3Dihub" target="_blank" >https://www.sciencedirect.com/science/article/pii/S0167739X24003315?via%3Dihub</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1016/j.future.2024.06.031" target="_blank" >10.1016/j.future.2024.06.031</a>
Alternative languages
Result language
angličtina
Original language name
An integrated model based on deep learning classifiers and pre-trained transformer for phishing URL detection
Original language description
The unique nature of website URLs has made phishing detection a challenging task. Unlike natural language, URLs have an unstructured nature with non-linear and sophisticated correlations. Therefore, they should be handled as both natural language and unstructured data sequences. However, the current solutions for phishing URL detection only focused on a single aspect of web page URLs. In this concern, this paper proposes an integrated model based on DL classifiers and pre-trained transformer to examine both the unique nature and the natural language structure of URL sequences simultaneously. The proposed model consists of three modules: RasNet (Keras-ResNet), TCMA (TCN-MHSA), and MPNet (Masked and Permuted Pre-training for Language Understanding). Considering the unique nature of the input data, RasNet combines two Keras embedding techniques to obtain the feature representations of URLs and then fuses them using a Residual Network (ResNet) to balance the weight distribution among the character-level and word-level information. Additionally, TCMA integrates the Temporal Convolutional Network (TCN) with the Multi-Head Self-Attention (MHSA) mechanism to optimize feature extraction and improve classification accuracy. Concurrently, MPNet joins the advantages and eliminates the drawbacks of Masked Language Modelling and Permuted Language Modelling to examine the nature language structure of web page URLs. The proposed model was trained and tested on four different datasets, including Ebbu2017, PhishCrawl, 420K-PD, and 1M-PD. The experimental results indicated that the proposed solution outperformed other models in classifying malicious URLs with the highest detection rate of 99.71% on the 1M-PD dataset, improving the performance accuracy of the state-of-the-art approaches by 1.37% to 2.01%. © 2024
Czech name
—
Czech description
—
Classification
Type
J<sub>imp</sub> - Article in a specialist periodical, which is included in the Web of Science database
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
S - Specificky vyzkum na vysokych skolach
Others
Publication year
2024
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Name of the periodical
Future Generation Computer Systems
ISSN
0167-739X
e-ISSN
1872-7115
Volume of the periodical
161
Issue of the periodical within the volume
December
Country of publishing house
NL - THE KINGDOM OF THE NETHERLANDS
Number of pages
17
Pages from-to
269-285
UT code for WoS article
001280731500001
EID of the result in the Scopus database
2-s2.0-85199275329