Vše

Co hledáte?

Vše
Projekty
Výsledky výzkumu
Subjekty

Rychlé hledání

  • Projekty podpořené TA ČR
  • Významné projekty
  • Projekty s nejvyšší státní podporou
  • Aktuálně běžící projekty

Chytré vyhledávání

  • Takto najdu konkrétní +slovo
  • Takto z výsledků -slovo zcela vynechám
  • “Takto můžu najít celou frázi”

Spam filtering using integrated distribution-based balancing approach and regularized deep neural networks

Identifikátory výsledku

  • Kód výsledku v IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216275%3A25410%2F18%3A39913376" target="_blank" >RIV/00216275:25410/18:39913376 - isvavai.cz</a>

  • Výsledek na webu

    <a href="https://link.springer.com/content/pdf/10.1007/s10489-018-1161-y.pdf" target="_blank" >https://link.springer.com/content/pdf/10.1007/s10489-018-1161-y.pdf</a>

  • DOI - Digital Object Identifier

    <a href="http://dx.doi.org/10.1007/s10489-018-1161-y" target="_blank" >10.1007/s10489-018-1161-y</a>

Alternativní jazyky

  • Jazyk výsledku

    angličtina

  • Název v původním jazyce

    Spam filtering using integrated distribution-based balancing approach and regularized deep neural networks

  • Popis výsledku v původním jazyce

    Rapid growth in the volume of unsolicited and unwanted messages has inspired the development of many anti-spam methods. Supervised anti-spam filters using machine-learning methods have been particularly effective in categorizing spam and non-spam messages. These automatically integrate spam corpora pre-processing, appropriate word lists selection, and the calculation of word weights, usually in a bag-of-words fashion. To develop an accurate spam filter is challenging because spammers attempt to decrease the probability of spam detection by using legitimate words. Complex models are therefore needed to solve such a problem. However, existing spam filtering methods usually converge to a poor local minimum, cannot effectively handle high-dimensional data and suffer from overfitting issues. To overcome these problems, we propose a novel spam filter integrating an N-gram tf.idf feature selection, modified distribution-based balancing algorithm and a regularized deep multi-layer perceptron NN model with rectified linear units (DBB-RDNN-ReL). As demonstrated on four benchmark spam datasets (Enron, SpamAssassin, SMS spam collection and Social networking), the proposed approach enables capturing more complex features from high-dimensional data by additional layers of neurons. Another advantage of this approach is that no additional dimensionality reduction is necessary and spam dataset imbalance is addressed using a modified distribution-based algorithm. We compare the performance of the approach with that of state-of-the-art spam filters (Minimum Description Length, Factorial Design using SVM and NB, Incremental Learning C4.5, and Random Forest, Voting and Convolutional Neural Network) and several machine learning algorithms commonly used to classify text. We show that the proposed model outperforms these other methods in terms of classification accuracy, with fewer false negatives and false positives. Notably, the proposed spam filter classifies both major (legitimate) and minor (spam) classes well on personalized / non-personalized and balanced / imbalanced spam datasets. In addition, we show that the proposed model performs better than the results reported by previous studies in terms of accuracy. However, the high computational expenses related to additional hidden layers limit its application as an online spam filter and make it difficult to overcome the problem of concept drift.

  • Název v anglickém jazyce

    Spam filtering using integrated distribution-based balancing approach and regularized deep neural networks

  • Popis výsledku anglicky

    Rapid growth in the volume of unsolicited and unwanted messages has inspired the development of many anti-spam methods. Supervised anti-spam filters using machine-learning methods have been particularly effective in categorizing spam and non-spam messages. These automatically integrate spam corpora pre-processing, appropriate word lists selection, and the calculation of word weights, usually in a bag-of-words fashion. To develop an accurate spam filter is challenging because spammers attempt to decrease the probability of spam detection by using legitimate words. Complex models are therefore needed to solve such a problem. However, existing spam filtering methods usually converge to a poor local minimum, cannot effectively handle high-dimensional data and suffer from overfitting issues. To overcome these problems, we propose a novel spam filter integrating an N-gram tf.idf feature selection, modified distribution-based balancing algorithm and a regularized deep multi-layer perceptron NN model with rectified linear units (DBB-RDNN-ReL). As demonstrated on four benchmark spam datasets (Enron, SpamAssassin, SMS spam collection and Social networking), the proposed approach enables capturing more complex features from high-dimensional data by additional layers of neurons. Another advantage of this approach is that no additional dimensionality reduction is necessary and spam dataset imbalance is addressed using a modified distribution-based algorithm. We compare the performance of the approach with that of state-of-the-art spam filters (Minimum Description Length, Factorial Design using SVM and NB, Incremental Learning C4.5, and Random Forest, Voting and Convolutional Neural Network) and several machine learning algorithms commonly used to classify text. We show that the proposed model outperforms these other methods in terms of classification accuracy, with fewer false negatives and false positives. Notably, the proposed spam filter classifies both major (legitimate) and minor (spam) classes well on personalized / non-personalized and balanced / imbalanced spam datasets. In addition, we show that the proposed model performs better than the results reported by previous studies in terms of accuracy. However, the high computational expenses related to additional hidden layers limit its application as an online spam filter and make it difficult to overcome the problem of concept drift.

Klasifikace

  • Druh

    J<sub>imp</sub> - Článek v periodiku v databázi Web of Science

  • CEP obor

  • OECD FORD obor

    10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

  • Projekt

  • Návaznosti

    S - Specificky vyzkum na vysokych skolach

Ostatní

  • Rok uplatnění

    2018

  • Kód důvěrnosti údajů

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

  • Název periodika

    Applied Intelligence

  • ISSN

    0924-669X

  • e-ISSN

  • Svazek periodika

    48

  • Číslo periodika v rámci svazku

    10

  • Stát vydavatele periodika

    NL - Nizozemsko

  • Počet stran výsledku

    19

  • Strana od-do

    3538-3556

  • Kód UT WoS článku

    000443262400021

  • EID výsledku v databázi Scopus

    2-s2.0-85044370954