Comparison of robust machine learning algorithms on outliers and imbalanced spam data

Dodo Zaenal Abidin; Jasmir Jasmir; Errisya Rasywir; Agus Siswanto

doi:10.11591/ijeecs.v39.i2.pp1130-1144

Comparison of robust machine learning algorithms on outliers and imbalanced spam data

Dodo Zaenal Abidin, Jasmir Jasmir, Errisya Rasywir, Agus Siswanto

Abstract

Effective spam detection is essential for data security, user experience, and organizational trust. However, outliers and class imbalance can impact machine learning models for spam classification. Previous studies focused on feature selection and ensemble learning but have not explicitly examined their combined effects. This study evaluates the performance of random forest (RF), gradient boosting (GB), and extreme gradient boosting (XGBoost) under four experimental scenarios: (i) without synthetic minority over-sampling technique (SMOTE) and outliers, (ii) without SMOTE but with outliers, (iii) with SMOTE and without outliers, and (iv) with SMOTE and with outliers. Results show that XGBoost achieves the highest accuracy (96%), an area under the curve-receiver operating characteristic (AUCROC) of 0.9928, and the fastest computation time (0.6184 seconds) under the SMOTE and outlier-free scenario. Additionally, RF attained an AUCROC of 0.9920, while GB achieved 0.9876 but required more processing time. These findings emphasize the need to address class imbalance and outliers in spam detection models. This study contributes to developing more robust spam filtering techniques and provides a benchmark for future improvements. By systematically evaluating these factors, it lays a foundation for designing more effective spam detection frameworks adaptable to real-world imbalanced and noisy data conditions.

Keywords

Comparison; Imbalanced data; Machine learning; Outliers; Spam

Full Text:

PDF

DOI: http://doi.org/10.11591/ijeecs.v39.i2.pp1130-1144

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Indonesian Journal of Electrical Engineering and Computer Science (IJEECS)
p-ISSN: 2502-4752, e-ISSN: 2502-4760
This journal is published by the Institute of Advanced Engineering and Science (IAES).

IJEECS visitor statistics

Username
Password
Remember me