Hybrid Outlier Detection Framework Based on Optimized KMeans and HDBSCAN Using Bat Algorithm and LSTM Autoencoder

  • Mai Abdelsamie Information System Department, Faculty of computers and information, Suez Canal University, Ismailia, Egypt
  • Hosam Refaat Information System Department, Faculty of computers and information, Suez Canal University, Ismailia, Egypt
  • Mohammed Abdallah Information System Department, Faculty of computers and information, Suez Canal University, Ismailia, Egypt
  • Osama Farouk Information System Department, Faculty of computers and information, Suez Canal University, Ismailia, Egypt
Keywords: Outlier Detectio, hybrid KMeans HDBSCAN (BAT optimized), IS-DBSCAN, utoencoder, Cluster Catch Digraphs (CCDs), Outlyingness Score,Inbound/Outbound Score (IOS/OOS), Local Coulomb Outlier Factor (LCOF)

Abstract

Outlier detection is a critical task in data mining, especially in domains such as healthcare, cybersecurity, and fraud detection, where abnormal instances can signify crucial insights. Traditional approaches, including DBSCAN, Isolation Forest, and statistical techniques like Z-Score and IQR, often suffer from issues such as sensitivity to parameters, limited adaptability, and reduced effectiveness in high-dimensional or complex data. To overcome these limitations, this paper proposes a hybrid outlier detection framework that combines KMeans clustering with HDBSCAN, enhanced through Bat Algorithm-based optimization for dynamic selection of clustering parameters (eps and minsamples). The proposed method is evaluated alongside IS-DBSCAN, Autoencoders, and advanced graph-based approaches like Cluster Catch Digraphs (CCDs) with Outbound and Inbound Outlyingness Scores (OOS and IOS) use in this study. It explores and compares two advanced outlier detection approaches applied to two real-world datasets: the Online Retail and the Diabetes 130-US hospitals datasets. The first approach utilizes a scalable Spark-based DBSCAN algorithm, while the second integrates KMeans clustering with HDBSCAN, optimized via the Bat Algorithm (KMeans + HDBSCAN (BAT)). A Spark-based implementation of DBSCAN.These methods were evaluated on two real-world datasets—Diabetes and Online Retail—using Silhouette Score (SII) and classification Accuracy (Acc) as performance metrics with performanceperformance (F1 = 0.972, AUPRC = 0.947). Experimental results demonstrate that the proposed hybrid approach significantly outperforms the Spark-based DBSCAN in both clustering quality and classification performance, achieving a Silhouette score of 0.67 and accuracy of 66.8% on the Diabetes dataset performance (F1 = 0.66.2, AUC = 0.72.26%), and 0.59 and 97.35% respectively on the Online Retail dataset.For MINIST dataset The model achieved high performance (F1 = 0.92, AUC = 0.96), outperforming Isolation Forest, with notable improvements in clustering quality as BAT iterations increased.These results highlight the effectiveness of integrating KMeans for initialization, HDBSCAN for density-based clustering, and the Bat Optimization algorithm for fine-tuning key parameters.
Published
2025-07-23
How to Cite
Abdelsamie, M., Refaat, H., Abdallah, M., & Farouk, O. (2025). Hybrid Outlier Detection Framework Based on Optimized KMeans and HDBSCAN Using Bat Algorithm and LSTM Autoencoder. Statistics, Optimization & Information Computing, 14(2), 970-1017. https://doi.org/10.19139/soic-2310-5070-2581
Section
Research Articles