Hybrid Outlier Detection Framework Based on Optimized KMeans and HDBSCAN Using Bat Algorithm and LSTM Autoencoder
Keywords:
Outlier Detectio, hybrid KMeans HDBSCAN (BAT optimized), IS-DBSCAN, utoencoder, Cluster Catch Digraphs (CCDs), Outlyingness Score,Inbound/Outbound Score (IOS/OOS), Local Coulomb Outlier Factor (LCOF)
Abstract
Outlier detection is a critical task in data mining, especially in domains such as healthcare, cybersecurity, and fraud detection, where abnormal instances can signify crucial insights. Traditional approaches, including DBSCAN, Isolation Forest, and statistical techniques like Z-Score and IQR, often suffer from issues such as sensitivity to parameters, limited adaptability, and reduced effectiveness in high-dimensional or complex data. To overcome these limitations, this paper proposes a hybrid outlier detection framework that combines KMeans clustering with HDBSCAN, enhanced through Bat Algorithm-based optimization for dynamic selection of clustering parameters (eps and minsamples). The proposed method is evaluated alongside IS-DBSCAN, Autoencoders, and advanced graph-based approaches like Cluster Catch Digraphs (CCDs) with Outbound and Inbound Outlyingness Scores (OOS and IOS) use in this study. It explores and compares two advanced outlier detection approaches applied to two real-world datasets: the Online Retail and the Diabetes 130-US hospitals datasets. The first approach utilizes a scalable Spark-based DBSCAN algorithm, while the second integrates KMeans clustering with HDBSCAN, optimized via the Bat Algorithm (KMeans + HDBSCAN (BAT)). A Spark-based implementation of DBSCAN.These methods were evaluated on two real-world datasets—Diabetes and Online Retail—using Silhouette Score (SII) and classification Accuracy (Acc) as performance metrics with performanceperformance (F1 = 0.972, AUPRC = 0.947). Experimental results demonstrate that the proposed hybrid approach significantly outperforms the Spark-based DBSCAN in both clustering quality and classification performance, achieving a Silhouette score of 0.67 and accuracy of 66.8% on the Diabetes dataset performance (F1 = 0.66.2, AUC = 0.72.26%), and 0.59 and 97.35% respectively on the Online Retail dataset.For MINIST dataset The model achieved high performance (F1 = 0.92, AUC = 0.96), outperforming Isolation Forest, with notable improvements in clustering quality as BAT iterations increased.These results highlight the effectiveness of integrating KMeans for initialization, HDBSCAN for density-based clustering, and the Bat Optimization algorithm for fine-tuning key parameters.
Published
2025-07-23
How to Cite
Abdelsamie, M., Refaat, H., Abdallah, M., & Farouk, O. (2025). Hybrid Outlier Detection Framework Based on Optimized KMeans and HDBSCAN Using Bat Algorithm and LSTM Autoencoder. Statistics, Optimization & Information Computing, 14(2), 970-1017. https://doi.org/10.19139/soic-2310-5070-2581
Issue
Section
Research Articles
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).