A Machine Learning Framework for Identifying Sources of AI-Generated  Text

Md. Sadiq Iqbal; Mohammod Abul Kashem

doi:10.19139/soic-2310-5070-2225

A Machine Learning Framework for Identifying Sources of AI-Generated Text

Md. Sadiq Iqbal Bangladesh University
Mohammod Abul Kashem Dhaka University of Engineering and Technology

DOI: https://doi.org/10.19139/soic-2310-5070-2225

Keywords: Text Classification, Manual Extraction, Deep Learning, Feature Optimization, Explainable Artificial Intelligence, Natural Language Processing

Abstract

The rise of AI-generated text requires efficient identification methods to ascertain its origin. This research presents a comprehensive dataset derived from responses to various questions posed to AI models including ChatGPT, Gemini, DeepAI, and Bing, alongside human respondents. We meticulously preprocessed the dataset and utilized both manual methods such as Count Vector (CV), Bag of Words (BoW), and Hashing Vectorization (HV), as well as automated Deep Learning (DL) models like Bidirectional Encoder Representations from Transformers (BERT), Extreme Language understanding Network (XLNet), Enhanced Representation through Knowledge Integration (ERNIE), and Generative Pre-Trained Transformers (GPT) to convert text into features. These features are then used to train multiple Machine Learning (ML) classifiers, including Support Vector Machines (SVM), Logistic Regression (LR), Decision Trees (DT), Random Forests (RF), Naive Bayes (NB), and Extreme Gradient Boosting (XGB). This research also uses Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) to maximize the classification accuracy of ML models. Remarkably, the combination of HV with LDA and XGB achieved the highest accuracy of 99.4%. Further evaluation using precision, recall, f1 score, specificity with Confusion Matrix (CM) and Receiver operating characteristic (ROC) Curve confirmed its superior performance, while Explainable Artificial Intelligence (XAI) tools such as Shapley Additive Explanations (SHAP) and Local Interpretable Model-Agnostic Explanations (LIME) techniques are employed to explain the model's outputs, ensuring transparency and interpretability.

Published

2025-04-23

How to Cite

Md. Sadiq Iqbal, & Mohammod Abul Kashem. (2025). A Machine Learning Framework for Identifying Sources of AI-Generated Text. Statistics, Optimization & Information Computing, 13(5), 2186-2204. https://doi.org/10.19139/soic-2310-5070-2225

Download Citation

Issue

Vol 13 No 5 (2025)

Section

Research Articles

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).