A Machine Learning Framework for Identifying Sources of AI-Generated Text

  • Md. Sadiq Iqbal Bangladesh University
  • Mohammod Abul Kashem Dhaka University of Engineering and Technology
Keywords: Text Classification, Manual Extraction, Deep Learning, Feature Optimization, Explainable Artificial Intelligence, Natural Language Processing

Abstract

The rise of AI-generated text requires efficient identification methods to ascertain its origin. This research presents a comprehensive dataset derived from responses to various questions posed to AI models including ChatGPT, Gemini, DeepAI, and Bing, alongside human respondents. We meticulously preprocessed the dataset and utilized both manual methods such as Count Vector (CV), Bag of Words (BoW), and Hashing Vectorization (HV), as well as automated Deep Learning (DL) models like Bidirectional Encoder Representations from Transformers (BERT), Extreme Language understanding Network (XLNet), Enhanced Representation through Knowledge Integration (ERNIE), and Generative Pre-Trained Transformers (GPT) to convert text into features. These features are then used to train multiple Machine Learning (ML) classifiers, including Support Vector Machines (SVM), Logistic Regression (LR), Decision Trees (DT), Random Forests (RF), Naive Bayes (NB), and Extreme Gradient Boosting (XGB). This research also uses Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) to maximize the classification accuracy of ML models. Remarkably, the combination of HV with LDA and XGB achieved the highest accuracy of 99.40\%. Further evaluation using precision, recall, f1 score, specificity with Confusion Matrix (CM) and Receiver operating characteristic (ROC) Curve confirmed its superior performance, while Explainable Artificial Intelligence (XAI) tools such as Shapley Additive Explanations (SHAP) and Local Interpretable Model-Agnostic Explanations (LIME) techniques are employed to explain the model's outputs, ensuring transparency and interpretability.
Published
2025-04-23
How to Cite
Md. Sadiq Iqbal, & Mohammod Abul Kashem. (2025). A Machine Learning Framework for Identifying Sources of AI-Generated Text. Statistics, Optimization & Information Computing, 13(5), 2186-2204. https://doi.org/10.19139/soic-2310-5070-2225
Section
Research Articles