Forecasting Scientific Impact: A Model for Predicting Citation Counts

  • Bao T. Nguyen Institute of Intelligent and Interactive Technology, College of Technology & Design, University of Economics Ho Chi Minh City, Vietnam
  • Thinh T. Nguyen Institute of Intelligent and Interactive Technology, College of Technology & Design, University of Economics Ho Chi Minh City, Vietnam
Keywords: Citation Prediction, LSTM, GRU, Deep Learning, NLP

Abstract

Forecasting the citation counts of scientific papers is a challenging task, particularly when utilizing textual data such as author names, paper titles, abstracts, and affiliations. This task diverges from conventional regression problems involving numerical or categorical inputs, as it demands the processing of complex, high-dimensional text features. Traditional regression techniques, including Linear Regression, Polynomial Regression, and Decision Tree Regression, often fail to encapsulate the semantic intricacies of textual data and are susceptible to overfitting due to the expansive feature space. In the context of Vietnam, where research output is rapidly growing yet underexplored in predictive modeling, these limitations are especially pronounced. To tackle these issues, we leverage advanced Natural Language Processing (NLP) techniques, employing Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks. These deep learning models are adept at handling sequential data, capturing long-range dependencies, and preserving contextual nuances, rendering them well-suited for text-based citation prediction. We conducted experiments using a dataset of academicpapers authored by Vietnamese researchers across diverse disciplines, sourced from publications featuring Vietnamese author contributions. The dataset includes features such as author names, titles, abstracts, and affiliations, reflecting the unique characteristics of Vietnam’s research landscape. We compared the performance of LSTM and GRU models against traditional machine learning approaches, evaluating prediction accuracy with metrics like Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). The results reveal that LSTM and GRU models substantially outperform their traditional counterparts. The LSTM model achieved an RMSE of 8.54 and an MAE of 8.1, while the GRU model yielded an RMSE of 8.32 and an MAE of 7.83, demonstrating robust predictive capabilities. In contrast, traditional models such as Decision Tree Regression and Linear Regression exhibited higher error rates, with RMSEs exceeding 12.0. These findings underscore the efficacy of deep learning in forecasting citation counts from textual data, particularly for Vietnamese research outputs, and highlight the potential of LSTM and GRU models to uncover intricate patterns driving scientific impact in emerging research ecosystems.
Published
2025-05-28
How to Cite
Nguyen, B. T., & Nguyen, T. T. (2025). Forecasting Scientific Impact: A Model for Predicting Citation Counts. Statistics, Optimization & Information Computing, 13(6), 2601-2615. https://doi.org/10.19139/soic-2310-5070-2524
Section
Research Articles