Forecasting Scientific Impact: A Model for Predicting Citation Counts
Keywords:
Citation Prediction, LSTM, GRU, Deep Learning, NLP
Abstract
Forecasting the citation counts of scientific papers is a challenging task, particularly when utilizing textual data such as author names, paper titles, abstracts, and affiliations. This task diverges from conventional regression problems involving numerical or categorical inputs, as it demands the processing of complex, high-dimensional text features. Traditional regression techniques, including Linear Regression, Polynomial Regression, and Decision Tree Regression, often fail to encapsulate the semantic intricacies of textual data and are susceptible to overfitting due to the expansive feature space. In the context of Vietnam, where research output is rapidly growing yet underexplored in predictive modeling, these limitations are especially pronounced. To tackle these issues, we leverage advanced Natural Language Processing (NLP) techniques, employing Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks. These deep learning models are adept at handling sequential data, capturing long-range dependencies, and preserving contextual nuances, rendering them well-suited for text-based citation prediction. We conducted experiments using a dataset of academicpapers authored by Vietnamese researchers across diverse disciplines, sourced from publications featuring Vietnamese author contributions. The dataset includes features such as author names, titles, abstracts, and affiliations, reflecting the unique characteristics of Vietnam’s research landscape. We compared the performance of LSTM and GRU models against traditional machine learning approaches, evaluating prediction accuracy with metrics like Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). The results reveal that LSTM and GRU models substantially outperform their traditional counterparts. The LSTM model achieved an RMSE of 8.54 and an MAE of 8.1, while the GRU model yielded an RMSE of 8.32 and an MAE of 7.83, demonstrating robust predictive capabilities. In contrast, traditional models such as Decision Tree Regression and Linear Regression exhibited higher error rates, with RMSEs exceeding 12.0. These findings underscore the efficacy of deep learning in forecasting citation counts from textual data, particularly for Vietnamese research outputs, and highlight the potential of LSTM and GRU models to uncover intricate patterns driving scientific impact in emerging research ecosystems.
Published
2025-05-28
How to Cite
Nguyen, B. T., & Nguyen, T. T. (2025). Forecasting Scientific Impact: A Model for Predicting Citation Counts. Statistics, Optimization & Information Computing, 13(6), 2601-2615. https://doi.org/10.19139/soic-2310-5070-2524
Issue
Section
Research Articles
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).