Selecting Key Features of Online Behaviour on South African Informative Websites Prior to Unsupervised Machine Learning
Abstract
The main aim of the study was to explore the feature selection process of online web data prior to unsupervised machine learning models. At the time of writing, no such literature could be found reporting the use of feature selection in this context. Feature selection was determined by inspecting the variability and association between features. The variability of numeric features were quantified using the variance, mean absolute difference and dispersion ratio metrics whilst the coefficient of unalikeability was employed for categorical features. To quantify association, correlation matrices were used for numeric features, chi-squared independence tests between categorical features and box-and-whisker plots between mixed features. The main findings showed the variance, mean absolute difference, dispersion ratio and coefficient of unalikeability metrics have successfully highlighted features with very low variability within the observed data. Whilst the correlation matrix, chi-squared test for independence and box-and-whisker plots highlighted possible redundancy, natural relationships and insightful relationships between the features thereby suggesting features to be considered for omission prior to unsupervised modelling. The proposed methods and findings can be applied to various other applications of feature selection and exploration.References
C. E. Brown, Coefficient of Variation, Springer, Berlin, Heidelberg, 1998.
S. Chormunge, and S. Jena, Correlation based feature selection with clustering for high dimensional data, Journal of Electrical Systems and Information Technology, vol. 5, pp. 542–549, 2018.
R. de la Cruz, and J. Kreft, Geometric mean extension for data sets with zeros, arXiv, 1806.06403 , 2019.
G. D. Dy, and C. E. Bordley, Feature Selection for Unsupervised Learning, Journal of Machine Learning Research, vol. 5, pp. 845–889, 2004.
A. J. Ferreira, and M. A. T. Figueiredo, Efficient feature selection filters for high-dimensional data, Pattern Recognition Letters, vol. 33, no. 13, pp. 1794–1804, 2012.
R. Fob, and T. Murphy, Variable selection methods for model-based clustering, Statistical Surveys, vol. 12, pp. 18–65, 2018.
R. Fraiman, A. Justel, and M. Svarc, Selection of Variables for Cluster Analysis and Classification Rules, Journal of the American Statistical Association, vol. 103, no. 483, pp. 1294–1303, 2008.
S. Guerif, Unsupervised Variable Selection: when random rankings sound as irrelevancy, Journal of Machine Learning Research Proceedings, vol. 4, pp. 163–177, 2008.
D. Holt, A. J. Scott, and P. D. Ewings, Chi-Squared Tests with Survey Data, Journal of the Royal Statistical Society: Series A (General), vol. 143, no. 3, pp. 302–320, 1980.
G. D. Kader, and M. Perry, Variability for Categorical Variables, Journal of Statistics Education, vol. 15, no. 2, 2007.
C. Maugis, G. Celeux, M. Martin, and L. Magniette, Variable selection in model-based clustering: A general variable, Computational Statistics and Data Analysis, vol. 53, pp. 3872–3882, 2009.
J. Steven, and M. L. L. S. Turner, A Study of Web Mining Application on E-Commerce using Google Analytics Tool, International Journal of Computer Applications, vol. 149, no. 11, pp. 975–8887, 2016.
R. Taylor, Interpretation of the Correlation Coefficient: A Basic Review, Journal of Diagnostic Medical Sonography, vol. 6, no. 1, pp. 35–36, 1990.
C. Thirumalai, M. Vignesh, and R. Balaji, Data analysis using box and whisker plot for lung cancer, Innovations in Power and Advanced Computing Technologies (i-PACT), 10.1109/IPACT.2017.8245071, 2017.
Y. Thushara, and V. Ramesh, A Study of Web Mining Application on E-Commerce using Google Analytics Tool, International Journal of Computer Applications, vol. 149, no. 11, pp. 975–8887, 2016.
K. Venkatram, and G. A. Mary, Review on Big Data and Analytics – Concepts, Philosophy, Process and Applications, Cybernetics and Information Technologies, vol. 17, no. 2, 2017.
S. Wang, and J. Zhu, Variable Selection for Model-Based High-Dimensional Clustering, The International Biometric Society, vol. 64, no. 2, pp. 440–448, 2008.
W. Xing, R. Guo, G. Fitzgerald, and C. Xu, Google Analytics based Temporal-Geospatial Analysis for Web Management: A Case Study of a K-12 Online Resource Website, International Journal of Information Science and Management, vol. 13, no. 1, pp. 87–106,2015.
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).