Selecting Key Features of Online Behaviour on South African Informative Websites Prior to Unsupervised Machine Learning

Judah Soobramoney; Retius Chifurira; Temesgen Zewotir

doi:10.19139/soic-2310-5070-1139

Judah Soobramoney University of Kwa-Zulu Natal
Retius Chifurira University of KwaZulu-Natal
Temesgen Zewotir University of KwaZulu-Natal

DOI: https://doi.org/10.19139/soic-2310-5070-1139

Keywords: feature selection, google analytics tracking; online behaviour, unsupervised machine learning.

Abstract

The main aim of the study was to explore the feature selection process of online web data prior to unsupervised machine learning models. At the time of writing, no such literature could be found reporting the use of feature selection in this context. Feature selection was determined by inspecting the variability and association between features. The variability of numeric features were quantified using the variance, mean absolute difference and dispersion ratio metrics whilst the coefficient of unalikeability was employed for categorical features. To quantify association, correlation matrices were used for numeric features, chi-squared independence tests between categorical features and box-and-whisker plots between mixed features. The main findings showed the variance, mean absolute difference, dispersion ratio and coefficient of unalikeability metrics have successfully highlighted features with very low variability within the observed data. Whilst the correlation matrix, chi-squared test for independence and box-and-whisker plots highlighted possible redundancy, natural relationships and insightful relationships between the features thereby suggesting features to be considered for omission prior to unsupervised modelling. The proposed methods and findings can be applied to various other applications of feature selection and exploration.

References

C. E. Brown, Coefficient of Variation, Springer, Berlin, Heidelberg, 1998.

S. Chormunge, and S. Jena, Correlation based feature selection with clustering for high dimensional data, Journal of Electrical Systems and Information Technology, vol. 5, pp. 542–549, 2018.

R. de la Cruz, and J. Kreft, Geometric mean extension for data sets with zeros, arXiv, 1806.06403 , 2019.

G. D. Dy, and C. E. Bordley, Feature Selection for Unsupervised Learning, Journal of Machine Learning Research, vol. 5, pp. 845–889, 2004.

A. J. Ferreira, and M. A. T. Figueiredo, Efficient feature selection filters for high-dimensional data, Pattern Recognition Letters, vol. 33, no. 13, pp. 1794–1804, 2012.

R. Fob, and T. Murphy, Variable selection methods for model-based clustering, Statistical Surveys, vol. 12, pp. 18–65, 2018.

R. Fraiman, A. Justel, and M. Svarc, Selection of Variables for Cluster Analysis and Classification Rules, Journal of the American Statistical Association, vol. 103, no. 483, pp. 1294–1303, 2008.

S. Guerif, Unsupervised Variable Selection: when random rankings sound as irrelevancy, Journal of Machine Learning Research Proceedings, vol. 4, pp. 163–177, 2008.

D. Holt, A. J. Scott, and P. D. Ewings, Chi-Squared Tests with Survey Data, Journal of the Royal Statistical Society: Series A (General), vol. 143, no. 3, pp. 302–320, 1980.

G. D. Kader, and M. Perry, Variability for Categorical Variables, Journal of Statistics Education, vol. 15, no. 2, 2007.

C. Maugis, G. Celeux, M. Martin, and L. Magniette, Variable selection in model-based clustering: A general variable, Computational Statistics and Data Analysis, vol. 53, pp. 3872–3882, 2009.

J. Steven, and M. L. L. S. Turner, A Study of Web Mining Application on E-Commerce using Google Analytics Tool, International Journal of Computer Applications, vol. 149, no. 11, pp. 975–8887, 2016.

R. Taylor, Interpretation of the Correlation Coefficient: A Basic Review, Journal of Diagnostic Medical Sonography, vol. 6, no. 1, pp. 35–36, 1990.

C. Thirumalai, M. Vignesh, and R. Balaji, Data analysis using box and whisker plot for lung cancer, Innovations in Power and Advanced Computing Technologies (i-PACT), 10.1109/IPACT.2017.8245071, 2017.

Y. Thushara, and V. Ramesh, A Study of Web Mining Application on E-Commerce using Google Analytics Tool, International Journal of Computer Applications, vol. 149, no. 11, pp. 975–8887, 2016.

K. Venkatram, and G. A. Mary, Review on Big Data and Analytics – Concepts, Philosophy, Process and Applications, Cybernetics and Information Technologies, vol. 17, no. 2, 2017.

S. Wang, and J. Zhu, Variable Selection for Model-Based High-Dimensional Clustering, The International Biometric Society, vol. 64, no. 2, pp. 440–448, 2008.

W. Xing, R. Guo, G. Fitzgerald, and C. Xu, Google Analytics based Temporal-Geospatial Analysis for Web Management: A Case Study of a K-12 Online Resource Website, International Journal of Information Science and Management, vol. 13, no. 1, pp. 87–106,2015.

Selecting Key Features of Online Behaviour on South African Informative Websites Prior to Unsupervised Machine Learning

Abstract

References

Most read articles by the same author(s)