The k-nearest Neighbor Classification of Histogram- and Trapezoid-Valued Data

Mostafa Razmkhah; Fathimah  Al-Ma'shumah; Sohrab Effati

doi:10.19139/soic-2310-5070-1451

Mostafa Razmkhah Ferdowsi University of Mashhad, Iran
Fathimah Al-Ma'shumah Ferdowsi University of Mashhad, Iran
Sohrab Effati Ferdowsi University of Mashhad, Iran

DOI: https://doi.org/10.19139/soic-2310-5070-1451

Keywords: Dissimilarity measure; Histogram-valued data (HVD); Supervised learning; Trapezoid-valued data (TVD), Wasserstein distance.

Abstract

‎A histogram-valued observation is a specific type of symbolic objects that represents its value by a list of bins (intervals) along with their corresponding relative frequencies or probabilities‎. ‎In the literature‎, ‎the raw data in bins of all histogram-valued data have been assumed to be uniformly distributed‎. ‎A new representation of such observations is proposed in this paper by assuming that the raw data in each bin are linearly distributed‎, ‎which are called trapezoid-valued data‎. ‎Moreover‎, ‎new definitions of union and intersection between trapezoid-valued observations are made‎. This study proposes the k-nearest neighbor technique for classifying histogram-valued data using various dissimilarity measures‎. ‎Further‎, ‎the limiting behavior of the computational complexities based on the performed dissimilarity measures are compared‎. ‎Some simulations are done to study the performance of the proposed procedures‎. ‎Also‎, ‎the results are applied to three various real data sets‎. ‎Eventually‎, ‎some conclusions are stated‎.

References

Alkhatib, K., Najadat, H., Hmeidi, I., and Shatnawi, M. K. A. (2013). Stock price prediction using k-nearest neighbor (knn) algorithm. International Journal of Business, Humanities and Technology, 3(3):32–44.

Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3):175–185.

Arroyo, J., Gonzalez-Rivera, G., Mate, C., and Roque, A. M. S. (2011). Smoothing methods for histogram-valued time series: an application to value-at-risk. Statistical Analysis and Data Mining, 4(2):216–228.

Arroyo, J., Guijarro, M., and Pajares, G. (2016). An instance-based learning approach for thresholding in crop images under different outdoor conditions. Com-puters and Electronics in Agriculture, 127:669–679.

Arroyo, J. and Mate, C. (2009). Forecasting histogram time series with k-nearest neighbours methods. International Journal of Forecasting, 25(1):192–207.

Bagnall, A., Lines, J., Bostrom, A., Large, J., and Keogh, E. (2017). The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery, 31(3):606–660.

Beyaztas, U., Shang, H. L., and Abdel-Salam, A.-S. G. (2020). Functional linear models for interval-valued data. Communications in Statistics - Simulation and Computation, pages 1–20.

Billard, L. and Diday, E. (2003). From the statistics of data to the statistics of knowledge: Symbolic data analysis. Journal of American Statistical Association, 98:470–487.

Billard, L. and Diday, E. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining. John Wiley and Sons, West Sussex, UK.Bock, H. and Diday, E. (2000). Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data. Springer-Verlag, Berlin.

Brown, A. E. X., Yemini, E. I., Grundy, L. J., Jucikas, T., and Schafer, W. R. (2013). A dictionary of behavioral motifs reveals clusters of genes affecting caenorhabditis elegans locomotion. 110(2):791–796.

de Souza, R. M. C. R., Queiroz, D. C. F., and Cysneiros, F. J. A. (2011). Logistic regression-based pattern classifiers for symbolic interval data. Pattern Analysis and Applications, 14:273–282.

Dias, S. and Brito, P. (2015). Linear regression model with histogram-valued vari-ables. Statistical Analysis and Data Mining, 8(2):75–113.

Dias, S., Brito, P., and Amaral, P. (2021). Discriminant analysis of distributional data via fractional programming. European Journal of Operational Research, 294(1):206–218.

Duarte Silva, A. and Brito, P. (2006). Linear discriminant analysis for interval data. Computational Statistics, 21(2):289–308.

Gonzalez-Rivera, G. and Arroyo, J. (2012). Time series modeling of histogram-valued data: The daily histogram time series of s&p500 intradaily returns. Inter-national Journal of Forecasting, 28(1):20–33. Special Section 1: The Predictability of Financial Markets Special Section 2: Credit Risk Modelling and Forecasting.

Guo, G., Wang, H., Bell, D., Bi, Y., and Greer, K. (2004). An knn model-based approach and its application in text categorization. In: Gelbukh A. (eds) Compu-tational Linguistics and Intelligent Text Processing. CICLing 2004. Lecture Notes

in Computer Science, 2945:559–570.

Gurung, R. B., Lindgren, T., and Bostrom, H. (2015). Learning decision trees from histogram data. In Proceedings of the 2015 International Conference on Data Mining : DMIN 2015, pages 139–145.

Gurung, R. B., Lindgren, T., and Bostrom, H. (2018). Learning random forest from histogram data using split specific axis rotation. International Journal of Machine Learning and Computing, 8(1):74–79.

Hand, D. J., Mannila, H., and Smyth, P. (2001). Principles of Data Mining. The MIT Press.

Imandoust, S. and Bolandraftar, M. (2013). Application of k-nearest neighbor (knn) approach for predicting economic events theoretical background. International Journal of Engineering Research and Applications, 3:605–610.

Irpino, A. and Verde, R. (2015). Linear regression for numeric symbolic variables: a least squares approach based on wasserstein distance. Advances in Data Analysis and Classification, 9:81–106.

Kejzar, N., Korenjak-Cerne, S., and V, V. B. (2021). Clustering of modal valued symbolic data. Advances in Data Analysis and Classification, 15:513–541.

Kim, J. and Billard, L. (2011). A polythetic clustering process and cluster validity indexes for histogram-valued objects. Computational Statistics and Data Analysis,55(7):2250 – 2262.

Kim, J. and Billard, L. (2013). Dissimilarity measures for histogram-valued obser- vations. Communications in Statistics - Theory and Methods, 42(2):283–303.

Kim, J. and Billard, L. (2018). Double monothetic clustering for histogram-valued data. Communications for Statistical Applications and Methods, 25:263–274.

Lima Neto, E. d. A. and de Carvalho, F. d. A. T. (2017). Nonlinear regression applied to interval-valued data. Pattern Analysis and Applications, 20:809–824.

Nagabhushan, P. and Pradeep Kumar, R. (2007). Histogram pca. Advances in Neural Networks–ISNN 2007. Lecture Notes in Computer Science, 4492:1012–1021.

Olszewski, R. T. (2001). Generalized Feature Extraction for Structural Pattern Recognition in Time-Series Data. PhD thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh, 2001.

Ramos-Guajardo, A. B. and Grzegorzewski, P. (2016). Distance-based linear dis- criminant analysis for interval-valued data. Information Sciences, 372:591–607.

Silva, A. P. D. and Brito, P. (2015). Discriminant analysis of interval data: An as-sessment of parametric and distance-based approaches. Journal of Classification,32(3):516–541.

Verde, R., Irpino, A., and Balzanella, A. (2016). Dimension reduction techniques for distributional symbolic data. IEEE Trans Cybern, 46:344–355.

Wu, X., Kumar, V., Quinlan, R., Ghosh, J., Yang, Q., Motoda, H., Mclachlan, G.,Ng, S. K. A., Liu, B., Yu, P., Zhou, Z.-H., Steinbach, M., Hand, D., and Steinberg,D. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems,14:1–37.

Yao, Z. and Ruzzo, W. L. (2006). A regression-based k nearest neighbor algorithm for gene function prediction from heterogeneous data. BMC Bioinformatics, 7:S11.