The k-nearest Neighbor Classification of Histogram- and Trapezoid-Valued Data
Abstract
A histogram-valued observation is a specific type of symbolic objects that represents its value by a list of bins (intervals) along with their corresponding relative frequencies or probabilities. In the literature, the raw data in bins of all histogram-valued data have been assumed to be uniformly distributed. A new representation of such observations is proposed in this paper by assuming that the raw data in each bin are linearly distributed, which are called trapezoid-valued data. Moreover, new definitions of union and intersection between trapezoid-valued observations are made. This study proposes the k-nearest neighbor technique for classifying histogram-valued data using various dissimilarity measures. Further, the limiting behavior of the computational complexities based on the performed dissimilarity measures are compared. Some simulations are done to study the performance of the proposed procedures. Also, the results are applied to three various real data sets. Eventually, some conclusions are stated.References
Alkhatib, K., Najadat, H., Hmeidi, I., and Shatnawi, M. K. A. (2013). Stock price prediction using k-nearest neighbor (knn) algorithm. International Journal of Business, Humanities and Technology, 3(3):32–44.
Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3):175–185.
Arroyo, J., Gonzalez-Rivera, G., Mate, C., and Roque, A. M. S. (2011). Smoothing methods for histogram-valued time series: an application to value-at-risk. Statistical Analysis and Data Mining, 4(2):216–228.
Arroyo, J., Guijarro, M., and Pajares, G. (2016). An instance-based learning approach for thresholding in crop images under different outdoor conditions. Com-puters and Electronics in Agriculture, 127:669–679.
Arroyo, J. and Mate, C. (2009). Forecasting histogram time series with k-nearest neighbours methods. International Journal of Forecasting, 25(1):192–207.
Bagnall, A., Lines, J., Bostrom, A., Large, J., and Keogh, E. (2017). The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery, 31(3):606–660.
Beyaztas, U., Shang, H. L., and Abdel-Salam, A.-S. G. (2020). Functional linear models for interval-valued data. Communications in Statistics - Simulation and Computation, pages 1–20.
Billard, L. and Diday, E. (2003). From the statistics of data to the statistics of knowledge: Symbolic data analysis. Journal of American Statistical Association, 98:470–487.
Billard, L. and Diday, E. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining. John Wiley and Sons, West Sussex, UK.Bock, H. and Diday, E. (2000). Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data. Springer-Verlag, Berlin.
Brown, A. E. X., Yemini, E. I., Grundy, L. J., Jucikas, T., and Schafer, W. R. (2013). A dictionary of behavioral motifs reveals clusters of genes affecting caenorhabditis elegans locomotion. 110(2):791–796.
de Souza, R. M. C. R., Queiroz, D. C. F., and Cysneiros, F. J. A. (2011). Logistic regression-based pattern classifiers for symbolic interval data. Pattern Analysis and Applications, 14:273–282.
Dias, S. and Brito, P. (2015). Linear regression model with histogram-valued vari-ables. Statistical Analysis and Data Mining, 8(2):75–113.
Dias, S., Brito, P., and Amaral, P. (2021). Discriminant analysis of distributional data via fractional programming. European Journal of Operational Research, 294(1):206–218.
Duarte Silva, A. and Brito, P. (2006). Linear discriminant analysis for interval data. Computational Statistics, 21(2):289–308.
Gonzalez-Rivera, G. and Arroyo, J. (2012). Time series modeling of histogram-valued data: The daily histogram time series of s&p500 intradaily returns. Inter-national Journal of Forecasting, 28(1):20–33. Special Section 1: The Predictability of Financial Markets Special Section 2: Credit Risk Modelling and Forecasting.
Guo, G., Wang, H., Bell, D., Bi, Y., and Greer, K. (2004). An knn model-based approach and its application in text categorization. In: Gelbukh A. (eds) Compu-tational Linguistics and Intelligent Text Processing. CICLing 2004. Lecture Notes
in Computer Science, 2945:559–570.
Gurung, R. B., Lindgren, T., and Bostrom, H. (2015). Learning decision trees from histogram data. In Proceedings of the 2015 International Conference on Data Mining : DMIN 2015, pages 139–145.
Gurung, R. B., Lindgren, T., and Bostrom, H. (2018). Learning random forest from histogram data using split specific axis rotation. International Journal of Machine Learning and Computing, 8(1):74–79.
Hand, D. J., Mannila, H., and Smyth, P. (2001). Principles of Data Mining. The MIT Press.
Imandoust, S. and Bolandraftar, M. (2013). Application of k-nearest neighbor (knn) approach for predicting economic events theoretical background. International Journal of Engineering Research and Applications, 3:605–610.
Irpino, A. and Verde, R. (2015). Linear regression for numeric symbolic variables: a least squares approach based on wasserstein distance. Advances in Data Analysis and Classification, 9:81–106.
Kejzar, N., Korenjak-Cerne, S., and V, V. B. (2021). Clustering of modal valued symbolic data. Advances in Data Analysis and Classification, 15:513–541.
Kim, J. and Billard, L. (2011). A polythetic clustering process and cluster validity indexes for histogram-valued objects. Computational Statistics and Data Analysis,55(7):2250 – 2262.
Kim, J. and Billard, L. (2013). Dissimilarity measures for histogram-valued obser- vations. Communications in Statistics - Theory and Methods, 42(2):283–303.
Kim, J. and Billard, L. (2018). Double monothetic clustering for histogram-valued data. Communications for Statistical Applications and Methods, 25:263–274.
Lima Neto, E. d. A. and de Carvalho, F. d. A. T. (2017). Nonlinear regression applied to interval-valued data. Pattern Analysis and Applications, 20:809–824.
Nagabhushan, P. and Pradeep Kumar, R. (2007). Histogram pca. Advances in Neural Networks–ISNN 2007. Lecture Notes in Computer Science, 4492:1012–1021.
Olszewski, R. T. (2001). Generalized Feature Extraction for Structural Pattern Recognition in Time-Series Data. PhD thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh, 2001.
Ramos-Guajardo, A. B. and Grzegorzewski, P. (2016). Distance-based linear dis- criminant analysis for interval-valued data. Information Sciences, 372:591–607.
Silva, A. P. D. and Brito, P. (2015). Discriminant analysis of interval data: An as-sessment of parametric and distance-based approaches. Journal of Classification,32(3):516–541.
Verde, R., Irpino, A., and Balzanella, A. (2016). Dimension reduction techniques for distributional symbolic data. IEEE Trans Cybern, 46:344–355.
Wu, X., Kumar, V., Quinlan, R., Ghosh, J., Yang, Q., Motoda, H., Mclachlan, G.,Ng, S. K. A., Liu, B., Yu, P., Zhou, Z.-H., Steinbach, M., Hand, D., and Steinberg,D. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems,14:1–37.
Yao, Z. and Ruzzo, W. L. (2006). A regression-based k nearest neighbor algorithm for gene function prediction from heterogeneous data. BMC Bioinformatics, 7:S11.
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).