Two-Step Semidefinite Programming approach to clustering and dimensionality reduction
Abstract
Inspired by the recently proposed statistical technique called clustering and disjoint principal component analysis (CDPCA), we present in this paper a new algorithm for clustering objects and dimensionality reduction, based on Semidefinite Programming (SDP) models. The Two-Step-SDP algorithm is based on SDP relaxations of two clustering problems and on a K-means step in a reduced space. The Two-Step-SDP algorithm was implemented and tested in R, a widely used open source software. Besides returning clusters of both objects and attributes, the Two-Step-SDP algorithm returns the variance explained by each component and the component loadings. The numerical experiments on different data sets show that the algorithm is quite efficient and fast. Comparing to other known iterative algorithms for clustering, namely, the K-means and ALS algorithms, the computational time of the Two-Step-SDP algorithm is comparable to the K-means algorithm, and it is faster than the ALS algorithm.References
Abadir, K.M. and Magnus, J.R., Matrix Algebra, Cambridge University Press, 2005.
Aloise, D. and Hansen, P., A branch-and-cut sdp-based algorithm for minimum sum-of-squares clustering, Pesquisa Operacional, vol.29, n.3, pp. 503-516, 2009.
Ames, Brendan P.W., Guaranteed clustering and biclustering via semidefinite programming, Mathematical Programming, vol. 147, Issue 1-2, Springer Berlin Heidelberg, pp. 429-465, 2014.
Akteke-Ozturk, B., Weber, G.-W. and Kropat, E., Continuous Optimization Approaches for Clustering via Minimum Sum of Squares, Proc. 20th Mini-EURO Conf. Continuous Optimization and Knowledge-Based Technologies, 2007.
Anjos, M.F. and Lasserre, J.B. (Eds.), Handbook of Semidefinite, Conic and Polynomial Optimization: Theory, Algorithms, Software and Applications, International Series in Operational Research and Management Science, vol. 166, Springer, 2012.
Bagirov, A.M., Modified global k-means algorithm for minimum sum-of-squares clustering problems, Pattern Recognition, 41, pp. 3192-3199, 2008.
Bagirov, A.M., Rubinov, A.M., Soukhoroukova, N.V. and Yearwood, J., Unsupervised and Supervised Data Classification via Nonsmooth and Global Optimization, Top, vol. 11, pp. 1-93, 2003.
Bronson, R. and Costa, G.B., Matrix Methods: Applied Linear Algebra,3rd Ed., Academic Press, Elsevier, 2009.
d'Aspremont, A., El Ghaoui, L., Jordan, M.I., Lanckriet, G.R.G.: A Direct Formulation for Sparse PCA Using Semidefinite Programming. SIAM Rev., 49(3), pp.434-448 2007.
Enki, D.G., Trendafilov, N.T. and Jolliffe, I.T., A clustering approach to interpretable principal components, Journal of Applied Statistics, 40(3), pp. 583–599, 2013.
Fiala, J., Kocvara, M. and Stingl, M., PENLAB: A MATLAB solver for nonlinear semidefinite optimization, Nov 2013, http://arxiv.org/pdf/1311.5240v1.pdf,
Gartner, B. and Matousek, J., Approximation Algorithms and Semidefinite Programming, Springer-Verlag, 2012.
Goemans, M.X. and Williamson, D.P., Improved Approximation Algorithms for Maximum Cut and Satisfability Problems Using Semidefinite Programming, J. ACM, vol. 42(6), pp: 1115-1145, 1995.
Hastie, T., Tibshirani, R. and Friedman, J., The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer Series in Statistics, 2009
Jain, A. K., Data Clustering: 50 Years Beyond K-means, Pattern Recogn. Lett., vol. 31(8), Elsevier Science Inc., pp. 651-666, 2010
Jolliffe, I.T., Principal Component Analysis Second edition, Springer-Verlag, New York, 2002.
Jollie, I.T., Trendalov, N.T., Uddin, M.: A modied principal componente technique based on the lasso. J. of Computational and Graphical Statistics 12(3), pp. 531-547 (2003)
Kogan, Jacob, Nicholas, Charles, Teboulle, Marc (Eds.), Grouping Multidimensional Data: Recent Advances in Clustering, Springer, XII, 2006.
Kulis, B., Surendran, A. C. and Platt, J. C., Fast low-rank semidefinite programming for embedding and clustering, in Proceedings of the International Workshop on Artificial Intelligence and Statistics, San Juan, Puerto Rico, 2007.
Laurent, M. and Rendl, F., Semidefinite programming and integer programming, In Nemhauser, G., Aardal, K. and Weismantel, R., editors, Handbook on Discrete Optimization, Elsevier, pp. 393-514, Elsevier, 2005.
Ma, Z.: Sparse principal component analysis and iterative thresholding. The Annals of Statistics 41(2), pp. 772-801 (2013)
Macedo, E., Freitas, A., Statistical Methods and Optimization in Data Mining, In: III International Conference of Optimization and Applications OPTIMA2012, pp. 164-169 (2012)
Macedo, E. and Freitas, A., The Alternating Least-Squares Algorithm for CDPCA, In: Plakhov, A., Tchemisova, T. and Freitas, A. (eds.) Optimization in the Natural Sciences, Communications in Computer and Information Science (CCIS), Springer Verlag, 499, pp. 173–191, 2015.
Overton, M.L. and Womersley, R.S., Optimality Conditions and Duality Theory for Minimizing Sums of the Largest Eigenvalues of Symmetric Matrices, Mathematical Programming, 62, pp. 321-357, 1993.
Peng, J. and Wei, Y., Approximating k-means-type clustering via semidefinite programming, SIAM J. OPTIM., vol. 18, N. 1, pp. 186-205, 2007.
Peng, J. and Xia, Y., A New Theoretical Framework for K-Means-Type Clustering, Foundations and Advances in Data Mining Studies in Fuzziness and Soft. Computing, Vol 180, 2005, pp. 79-96 .
Pinto Da Costa, J.F., Alonso, H. and Roque, L., A weighted principal componente analysis and its application to gene expression data, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 8, no. 1, pp. 246-252, 2011.
R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, http://www.R-project.org, 2015.
UCI Repository: http://archive.ics.uci.edu/ml/
Vandenberghe, L. and Boyd, S., Semidefinite Programming, SIAM REVIEW, Vol. 38, No. 1, pp. 49-95, 1996.
Vichi, M. and Kiers, H.A.L., Factorial k-means analysis for two-way data, Computational Statistics & Data Analysis, 37(1), pp. 49–64, 2001.
Vichi, M., Saporta, G., Clustering and Disjoint Principal Component Analysis, Computational Statistics & Data Analysis 53, pp. 3194-3208, 2009.
Weber, G.-W., Taylan, P., Ozogur, S. and Akteke-Ozturk, B., Statistical Learning and Optimization Methods in Data Mining, in Ayhan, H. O. and Batmaz, I. (eds.): Recent Advances in Statistics, Turkish Statistical Institute Press, Ankara, pp. 181-195, 2007.
Xu, R. and Wunsch, D., Survey of Clustering Algorithms, IEEE TRANSACTIONS ON NEURAL NETWORKS, Vol. 16, NO. 3, pp. 645-648, 2005.
Yanai, H., Takeuchi, K. and Takane, Y., Projection Matrices, Generalized Inverse Matrices, and Singular Value Decomposition, Statistics for Social and Behavioral Sciences, Springer, 2011.
Zhao, D. and Zhang, M., A primal-dual large-update interior-point algorithm for semi-definite optimization based on a new kernel function, Statistics, Optimization & Information Computing, 1(1), pp. 41–61, 2013.
Zou, H., Hastie, T., Tibshirani, R., Sparse principal component analysis, J. of Computational and Graphical Statistics, vol. 15(2), pp. 262-286, 2006.
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).