Two-Step Semidefinite Programming  approach to clustering and dimensionality reduction

Eloisa Macedo

doi:10.19139/soic.v3i3.145

Eloisa Macedo University of Aveiro

DOI: https://doi.org/10.19139/soic.v3i3.145

Keywords: Data Mining, Clustering, PCA, Semidefinite Programming

Abstract

Inspired by the recently proposed statistical technique called clustering and disjoint principal component analysis (CDPCA), we present in this paper a new algorithm for clustering objects and dimensionality reduction, based on Semidefinite Programming (SDP) models. The Two-Step-SDP algorithm is based on SDP relaxations of two clustering problems and on a K-means step in a reduced space. The Two-Step-SDP algorithm was implemented and tested in R, a widely used open source software. Besides returning clusters of both objects and attributes, the Two-Step-SDP algorithm returns the variance explained by each component and the component loadings. The numerical experiments on different data sets show that the algorithm is quite efficient and fast. Comparing to other known iterative algorithms for clustering, namely, the K-means and ALS algorithms, the computational time of the Two-Step-SDP algorithm is comparable to the K-means algorithm, and it is faster than the ALS algorithm.

References

Abadir, K.M. and Magnus, J.R., Matrix Algebra, Cambridge University Press, 2005.

Aloise, D. and Hansen, P., A branch-and-cut sdp-based algorithm for minimum sum-of-squares clustering, Pesquisa Operacional, vol.29, n.3, pp. 503-516, 2009.

Ames, Brendan P.W., Guaranteed clustering and biclustering via semidefinite programming, Mathematical Programming, vol. 147, Issue 1-2, Springer Berlin Heidelberg, pp. 429-465, 2014.

Akteke-Ozturk, B., Weber, G.-W. and Kropat, E., Continuous Optimization Approaches for Clustering via Minimum Sum of Squares, Proc. 20th Mini-EURO Conf. Continuous Optimization and Knowledge-Based Technologies, 2007.

Anjos, M.F. and Lasserre, J.B. (Eds.), Handbook of Semidefinite, Conic and Polynomial Optimization: Theory, Algorithms, Software and Applications, International Series in Operational Research and Management Science, vol. 166, Springer, 2012.

Bagirov, A.M., Modified global k-means algorithm for minimum sum-of-squares clustering problems, Pattern Recognition, 41, pp. 3192-3199, 2008.

Bagirov, A.M., Rubinov, A.M., Soukhoroukova, N.V. and Yearwood, J., Unsupervised and Supervised Data Classification via Nonsmooth and Global Optimization, Top, vol. 11, pp. 1-93, 2003.

Bronson, R. and Costa, G.B., Matrix Methods: Applied Linear Algebra,3rd Ed., Academic Press, Elsevier, 2009.

d'Aspremont, A., El Ghaoui, L., Jordan, M.I., Lanckriet, G.R.G.: A Direct Formulation for Sparse PCA Using Semidefinite Programming. SIAM Rev., 49(3), pp.434-448 2007.

Enki, D.G., Trendafilov, N.T. and Jolliffe, I.T., A clustering approach to interpretable principal components, Journal of Applied Statistics, 40(3), pp. 583–599, 2013.

Fiala, J., Kocvara, M. and Stingl, M., PENLAB: A MATLAB solver for nonlinear semidefinite optimization, Nov 2013, http://arxiv.org/pdf/1311.5240v1.pdf,

Gartner, B. and Matousek, J., Approximation Algorithms and Semidefinite Programming, Springer-Verlag, 2012.

Goemans, M.X. and Williamson, D.P., Improved Approximation Algorithms for Maximum Cut and Satisfability Problems Using Semidefinite Programming, J. ACM, vol. 42(6), pp: 1115-1145, 1995.

Hastie, T., Tibshirani, R. and Friedman, J., The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer Series in Statistics, 2009

Jain, A. K., Data Clustering: 50 Years Beyond K-means, Pattern Recogn. Lett., vol. 31(8), Elsevier Science Inc., pp. 651-666, 2010

Jolliffe, I.T., Principal Component Analysis Second edition, Springer-Verlag, New York, 2002.

Jollie, I.T., Trendalov, N.T., Uddin, M.: A modied principal componente technique based on the lasso. J. of Computational and Graphical Statistics 12(3), pp. 531-547 (2003)

Kogan, Jacob, Nicholas, Charles, Teboulle, Marc (Eds.), Grouping Multidimensional Data: Recent Advances in Clustering, Springer, XII, 2006.

Kulis, B., Surendran, A. C. and Platt, J. C., Fast low-rank semidefinite programming for embedding and clustering, in Proceedings of the International Workshop on Artificial Intelligence and Statistics, San Juan, Puerto Rico, 2007.

Laurent, M. and Rendl, F., Semidefinite programming and integer programming, In Nemhauser, G., Aardal, K. and Weismantel, R., editors, Handbook on Discrete Optimization, Elsevier, pp. 393-514, Elsevier, 2005.

Ma, Z.: Sparse principal component analysis and iterative thresholding. The Annals of Statistics 41(2), pp. 772-801 (2013)

Macedo, E., Freitas, A., Statistical Methods and Optimization in Data Mining, In: III International Conference of Optimization and Applications OPTIMA2012, pp. 164-169 (2012)

Macedo, E. and Freitas, A., The Alternating Least-Squares Algorithm for CDPCA, In: Plakhov, A., Tchemisova, T. and Freitas, A. (eds.) Optimization in the Natural Sciences, Communications in Computer and Information Science (CCIS), Springer Verlag, 499, pp. 173–191, 2015.

Overton, M.L. and Womersley, R.S., Optimality Conditions and Duality Theory for Minimizing Sums of the Largest Eigenvalues of Symmetric Matrices, Mathematical Programming, 62, pp. 321-357, 1993.

Peng, J. and Wei, Y., Approximating k-means-type clustering via semidefinite programming, SIAM J. OPTIM., vol. 18, N. 1, pp. 186-205, 2007.

Peng, J. and Xia, Y., A New Theoretical Framework for K-Means-Type Clustering, Foundations and Advances in Data Mining Studies in Fuzziness and Soft. Computing, Vol 180, 2005, pp. 79-96 .

Pinto Da Costa, J.F., Alonso, H. and Roque, L., A weighted principal componente analysis and its application to gene expression data, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 8, no. 1, pp. 246-252, 2011.

R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, http://www.R-project.org, 2015.

UCI Repository: http://archive.ics.uci.edu/ml/

Vandenberghe, L. and Boyd, S., Semidefinite Programming, SIAM REVIEW, Vol. 38, No. 1, pp. 49-95, 1996.

Vichi, M. and Kiers, H.A.L., Factorial k-means analysis for two-way data, Computational Statistics & Data Analysis, 37(1), pp. 49–64, 2001.

Vichi, M., Saporta, G., Clustering and Disjoint Principal Component Analysis, Computational Statistics & Data Analysis 53, pp. 3194-3208, 2009.

Weber, G.-W., Taylan, P., Ozogur, S. and Akteke-Ozturk, B., Statistical Learning and Optimization Methods in Data Mining, in Ayhan, H. O. and Batmaz, I. (eds.): Recent Advances in Statistics, Turkish Statistical Institute Press, Ankara, pp. 181-195, 2007.

Xu, R. and Wunsch, D., Survey of Clustering Algorithms, IEEE TRANSACTIONS ON NEURAL NETWORKS, Vol. 16, NO. 3, pp. 645-648, 2005.

Yanai, H., Takeuchi, K. and Takane, Y., Projection Matrices, Generalized Inverse Matrices, and Singular Value Decomposition, Statistics for Social and Behavioral Sciences, Springer, 2011.

Zhao, D. and Zhang, M., A primal-dual large-update interior-point algorithm for semi-definite optimization based on a new kernel function, Statistics, Optimization & Information Computing, 1(1), pp. 41–61, 2013.

Zou, H., Hastie, T., Tibshirani, R., Sparse principal component analysis, J. of Computational and Graphical Statistics, vol. 15(2), pp. 262-286, 2006.