Supersaturated plans for variable selection in large databases

Christina Parpoula; Christos Koukouvinos; Dimitrios Simos; Stella Stylianou

doi:10.19139/soic.v2i2.75

Christina Parpoula National Technical University of Athens
Christos Koukouvinos National Technical University of Athens
Dimitrios Simos SBA Research
Stella Stylianou University of the Aegean

DOI: https://doi.org/10.19139/soic.v2i2.75

Abstract

Over the last decades, the collection and storage of data has become massive with the advance of technology and variable selection has become a fundamental tool to large dimensional statistical modelling problems. In this study we implement data mining techniques, metaheuristics and use experimental designs in databases in order to determine the most relevant variables for classification in regression problems in cases where observations and labels of a large database are available. We propose a database-driven scheme for the encryption of specific fields of a database in order to select an optimal supersaturated design consisting of the variables of a large database which have been found to influence significantly the response outcome. The proposed design selection approach is quite promising, since we are able to retrieve an optimal supersaturated plan using a very small percentage of the available runs, a fact that makes the statistical analysis of a large database computationally feasible and affordable.

References

G.E.P Box and R.D. Meyer (1986). An analysis for unreplicated fractional factorials,

Technometrics 28, 11-18.

P.S. Bradley and O.L. Mangasarian (1998). Feature Selection via Concave Minimization and

Support Vector Machines, In Shavlik, J. (ed.) Machine Learning Proceedings of the Fifteenth

International Conference(ICML ’98), Morgan Kaufmann, San Fransisco, CA, 82-90.

C.J.C. Burges (1998). A tutorial on Support Vector Machines for Pattern Recognition, Data

Mining and Knowledge Discovery, 2, 121-167.

L. Davis (1991). Handbook of Genetic Algorithms, Van Nostrand, Reinhold.

J. Fan and R. Li (2001). Variable selection via nonconcave penalized likelihood and its oracle

properties, J. Amer. Statist. Assoc., 96, 1348-1360.

J. Fan and R. Li (2006). Statistical challenges with high dimensionality: feature selection

in knowledge discovery Proceedings of the International Congress of Mathematicians ,(M.Sanz-Sole, J. Soria, J.L. Varona, J. Verdera, eds.), Vol. III, 595-622.

G. Fung and O.L. Mangasarian (2004). A feature selection newton method for support vector

machine classification, Comput. Optim. Appl. J., 28, 185-202.

S.D. Georgiou (2014). Supersaturated designs: A review of their construction and analysis.

Journal of Statistical Planning and Inference ,144, 92-109.

S.G. Gilmour (2006). Factor Screening via Supersaturated Designs, In: A. Dean, and S. Lewis,

(Eds.), Screening Methods for Experimentation in Industry, Drug Discovery, and Genetics,

Springer-Verlag, New York, 169-190.

D.E. Goldberg (1989). Genetic Algorithms in Search, Optimization, and Machine Learning,

Addison-Wesley Reading, MA.

S. Gupta and P Kohli (2008). Analysis of supersaturated designs: a review, Journal of Indian

Society of Agricultural Statistics, 62, 156-168.

J.A. Hanley and B.J. McNeil (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve, Radiology, 143, 29-36.

D.R. Holcomb, D.C. Montgomery and W.M. Carlyle (2007). The use of supersaturated

experiments in turbine engine development, Quality Engineering, 19, 17-27.

B. Kole, J. Gangwani, V.K. Gupta and R. Parsad (2010). Two level supersaturated designs: a

review, Journal of Statistical Theory and Practice, 4, 598-608.

C. Koukouvinos, C. Parpoula and D.E Simos (2013). Genetic algorithm and data mining

techniques for design selection in databases, In Proceedings of the 2013 IEEE International

Conference on Availability, Reliability and Security, (2013), 743-746, DOI:

1109/ARES.2013.98.

R. Li and D.K.J. Lin (2002). Data analysis in supersaturated designs, Statist. Probab. Lett.,

, 135-144.

R. Li and D.K.J. Lin (2003). Analysis methods for supersaturated designs: Some comparisons,

Journal of Data Science, 1, 249-260.

R. Li, D.K.J. Lin and B. Li (2013). Statistical inference in massive data sets, Applied

Stochastic Models in Business and Industry, 29, 399-409.

D.K.J. Lin (1993). Another look at first-order saturated designs: the p-efficient designs,

Technometrics, 35, 284-292.

D.K.J. Lin (1995). Generating systematic supersaturated designs, Technometrics ,37, 213-225.

A.J. Miller (2002). Subset Selection in Regression, Chapman & Hall/CRC, Boca Raton.

C. Parpoula, K. Drosou, C. Koukouvinos and K. Mylona (2014). A new variable selection

approach inspired by supersaturated designs given a large-dimensional dataset, Journal of

Data Science, 12, 35-52.

M.S. Pepe (2000). Receiver operating characteristic methodology J. Am. Statist. Assoc., 95, 308-311.

C. Pumplun, S. Ruping, K. Morik and C. Weihs (2005a). D-optimal plans in observational

studies, Technical Report 44/2005, SFB 475, Complexity reduction in multivariate

data structures, Technische Universitat Dortmund, 44221 Dortmund, Germany, URL

http://www.statistik.tu-dortmund.de/sfb-tr2005.html.

C. Pumplun, C. Weihs and A. Preusser (2005b).

Experimental design for variable selection

in data bases, In C. Weihs and W. Gaul, editors, Classification The Ubiquitous Challenge,

-199, Springer, Berlin.

S. Ruping and C. Weihs (2009). Kernelized design of experiments, Technical Report,

Sonderforschungsbereich 475, Komplexitatsreduktion in Multivariaten Datenstrukturen,

Universitat Dortmund, No. 2009, 02, http://hdl.handle.net/10419/36602.

J. Schiffner and C. Weihs (2009). D-optimal plans for variable selection in data bases,

Technical Report, Sonderforschungsbereich 475, Komplexitatsreduktion in Multivariaten

Datenstrukturen, Universitat Dortmund, No. 2009, 14, http://hdl.handle.net/10419/41052.

P Smyth and R.M. Goodman (1992). An Information Theoretic Approach to Rule Induction

from Databases, IEEE Trans. Knowl. Data Eng. ,4, 301-316.

R. Tibshirani (1996). Regression shrinkage and selection via the lasso, J. Roy. Statist., Soc.

Ser. B, 58, 267-288.

V. Vapnik (1998). Statistical Learning Theory, Wiley New York.Vapnik 1998.

L.Wang, J. Zhu and H. Zou (2006). The doubly regularized support vector machine, Statistica

Sinica, 16(2), 589-615.

L.Wang, J. Zhu and H. Zou (2008). Hybrid huberized support vector machines for microarray

classification and gene selection, Bioinformatics, 24(3), 412-419.

I.H. Witten and E. Frank (2005). Data Mining: Practical Machine learning Tools and

Techniques with Java Implementations, 2nd edn, Morgan Kaufmann Publishers San Francisco.

G.-B. Ye, Y. Chen and X. Xie (2011). Efficient variable selection in support vector machines

via the alternating direction method of multipliers, In Proceedings of the 14thInternational

Conference on Artificial Intelligence and Statistics (AISTATS) 2011, Fort Lauderdale, FL,

USA, JMLR W&CP 15, 832-840.

J. Zhu, S. Rosset, T. Hastie and R. Tibshirani (2003). 1-norm support vector machines, In

Advances in Neural Information Processing Systems 16, Proceedings of the 2003 Conference.

H. Zou and T. Hastie (2005). Regularization and variable selection via the elastic net, J. Roy.

Statist. Soc. Ser. B, 67, 301-320.