Supersaturated plans for variable selection in large databases
Abstract
Over the last decades, the collection and storage of data has become massive with the advance of technology and variable selection has become a fundamental tool to large dimensional statistical modelling problems. In this study we implement data mining techniques, metaheuristics and use experimental designs in databases in order to determine the most relevant variables for classification in regression problems in cases where observations and labels of a large database are available. We propose a database-driven scheme for the encryption of specific fields of a database in order to select an optimal supersaturated design consisting of the variables of a large database which have been found to influence significantly the response outcome. The proposed design selection approach is quite promising, since we are able to retrieve an optimal supersaturated plan using a very small percentage of the available runs, a fact that makes the statistical analysis of a large database computationally feasible and affordable.References
G.E.P Box and R.D. Meyer (1986). An analysis for unreplicated fractional factorials,
Technometrics 28, 11-18.
P.S. Bradley and O.L. Mangasarian (1998). Feature Selection via Concave Minimization and
Support Vector Machines, In Shavlik, J. (ed.) Machine Learning Proceedings of the Fifteenth
International Conference(ICML ’98), Morgan Kaufmann, San Fransisco, CA, 82-90.
C.J.C. Burges (1998). A tutorial on Support Vector Machines for Pattern Recognition, Data
Mining and Knowledge Discovery, 2, 121-167.
L. Davis (1991). Handbook of Genetic Algorithms, Van Nostrand, Reinhold.
J. Fan and R. Li (2001). Variable selection via nonconcave penalized likelihood and its oracle
properties, J. Amer. Statist. Assoc., 96, 1348-1360.
J. Fan and R. Li (2006). Statistical challenges with high dimensionality: feature selection
in knowledge discovery Proceedings of the International Congress of Mathematicians ,(M.Sanz-Sole, J. Soria, J.L. Varona, J. Verdera, eds.), Vol. III, 595-622.
G. Fung and O.L. Mangasarian (2004). A feature selection newton method for support vector
machine classification, Comput. Optim. Appl. J., 28, 185-202.
S.D. Georgiou (2014). Supersaturated designs: A review of their construction and analysis.
Journal of Statistical Planning and Inference ,144, 92-109.
S.G. Gilmour (2006). Factor Screening via Supersaturated Designs, In: A. Dean, and S. Lewis,
(Eds.), Screening Methods for Experimentation in Industry, Drug Discovery, and Genetics,
Springer-Verlag, New York, 169-190.
D.E. Goldberg (1989). Genetic Algorithms in Search, Optimization, and Machine Learning,
Addison-Wesley Reading, MA.
S. Gupta and P Kohli (2008). Analysis of supersaturated designs: a review, Journal of Indian
Society of Agricultural Statistics, 62, 156-168.
J.A. Hanley and B.J. McNeil (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve, Radiology, 143, 29-36.
D.R. Holcomb, D.C. Montgomery and W.M. Carlyle (2007). The use of supersaturated
experiments in turbine engine development, Quality Engineering, 19, 17-27.
B. Kole, J. Gangwani, V.K. Gupta and R. Parsad (2010). Two level supersaturated designs: a
review, Journal of Statistical Theory and Practice, 4, 598-608.
C. Koukouvinos, C. Parpoula and D.E Simos (2013). Genetic algorithm and data mining
techniques for design selection in databases, In Proceedings of the 2013 IEEE International
Conference on Availability, Reliability and Security, (2013), 743-746, DOI:
1109/ARES.2013.98.
R. Li and D.K.J. Lin (2002). Data analysis in supersaturated designs, Statist. Probab. Lett.,
, 135-144.
R. Li and D.K.J. Lin (2003). Analysis methods for supersaturated designs: Some comparisons,
Journal of Data Science, 1, 249-260.
R. Li, D.K.J. Lin and B. Li (2013). Statistical inference in massive data sets, Applied
Stochastic Models in Business and Industry, 29, 399-409.
D.K.J. Lin (1993). Another look at first-order saturated designs: the p-efficient designs,
Technometrics, 35, 284-292.
D.K.J. Lin (1995). Generating systematic supersaturated designs, Technometrics ,37, 213-225.
A.J. Miller (2002). Subset Selection in Regression, Chapman & Hall/CRC, Boca Raton.
C. Parpoula, K. Drosou, C. Koukouvinos and K. Mylona (2014). A new variable selection
approach inspired by supersaturated designs given a large-dimensional dataset, Journal of
Data Science, 12, 35-52.
M.S. Pepe (2000). Receiver operating characteristic methodology J. Am. Statist. Assoc., 95, 308-311.
C. Pumplun, S. Ruping, K. Morik and C. Weihs (2005a). D-optimal plans in observational
studies, Technical Report 44/2005, SFB 475, Complexity reduction in multivariate
data structures, Technische Universitat Dortmund, 44221 Dortmund, Germany, URL
http://www.statistik.tu-dortmund.de/sfb-tr2005.html.
C. Pumplun, C. Weihs and A. Preusser (2005b).
Experimental design for variable selection
in data bases, In C. Weihs and W. Gaul, editors, Classification The Ubiquitous Challenge,
-199, Springer, Berlin.
S. Ruping and C. Weihs (2009). Kernelized design of experiments, Technical Report,
Sonderforschungsbereich 475, Komplexitatsreduktion in Multivariaten Datenstrukturen,
Universitat Dortmund, No. 2009, 02, http://hdl.handle.net/10419/36602.
J. Schiffner and C. Weihs (2009). D-optimal plans for variable selection in data bases,
Technical Report, Sonderforschungsbereich 475, Komplexitatsreduktion in Multivariaten
Datenstrukturen, Universitat Dortmund, No. 2009, 14, http://hdl.handle.net/10419/41052.
P Smyth and R.M. Goodman (1992). An Information Theoretic Approach to Rule Induction
from Databases, IEEE Trans. Knowl. Data Eng. ,4, 301-316.
R. Tibshirani (1996). Regression shrinkage and selection via the lasso, J. Roy. Statist., Soc.
Ser. B, 58, 267-288.
V. Vapnik (1998). Statistical Learning Theory, Wiley New York.Vapnik 1998.
L.Wang, J. Zhu and H. Zou (2006). The doubly regularized support vector machine, Statistica
Sinica, 16(2), 589-615.
L.Wang, J. Zhu and H. Zou (2008). Hybrid huberized support vector machines for microarray
classification and gene selection, Bioinformatics, 24(3), 412-419.
I.H. Witten and E. Frank (2005). Data Mining: Practical Machine learning Tools and
Techniques with Java Implementations, 2nd edn, Morgan Kaufmann Publishers San Francisco.
G.-B. Ye, Y. Chen and X. Xie (2011). Efficient variable selection in support vector machines
via the alternating direction method of multipliers, In Proceedings of the 14thInternational
Conference on Artificial Intelligence and Statistics (AISTATS) 2011, Fort Lauderdale, FL,
USA, JMLR W&CP 15, 832-840.
J. Zhu, S. Rosset, T. Hastie and R. Tibshirani (2003). 1-norm support vector machines, In
Advances in Neural Information Processing Systems 16, Proceedings of the 2003 Conference.
H. Zou and T. Hastie (2005). Regularization and variable selection via the elastic net, J. Roy.
Statist. Soc. Ser. B, 67, 301-320.
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).