Course “Data Mining” Vladimir Panov - HSE Short description This course is suitable for those who are interested in data treatment with Data Mining techniques and effective use of statistical software. The current version of the course is provided for the STATISTICA software (developed by StatSoft, Inc.). The course is divided into 3 parts. In the first part, we consider the data preparation for efficient statistical analysis and discuss the main aspects of the Data Mining methodology. The second part is devoted to studying different methods for solving two popular statististical problems known as classification and regression tasks. In the third part, we turn towards other tasks like clastering and dimension reduction. Duration of the course: 32 academic hours . Plan 1. Introduction to Data Mining, data preparation and preliminary remarks General concepts of Data Mining and realization in the STATISTICA software. Overview of the problems that can be solved by applying the Data Mining techniques. Data import and export, interaction with databases. Preliminary data treatment (data cleaning) and data transformations – missing data, outliers, sparse data, doubled values, uncorrect values. Descriptive statistics and preliminary data analysis, concept of the tool Drill down. Visualization of the input data, interactive analysis of the plots. Selection of the most important factors, tool Feature selection. Search of the regularity in data, concepts of Link analysis and Association rules. Analysis of the division into categories, tool Weights of evidence. 2. Classification and regression tasks Formulation of the problem, key concepts and definitions. Concept of the Classification and regression trees: graphical representation, analysis of the importance of predictors, general methodology, quality parameters, division into training and control samples, cross-validation methods. Other methods for building classification and regression trees: Generlized CHAID models, boosted trees, random forests. Support vector machines (SVM), notion of the optimal hyperplane. Probability approach for solving classification task, naive Bayes models. Nonparametric regression, Generalized additive models (GAM). Spline approach for solving regression problems, Multivariate adaptive regression splines (MARS). Comparison of different models with the tool Goodness of fit, visual analysis of the lift and gain charts. 1 Combining different models, ensemble learning (boosting and bagging). Application of the models to new data, tool Rapid Deployment, «vote» between models. Classical methods of regression analysis: multivariate and logistic regression, variable selection, Akaike information criterion. Multivariate normal distribution, Fisher's discriminant analysis. Analysis of cencored data, survival analysis. 3. Other tasks and methods for data analysis Cluster analysis: formulation of the problem, key concepts and definitions, k-means clustering, tree clustering, two-way joining, and EM - algorithm. Dimension reduction: formulation of the problem, curse of dimensionality, principal component analysis, multidimensional scaling, factor analysis, and independent component analysis. Neural networks: methodology of the neural networks approach, structures of the networks, optimal choice of complexity and architecture. Automation of data analysis, creation of the automated reports, tools Data Miner Workspace and Data Miner Recipes. Literature 1. Duda, R., Hart, P., and Stork, D. Pattern classification. John Wiley, 2001. 2. Hastie, T.J., Tibshirani, R., and Friedman, J. The elements of statistical learning: Data Mining, inference and prediction. Springer, 1996. 3. Härdle, W. and Simar, L. Applied multivariate statistical analysis. Springer, 2012. 4. Hyvärinen, A., Karhunen, J., and Oja, E. Independent component analysis. John Wiley & Sons, 2001. 5. Nisbet, R., Elder, J., and Miner, G. Handbook of statistical analysis and Data Mining applications. Elsevier, 2009. 6. Wasserman, L. All of nonparametric statistics. Springer, 2007. 2