Course “Data Mining” Vladimir Panov

advertisement
Course “Data Mining”
Vladimir Panov - HSE
Short description
This course is suitable for those who are interested in data treatment with Data
Mining techniques and effective use of statistical software. The current version of the
course is provided for the STATISTICA software (developed by StatSoft, Inc.).
The course is divided into 3 parts. In the first part, we consider the data
preparation for efficient statistical analysis and discuss the main aspects of the Data
Mining methodology. The second part is devoted to studying different methods for
solving two popular statististical problems known as classification and regression tasks.
In the third part, we turn towards other tasks like clastering and dimension reduction.
Duration of the course: 32 academic hours .
Plan
1. Introduction to Data Mining, data preparation and preliminary remarks
 General concepts of Data Mining and realization in the STATISTICA
software.
 Overview of the problems that can be solved by applying the Data Mining
techniques.
 Data import and export, interaction with databases.
 Preliminary data treatment (data cleaning) and data transformations –
missing data, outliers, sparse data, doubled values, uncorrect values.
 Descriptive statistics and preliminary data analysis, concept of the tool
Drill down.
 Visualization of the input data, interactive analysis of the plots.
 Selection of the most important factors, tool Feature selection.
 Search of the regularity in data, concepts of Link analysis and Association
rules.
 Analysis of the division into categories, tool Weights of evidence.
2. Classification and regression tasks
 Formulation of the problem, key concepts and definitions.
 Concept of the Classification and regression trees: graphical representation,
analysis of the importance of predictors, general methodology, quality
parameters, division into training and control samples, cross-validation
methods.
 Other methods for building classification and regression trees: Generlized
CHAID models, boosted trees, random forests.
 Support vector machines (SVM), notion of the optimal hyperplane.
 Probability approach for solving classification task, naive Bayes models.
 Nonparametric regression, Generalized additive models (GAM).
 Spline approach for solving regression problems, Multivariate adaptive
regression splines (MARS).
 Comparison of different models with the tool Goodness of fit, visual
analysis of the lift and gain charts.
1





Combining different models, ensemble learning (boosting and bagging).
Application of the models to new data, tool Rapid Deployment, «vote»
between models.
Classical methods of regression analysis: multivariate and logistic
regression, variable selection, Akaike information criterion.
Multivariate normal distribution, Fisher's discriminant analysis.
Analysis of cencored data, survival analysis.
3. Other tasks and methods for data analysis
 Cluster analysis: formulation of the problem, key concepts and definitions,
k-means clustering, tree clustering, two-way joining, and EM - algorithm.
 Dimension reduction: formulation of the problem, curse of dimensionality,
principal component analysis, multidimensional scaling, factor analysis, and
independent component analysis.
 Neural networks: methodology of the neural networks approach,
structures of the networks, optimal choice of complexity and architecture.
 Automation of data analysis, creation of the automated reports, tools Data
Miner Workspace and Data Miner Recipes.
Literature
1. Duda, R., Hart, P., and Stork, D. Pattern classification. John Wiley, 2001.
2. Hastie, T.J., Tibshirani, R., and Friedman, J. The elements of statistical learning:
Data Mining, inference and prediction. Springer, 1996.
3. Härdle, W. and Simar, L. Applied multivariate statistical analysis. Springer, 2012.
4. Hyvärinen, A., Karhunen, J., and Oja, E. Independent component analysis. John
Wiley & Sons, 2001.
5. Nisbet, R., Elder, J., and Miner, G. Handbook of statistical analysis and Data Mining
applications. Elsevier, 2009.
6. Wasserman, L. All of nonparametric statistics. Springer, 2007.
2
Download