Intelligent Data Analysis and Probabilistic Inference Data Mining Tutorial 1: Overview & Data Cleaning 1. Basic Concepts a. Give a brief definition for the term “Data Mining”? b. Briefly explain the difference between “Data Mining”, “OLAP” and traditional “Database Querying”. c. Explain the difference between “Explorative Data Mining” and “Predictive Data Mining” and give one example of each. d. State three different applications for which data mining techniques seem appropriate. Informally explain each application. e. Explain what is meant by “Data Integration” and describe why it is an important pre-processing step for data mining. 2. Data Mining Techniques and Applications a. Explain briefly the difference between “Classification” and “Clustering” and give an informal example of an application that would benefit from each technique. b. Explain briefly the difference between “Regression” and “Classification”. c. Explain briefly what is meant by “Association Rule Analysis” and describe the different between it and “Sequence Rule Analysis”. 3. Clustering: You are given the task to cluster (i.e. divide into similar groups) the students attending this tutorial based on their physical appearance. a. Devise a feature representation scheme that allows describing each student in the class as a record, make sure that you have at least 5 features to describe each student. For each feature, describe the type of variable it denotes (Numerical, Categorical, etc) and state the valid range of values for that variable. b. Fill in the feature table for six students, i.e. build a table containing 6 rows and 5 columns and provide the values for each cell in the table. c. Describe why you believe your feature representation scheme will produce good results when applied to grouping the students in the tutorial. d. Explain what is meant by an “outlier”. Add a new record to the table that you believe would be an outlier compared to the whole data set and also to the different clusters, and explain why it is indeed an outlier. 4. Classification: You are now given the task to derive a model that can predict whether a student will pass the data mining course or not (PASS/FAIL decision). a. Devise a feature representation scheme with five features that can help deriving such a model. Make sure you choose features you believe may be good predictors of a student’s grade, and describe why you believe they are better predictors than the features you chose in question 3. b. Fill in the table with six different records for six hypothetical students from the class of 2000. This table should contain six columns (one column for each of your chosen features, and one column for Pass/Fail result) and six rows (one for each yg@doc.ic.ac.uk, mmg@doc.ic.ac.uk 25th Nov 2003 c. d. e. f. student). Which columns (variables) of this table are “independent” (“input”) variables and which are “dependent” (“output” or “class”) variables? A decision rule is in the form “If FeatureA = FeatureValue1 then ClassValue = ClassValue1). Informally derive at least 4 “Decision Rules” that can be inferred from your data table. Is there any inconsistency between your rules? Explain informally how you can test the accuracy of your decision rules based on the data set you have provided. What is the accuracy of the each rule? What is the accuracy of the overall model (i.e. the 4 rules together)? Testing the accuracy of the rules on your data set may be biased, they probably over-fit your data since they were derived and tested only using this data set. What would be a better way to assess the accuracy of your rules? Explain how your decision rules can be applied to predict whether you yourself will PASS or FAIL the data mining course in 2003. 5. Classification/Prediction/Feature Selection: There are many applications of data mining in finance. Explain why and how it can be dangerous to naively use predictive data mining techniques to predict stock price movements. Hint: Consider what features would you choose to describe each stock and also consider what really makes stock prices move. Can you find a good feature set that can be presented to a data mining algorithm? 6. Data Cleaning: a. Explain what is meant by “Data Cleaning” and why it may be required before mining a large data set. b. Describe three commonly used data cleaning operations. c. Explain three methods for handling missing data in dataset. 7. Data Cleaning: Given the following data set [4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34] a. Divide the data set into 3 equi-depth bins. b. Divide the data set into 3 bins that are smoothed by their means. c. Normalize the data set based on a min-max normalization. yg@doc.ic.ac.uk, mmg@doc.ic.ac.uk 25th Nov 2003