Data Mining: Questions we would like to answer: 1. What is data mining from a Statistical point of view? 2. How and why is it different (and is it different) from data analysis that we already know? 3. What EXTRA steps are involved 4. What is the end product? Essentially, in my opinion the idea of data mining is letting the data speak for itself without a a-priori hypothesis. Generally till this point when we have analyzed data we had a hypothesis in mind and really collected the data to verify our hypothesis. In data mining the idea is we want to see what the data tells us and sort of create hypothesis along the way. So the focus is on WHAT the data tells us. So the general end product of data mining is PREDICTION. So the big difference with what we have done before is that, the MODEL isn’t as important as the PREDICTION. In this day and age data is being constantly collected in all avenues of life. For example, Blockbuster Entertainment collected its video rental history, and then MINED the database to recommend rentals to individual customers. Amazon does this ALL the time. American Express can suggest products to its cardholders based on analysis of their monthly expenditures. WalMart is pioneering massive data mining to transform its supplier relationships. WalMart captures point-of-sale transactions from over thousands of stores in multiple countries and continuously transmits this data to its massive data warehouse. The National Basketball Association (NBA) is exploring a data mining application that can be used in conjunction with image recordings of basketball games. So, data is constantly being collected, without having a SPECIFIC hypothesis in mind. When we use this data to PREDICT patterns we are essentially DATA MINING. According to http://www.anderson.ucla.edu/faculty/jason.frand/teacher/tec hnologies/palace/datamining.htm One Midwest grocery chain used the data mining capacity of Oracle software to analyze local buying patterns. They discovered that when men bought diapers on Thursdays and Saturdays, they also tended to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping on Saturdays. On Thursdays, however, they only bought a few items. The retailer concluded that they purchased the beer to have it available for the upcoming weekend. The grocery chain could use this newly discovered information in various ways to increase revenue. For example, they could move the beer display closer to the diaper display. And, they could make sure beer and diapers were sold at full price on Thursdays. The Advanced Scout software analyzes the movements of players to help coaches orchestrate plays and strategies. For example, an analysis of the play-by-play sheet of the game played between the New York Knicks and the Cleveland Cavaliers on January 6, 1995 reveals that when Mark Price played the Guard position, John Williams attempted four jump shots and made each one! Advanced Scout not only finds this pattern, but explains that it is interesting because it differs considerably from the average shooting percentage of 49.30% for the Cavaliers during that game. So STATISTICALLY the idea is prediction. We would like to divide the class into two parts: 1. Regression 2. Classification So in each case we are predicting. In Regression we are predicting a numerical outcome and in classification we are prediction a categorical outcome. In each case we talk about: 1. Data splitting for cross validation 2. Pre-processing (including transformation, centering, scaling and outlier issues) 3. Feature extraction (sort of model building) 4. Model tuning (based on cross-validation) I am going to use the following example as a motivating example. It is the weight various measures of a person including weight. pelvic brdth waistgirth thighgirth bicepgirth calfgirth height gender head age weight I am posting a second data set, see if you can write relevant code for this data. It is data from the car sales. When a used car is sold, certain info about the car is noted Price, Mileage, Make, Model, Trim, Type, Cylinder, Liter, Doors, Cruise, Sound, Leather. So, here we can try and use the EXISTING data on weight to PREDICT weight. R code for the data: Let us read the data and explore it: #reading in data my.data1=read.table("weight.csv",header=TRUE,sep=",") #looking at skewness library(e1071) skew=apply(my.data1,2,skewness) skew pelvic.brdth waistgirth thighgrth bicepgrth calfgrth height -0.24563916 0.55194739 0.65056220 0.21630837 0.12608718 0.19931443 gender head age weight 0.04982819 1.01998534 -0.08651051 0.98839201 #plotting histograms hist(my.data1$weight) 60 40 20 0 Frequency 80 100 Histogram of my.data1$weight 40 60 80 100 120 my.data1$weight 140 160 #looking at transformations library(caret) weight.tr=BoxCoxTrans(my.data1$weight) weight.tr Box-Cox Transformation 400 data points used to estimate Lambda Input data summary: Min. 1st Qu. Median Mean 3rd Qu. Max. 42.00 58.20 68.50 69.21 79.20 163.20 Largest/Smallest: 3.89 Sample Skewness: 0.988 Estimated Lambda: -0.3 #checking data quality nearZeroVar(my.data1) integer(0) #looking at correlation matrix corr=cor(my.data1) corr pelvic.brdth waistgirth thighgrth bicepgrth calfgrth pelvic.brdth 1.000000000 0.42381471 0.36365333 0.27494587 0.373358323 waistgirth 0.423814715 1.00000000 0.40798891 0.79807575 0.621871864 thighgrth 0.363653325 0.40798891 1.00000000 0.40773259 0.609362496 bicepgrth 0.274945870 0.79807575 0.40773259 1.00000000 0.632878549 calfgrth 0.373358323 0.62187186 0.60936250 0.63287855 1.000000000 height 0.388133873 0.56863605 0.11634735 0.59122767 0.484758342 gender 0.115209559 0.67522559 -0.07325898 0.74293782 0.404531109 head 0.002753601 0.01542238 0.02557778 0.01035976 -0.002296497 age 0.002094842 0.01912470 0.05944882 0.02746919 0.023530915 weight 0.448448308 0.84059319 0.49411519 0.81088742 0.721588269 height gender head age weight pelvic.brdth 0.3881338732 0.115209559 0.002753601 0.0020948419 0.44844831 waistgirth 0.5686360544 0.675225594 0.015422383 0.0191246961 0.84059319 thighgrth 0.1163473481 -0.073258984 0.025577778 0.0594488157 0.49411519 bicepgrth 0.5912276720 0.742937822 0.010359756 0.0274691933 0.81088742 calfgrth 0.4847583424 0.404531109 -0.002296497 0.0235309148 0.72158827 height 1.0000000000 0.687109407 -0.024881032 -0.0001059692 0.68226836 gender 0.6871094068 1.000000000 -0.008462887 -0.0299288380 0.62374433 head -0.0248810317 -0.008462887 1.000000000 -0.0476271640 -0.02773139 age -0.0001059692 -0.029928838 -0.047627164 1.0000000000 0.04000786 weight 0.6822683607 0.623744327 -0.027731391 0.0400078584 1.00000000 #correlation plots library(corrplot) corrplot(corr,order="original") weight age head gender height calfgrth bicepgrth thighgrth waistgirth pelvic.brdth 1 pelvic.brdth 0.8 waistgirth 0.6 thighgrth 0.4 bicepgrth 0.2 calfgrth 0 height -0.2 gender -0.4 head -0.6 age -0.8 weight -1 # Multiple Linear Regression Example fit <- lm(weight~ ., data=my.data1) summary(fit) # show results Call: lm(formula = weight ~ ., data = my.data1) Residuals: Min 1Q Median 3Q Max -6.111 -1.713 -0.413 1.325 99.943 Coefficients: Estimate -107.97833 0.20094 0.49686 0.31268 0.80092 0.74996 0.37624 -1.50312 -0.11695 0.02563 Std. Error t value 7.25294 -14.888 0.16415 1.224 0.05017 9.904 0.10918 2.864 0.14482 5.531 0.16171 4.638 0.04646 8.098 1.31795 -1.140 0.07432 -1.573 0.03950 0.649 Pr(>|t|) < 2e-16 *** 0.22164 < 2e-16 *** 0.00441 ** 5.86e-08 *** 4.82e-06 *** 7.24e-15 *** 0.25478 0.11642 0.51685 (Intercept) pelvic.brdth waistgirth thighgrth bicepgrth calfgrth height gender head age --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5.641 on 390 degrees of freedom Multiple R-squared: 0.8394, Adjusted R-squared: 0.8357 F-statistic: 226.5 on 9 and 390 DF, p-value: < 2.2e-16 pelvic.brdth thighgrth waistgirth 160 140 120 100 80 60 40 20 25 30 3545 50 55 60 65 70 75 gender 60 70 80 90 100 110 head height 160 140 120 100 80 60 40 0.0 0.2 0.4 0.6 0.8 1.0 0 5 age 10 15 20 25 150 160 170 180 190 200 bicepgrth calfgrth 160 140 120 100 80 60 40 20 25 30 35 40 45 25 30 35 Feature # diagnostic plots layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page plot(fit) 40 30 35 40 45 2 62 216 1 Standardized residuals 3 4 13 13 62 216 80 90 10013 40 15 Fitted values 50 13 60 70 80 90 100 Fitted values 10 70 5 60 62 62 216 184 0 50 Standardized residuals 40 Residuals vs Leverage 20 0 Normal Q-Q Cook's distance