Data Mining1.doc

advertisement
Data Mining:
Questions we would like to answer:
1. What is data mining from a Statistical point of view?
2. How and why is it different (and is it different) from data
analysis that we already know?
3. What EXTRA steps are involved
4. What is the end product?
Essentially, in my opinion the idea of data mining is letting the
data speak for itself without a a-priori hypothesis.
Generally till this point when we have analyzed data we had a
hypothesis in mind and really collected the data to verify our
hypothesis. In data mining the idea is we want to see what the
data tells us and sort of create hypothesis along the way. So
the focus is on WHAT the data tells us. So the general end
product of data mining is PREDICTION. So the big difference
with what we have done before is that, the MODEL isn’t as
important as the PREDICTION.
In this day and age data is being constantly collected in all
avenues of life. For example,
 Blockbuster Entertainment collected its video rental
history, and then MINED the database to recommend
rentals to individual customers.
 Amazon does this ALL the time.
 American Express can suggest products to its cardholders
based on analysis of their monthly expenditures.
 WalMart is pioneering massive data mining to transform
its supplier relationships. WalMart captures point-of-sale
transactions from over thousands of stores in multiple
countries and continuously transmits this data to its
massive data warehouse.
 The National Basketball Association (NBA) is exploring a
data mining application that can be used in conjunction
with image recordings of basketball games.
So, data is constantly being collected, without having a SPECIFIC
hypothesis in mind. When we use this data to PREDICT
patterns we are essentially DATA MINING.
According to
http://www.anderson.ucla.edu/faculty/jason.frand/teacher/tec
hnologies/palace/datamining.htm
One Midwest grocery chain used the data mining capacity of
Oracle software to analyze local buying patterns. They
discovered that when men bought diapers on Thursdays and
Saturdays, they also tended to buy beer. Further analysis
showed that these shoppers typically did their weekly grocery
shopping on Saturdays. On Thursdays, however, they only
bought a few items. The retailer concluded that they purchased
the beer to have it available for the upcoming weekend. The
grocery chain could use this newly discovered information in
various ways to increase revenue. For example, they could
move the beer display closer to the diaper display. And, they
could make sure beer and diapers were sold at full price on
Thursdays.
The Advanced Scout software analyzes the movements of
players to help coaches orchestrate plays and strategies. For
example, an analysis of the play-by-play sheet of the game
played between the New York Knicks and the Cleveland
Cavaliers on January 6, 1995 reveals that when Mark Price
played the Guard position, John Williams attempted four jump
shots and made each one! Advanced Scout not only finds this
pattern, but explains that it is interesting because it differs
considerably from the average shooting percentage of 49.30%
for the Cavaliers during that game.
So STATISTICALLY the idea is prediction.
We would like to divide the class into two parts:
1. Regression
2. Classification
So in each case we are predicting. In Regression we are
predicting a numerical outcome and in classification we are
prediction a categorical outcome.
In each case we talk about:
1. Data splitting for cross validation
2. Pre-processing (including transformation, centering,
scaling and outlier issues)
3. Feature extraction (sort of model building)
4. Model tuning (based on cross-validation)
I am going to use the following example as a motivating
example. It is the weight various measures of a person
including weight.
pelvic
brdth
waistgirth thighgirth bicepgirth calfgirth height
gender
head
age
weight
I am posting a second data set, see if you can write relevant
code for this data. It is data from the car sales. When a used
car is sold, certain info about the car is noted
Price, Mileage, Make, Model, Trim, Type, Cylinder, Liter, Doors,
Cruise, Sound, Leather.
So, here we can try and use the EXISTING data on weight to
PREDICT weight.
R code for the data:
Let us read the data and explore it:
#reading in data
my.data1=read.table("weight.csv",header=TRUE,sep=",")
#looking at skewness
library(e1071)
skew=apply(my.data1,2,skewness)
skew
pelvic.brdth waistgirth thighgrth bicepgrth calfgrth
height
-0.24563916 0.55194739 0.65056220 0.21630837 0.12608718 0.19931443
gender
head
age
weight
0.04982819 1.01998534 -0.08651051 0.98839201
#plotting histograms
hist(my.data1$weight)
60
40
20
0
Frequency
80
100
Histogram of my.data1$weight
40
60
80
100
120
my.data1$weight
140
160
#looking at transformations
library(caret)
weight.tr=BoxCoxTrans(my.data1$weight)
weight.tr
Box-Cox Transformation
400 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
42.00 58.20 68.50 69.21 79.20 163.20
Largest/Smallest: 3.89
Sample Skewness: 0.988
Estimated Lambda: -0.3
#checking data quality
nearZeroVar(my.data1)
integer(0)
#looking at correlation matrix
corr=cor(my.data1)
corr
pelvic.brdth waistgirth thighgrth bicepgrth calfgrth
pelvic.brdth 1.000000000 0.42381471 0.36365333 0.27494587 0.373358323
waistgirth 0.423814715 1.00000000 0.40798891 0.79807575 0.621871864
thighgrth 0.363653325 0.40798891 1.00000000 0.40773259 0.609362496
bicepgrth 0.274945870 0.79807575 0.40773259 1.00000000 0.632878549
calfgrth 0.373358323 0.62187186 0.60936250 0.63287855 1.000000000
height
0.388133873 0.56863605 0.11634735 0.59122767 0.484758342
gender
0.115209559 0.67522559 -0.07325898 0.74293782 0.404531109
head
0.002753601 0.01542238 0.02557778 0.01035976 -0.002296497
age
0.002094842 0.01912470 0.05944882 0.02746919 0.023530915
weight
0.448448308 0.84059319 0.49411519 0.81088742 0.721588269
height
gender
head
age weight
pelvic.brdth 0.3881338732 0.115209559 0.002753601 0.0020948419 0.44844831
waistgirth 0.5686360544 0.675225594 0.015422383 0.0191246961 0.84059319
thighgrth 0.1163473481 -0.073258984 0.025577778 0.0594488157 0.49411519
bicepgrth 0.5912276720 0.742937822 0.010359756 0.0274691933 0.81088742
calfgrth 0.4847583424 0.404531109 -0.002296497 0.0235309148 0.72158827
height
1.0000000000 0.687109407 -0.024881032 -0.0001059692 0.68226836
gender
0.6871094068 1.000000000 -0.008462887 -0.0299288380 0.62374433
head
-0.0248810317 -0.008462887 1.000000000 -0.0476271640 -0.02773139
age
-0.0001059692 -0.029928838 -0.047627164 1.0000000000 0.04000786
weight
0.6822683607 0.623744327 -0.027731391 0.0400078584 1.00000000
#correlation plots
library(corrplot)
corrplot(corr,order="original")
weight
age
head
gender
height
calfgrth
bicepgrth
thighgrth
waistgirth
pelvic.brdth
1
pelvic.brdth
0.8
waistgirth
0.6
thighgrth
0.4
bicepgrth
0.2
calfgrth
0
height
-0.2
gender
-0.4
head
-0.6
age
-0.8
weight
-1
# Multiple Linear Regression Example
fit <- lm(weight~ ., data=my.data1)
summary(fit) # show results
Call:
lm(formula = weight ~ ., data = my.data1)
Residuals:
Min 1Q Median 3Q Max
-6.111 -1.713 -0.413 1.325 99.943
Coefficients:
Estimate
-107.97833
0.20094
0.49686
0.31268
0.80092
0.74996
0.37624
-1.50312
-0.11695
0.02563
Std. Error t value
7.25294 -14.888
0.16415 1.224
0.05017 9.904
0.10918 2.864
0.14482 5.531
0.16171 4.638
0.04646 8.098
1.31795 -1.140
0.07432 -1.573
0.03950 0.649
Pr(>|t|)
< 2e-16 ***
0.22164
< 2e-16 ***
0.00441 **
5.86e-08 ***
4.82e-06 ***
7.24e-15 ***
0.25478
0.11642
0.51685
(Intercept)
pelvic.brdth
waistgirth
thighgrth
bicepgrth
calfgrth
height
gender
head
age
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.641 on 390 degrees of freedom
Multiple R-squared: 0.8394, Adjusted R-squared: 0.8357
F-statistic: 226.5 on 9 and 390 DF, p-value: < 2.2e-16
pelvic.brdth
thighgrth
waistgirth
160
140
120
100
80
60
40
20
25
30
3545 50 55 60 65 70 75
gender
60 70 80 90 100 110
head
height
160
140
120
100
80
60
40
0.0 0.2 0.4 0.6 0.8 1.0 0
5
age
10 15 20 25
150 160 170 180 190 200
bicepgrth
calfgrth
160
140
120
100
80
60
40
20
25
30
35
40
45
25
30
35
Feature
# diagnostic plots
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(fit)
40
30
35
40
45
2
62
216
1
Standardized residuals
3
4
13
13
62
216
80
90
10013 40
15
Fitted values
50
13
60
70
80
90
100
Fitted values
10
70
5
60
62
62
216
184
0
50
Standardized residuals
40
Residuals vs Leverage
20 0
Normal Q-Q
Cook's distance
Download