a2datadive-student-presentation-r-120221150915

advertisement
Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012
License: Unless otherwise noted, this material is made available under the terms
of the Creative Commons Attribution – Noncommercial – Share Alike 3.0 Lic
ense: http://creativecommons.org/licenses/by-nc-sa/3.0/
We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your abilit
y to use, share, and adapt it. The citation key on the following slide provides information about how you may sha
re and adapt this material.
Copyright holders of content included in this material should contact open.michigan@umich.edu with any questi
ons, corrections, or clarification regarding the use of content.
For more information about how to cite these materials visit http://open.umich.edu/education/about/terms-of-use.
Attribution Key
for more information see: http://open.umich.edu/wiki/AttributionPolicy
Use + Share + Adapt
{ Content the copyright holder, author, or law permits you to use, share and adapt. }
Public Domain – Government: Works that are produced by the U.S. Government. (17 USC § 10
5)
Public Domain – Expired: Works that are no longer protected due to an expired copyright term.
Public Domain – Self Dedicated: Works that a copyright holder has dedicated to the public domain.
Creative Commons – Zero Waiver
Creative Commons – Attribution License
Creative Commons – Attribution Share Alike License
Creative Commons – Attribution Noncommercial License
Creative Commons – Attribution Noncommercial Share Alike License
GNU – Free Documentation License
Make Your Own Assessment
{ Content Open.Michigan believes can be used, shared, and adapted because it is ineligible for copyright. }
Public Domain – Ineligible: Works that are ineligible for copyright protection in the U.S. (17 USC § 102(b)) *laws in
your jurisdiction may differ
{ Content Open.Michigan has used under a Fair Use determination. }
Fair Use: Use of works that is determined to be Fair consistent with the U.S. Copyright Act. (17 USC § 107) *laws in your j
urisdiction may differ
Our determination DOES NOT mean that all uses of this 3rd-party content are Fair Uses and we DO NOT guarantee that y
our use of the content is Fair.
To use this content you should do your own independent analysis to determine whether or not your use will be Fair.
Descriptive Statistics
quantitatively describe the main features of a collection of data.
How do salaries
vary across the
company?
manager
What should I
make of all
this???!!!
employee
Staff. Jones
HR
Descriptive Statistics in R
Mean
> mean(x);
> mean(x,trim=a)
Median
> median(x)
Mode
> sort(table(x))
Standard deviation
> sd(x)
Variance
> var(x)
the median absolute
deviation
> mad(c(x))
interquartile range
> IQR(x)
Range
> range(x)
Data Dimensions
> length(x)
[1] 1000
------------------------> nrow(X)
[1] 2030
> ncol(X)
[1] 100000
> dim(X)
[1] 2034 100000
Matrix X
….
….
Vectorization in R
Matrix X
> apply( X, MARGIN=1, FUN= mean)
> apply( X, MARGIN=2, FUN= mean)
25
boxplot(X)
0
5
10
15
20
• Good for small
data sets
• Easy to compar
e groups side b
y side
• 1.5*IQR defines
outlier
epiE
epiS
epiImp
epilie
epiNeur
The Big Six
 Minimum, 1st Q, Median, Mean, 3rd Q, Maximu
m
> summary(X)
R tries to understand you
> summary(X)
Histograms: > hist(X)
4
8
12
0
2
4
6
80
40
0
Frequency
20 40
0
Frequency
0
epilie
8
0
2
4
epiNeur
bfagree
bfcon
bfext
15
80
120
160
60
100
160
bfneur
bfopen
bdi
120
bfneur
80
120
bfopen
160
0
Frequency
20
0
20
80
0
10
20
bdi
0
50
150
bfext
60
bfcon
50
bfagree
Frequency
epiNeur
0
Frequency
20
0
30
0
Frequency
5
30
6
40
epilie
40
epiImp
Frequency
epiS
40
epiE
0
40
40
20
20
0
epiImp
0
Frequency
50
20
5 10
0
Frequency
0
Frequency
epiS
0
Frequency
epiE
Correlation
Scatterplot Example
25
20
15
10
Miles Per Gallon
30
> cor(wt,mpg)
[1] -0.8676594
> plot(x=wt,y=mpg)
2
3
4
Car Weight
5
Scatterplot Matrix
• Iris dataset
• 150 flowers
• 5 variables
Goingslo, flickr
Scatterplot Matrix
plot > pairs(data)
3.0
4.0
0.5
1.5
2.5
7.5
2.0
5
7
2.0
3.0
Sepal.Width
1.5
2.5
1
3
Petal.Length
Petal.Width
2.0
3.0
0.5
Species
1.0
setosa
versicolor
virginica
4.0
4.5
6.0
Sepal.Length
4.5 5.5 6.5 7.5
1 2 3 4 5 6 7
1.0
2.0
3.0
> coplot(lat ~ long | depth)
Given : depth
100
200
300
400
600
165 170 175 180 185
-25
-30
-35
lat
-20
-15
-10
165 170 175 180 185
500
165 170 175 180 185
long
165 170 175 180 185
Linear Regression
 Why?
 Prediction of future or unknown observations
 Assessment of relationship between variables
 General description of data structure
 What?
Variable Selection
 Why?
 Simplification
 Elimination of multicollinearity and noise
 Time and money saving
 How?
 Testing-based Variable Selection Methods
- Backward, Forward, Stepwise
 Criterion-based Procedures
 What?
 AIC = n ln(RSS/n) + 2(p)
Example: U.S. State Fact and Figures
 Life Expectancy
 Population, Income, Illiteracy, Murder, HS Grad, Frost, Area
 Selected R code
 Linear Regression
> g <- lm(Life.Exp ~ Population + Income + Illiteracy + Murder
+ HS.Grad + Frost + Area, data = statedata)
> summary(g) Coefficients:
Analysis of Variance Table
Response: Life.Exp
Estimate Std. Error t value Pr(>|t|)
Sum Sq Mean
Sq F value
(Intercept) Df7.094e+01
1.748e+00
40.586 Pr(>F)
< 2e-16 ***
> anova(g)
Population 5.180e-05
1 0.4089 2.919e-05
0.4089 0.7372
Population
1.7750.395434
0.0832 .
Income
1
11.5946
11.5946
20.9028
4.218e-05
Income
-2.180e-05 2.444e-04 -0.089
0.9293***
 AIC
Illiteracy 3.382e-02
1 19.4207 19.4207
35.0116
5.228e-07
Illiteracy
3.663e-01
0.092
0.9269***
Murder
1 27.4288 27.4288
49.4486
1.308e-08
Murder
-3.011e-01
4.662e-02
-6.459
8.68e-08***
***
> step(g)
HS.Grad
1 4.0989 2.332e-02
4.0989 7.3895
HS.Grad
4.893e-02
2.0980.009494
0.0420***
Frost
1 2.0488 3.143e-03
2.0488 3.6935
Frost
-5.735e-03
-1.8250.061426
0.0752. .
Area
1 0.0011 1.668e-06
0.0011 0.0020
Area
-7.383e-08
-0.0440.964908
0.9649
Residuals 42 23.2971 0.5547
AIC = n ln(RSS/n) + 2(p)
Continued: U.S. State Fact and Figures
Start: AIC=-22.18
Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost + Area
Df Sum of Sq
RSS
AIC
- Area
1
0.0011 23.298 -24.182
- Income
1
0.0044 23.302 -24.175
- Illiteracy 1
0.0047 23.302 -24.174
<none>
23.297 -22.185
- Population 1
1.7472 25.044 -20.569
- Frost
1
1.8466 25.144 -20.371
- HS.Grad
1
2.4413 25.738 -19.202
- Murder
1
23.1411 46.438 10.305
Step: AIC=-24.18
Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost
Df Sum of Sq
RSS
- Illiteracy 1
0.0038 23.302
- Income
1
0.0059 23.304
<none>
23.298
- Population 1
1.7599 25.058
- Frost
1
2.0488 25.347
- HS.Grad
1
2.9804 26.279
- Murder
1
26.2721 49.570
AIC
-26.174
-26.170
-24.182
-22.541
-21.968
-20.163
11.569
Continued: U.S. State Fact and Figures
73
Step: AIC=-28.16
Life.Exp ~ Population + Murder + HS.Grad + Frost
Df Sum of Sq
Life Expectancy
<none>
- Population
- Frost
- HS.Grad
- Murder
1
1
1
1
2.064
3.122
5.112
34.816
Coefficients:
(Intercept)
Population
7.103e+01
5.014e-05
RSS
23.308
25.372
26.430
28.420
58.124
Effect on Response Variable of
One Unit Change of Predict Variable
AIC
-28.161
-25.920
-23.877
-20.246
15.528
0.00005014
Murder
-3.001e-01
HS.Grad
Frost
0.3001
4.658e-02
-5.943e-03
0.005943
0.04658
71.03
70
Intercept
x1
x4
Predict Variables
x5
x6
What is Principal Component Analysis (PCA)?
 Two general approaches of reducing variables :
feature selection and feature extraction
Feature Selection : “Akaike Information
Criterion”(AIC), BIC or Back-Substitution
 Feature extraction : “Principal Component
Analysis”(PCA) is most widely used

 Create
several artificial variables
 Built-in functions in R = Convenient!
Actual Pima Data
1
2
3
4
5
6
pregnant glucose diastolic
6
148
72
1
85
66
8
183
64
1
89
66
0
137
40
5
116
74
triceps
35
29
0
23
35
0
insulin
0
0
0
94
168
0
bmi
33.6
26.6
23.3
28.1
43.1
25.6
diabetes
0.627
0.351
0.672
0.167
2.288
0.201
age
50
31
32
21
33
30
test
1
0
1
0
1
0
….
( Imagine a data set with many more (~1000) columns )
(Imagine a Linear Regression: Which variables affect diabetes in what ways?)
PCA Example: Pima Indians
 The National Institute of Diabetes and Digestive and Kidney Diseases conducte
d a study on 768 adult female Pima Indians living near Phoenix.
 9 Variables (8 continuous, 1 categorical)









pregnant: Number of times pregnant
Glucose : Plasma glucose concentration at 2 hours in an oral glucose tolerance test
Diastolic : Diastolic blood pressure (mm Hg)
Triceps : Triceps skin fold thickness (mm)
Insulin : 2-Hour serum insulin (mu U/ml)
Bmi : Body mass index (weight in kg/(height in metres squared))
Diabetes : Diabetes pedigree function
Age : Age (years)
Test : diabetes (coded 0 if negative, 1 if positive)
 Next Slide: PCA Implementation
What principal components might look like:
 PC1 : 1*Insulin + 0.01*Glucose + ..
 PC2 : 1*Glucose + 0.12*Age + 0.12*DiastolicBP + ..
 PC3 :

0.92 * DiastolicBP + 0.31*Triceps
Principal components : What are they composed of?
(less important)

Difference with Linear Regression
-4000
-2000
0
+
++
+
-0.30
-0.25
-0.20
-0.15
PC1
-0.10
-0.05
0.00
1000
500
0
-500
-0.05
+
-1000
+
insulin
+
-0.10
PC2
dimensions?
0.00
-- How many
+ +
++ +
+
++
+
+
+
+
+
+
+
+
+
+
+ +
++
+
+
+
+
+
+ +
+
+
+
+
+
+ +
+
+
+
+
+
+
+
+
+ ++
+
+
+
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
++
+
++
+++
+
+
+ + + + ++
+
+
+
+
+++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+
+ +
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ++
+
+
+
+
+
+ ++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
triceps
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
pregnant
+
+
+
+
+
+
+
+
bmi
+
+
+
+
+
+
+ ++
+
+
+
age
+
+
diastolic
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ++
+ + ++ +
++
+++
+
++
+
+ +++
+++
+ +
+
+
+
+
+ + ++
+++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
++
+
++ + +
+
+
+ ++
+
+
+
+
++ + ++
+++ + +
+
+
+
+
+
+
glucose
+
+
++
++
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+
+
-1500
0.05
about data in lower
dimensions
- R code in the next
slide:
-1000
0.10
-Goal: obtain summary
-3000
Brief : R-Code
> data.pca <- prcomp(data[,-9]); summary(data.pca);
Importance of components:
PC1
PC2
PC3
PC4
PC5
PC6
PC7
Standard deviation
116.002 30.5411 19.7630 14.0777 10.6155 6.76973 2.78575
Proportion of Variance 0.889 0.0616 0.0258 0.0131
0.00744 0.00303 0.00051
Cumulative Proportion 0.889 0.950 0.976
0.9890
0.996
0.999 1.00000
> data.pca
Rotation:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
pregnant 0.002 -0.02 0.02 0.05 2e-01 -0.005 -1e+00
glucose -0.098 -0.97 -0.14 -0.12 -9e-02 0.051 -9e-04
Diastolic -0.016 -0.14 0.92 0.26 -2e-01 0.076 1e-03
triceps
-0.061 0.06 0.31 -0.88 3e-01 0.221 4e-04
insulin
-0.993 0.09 -0.02 0.07 -2e-04 -0.006 -1e-03
bmi
-0.014 -0.05 0.13 -0.19 2e-02 -0.971 3e-03
age
0.004 -0.14 0.13 0.30 9e-01 -0.015 2e-01
> barplot(totalrep, main="Representation of Principal Components", xlab="Principal
Component", ylab="% of Total Variance")
> biplot(data.pca, xlabs=rep('+',768), xlim = c(-0.05,0.3), ylim = c(-0.15,0.12)); abline(h=0,v=0);
0.4
0.3
0.2
0.1
0.0
% of Total Variance
0.5
Representation of Principal Components
Principal Component
Download