Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution – Noncommercial – Share Alike 3.0 Lic ense: http://creativecommons.org/licenses/by-nc-sa/3.0/ We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your abilit y to use, share, and adapt it. The citation key on the following slide provides information about how you may sha re and adapt this material. Copyright holders of content included in this material should contact open.michigan@umich.edu with any questi ons, corrections, or clarification regarding the use of content. For more information about how to cite these materials visit http://open.umich.edu/education/about/terms-of-use. Attribution Key for more information see: http://open.umich.edu/wiki/AttributionPolicy Use + Share + Adapt { Content the copyright holder, author, or law permits you to use, share and adapt. } Public Domain – Government: Works that are produced by the U.S. Government. (17 USC § 10 5) Public Domain – Expired: Works that are no longer protected due to an expired copyright term. Public Domain – Self Dedicated: Works that a copyright holder has dedicated to the public domain. Creative Commons – Zero Waiver Creative Commons – Attribution License Creative Commons – Attribution Share Alike License Creative Commons – Attribution Noncommercial License Creative Commons – Attribution Noncommercial Share Alike License GNU – Free Documentation License Make Your Own Assessment { Content Open.Michigan believes can be used, shared, and adapted because it is ineligible for copyright. } Public Domain – Ineligible: Works that are ineligible for copyright protection in the U.S. (17 USC § 102(b)) *laws in your jurisdiction may differ { Content Open.Michigan has used under a Fair Use determination. } Fair Use: Use of works that is determined to be Fair consistent with the U.S. Copyright Act. (17 USC § 107) *laws in your j urisdiction may differ Our determination DOES NOT mean that all uses of this 3rd-party content are Fair Uses and we DO NOT guarantee that y our use of the content is Fair. To use this content you should do your own independent analysis to determine whether or not your use will be Fair. Descriptive Statistics quantitatively describe the main features of a collection of data. How do salaries vary across the company? manager What should I make of all this???!!! employee Staff. Jones HR Descriptive Statistics in R Mean > mean(x); > mean(x,trim=a) Median > median(x) Mode > sort(table(x)) Standard deviation > sd(x) Variance > var(x) the median absolute deviation > mad(c(x)) interquartile range > IQR(x) Range > range(x) Data Dimensions > length(x) [1] 1000 ------------------------> nrow(X) [1] 2030 > ncol(X) [1] 100000 > dim(X) [1] 2034 100000 Matrix X …. …. Vectorization in R Matrix X > apply( X, MARGIN=1, FUN= mean) > apply( X, MARGIN=2, FUN= mean) 25 boxplot(X) 0 5 10 15 20 • Good for small data sets • Easy to compar e groups side b y side • 1.5*IQR defines outlier epiE epiS epiImp epilie epiNeur The Big Six Minimum, 1st Q, Median, Mean, 3rd Q, Maximu m > summary(X) R tries to understand you > summary(X) Histograms: > hist(X) 4 8 12 0 2 4 6 80 40 0 Frequency 20 40 0 Frequency 0 epilie 8 0 2 4 epiNeur bfagree bfcon bfext 15 80 120 160 60 100 160 bfneur bfopen bdi 120 bfneur 80 120 bfopen 160 0 Frequency 20 0 20 80 0 10 20 bdi 0 50 150 bfext 60 bfcon 50 bfagree Frequency epiNeur 0 Frequency 20 0 30 0 Frequency 5 30 6 40 epilie 40 epiImp Frequency epiS 40 epiE 0 40 40 20 20 0 epiImp 0 Frequency 50 20 5 10 0 Frequency 0 Frequency epiS 0 Frequency epiE Correlation Scatterplot Example 25 20 15 10 Miles Per Gallon 30 > cor(wt,mpg) [1] -0.8676594 > plot(x=wt,y=mpg) 2 3 4 Car Weight 5 Scatterplot Matrix • Iris dataset • 150 flowers • 5 variables Goingslo, flickr Scatterplot Matrix plot > pairs(data) 3.0 4.0 0.5 1.5 2.5 7.5 2.0 5 7 2.0 3.0 Sepal.Width 1.5 2.5 1 3 Petal.Length Petal.Width 2.0 3.0 0.5 Species 1.0 setosa versicolor virginica 4.0 4.5 6.0 Sepal.Length 4.5 5.5 6.5 7.5 1 2 3 4 5 6 7 1.0 2.0 3.0 > coplot(lat ~ long | depth) Given : depth 100 200 300 400 600 165 170 175 180 185 -25 -30 -35 lat -20 -15 -10 165 170 175 180 185 500 165 170 175 180 185 long 165 170 175 180 185 Linear Regression Why? Prediction of future or unknown observations Assessment of relationship between variables General description of data structure What? Variable Selection Why? Simplification Elimination of multicollinearity and noise Time and money saving How? Testing-based Variable Selection Methods - Backward, Forward, Stepwise Criterion-based Procedures What? AIC = n ln(RSS/n) + 2(p) Example: U.S. State Fact and Figures Life Expectancy Population, Income, Illiteracy, Murder, HS Grad, Frost, Area Selected R code Linear Regression > g <- lm(Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost + Area, data = statedata) > summary(g) Coefficients: Analysis of Variance Table Response: Life.Exp Estimate Std. Error t value Pr(>|t|) Sum Sq Mean Sq F value (Intercept) Df7.094e+01 1.748e+00 40.586 Pr(>F) < 2e-16 *** > anova(g) Population 5.180e-05 1 0.4089 2.919e-05 0.4089 0.7372 Population 1.7750.395434 0.0832 . Income 1 11.5946 11.5946 20.9028 4.218e-05 Income -2.180e-05 2.444e-04 -0.089 0.9293*** AIC Illiteracy 3.382e-02 1 19.4207 19.4207 35.0116 5.228e-07 Illiteracy 3.663e-01 0.092 0.9269*** Murder 1 27.4288 27.4288 49.4486 1.308e-08 Murder -3.011e-01 4.662e-02 -6.459 8.68e-08*** *** > step(g) HS.Grad 1 4.0989 2.332e-02 4.0989 7.3895 HS.Grad 4.893e-02 2.0980.009494 0.0420*** Frost 1 2.0488 3.143e-03 2.0488 3.6935 Frost -5.735e-03 -1.8250.061426 0.0752. . Area 1 0.0011 1.668e-06 0.0011 0.0020 Area -7.383e-08 -0.0440.964908 0.9649 Residuals 42 23.2971 0.5547 AIC = n ln(RSS/n) + 2(p) Continued: U.S. State Fact and Figures Start: AIC=-22.18 Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost + Area Df Sum of Sq RSS AIC - Area 1 0.0011 23.298 -24.182 - Income 1 0.0044 23.302 -24.175 - Illiteracy 1 0.0047 23.302 -24.174 <none> 23.297 -22.185 - Population 1 1.7472 25.044 -20.569 - Frost 1 1.8466 25.144 -20.371 - HS.Grad 1 2.4413 25.738 -19.202 - Murder 1 23.1411 46.438 10.305 Step: AIC=-24.18 Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost Df Sum of Sq RSS - Illiteracy 1 0.0038 23.302 - Income 1 0.0059 23.304 <none> 23.298 - Population 1 1.7599 25.058 - Frost 1 2.0488 25.347 - HS.Grad 1 2.9804 26.279 - Murder 1 26.2721 49.570 AIC -26.174 -26.170 -24.182 -22.541 -21.968 -20.163 11.569 Continued: U.S. State Fact and Figures 73 Step: AIC=-28.16 Life.Exp ~ Population + Murder + HS.Grad + Frost Df Sum of Sq Life Expectancy <none> - Population - Frost - HS.Grad - Murder 1 1 1 1 2.064 3.122 5.112 34.816 Coefficients: (Intercept) Population 7.103e+01 5.014e-05 RSS 23.308 25.372 26.430 28.420 58.124 Effect on Response Variable of One Unit Change of Predict Variable AIC -28.161 -25.920 -23.877 -20.246 15.528 0.00005014 Murder -3.001e-01 HS.Grad Frost 0.3001 4.658e-02 -5.943e-03 0.005943 0.04658 71.03 70 Intercept x1 x4 Predict Variables x5 x6 What is Principal Component Analysis (PCA)? Two general approaches of reducing variables : feature selection and feature extraction Feature Selection : “Akaike Information Criterion”(AIC), BIC or Back-Substitution Feature extraction : “Principal Component Analysis”(PCA) is most widely used Create several artificial variables Built-in functions in R = Convenient! Actual Pima Data 1 2 3 4 5 6 pregnant glucose diastolic 6 148 72 1 85 66 8 183 64 1 89 66 0 137 40 5 116 74 triceps 35 29 0 23 35 0 insulin 0 0 0 94 168 0 bmi 33.6 26.6 23.3 28.1 43.1 25.6 diabetes 0.627 0.351 0.672 0.167 2.288 0.201 age 50 31 32 21 33 30 test 1 0 1 0 1 0 …. ( Imagine a data set with many more (~1000) columns ) (Imagine a Linear Regression: Which variables affect diabetes in what ways?) PCA Example: Pima Indians The National Institute of Diabetes and Digestive and Kidney Diseases conducte d a study on 768 adult female Pima Indians living near Phoenix. 9 Variables (8 continuous, 1 categorical) pregnant: Number of times pregnant Glucose : Plasma glucose concentration at 2 hours in an oral glucose tolerance test Diastolic : Diastolic blood pressure (mm Hg) Triceps : Triceps skin fold thickness (mm) Insulin : 2-Hour serum insulin (mu U/ml) Bmi : Body mass index (weight in kg/(height in metres squared)) Diabetes : Diabetes pedigree function Age : Age (years) Test : diabetes (coded 0 if negative, 1 if positive) Next Slide: PCA Implementation What principal components might look like: PC1 : 1*Insulin + 0.01*Glucose + .. PC2 : 1*Glucose + 0.12*Age + 0.12*DiastolicBP + .. PC3 : 0.92 * DiastolicBP + 0.31*Triceps Principal components : What are they composed of? (less important) Difference with Linear Regression -4000 -2000 0 + ++ + -0.30 -0.25 -0.20 -0.15 PC1 -0.10 -0.05 0.00 1000 500 0 -500 -0.05 + -1000 + insulin + -0.10 PC2 dimensions? 0.00 -- How many + + ++ + + ++ + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + ++ + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + ++ +++ + + + + + + ++ + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + triceps + + + + + + + + + + + + + + + + + + + pregnant + + + + + + + + bmi + + + + + + + ++ + + + age + + diastolic + + + + + + + + + + + + + + + ++ + + ++ + ++ +++ + ++ + + +++ +++ + + + + + + + + ++ +++ + + + + + + + + + + + + + + ++ + ++ + + + + + ++ + + + + ++ + ++ +++ + + + + + + + + glucose + + ++ ++ + + + + + + + + + ++ + + + + + + + + + + + -1500 0.05 about data in lower dimensions - R code in the next slide: -1000 0.10 -Goal: obtain summary -3000 Brief : R-Code > data.pca <- prcomp(data[,-9]); summary(data.pca); Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 Standard deviation 116.002 30.5411 19.7630 14.0777 10.6155 6.76973 2.78575 Proportion of Variance 0.889 0.0616 0.0258 0.0131 0.00744 0.00303 0.00051 Cumulative Proportion 0.889 0.950 0.976 0.9890 0.996 0.999 1.00000 > data.pca Rotation: PC1 PC2 PC3 PC4 PC5 PC6 PC7 pregnant 0.002 -0.02 0.02 0.05 2e-01 -0.005 -1e+00 glucose -0.098 -0.97 -0.14 -0.12 -9e-02 0.051 -9e-04 Diastolic -0.016 -0.14 0.92 0.26 -2e-01 0.076 1e-03 triceps -0.061 0.06 0.31 -0.88 3e-01 0.221 4e-04 insulin -0.993 0.09 -0.02 0.07 -2e-04 -0.006 -1e-03 bmi -0.014 -0.05 0.13 -0.19 2e-02 -0.971 3e-03 age 0.004 -0.14 0.13 0.30 9e-01 -0.015 2e-01 > barplot(totalrep, main="Representation of Principal Components", xlab="Principal Component", ylab="% of Total Variance") > biplot(data.pca, xlabs=rep('+',768), xlim = c(-0.05,0.3), ylim = c(-0.15,0.12)); abline(h=0,v=0); 0.4 0.3 0.2 0.1 0.0 % of Total Variance 0.5 Representation of Principal Components Principal Component