Stats Homework (R) Part 1 of 2 • Read section 11 of handbook_statsV13 • What is the difference between these two models? – g1 = glm(admit ~ ., data=mydata) – g2 = glm(admit ~ ., data=mydata, family=“binomial”) • Compute MSE (mean squared error) – mean((g1$fitted.values – mydata$admit)^2) – mean((g2$fitted.values – mydata$admit)^2) • Which model is better in terms of MSE Stats Homework (R) Part 2 of 2 • Consider two loss functions – MSE.loss = function(y, yhat) mean((y - yhat)^2) – log.loss = function(y, yhat) -mean(y * log(yhat) + (1-y) * log(1-yhat)) • Which model (g1 or g2) is better in terms of loss2? – Hint: log.loss assumes yhat is a probability between 0 and 1 – What are range(g1$fitted.values) and range(g2$fitted.values)? • Let s = g1$fitted.values > 0 & g1$fitted.values < 1 – What is log.loss(mydata$admit[s], g1$fitted.values[s])? • Let baseline.fit be rep(mean(admit), length(admit)) – How do the two models above compare to this baseline – Compare both g1 and g2 to this baseline using both loss functions Functions • • • • • • fact = function(x) if(x < 2) 1 else x*fact(x-1) fact(5) fact2 = function(x) gamma(x+1) fact(5) fact(1:5) fact(seq(1,5,0.5)) help(swiss) What are statistics packages good for? • Plotting – Scatter plots, histograms, boxplots • Visualization – Dendrograms, heatmaps • Modeling – Linear Regression – Logistic Regression • Hypothesis Testing Objects Data Types • • • Numbers: 1, 1.5 Strings Functions: – sqrt, help, summary • 3:30 c(1:5, 25:30) rnorm(100, 0, 1), rbinom(100, 10, .5) rep(0,10) scan( filename, 0) matrix(rep(0,100), ncol=10) diag(rep(1,10)) cbind(1:10, 21:30) rbind(1:10, 21:30) Data Frames – swiss – read.table (filename, header=T) All: summary, help, print, cat Numbers: +,*,sqrt Strings: grep, substr, paste Vectors: x = (3:30) – – – – – • x+5 plot(x) mean(x), var(x), median(x) mean((x – mean(x))^2) x[3], x[3:6], x[-1], x[x>20] Matrices: m = matrix(1:100,ncol=10) – m+3, m* 3, m %*% m – m[3,3], m[1:3,1:3], m[3,], m[m < 10] – diag(m), t(m) Matrices – – – – • • • • • Vectors: – – – – – • Operations • Data Frames – – – – names(swiss) plot(swiss$Education, swiss$Fertility) plot(swiss[,1], swiss[,2]) pairs(swiss) plot(swiss$Education, swiss$Fertility) plot(swiss$Education, swiss$Fertility, col = 3 + (swiss$Catholic > 50)) boxplot(split(swiss$Fertility, swiss$Catholic > 50)) boxplot(split(swiss$Fertility, swiss$Education > median(swiss$Education))) pairs(swiss) Correlations > cor(swiss$Fertility, swiss$Catholic) [1] 0.4636847 > cor(swiss$Fertility, swiss$Education) [1] -0.6637889 > round(100*cor(swiss)) Fertility Agriculture Examination Education Catholic Infant.Mortality Fertility Agriculture Examination Education Catholic Infant.Mortality 100 35 -65 -66 46 42 35 100 -69 -64 40 -6 -65 -69 100 70 -57 -11 -66 -64 70 100 -15 -10 46 40 -57 -15 100 18 42 -6 -11 -10 18 100 • Many functions in R are polymorphic: – cor works on vectors as well as matrices Regression > plot(swiss$Education, swiss$Fertility) > g = glm(swiss$Fertility ~ swiss$Education) > abline(g) > summary(g) Call: glm(formula = swiss$Fertility ~ swiss$Education) Deviance Residuals: Min 1Q Median -17.036 -6.711 -1.011 3Q 9.526 Max 19.689 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 79.6101 2.1041 37.836 < 2e-16 *** swiss$Education -0.8624 0.1448 -5.954 3.66e-07 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 plot(swiss$Education, swiss$Fertility, type='n’); abline(g) text(swiss$Education, swiss$Fertility, dimnames(swiss)[[1]]) g = glm(Fertility ~ . , data=swiss); summary(g) Call: glm(formula = Fertility ~ ., data = swiss) Deviance Residuals: Min 1Q -15.2743 -5.2617 Median 0.5032 3Q 4.1198 Max 15.3213 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 66.91518 10.70604 6.250 1.91e-07 *** Agriculture -0.17211 0.07030 -2.448 0.01873 * Examination -0.25801 0.25388 -1.016 0.31546 Education -0.87094 0.18303 -4.758 2.43e-05 *** Catholic 0.10412 0.03526 2.953 0.00519 ** Infant.Mortality 1.07705 0.38172 2.822 0.00734 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 51.34251) http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html Plot Example • help(ldeaths) • plot(mdeaths, col="blue", ylab="Deaths", sub="Male (blue), Female (pink)", ylim=range(c(mdeaths, fdeaths))) • lines(fdeaths, lwd=3, col="pink") • abline(v=1970:1980, lty=3) • abline(h=seq(0,3000,1000), lty=3, col="red") Periodicity • plot(1:length(fdeaths), fdeaths, type='l') • lines((1:length(fdeaths))+12, fdeaths, lty=3) par(mfrow=c(2,2)) plot(fdeaths, type='p', main="points") plot(fdeaths, type='l', main="lines") plot(fdeaths, type='b', main="b") plot(fdeaths, type='o', main="o") Type Argument • • • • plot(as.vector(mdeaths), as.vector(fdeaths)) g=glm(fdeaths ~ mdeaths) abline(g) g$coef (Intercept) -45.2598005 mdeaths 0.4050554 Scatter Plot • par(mfrow=c(2,1)) • hist(fdeaths/mdeaths, nclass=30) • plot(density(fdeaths/mdeaths)) Hist & Density Data Frames • • • • help(cars) names(cars) summary(cars) plot(cars) • • • • • • • • cars2 = cars cars2$speed2 = cars$speed^2 cars2$speed3 = cars$speed^3 summary(cars2) names(cars2) plot(cars2) options(digits=2) cor(cars2) Normality par(mfrow=c(2,1)) plot(density(cars$dist/cars$sp eed)) lines(density(rnorm(1000000, mean(cars$dist/cars$spe ed), sqrt(var(cars$dist/ cars$speed)))), col="red") qqnorm(cars$dist/cars$speed ) abline(mean(cars$dist/cars$s peed), sqrt(var(cars$dist/cars$s peed))) Stopping Distance Increases Quickly With Speed • • plot(cars$speed, cars$dist/cars$speed) boxplot(split(cars$dist/cars$sp eed, round(cars$speed/10)*10)) Quadratic Model of Stopping Distance plot(cars$speed, cars$dist) cars$speed2 = cars$speed^2 g2 = glm(cars$dist ~ cars$speed2) lines(cars$speed, g2$fitted.values) Bowed Residuals g1 = glm(dist ~ poly(speed, 1), data=cars) g2 = glm(dist ~ poly(speed, 2), data=cars) par(mfrow=c(1,2)) boxplot(split(g1$resid, round(cars$speed/5))); abline(h=0) boxplot(split(g2$resid, round(cars$speed/5))); abline(h=0) Help, Demo, Example • demo(graphics) – – – – – – – example(plot) example(lines) help(cars) help(WWWusage) example(abline) example(text) example(par) • pairs – – – – – • utils::data(anorexia, package="MASS") • pairs(anorexia, col=c("red", "green", "blue")[anorexia$Treat]) • boxplots – example(boxplot) – help(chickwts) • demo(plotmath) example(pairs) help(quakes) help(airquality) help(attitude) Anorexia • counting – – – – example(table) example(quantile) example(hist) help(faithful) • Randomness – example(rnorm) – example(rbinom) – example(rt) • Normality – example(qqnorm) • Regression – help(cars) – example(glm) – demo(lm.glm)