R 语言讲义 李大玮 ——在此,向吴喜之老师致 敬 R的优点 免费 通用性:在视窗、Mac、各种Unix系统通用 资源公开(不是黑盒子,也不是吝啬鬼) 容易学习的语法。可编程以实行复杂的课题 可扩展: 通过数千个网上提供的适用于不同领域、不同目的、 不同方法的软件包来实现你的目标。你也可以把你的方法贡 献出来 强大的绘图功能 R 有优秀的内在帮助系统 R有优秀的画图功能 R社区的支持,不断更新,不断修正 R: 绝大多数美国统计研究生都会的语言 Berkeley统计和应用数学本科都开设R语言课 美国应用统计学家大都把自己的方法首先以R来实现,并尽量 放到R 网站上 一年多,R网站的软件包数量增加了两倍,从近1000个到近 3000个。大都都有关于计算、演示和输入输出方法的函数和例 子数据 所有代码都是公开、可以改变的 透明是防止“腐败”的最好方式 下载R(http://www.r-project.org/) 点击CRAN得到一批镜像网站 点击镜像网站比如Berkeley 选择 base 选择这个,下载安装文件 选择这个,下载软件包 Packages (每个都有大量数据和可以读写修改的 函数/程序) base The R Base Package boot Bootstrap R (S-Plus) Functions (Canty) class Functions for Classification cluster Cluster Analysis Extended Rousseeuw et al. concord Concordance and reliability datasets The R Datasets Package exactRankTests Exact Distributions for Rank and Permutation Tests foreign Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, dBase, ... graphics The R Graphics Package grDevices The R Graphics Devices and Support for Colours and Fonts grid The Grid Graphics Package KernSmooth Functions for kernel smoothing for Wand & Jones (1995) lattice Lattice Graphics Interface tools Tools for Package Development utils The R Utils Package Packages (继续) MASSMain Package of Venables and Ripley's MASS methodsFormal Methods and Classes mgcvGAMs with GCV smoothness estimation and GAMMs by REML/PQL multtestResampling-based multiple hypothesis testing nlmeLinear and nonlinear mixed effects models nnetFeed-forward Neural Networks and Multinomial Log-Linear Models nortestTests for Normality outliersTests for outliers plsPartial Least Squares Regression (PLSR) and Principal Component Regression (PCR) pls.pcrPLS and PCR functions rpartRecursive Partitioning SAGxStatistical Analysis of the GeneChip smaStatistical Microarray Analysis spatialFunctions for Kriging and Point Pattern Analysis splinesRegression Spline Functions and Classes statsThe R Stats Package stats4Statistical Functions using S4 Classes survivalSurvival analysis, including penalised likelihood. tcltkTcl/Tk Interface toolsTools for Package Development utilsThe R Utils Package Packages (网上) 网上还有许多 所有这些Packages可以自由下载 Base中的package包含常用的函 数和数据 而其他的packages包含各个方向 统计学家自己发展的方法和数据。 希望你是下一个加盟这些 packages的作者之一。 安装Packages 几个有用的函数 函数:f(x): 名字(变元) getwd() setwd(dir = "f:/2010stat")#或setwd("f:/2010stat") getwd() x=rnorm(100) ls() ?rnorm#或help(rnorm) apropos(“norm“) 赋值和运算 z = rnorm(1000000,4,0.1) median(z) 赋值: “=”可以用“<-”代替 x<-z->y->w 简单数学运算有: +,-,*,/, ^,%*%,%%(mod) %/%(整数除法)等等 常用的数学函数有:abs , sign , log , log2, log10 , logb, expm1, log1p(x), sqrt , exp , sin , cos , tan , acos , asin, atan , cosh , sinh, tanh 赋值和运算 round, floor, ceiling gamma , lgamma, digamma and trigamma. sum, prod, cumsum, cumprod max, min, cummax, cummin, pmax, pmin, range mean, length, var, duplicated, unique union, intersect, setdiff >, >=, <, <=, &, |, ! 从高到低的运算次序 一些基本运算例子 x=1:100 (x=1:100) sample(x,20) set.seed(0);sample(1:10,3)#随机种子! z=sample(1:200000,10000) z[1:10]#向量下标 y=c(1,3,7,3,4,2) z[y] 一些基本运算例子 z=sample(x,20,rep=T) z (z1=unique(z));length(z1) z=sample(x,100,rep=T) xz=setdiff(x,z) sort(union(xz,z)) sort(union(xz,z))==x setequal(union(xz,z),x) intersect(1:10,7:50) sample(1:100,20,prob=1:100) 一些基本运算例子 pi * 10^2 #能够用?”*”来看基本算术运算方法 "*"(pi, "^"(10, 2)) pi * (1:10)^2 x <- pi * 10^2 x print(x) (x=pi * 10^2) pi^(1:5) print(x, digits = 12) class(x) typeof(x) 一些基本运算例子 class(cars) typeof(cars) names(cars) summary(cars) str(cars) row.names(cars) class(dist ~ speed) plot(dist ~ speed,cars) 一些基本运算例子 head(cars)#cars[1:6,] tail(cars) ncol(cars);nrow(cars) dim(cars) lm(dist ~ speed, data = cars) cars$qspeed =cut(cars$speed, breaks =quantile(cars$speed), include.lowest = TRUE) names(cars) cars[3] table(cars[3]) is.factor(cars$qspeed) plot(dist ~ qspeed, data = cars) (a=lm(dist ~ qspeed, data = cars)) summaryu(a) 一些基本运算例子 x <- round(runif(20,0,20), digits=2) summary(x) min(x);max(x) median(x) # median mean(x) # mean var(x) # variance sd(x) # standard deviation sqrt(var(x)) rank(x) # rank order(x) x[order(x)] sort(x) sort(x,decreasing=T)#sort(x,dec=T) sum(x);length(x) round(x) 一些基本运算例子 fivenum(x) # quantiles quantile(x) # quantiles (different convention) 有多种定义 quantile(x, c(0,.33,.66,1)) mad(x) # normalized mean deviation to the median (“median average distance“) 可用?mad查看 cummax(x) cummin(x) cumprod(x) cor(x,sin(x/20)) # correlation 一些基本运算例子 #直方图 x <- rnorm(200) hist(x, col = "light blue") rug(x) #茎叶图 stem(x) #散点图 N <- 500 x <- rnorm(N) y <- x + rnorm(N) plot(y ~ x) a=lm(y~x) abline(a,col="red")#或者abline(lm(y~x),col="red") print("Hello World!") paste("x 的最小值= ", min(x)) #cat("\\end{document}\n", file="RESULT.tex", append=TRUE) demo(graphics)#演示画图 一些基本运算例子 #复数运算 x=2+3i (z <- complex(real=rnorm(10), imaginary =rnorm(10))) complex(re=rnorm(3),im=rnorm(3)) Re(z) Im(z) Mod(z) Arg(z) choose(3,2);factorial(6) #解方程 f =function(x) x^3-2*x-1 uniroot(f,c(0,2))#迭代 #如果知道根为极值 f =function(x) x^2+2*x+1 optimize(f,c(-2,2)) 分布和产生随机数 正态分布: pnorm(1.2,2,1); dnorm(1.2,2,1); qnorm(.7,2,1); rnorm(10,0,1) #rnorm(10) t分布: pt(1.2,1); dt(1.2,2); qt(.7,1); rt(10,1) 此外还有指数分布、F分布、“卡方”分布、Beta分布、二项分 布、Cauchy分布、Gamma分布、几何分布、超几何分布、对数正 态分布、Logistic分布、负二项分布、Poisson分布、均匀分布、 Weibull分布、Willcoxon分布等 变元可以是向量! 输入输出数据 x=scan() 1.5 2.6 3.7 2.1 8.9 12 -1.2 -4 #等价于x=c(1.5,2.6,3.7,2.1,8.9,12,-1.2,-4) setwd(“f:/2010stat”)#或setwd("f:\\2010stat") (x=rnorm(20)) write(x,"f:/2010stat/test.txt") y=scan("f:/2010stat/test.txt");y y=iris;y[1:5,];str(y) write.table(y,"f:/2010stat/test.txt") w=read.table("f:/2010stat/test.txt",header=T) str(w) write.csv(y,"f:/2010stat/test.csv") v=read.csv("f:/2010stat/test.csv") str(v) data=read.table("clipboard") write.table("clipboard") 序列和向量 z=seq(-1,10,length=100)#z=seq(-1,10, len=100) z=seq(10,-1,-1) #z=10:-1 x=rep(1:3,3) x=rep(3:5,1:3) x=rep(c(1,10),c(4,5)) w=c(1,3,x,z);w[3] x=rep(0,10);z=1:3;x+z x*z rev(x) z=c("no cat","has ","nine","tails") z[1]=="no cat" z=1:5 z[7]=8;z z=NULL z[c(1,3,5)]=1:3; z rnorm(10)[c(2,5)] z[-c(1,3)] #去掉第1、3元素 z=sample(1:100,10);z which(z==max(z))#给出下标 向量矩阵 x=sample(1:100,12);x all(x>0);all(x!=0);any(x>0);(1:10)[x>0] diff(x) diff(x,lag=2) x=matrix(1:20,4,5);x x=matrix(1:20,4,5,byrow=T);x t(x) x=matrix(sample(1:100,20),4,5) 2*x x+5 y=matrix(sample(1:100,20),5,4) x+t(y) (z=x%*%y) z1=solve(z) # solve(a,b)可以解ax=b方程 z1%*%z round(z1%*%z,14) 矩阵 nrow(x); ncol(x);dim(x)#行列数目 x=matrix(rnorm(24),4,6) x[c(2,1),]#第2和第1行 x[,c(1,3)] #第1和第3列 x[2,1] #第[2,1]元素 x[x[,1]>0,1] #第1列大于0的元素 sum(x[,1]>0) #第1列大于0的元素的个数 sum(x[,1]<=0) #第1列不大于0的元素的个数 x[,-c(1,3)] #没有第1、3列的x. diag(x) diag(1:5) diag(5) x[-2,-c(1,3)] #没有第2行、第1、3列的x. x[x[,1]>0&x[,3]<=1,1]; #第1中大于0并且相应于第3列中小于或等于1的元 x[x[,2]>0|x[,1]<.51,1] #第1中小于.51或者相应于第2列中大于0的元素(“或”) x[!x[,2]<.51,1]#第一列中相应于第2列中不小于.51的元素(“非”) apply(x,1,mean);apply(x,2,sum) 矩阵/高维数组 #上下三角阵 x=matrix(rnorm(24),4,6) diag(x) diag(1:5) diag(5) x[lower.tri(x)]=0#x[upper.tri(x)]=0;diag(x)=0 x=array(runif(24),c(4,3,2));x is.matrix(x) #可由dim(x)得到维数(4,3,2) is.matrix(x[1,,]) x=array(1:24,c(4,3,2)) x[c(1,3),,] x=array(1:24,c(4,3,2)) apply(x,1,mean) apply(x,1:2,sum) apply(x,c(1,3),prod) 矩阵/高维数组/scale #矩阵与向量之间的运算 x=matrix(1:20,5,4) sweep(x,1,1:5,"*") x*1:5 sweep(x,2,1:4,"+") (x=matrix(sample(1:100,24),6,4));(x1=scale(x)) (x2=scale(x,scale=F)); (x3=scale(x,center=F)) round(apply(x1,2,mean),14) apply(x1,2,sd) round(apply(x2,2,mean),14);apply(x2,2,sd) round(apply(x3,2,mean),14);apply(x3,2,sd) Data.frame x=matrix(1:6,2,3) z=data.frame(x);z z$X2 attributes(z) names(z)=c("TOYOTA","GM","HUNDA") row.names(z)=c("2001","2002") Z attach(x) GM detach(x) GM sapply(z,is.numeric)#apply(z,2,is.numeric) 缺失值问题等 airquality complete.cases(airquality)#哪一行没有缺失值 which(complete.cases(airquality)==F) sum(complete.cases(airquality)) na.omit(airquality) #append,cbind,vbind x=1:10;x[12]=3 (x1=append(x,77,after=5)) cbind(1:3,4:6);rbind(1:3,4:6) #去掉矩阵重复的行 (x=rbind(1:5,runif(5),runif(5),1:5,7:11)) x[!duplicated(x),] unique(x) List #list可以是任何对象的集合(包括lists) z=list(1:3,Tom=c(1:2, a=list("R",letters[1:5]),w="hi!")) z[[1]];z[[2]] z$T z$T$a2 z$T[[3]] z$T$w attributes(airquality)#属性! airquality$Ozone attributes(matrix(1:6,2,3)) Categorical data A survey asks people if they smoke or not. The data is Yes, No, No, Yes, Yes x=c("Yes","No","No","Yes","Yes") table(x);x factor(x) •Barplot:Suppose, a group of 25 people are surveyed as to their beer-drinking preference. The categories were (1) Domestic can, (2) Domestic bottle, (3) Microbrew and (4) import. The raw data is 3 4 1 1 3 4 3 3 1 3 2 1 2 1 2 3 2 3 1 1 1 1 4 3 1 beer = scan() 3411343313212123231111431 barplot(beer) # this isn't correct barplot(table(beer)) # Yes, call with summarized data barplot(table(beer)/length(beer)) # divide by n for proportion table(beer)/length(beer) Table/categorical data library(MASS) quine attach(quine) table(Age) table(Sex, Age); tab=xtabs(~ Sex + Age, quine); unclass(tab) tapply(Days, Age, mean) tapply(Days, list(Sex, Age), mean) #apply, sapply, tapply, lapply smokes = c("Y","N","N","Y","N","Y","Y","Y","N","Y") amount = c(1,2,2,3,3,1,2,1,3,2) (tmp=table(smokes,amount)) # store the table options(digits=3) # only print 3 decimal places prop.table(tmp,1) # the rows sum to 1 now prop.table(tmp,2) # the columns sum to 1 now #上两行等价于下面两行 sweep(tmp, 1, margin.table(tmp, 1), "/") sweep(tmp, 2, margin.table(tmp, 2), "/") prop.table(tmp)#amount # all the numbers sum to 1 options(digits=7) # restore the number of digits array/matrixtabledata.frame ## Start with a contingency table. ftable(Titanic, row.vars = 1:3) ftable(Titanic, row.vars = 1:2) data.frame(Titanic)#把array变成data.frame a=xtabs(Freq~Survived+Sex, w) biplot(corresp(a, nf=2))#应用之一 ## Start with a data frame. str(mtcars) x <- ftable(mtcars[c("cyl", "vs", "am", "gear")]) x #为array,其维的次序为("cyl", "vs", "am", "gear") ftable(x, row.vars = c(2, 4))#从x(array)确定表的行变量 ## Start with expressions, use table()'s "dnn" to change labels ftable(mtcars$cyl, mtcars$vs, mtcars$am, mtcars$gear, row.vars = c(2, 4), dnn = c("Cylinders", "V/S", "Transmission", "Gears")) ftable(vs~carb,mtcars)#vs是列,carb是行#或ftable(mtcars$vs~mtcars$carb) ftable(carb~vs,mtcars) #vs是行,carb是列 ftable(mtcars[,c(8,11)])#和上面ftable(carb~vs,mtcars)等价 ftable(breaks~wool+tension,warpbreaks) #as.data.frame (DF <- as.data.frame(UCBAdmissions)) #等价于data.frame(UCBAdmissions) xtabs(Freq ~ Admit+ Gender + Dept, DF)#:把方阵变成原来的列联表 (a=xtabs( Freq~ Admit + Gender, data=DF))#如无频数(权),左边为空 写函数 ss=function(n=100){z=NULL;for (i in 2:n)if(any(i%%2:(i-1)==0)==F)z=c(z,i);return(z) } fix(ss) ss() t1=Sys.time() ss(10000) Sys.time()-t1 system.time(ss(10000)) #函数可以不写return,这时最后一个值为return的 值.为了输出多个值最好使用list 关于画图 #几个图一起: par(mfrow=c(2,4))#par(mfcol=c(2,4)) layout(matrix(c(1,1,1,2,3,4,2,3,4),nr=3,byrow=T)) hist(rnorm(100),col="Red",10) hist(rnorm(100),col="Blue",8) hist(rnorm(100),col="Green") hist(rnorm(100),col="Brown") #par(mar = c(bottom, left, top, right))设置边缘 #缺省值c(5, 4, 4, 2) + 0.1 (英寸) spring= data.frame(compression=c(41,39,43,53,42,48,47,46), distance=c(120,114,132,157,122,144,137,141)) attach(spring)#(Hooke’s law: f=.5ks) par(mfcol=c(2,2)) plot(distance ~ compression) plot(distance ~ compression,type="l") plot(compression, distance,type="o") plot(compression, distance,type="b") 关于画图 par(mfrow=c(2,2))#准备画2x2的4个图 plot(compression, distance,main= "Hooke's Law") #只有标题 plot(compression, distance,main= "Hooke's Law", xlab= "x",ylab= "y") #标题+x,y标记 identify(compression,distance) #标出点号码 plot(compression, distance,main="Hooke's Law") #只有标题 text(46,120, expression(f==frac(1,2)*k*s))#在指定位写入文字 plot(compression, distance,main="Hooke's Law") #只有标题的图 text(locator(2), c("I am here!","you are there!")) #在点击的两个位 置写入文字 par(mfrow=c(1,1)) plot(1:10,sin(1:10),type="l",lty=2,col=4,main=paste(strwrap("The title is too long, and I hate to make it shorter, !@#$%^&*",width=50),collapse="\n")) legend(1.2,1.0,"Just a sine",lty=2,col=4) 关于画图 library(MASS);data(Animals);attach(Animals) par(mfrow=c(2,2)) plot(body, brain) plot(sqrt(body), sqrt(brain)) plot((body)^0.1, (brain)^0.1) plot(log(body),log(brain)) #或者plot(brain~body,log="xy") par(mfrow=c(1,1)) par(cex=0.7,mex=0.7) #character (cex) & margin (mex) expansion plot(log(body),log(brain)) text(x=log(body), y=log(brain),labels=row.names(Animals), adj=1.5)# adj=0 implies left adjusted text plot(log(body),log(brain)) identify(log(body),log(brain),row.names(Animals)) 关于画图(符号颜色大小形状等) plot(1,1,xlim=c(1,7.5),ylim=c(0,5),type="n") # Do not plot points points(1:7,rep(4.5,7),cex=seq(1,4,l=7),col=1:7, pch=0:6) text(1:7,rep(3.5,7),labels=paste(0:6,letters[1:7]),cex=seq(1,4,l=7), col=1:7) points(1:7,rep(2,7), pch=(0:6)+7) # Plot symbols 7 to 13 text((1:7)+0.25, rep(2,7), paste((0:6)+7)) # Label with symbol number points(1:7,rep(1,7), pch=(0:6)+14) # Plot symbols 14 to 20 text((1:7)+0.25, rep(1,7), paste((0:6)+14)) # Labels with symbol number #调色板 par(mfrow=c(2,4)) palette(); barplot(rnorm(15,10,3),col=1:15) palette(rainbow(15));barplot(rnorm(15,10,3),col=1:15) palette(heat.colors(15));barplot(rnorm(15,10,3),col=1:15) palette(terrain.colors(15));barplot(rnorm(15,10,3),col=1:15) palette(topo.colors(15));barplot(rnorm(15,10,3),col=1:15) palette(cm.colors(15));barplot(rnorm(15,10,3),col=1:15) palette(gay(15));barplot(rnorm(15,10,3),col=1:15) palette(grey(15));barplot(rnorm(15,10,3),col=1:15) palette("default") par(mfrow=c(1,1)) 关于画图 #matplot sines=outer(1:20,1:4,function(x, y) sin(x/20*pi*y)) matplot(sines, pch = 1:4, type = "o", col = rainbow(ncol(sines))) #legend x <- seq(-pi, pi, len = 65) plot(x, sin(x), type = "l", ylim = c(-1.2, 1.8), col = 3, lty = 2) points(x, cos(x), pch = 3, col = 4) lines(x, tan(x), type = "b", lty = 1, pch = 4, col = 6) title("legend(..., lty = c(2, -1, 1), pch = c(-1,3,4), merge = TRUE)", cex.main = 1.1) legend(-1, 1.9, c("sin", "cos", "tan"), col = c(3,4,6), lty = c(2, -1, 1), pch = c(-1, 3, 4), merge = TRUE, bg='gray90') 关于画图 #barplot and table par(mfrow=c(2,2)) tN=table(Ni=rpois(100, lambda=5));tN r=barplot(tN, col='gray') lines(r, tN, type='h', col='red', lwd=2) #- type = "h" plotting *is* `bar'plot barplot(tN, space = 1.5, axisnames=FALSE, sub = "barplot(..., space=0, axisnames = FALSE)") #如space=1.5则有稀牙缝 barplot(tN, space = 0, axisnames=FALSE, sub = "barplot(..., space=0, axisnames = FALSE)") pie(tN)#pie plot par(mfrow=c(1,1)) #加grid plot (1:3) grid(10, 5 , lwd = 2) dev.set;dev.off;dev.list 关于画图(pairs/三维) #pairs#data(iris) pairs(iris[1:4], main = "Anderson's Iris Data -- 3 species", pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)]) #iris为150x5数据,这里是4个数量变量的点图(最后一个是分类变量(iris$Species)) #stars#data(mtcars) stars(mtcars[, 1:7], key.loc = c(14, 1.5), main = "Motor Trend Cars : full stars()",flip.labels=FALSE) #mtcars为32x11数据,这里只选前7个数量变量的点图 #persp x <- seq(-10, 10, length= 30) y <- x f <- function(x,y) { r <- sqrt(x^2+y^2); 10 * sin(r)/r } z <- outer(x, y, f) z[is.na(z)] <- 1 persp(x, y, z, theta = 30, phi = 30, expand = 0.5, col = "lightblue") data(volcano) par(mfrow=c(2,2)) z <- 2 * volcano# Exaggerate the relief x <- 10 * (1:nrow(z)) #10 meter spacing(S to N) y <- 10 * (1:ncol(z)) #10 meter spacing(E to W) ## Don't draw the grid lines : border = NA #par(bg = "slategray") persp(x, y, z, theta = 135, phi = 30, col = "green3", scale = FALSE, ltheta = -120, shade = 0.75, border = NA, box = FALSE) par(bg= "white") 关于画图(三维) #contour rx <- range(x <- 10*1:nrow(volcano)) ry <- range(y <- 10*1:ncol(volcano)) ry <- ry + c(-1,1) * (diff(rx) - diff(ry))/2 tcol <- terrain.colors(12) opar <- par(pty = "s", bg = "lightcyan");par(opar) plot(x = 0, y = 0,type = "n", xlim = rx, ylim = ry, xlab = "", ylab = "") u <- par("usr") rect(u[1], u[3], u[2], u[4], col = tcol[8], border = “red”) #rect画矩形 contour(x, y, volcano, col = tcol[2], lty = "solid", add = TRUE, vfont = c("sans serif", "plain")) title("A Topographic Map of Maunga Whau", font = 4) abline(h = 200*0:4, v = 200*0:4, col = "lightgray", lty = 2, lwd = 0.1);par(opar) #image x <- 10*(1:nrow(volcano)) y <- 10*(1:ncol(volcano)) image(x, y, volcano, col = terrain.colors(100), axes = FALSE) contour(x, y, volcano, levels = seq(90, 200, by=5), add = TRUE, col = "peru") axis(1, at = seq(100, 800, by = 100)) axis(2, at = seq(100, 600, by = 100)) box() title(main = "Maunga Whau Volcano", font.main = 4) par(mfrow=c(1,1)) 多窗口操作 x11() plot(1:10) x11() plot(rnorm(10)) dev.set(dev.prev()) abline(0,1)# through the 1:10 points dev.set(dev.next()) abline(h=0, col="gray")# for the residual plot dev.set(dev.prev()) dev.off(); dev.off()#- close the two X devices #dev.list() 画图杂项 #模拟布朗运动 n=100;x=cumsum(rnorm(100));y=cumsum(rnorm(100));plot(x,y,type="l") x=0;y=0;plot(100,ylim=c(-15,15),xlim=c(-15,15))#慢动作 for(i in 1:200){x1=x+rnorm(1);y1=y+rnorm(1); segments(x,y,x1,y1);x=x1;y=y1 Sys.sleep(.05)} #散点大小同因变量值成比例 x=1:10;y=runif(10) symbols(x,y,circle=y/2,inches=F,bg=x) #数据框的每一列都做Q-Q图 table=data.frame(x1=rnorm(100),x2=rnorm(100,1,1)) par(ask=TRUE)#waitforchanging等待页面改变的确认 results=apply(table,2,qqnorm) par(ask=FALSE) #在一个图上添加一个小图 x=rnorm(100) hist(x) op=par(fig=c(.02,.5,.5,.98),new=TRUE) boxplot(x) #数学符号 x=1:10;plot(x,type="n") text(3,2,expression(paste("Temperature(",degree,"C) in 2003"))) text(4,4,expression(bar(x)==sum(frac(x[i],n),i==1,n))) text(6,6,expression(hat(beta)==(X^t*X)^{.1}*X^t*y)) text(8,8,expression(z[i]==sqrt(x[i]^2+y[i]^2))) 改变大小写字母 x=c("I","am","A","BIG", "Cat") tolower(x) toupper(x) R统计模型讲义 #基础 x=rnorm(20,10) t.test(x,m=9,alt="greater") t.test(x[1:10],m=9,alt="greater")$p.value t.test(x,con=.90)$conf x=rnorm(30,10);y=rnorm(30,10.1) t.test(x,y,alt="less") library(TeachingDemos) ci.examp() run.ci.examp() vis.boxcox() vis.boxcoxu() 回归 相关 #相关 x=rnorm(20);y=rnorm(20); cor(x,y) cor(x,y,method="kendall"); cor(x,y,method="spearman") cor.test(x,y); cor.test(x,y,method="kendall"); cor.test(x,y,method="spearman") cor.test(x,y,method="kendall")$p.value #相关吗? x=rnorm(3);y=rnorm(3);cor(x,y);cor.test(x,y)$p.value library(TeachingDemos) put.points.demo() 基本原理 #基本原理 set.seed(100) x1=rnorm(100);x2=rnorm(100);eps=rnorm(100) y=5+2*x1-3*x2+eps a=lm(y~x1+x2) (lm(y~0+x1+x2))#不要截距:等价于(lm(y~-1+x1+x2)) summary(a);anova(a) names(a) shapiro.test(a$res) qqnorm(a$res);qqline(a$res) #数学原理 x=cbind(1,x1,x2) dim(x) b=solve(t(x)%*%x)%*%t(x)%*%y b a$coe 5 0 y 10 例1:cross.txt 3 4 5 6 61 x 7 8 例1: cross.txt w=read.table("cross.txt",header=T) head(w) plot(y~x,w);summary(w) a=lm(y~x+z,w) summary(a) anova(a) qqnorm(a$res);qqline(a$res) shapiro.test(a$res) a1=lm(y~x*z,w) summary(a1);anova(a1) qqnorm(a1$res);qqline(a1$res) shapiro.test(a1$res) anova(a,a1) library(party)#更简单的方法 wt=mob(y~x|z,data=w) coef(wt);plot(wt) plot(y~x,w);abline(coef(wt)[1,],col=2);abline(coef(wt)[2,],col=4) 回归方程 63 Poison Experiment The data give the survival times (in 10 hour units) in a 3 x 4 factorial experiment, the factors being (a) three poisons and (b) four treatments. Each combination of the two factors is used for four animals, the allocation to animals being completely randomized. Box, G. E. P., and Cox, D. R. (1964). An analysis of transformations (with Discussion). J. R. Statist. Soc. B, 26, 211-252. http://www.statsci.org/data/general/poison.html 64 例2:poison.txt:3种毒药,4种处理, 用于动物实验,48个观测值 1.5 2.0 2.5 3.0 3.5 4.0 2.0 2.5 3.0 1.0 3.0 4.0 1.0 1.5 Poison 0.2 0.4 0.6 0.8 1.0 1.2 1.0 2.0 Treatment Time 1.0 1.5 2.0 2.5 3.0 0.2 0.4 0.6 0.8 1.0 1.2 65 setwd("f:/2010stat") w=read.table("poison.txt",head=T) head(w);tail(w) str(w);summary(w) dim(w) w$Poison=factor(w$Poison) w$Treatment=factor(w$Treatment) pairs(w) #直接回归 a=lm(Time~Poison*Treatment,w) anova(a) a=lm(Time~.,w) anova(a) qqnorm(a$res);qqline(a$res) shapiro.test(a$res) #变换 a=lm(1/Time~Poison+Treatment,w) anova(a) qqnorm(a$res);qqline(a$res) shapiro.test(a$res) summary(a) 回归 变换并拟合主效应 67 结果解释 68 多项式回归 #多项式回归 y <- cars$dist;x <- cars$speed o = order(x) plot( y~x ) do.it <- function (model, col) { r <- lm( model ); yp <- predict(r) lines( yp[o] ~ x[o], col=col, lwd=3 )} do.it(y~x, col="red") do.it(y~x+I(x^2), col="blue") do.it(y~-1+I(x^2), col="green") legend(par("usr")[1], par("usr")[4], c("affine function", "degree-2 polynomial", "degree 2 monomial"), lwd=3, col=c("red", "blue", "green"), ) n <- 100 x <- runif(n,min=-4,max=4) + sign(x)*.2 y <- 1/x + rnorm(n)#双曲线 plot(y~x) lm( 1/y ~ x ) n <- 100 x <- rlnorm(n)^3.14#a log-normal distribution is a probability distribution of a random variable whose logarithm is normally distributed. y <- x^-.1 * rlnorm(n) plot(y~x) lm(log(y) ~ log(x)) 多项式p,q正交, 如 #关于正交多项式 y <- cars$dist;x <- cars$speed #非正交: 一项加一项(互相影响, 显著的系数变成不显著) summary( lm(y~x) ) summary( lm(y~x+I(x^2)) ) summary( lm(y~x+I(x^2)+I(x^3)) ) summary( lm(y~x+I(x^2)+I(x^3)+I(x^4)) ) summary( lm(y~x+I(x^2)+I(x^3)+I(x^4)+I(x^5)) ) #正交: 不会改变开始显著的系数 poly: Compute Orthogonal Polynomials summary( lm(y~poly(x,1)) ) summary( lm(y~poly(x,2)) ) summary( lm(y~poly(x,3)) ) summary( lm(y~poly(x,4)) ) summary( lm(y~poly(x,5)) ) #对正交多项式点出系数的p-values n <- 5 p <- matrix( nrow=n, ncol=n+1 ) for (i in 1:n) { p[i,1:(i+1)] <- summary(lm( y ~ poly(x,i) ))$coefficients[,4] } matplot(p, type='l', lty=1, lwd=3) legend( par("usr")[1], par("usr")[4], as.character(1:n), lwd=3, lty=1, col=1:n ) title(main="Evolution of the p-values (orthonormal polynomials)") #对非正交多项式, 点出系数的p-values p <- matrix( nrow=n, ncol=n+1 ) p[1,1:2] <- summary(lm(y ~ x) )$coefficients[,4] p[2,1:3] <- summary(lm(y ~ x+I(x^2)) )$coefficients[,4] p[3,1:4] <- summary(lm(y ~ x+I(x^2)+I(x^3)) )$coefficients[,4] p[4,1:5] <- summary(lm(y ~ x+I(x^2)+I(x^3)+I(x^4)) )$coefficients[,4] p[5,1:6] <- summary(lm(y ~ x+I(x^2)+I(x^3)+I(x^4)+I(x^5)) )$coefficients[,4] matplot(p, type='l', lty=1, lwd=3) legend( par("usr")[1], par("usr")[4], as.character(1:n), lwd=3, lty=1, col=1:n ) title(main="Evolution of the p-values (non orthonormal polynomials)“) 37.4 36.4 36.6 36.8 y 37.0 37.2 #例子 data(beavers) y <- beaver1$temp x <- 1:length(y) plot(y~x) for (i in 1:10) { r <- lm( y ~ poly(x,i) ) lines( predict(r), type="l", col=i ) } summary(r) 0 20 40 60 x 80 100 非参数回归 #非参数: 样条 plot(quakes$long, quakes$lat) lines( smooth.spline(quakes$long, quakes$lat), col='red', lwd=3) library(Design)#rcs: Design Special Transformation Functions # 4-node spline r3 <- lm( quakes$lat ~ rcs(quakes$long) ) plot( quakes$lat ~ quakes$long ) o <- order(quakes$long) lines( quakes$long[o], predict(r)[o], col='red', lwd=3 ) r <- lm( quakes$lat ~ rcs(quakes$long,10) ) lines( quakes$long[o], predict(r)[o], col='blue', lwd=6, lty=3 ) title(main="Regression with rcs") legend( par("usr")[1], par("usr")[3], yjust=0, c("4 knots", "10 knots"), lwd=c(3,3), lty=c(1,3), col=c("red", "blue") ) #更多的样条 library(splines) data(quakes) x <- quakes[,2] y <- quakes[,1] o <- order(x) x <- x[o] y <- y[o] r1 <- lm( y ~ bs(x,df=10) ) r2 <- lm( y ~ ns(x,df=6) ) plot(y~x) lines(predict(r1)~x, col='red', lwd=3) lines(predict(r2)~x, col='green', lwd=3) #核光滑 plot(cars$speed, cars$dist) lines(ksmooth(cars$speed, cars$dist, "normal", bandwidth=2), col='red') lines(ksmooth(cars$speed, cars$dist, "normal", bandwidth=5), col='green') lines(ksmooth(cars$speed, cars$dist, "normal", bandwidth=10), col='blue') #加权局部最小二乘. Weighted Local Least Squares: loess #各种核函数 curve(dnorm(x), xlim=c(-3,3), ylim=c(0,1.1)) x <- seq(-3,3,length=200) D.Epanechikov <- function (t) { ifelse(abs(t)<1, 3/4*(1-t^2), 0) } lines(D.Epanechikov(x) ~ x, col='red') D.tricube <- function (t) { # aka "triweight kernel" ifelse(abs(t)<1, (1-abs(t)^3)^3, 0) } lines(D.tricube(x) ~ x, col='blue') legend( par("usr")[1], par("usr")[4], yjust=1, c("noyau gaussien", "noyau d'Epanechikov", "noyau tricube"), lwd=1, lty=1, col=c(par('fg'),'red', 'blue')) title(main="Differents kernels") #局部多项式回归 library(KernSmooth) data(quakes) x <- quakes$long;y <- quakes$lat plot(y~x) bw <- dpill(x,y) # .2 lines( locpoly(x,y,degree=0, bandwidth=bw), col='red' ) lines( locpoly(x,y,degree=1, bandwidth=bw), col='green' ) lines( locpoly(x,y,degree=2, bandwidth=bw), col='blue' ) legend( par("usr")[1], par("usr")[3], yjust=0, c("degree = 0", "degree = 1", "degree = 2"), lwd=1, lty=1, col=c('red', 'green', 'blue')) title(main="Local Polynomial Regression") #大窗宽 plot(y~x);bw <- .5 lines( locpoly(x,y,degree=0, bandwidth=bw), col='red' ) lines( locpoly(x,y,degree=1, bandwidth=bw), col='green' ) lines( locpoly(x,y,degree=2, bandwidth=bw), col='blue' ) legend( par("usr")[1], par("usr")[3], yjust=0, c("degree = 0", "degree = 1", "degree = 2"), lwd=1, lty=1, col=c('red', 'green', 'blue')) title(main="Local Polynomial Regression (wider window)") 非线性回归 #非线性回归 library(nls2) f <- function (x,p) { u <- p[1] v <- p[2] u/(u-v) * (exp(-v*x) - exp(-u*x)) } n <- 100 x <- runif(n,0,2) y <- f(x, c(3.14,2.71)) + .1*rnorm(n) r <- nls( y ~ f(x,c(a,b)), start=c(a=3, b=2.5) ) plot(y~x) xx <- seq(0,2,length=200) lines(xx, f(xx,r$m$getAllPars()), col='red', lwd=3) lines(xx, f(xx,c(3.14,2.71)), lty=2) 分位数回归 模型 损失函数 寻找参数(可能是向量) 对线性回归模型 满足 条件则为 在最小二乘回归中 在t分位数回归中 t=0.5最小一乘回归 在t分位数定义: 损失函数rt(u)形状 #分位数回归 library(quantreg) data(engel);head(engel);plot(engel) plot(engel, log = "xy",main = "'engel' data (log - log scale)") plot(log10(foodexp) ~ log10(income), data = engel,main = "'engel' data (log10 - tranformed)") taus <- c(.15, .25, .50, .75, .95, .99) rqs <- as.list(taus) for(i in seq(along = taus)) { rqs[[i]] <- rq(log10(foodexp) ~ log10(income), tau = taus[i], data = engel) lines(log10(engel$income), fitted(rqs[[i]]), col = i+1)} legend("bottomright", paste("tau = ", taus), inset = .04, col = 2:(length(taus)+1), lty=1) abline(lm(log10(foodexp)~log10(income),engel),lwd=5)#最小二乘黑粗线 plot(summary(rq(log10(foodexp)~log10(income),tau = 1:49/50,data=engel)))#画出系数图 #未变换数据 plot(foodexp~income, data = engel, main = "'engel' data") for(i in seq(along = taus)) { rqs[[i]] <- rq(foodexp ~ income, tau = taus[i], data = engel) lines(engel$income, fitted(rqs[[i]]), col = i+1)} legend("bottomright", paste("tau = ", taus), inset = .04, col = 2:(length(taus)+1), lty=1) abline(lm(foodexp~income,engel),lwd=5)#最小二乘黑粗线 plot(summary(rq(foodexp~income,tau = 1:49/50,data=engel)))#画出系数图 N <- 2000 x <- runif(N) y <- rnorm(N) y <- -1 + 2 * x + ifelse(y>0, y+5*x^2, y-x^2) plot(x,y) abline(lm(y~x), col="red") library(quantreg) plot(y~x) for (a in seq(.1,.9,by=.1)) { abline(rq(y~x, tau=a), col="blue", lwd=3) } #局部多项式分位数回归: locally polynomial quantile regression plot(y~x) for (a in seq(.1,.9,by=.1)) { r <- lprq(x,y, h=bw.nrd0(x), # See ?density tau=a) lines(r$xx, r$fv, col="blue", lwd=3) } Logistic & Probit 回归 广义线性模型(GLM) 称为连接函数(link function) GLM的对数似然函数为 记分函数(score function)为 Logistic回归/Probit回归 例子: ModeChoice 88 例子: ModeChoice 实际上,Mode只有两种:0、1, 其余变量为数量 89 考虑2种logistic模型 模型b 模型c 90 ANOVA library(Ecdat);data(ModeChoice)#二分类 w=ModeChoice #两个logistic模型 b=glm(factor(mode)~ttme+invt+gc,data=w,family="binomial") c=glm(factor(mode)~ttme+invc*invt+gc,data=w,family="binomial") anova(c,test="Chi");summary(c);anova(b,c,test="Chi") 91 模型b拟合 92 模型c拟合 93 考虑2种probit模型 模型bb 或 模型cb 或 94 ANOVA #两个probit模型 bb=glm(factor(mode)~ttme+invt+gc,data=w,family=binomial(link=probit)) cb=glm(factor(mode)~ttme+invc*invt+gc,data=w,binomial(link=probit)) anova(cb,test="Chi");summary(c);anova(bb,cb,test="Chi") anova(b,bb,test="Chi");anova(c,cb,test="Chi") 95 模型b拟合 97 dispersion library(dglm) library(statmod) clotting <- data.frame( u = c(5,10,15,20,30,40,60,80,100), lot1 = c(118,58,42,35,27,25,21,19,18), lot2 = c(69,35,26,21,18,16,13,12,12)) a1=glm(lot1 ~ log(u), data=clotting, family=Gamma) summary(a1) # The same example as in glm: the dispersion is modelled as constant # However, dglm used ml not reml, so results slightly different: out <- dglm(lot1 ~ log(u), ~1, data=clotting, family=Gamma) summary(out) # Try a double glm out2 <- dglm(lot1 ~ log(u), ~u, data=clotting, family=Gamma) summary(out2) anova(out2) # Summarize the mean model as for a glm summary.glm(out2) # Summarize the dispersion model as for a glm summary(out2$dispersion.fit) # Examine goodness of fit of dispersion model by plotting residuals plot(fitted(out2$dispersion.fit),residuals(out2$dispersion.fit)) Poisson log-linear model: dispersion offset n independent responses The Poisson distribution has but it may happen that the actual variance exceeds the nominal variance under the assumed probability model. Suppose now that θi=λi ni Thus, it can be shown Hence, for φ>0 we have overdispersion. It is interesting to note that the same mean and variance arise also if we assume a negative binomial distribution for the response variable. Poisson log-linear model: dispersion library(dispmod) data(salmonellaTA98) attach(salmonellaTA98) log.x10 <- log(x+10) mod <- glm(y ~ log.x10 + x, family=poisson(log)) summary(mod) mod.disp <- glm.poisson.disp(mod) summary(mod.disp) mod.disp$dispersion # compute predictions on a grid of x-values... x0 <- seq(min(x), max(x), length=50) eta0 <- predict(mod, newdata=data.frame(log.x10=log(x0+10), x=x0), se=TRUE) eta0.disp <- predict(mod.disp, newdata=data.frame(log.x10=log(x0+10), x=x0), se=TRUE) # ... and plot the mean functions with variability bands plot(x, y) lines(x0, exp(eta0$fit)) lines(x0, exp(eta0$fit+2*eta0$se), lty=2) lines(x0, exp(eta0$fit-2*eta0$se), lty=2) lines(x0, exp(eta0.disp$fit), col=2) lines(x0, exp(eta0.disp$fit+2*eta0.disp$se), lty=2, col=2) lines(x0, exp(eta0.disp$fit-2*eta0.disp$se), lty=2, col=2) Poisson log-linear model: dispersion ##-- Holford's data data(holford) attach(holford) mod <- glm(incid ~ offset(log(pop)) + Age + Cohort, family=poisson(log)) summary(mod) mod.disp <- glm.poisson.disp(mod) summary(mod.disp) mod.disp$dispersion #另一种方法(利用Tweedie distributions—自己找文献) tt= glm(incid ~ offset(log(pop)) + Age + Cohort, family=tweedie(var.power=4,link.power=0)) 岭回归 ridge ˆ 2 p p N 2 = arg min yi 0 xij j j j =1 j =1 i =1 ˆ = arg min yi 0 xij j i =1 j =1 p N ridge p subject to j =1 2 j s 2 library(perturb);data(consumption) A data frame with 28 observations on the following 5 variables. •year: 1947 to 1974 •c: total consumption, 1958 dollars •r: the interest rate (Moody's Aaa) •dpi: disposable income, 1958 dollars •d_dpi annual change in disposable income library(perturb);data(consumption) head(consumption) library(MASS) ct1<-c(NA,c[-length(c)]); a<-lm.ridge(c~ct1+dpi+r+d_dpi, lambda=seq(0, 0.1,length=100), model =TRUE) names(a)# "coef" "scales" "Inter" "lambda" "ym" "xm" "GCV" "kHKB" "kLW" a$lambda[which.min(a$GCV)] ##找到GCV 最小时的lambdaGCV= 0.014 a$coef[,which.min(a$GCV)] ##找到GCV 最小时对应的系数 a$coef[,which.min(a$GCV)] par(mfrow=c(1,2)) plot(a) ##画出图形,并作出lambda 取0.01 时的那条线,以红线表示。 abline(v=a$lambda[which.min(a$GCV)],col="red") plot(a$lambda,a$GCV,type="l")#lamda 同GCV 之间关系的图形 abline(v=a$lambda[which.min(a$GCV)],col="green") 0.55 a$GCV 0.53 0.54 80 60 0.52 t(x$coef) 40 20 0 0.00 0.02 0.04 0.06 x$lambda 0.08 0.10 0.00 0.02 0.04 0.06 a$lambda 0.08 0.10 0.10 0.15 a$GCV 2 0.05 1 t(x$coef) 3 0.20 0.25 4 对于正交数据(独立) 0 20 40 60 x$lambda 80 100 0 20 40 60 a$lambda 80 100 #另一个例子 longley # not the same as the S-PLUS dataset names(longley)[1] <- "y" a0=lm.ridge(y ~ ., longley)#lambda = 0 plot(lm.ridge(y ~ ., longley,lambda = seq(0,0.1,0.001))) select(lm.ridge(y ~ ., longley,lambda = seq(0,0.1,0.0001))) a1=lm.ridge(y ~ ., longley,lambda=0.0057) a1$coe 偏最小二乘回归 PLSR (Partial Least Squares and Principal Component Regression) oliveoil {pls} Sensory and physico-chemical data of olive oils Description A data set with scores on 6 attributes from a sensory panel and measurements of 5 physicochemical quality parameters on 16 olive oil samples. The first five oils are Greek, the next five are Italian and the last six are Spanish. data(oliveoil) Format: A data frame with 16 observations on the following 2 variables. Sensory: a matrix with 6 columns. Scores for attributes ‘yellow’, ‘green’, ‘brown’, ‘glossy’, ‘transp’, and ‘syrup’. Chemical: a matrix with 5 columns. Measurements of acidity, peroxide, K232, K270, and DK. Source Massart, D. L., Vandeginste, B. G. M., Buydens, L. M. C., de Jong, S., Lewi, P. J., SmeyersVerbeke, J. (1998) Handbook of Chemometrics and Qualimetrics: Part B. Elsevier. Tables 35.1 and 35.4. [Package pls version 2.1-0 Index] sensory ~ chemical Sensory Chemical #偏最小二乘回归(先主成份回归) library(pls); data(oliveoil);head(oliveoil);dim(oliveoil) oliveoil$sensory#是一个16x6矩阵 oliveoil$chemical #是一个16x5矩阵 #PCR sens.pcr <- pcr(sensory ~ chemical, ncomp = 4, scale = TRUE, data = oliveoil) summary(sens.pcr);names(sens.pcr) [1] "coefficients" "scores" "loadings" "Yloadings" "projection" "Xmeans" [7] "Ymeans" "fitted.values" "residuals" "Xvar" "Xtotvar" "ncomp" [13] "method" "scale" "call" "terms" "model" sens.pcr$loadings sens.pcr$coefficients sens.pcr$scores sens.pcr$Yloadings sens.pcr$projection sens.pcr$residuals #PLSR sens.pls <- plsr(sensory ~ chemical, ncomp = 4, scale = TRUE, data = oliveoil) summary(sens.pls);names(sens.pls) [1] "coefficients" "scores" "loadings" "loading.weights" "Yscores" [6] "Yloadings" "projection" "Xmeans" "Ymeans" "fitted.values" [11] "residuals" "Xvar" "Xtotvar" "ncomp" "method" [16] "scale" "call" "terms" "model" sens.pls$loadings sens.pls$coef sens.pls$scores sens.pls$loading.weights sens.pls$Yscores sens.pls$Yloadings sens.pls$Xvar sens.pls$Xtotvar library(pls);data(yarn) Consisting of 21 NIR spectra of PET yarns, measured at 268 wavelengths, and 21 corresponding densities. (Erik Swierenga). library(pls);data(yarn) names(yarn)#[1] "NIR" "density" "train" dim(yarn$NIR)#28 268 自变量 yarn$density #因变量 summary(yarn$train) yarn$train Mode FALSE TRUE NA's logical 7 21 0 yarn.pls <- plsr(density ~ NIR, ncomp = 4, scale = TRUE, data = yarn) summary(yarn.pls);names(yarn.pls) [1] "coefficients" "scores" "loadings" "loading.weights" "Yscores" [6] "Yloadings" "projection" "Xmeans" "Ymeans" "fitted.values" [11] "residuals" "Xvar" "Xtotvar" "ncomp" "method" [16] "scale" "call" "terms" "model" yarn.pls$loadings yarn.pls$coef yarn.pls$scores yarn.pls$loading.weights yarn.pls$Yscores yarn.pls$Yloadings yarn.pls$Xvar yarn.pls$Xtotvar