正交(对称,全)回归 哪个y, 哪个x 身高 预测体重还是体重预测身高 父亲身高预测孩子身高,还是孩子身高判 断父亲身高 #heights {alr3} #Karl Pearson organized the collection of data on over 1100 families in England #in the period 1893 to 1898. This particular data set gives the heights # in inches of mothers and their daughters, with up to two daughters per mother. # All daughters are at least age 18, and all mothers are younger than 65. #Data were given in the source as a frequency table to the nearest inch. #Rounding error has been added to remove discreteness from graph #Davis {car} The Davis data frame has 200 rows and 5 columns. The subjects were men and women engaged in regular exercise. There are some missing data. # father.son {UsingR} #1078 measurements of a father's height and his son's height father.son {UsingR} summary(lm(fheight~sheight,father.son)) summary(lm(sheight~fheight,father.son)) o1=lm(fheight~sheight,father.son) o2=lm(sheight~fheight,father.son) plot(fheight~sheight,father.son) s.prid=expand.grid(sheight=seq(50,90,1)) s.prid$fheight=predict(o1,s.prid) s.prid2=expand.grid(fheight=s.prid$fheight) s.prid2$sheight=predict(o2,s.prid2) lines(fheight~sheight,s.prid,col="red") lines(fheight~sheight,s.prid2,col="blue") legend("topleft",c("fheight~sheight","sheight~fh eight"),lty=1,col=c("red","blue")) 对称回归 如果难以确定x, y中哪个是响应变量, 如何建立 两者之间的函数关系? 如果x, y地位对等(对称),y~x以及x~y都不合理。 应该使用对称回归方法,包括major-axis reg(或 orthogonal reg), reduced major reg(或impartial reg),bisector reg(或double regression) Pearson给出了major axis regression (也称作 orthogonal regression) 方法, 这是一种对称回归 方法。 Reduced major axis regression (impartial regression):the SD line 其它symmetric regression Bisector regression (double regression):平分 y~x, x~y回归直线的夹角 二元正态分布-回归、逆回归 程序 ol<-function(x,y) { s_xy=sum((x-mean(x))*(y-mean(y))) s_xx=sum((x-mean(x))^2) s_yy=sum((y-mean(y))^2) b1=s_xy/s_xx b2=s_yy/s_xy r=cor(x,y) b_ol=(-(b2-1/b1)+sign(r)*sqrt(4+(b2-1/b1)^2))/2 b_sd=sign(r)*sqrt(b1*b2) b_bi=(b1*b2-1+sqrt((1+b1^2)*(1+b2^2)))/(b1+b2) B=list(b_xy=b1,b_yx=b2,b_ol=b_ol,b_sd=b_sd,b_bi=b_bi) return(B) } 数据 IQ=c(90,92,93,95,97,98,100) P=c(39,42,36,45,39,45,42) 分析 B=as.numeric(ol(IQ,P)) A=mean(P)-B*mean(IQ) plot(IQ,P) lines(IQ,A[1]+B[1]*IQ) lines(IQ,A[2]+B[2]*IQ,col="purple") lines(IQ,A[3]+B[3]*IQ,col="red") lines(IQ,A[4]+B[4]*IQ,col="blue") lines(IQ,A[5]+B[5]*IQ,col="green") legend("topleft",c("x~y","y~x","ol","sd","bi"),lty=1,col=c("bla ck","purple","red","blue","green")) 一些特殊问题 1. 异常点/标准化 2. 中心化 1. 异常值(outlier)/标准化 例1.1.青年人IQ分数的分布为正态,超过99% 分位数的可定义为智力超常者(outlier): 例1.2.体重指数。肥胖的不恰当的定义:重量 超过群体95%分位数的人为肥胖: 不同身高、性别、年龄的人不具可比性。即μ 是若干因素的函数。 一个简单但繁琐的办法是分层,对给定群体 发现W分布,并定义超过C(比如标准正态 分布95%分位数)的人为肥胖。 另外一个做法是消除掉log(H)对log(W)的影 响(同时控制性别G、年龄),即假设回归 模型: 2. 中心化