对称回归模型

advertisement
正交(对称,全)回归
哪个y, 哪个x

身高 预测体重还是体重预测身高

父亲身高预测孩子身高,还是孩子身高判
断父亲身高
#heights {alr3}
#Karl Pearson organized the collection of data on over 1100 families in England
#in the period 1893 to 1898. This particular data set gives the heights
# in inches of mothers and their daughters, with up to two daughters per mother.
# All daughters are at least age 18, and all mothers are younger than 65.
#Data were given in the source as a frequency table to the nearest inch.
#Rounding error has been added to remove discreteness from graph


#Davis {car} The Davis data frame has 200 rows and 5 columns. The subjects were
men and women engaged in regular exercise. There are some missing data.
# father.son {UsingR}
#1078 measurements of a father's height and his son's height

father.son {UsingR}
summary(lm(fheight~sheight,father.son))
 summary(lm(sheight~fheight,father.son))
 o1=lm(fheight~sheight,father.son)
 o2=lm(sheight~fheight,father.son)
 plot(fheight~sheight,father.son)
 s.prid=expand.grid(sheight=seq(50,90,1))

s.prid$fheight=predict(o1,s.prid)
 s.prid2=expand.grid(fheight=s.prid$fheight)
 s.prid2$sheight=predict(o2,s.prid2)
 lines(fheight~sheight,s.prid,col="red")
 lines(fheight~sheight,s.prid2,col="blue")
 legend("topleft",c("fheight~sheight","sheight~fh
eight"),lty=1,col=c("red","blue"))

对称回归
如果难以确定x, y中哪个是响应变量, 如何建立
两者之间的函数关系?
 如果x, y地位对等(对称),y~x以及x~y都不合理。
应该使用对称回归方法,包括major-axis reg(或
orthogonal reg), reduced major reg(或impartial
reg),bisector reg(或double regression)
 Pearson给出了major axis regression (也称作
orthogonal regression) 方法, 这是一种对称回归
方法。

Reduced major axis regression
(impartial regression):the SD line
其它symmetric regression

Bisector regression (double regression):平分
y~x, x~y回归直线的夹角
二元正态分布-回归、逆回归
程序














ol<-function(x,y)
{
s_xy=sum((x-mean(x))*(y-mean(y)))
s_xx=sum((x-mean(x))^2)
s_yy=sum((y-mean(y))^2)
b1=s_xy/s_xx
b2=s_yy/s_xy
r=cor(x,y)
b_ol=(-(b2-1/b1)+sign(r)*sqrt(4+(b2-1/b1)^2))/2
b_sd=sign(r)*sqrt(b1*b2)
b_bi=(b1*b2-1+sqrt((1+b1^2)*(1+b2^2)))/(b1+b2)
B=list(b_xy=b1,b_yx=b2,b_ol=b_ol,b_sd=b_sd,b_bi=b_bi)
return(B)
}
数据
IQ=c(90,92,93,95,97,98,100)
 P=c(39,42,36,45,39,45,42)

分析









B=as.numeric(ol(IQ,P))
A=mean(P)-B*mean(IQ)
plot(IQ,P)
lines(IQ,A[1]+B[1]*IQ)
lines(IQ,A[2]+B[2]*IQ,col="purple")
lines(IQ,A[3]+B[3]*IQ,col="red")
lines(IQ,A[4]+B[4]*IQ,col="blue")
lines(IQ,A[5]+B[5]*IQ,col="green")
legend("topleft",c("x~y","y~x","ol","sd","bi"),lty=1,col=c("bla
ck","purple","red","blue","green"))
一些特殊问题
1. 异常点/标准化
 2. 中心化

1. 异常值(outlier)/标准化
例1.1.青年人IQ分数的分布为正态,超过99%
分位数的可定义为智力超常者(outlier):
例1.2.体重指数。肥胖的不恰当的定义:重量
超过群体95%分位数的人为肥胖:
不同身高、性别、年龄的人不具可比性。即μ
是若干因素的函数。
一个简单但繁琐的办法是分层,对给定群体
发现W分布,并定义超过C(比如标准正态
分布95%分位数)的人为肥胖。

另外一个做法是消除掉log(H)对log(W)的影
响(同时控制性别G、年龄),即假设回归
模型:
2. 中心化
Download