Uploaded by 王端凝

H1

advertisement
A1
1
a)
1.When people decide to buy house, they will consider many elements, such as the price, the location, the
area and so on, in order to find the key element that influences people’s decision most, a regression analysis
is needed. This will help land agent to increase sales.
predictor: price, location, area and house type
response: people buy or not
The goal is inference. What land agent want to know is the most important feature for most of buyers.
2.Since the member of a team is limiting, the coach will use regression analysis to find the most outstanding
advantage of every member and put them on the right position.
predictor: speed, response rate, vigor
response: the total performance
The goal is inference. What the coach want to know is the member’s strong suit.
3.According to people’s browsing records and buying records, Taobao can use regression analysis to find
people’s interests.
predictor: browsing records and buying records
response: people’s interests
The goal is prediction. Taobao use people’s living records to recommend things that people may interest in.
b)
1.The entertainment company will sell their actors in a specific model according to actors’ characteristics,
appearances and abilities.
predictors: characteristics, appearances and abilities.
response: the type the entertainment company choose
The goal is prediction. Using classification analysis to find the proper type that the actor belongs to.
2.Educational institution can help high school students to find the major in university that may suits them
best according to his/her characteristics, interests, the subjects he/she is good at.
predictors: characteristics, interests, superiority
response: recommended major
The goal is prediction. Using classification analysis to find the proper major that the student can choose.
3.The luxury brands find the main features that can let their goods be in luxurious level by sell some samples
for customers and collect their feedback.
predictor: the features of goods
response: the goods is luxury enough or not
The goal is inference. The brands do not need to predict, they want to find the key feature to let them keep
in high level.
1
c)
1.Classify animals and plants and classify genes to gain insights into the inherent structure of the population.
2.Classify documents online to repair information.
3.Discover different customer groups and characterize different customer groups through purchase patterns.
2
a)better.
Since a more flexible method can capture more information from the data.
b)worse.
Since a more flexible method will highly possible to cause over-fitting.
c)better.
Since a highly nonlinear model always has more flexibility, a more flexible method can fit the model better.
d)worse.
Since a more flexible method will leads to a larger variance and smaller bias model. In the given situation,
the variance of the data is already high, if we apply flexible method, variance will be larger, the irreducible
error will be captured as the feature of the data and cause over-fitting.
3
see it in the last page.
4
see it in the last page.
5
a)
speedX=c(20,20,30,30,40,40,50,50,60,60)
stopDisY=c(16.9, 25.7, 38.2, 63.5, 65.7, 96.4, 103.1, 155.6, 218.2, 160.8)
lm.fit=lm(stopDisY~speedX)
summary(lm.fit)
2
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = stopDisY ~ speedX)
Residuals:
Min
1Q Median
-32.80 -16.12
3.73
3Q
13.35
Max
40.81
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -71.5500
23.2258 -3.081
0.0151 *
speedX
4.1490
0.5474
7.579 6.43e-05 ***
--Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 24.48 on 8 degrees of freedom
Multiple R-squared: 0.8778, Adjusted R-squared: 0.8625
F-statistic: 57.44 on 1 and 8 DF, p-value: 6.431e-05
1. Is there a relationship between the predictor and the response?
Yes.
2. How strong is the relationship between the predictor and the response?
The p-value is 6.43e-05. The relationship is very strong.
3. Is the relationship between the predictor and the response positive or negative?
Positive. Since the coefficient of the predictor is positive.
4. What is the predicted stop distance associated with a speed of 55? For this value of speed, what are
the associated 95% confidence and prediction intervals for the prediction of stop distance?
predict(lm.fit, newdata = data.frame(speedX=55), interval = "confidence", level=0.95)
##
fit
lwr
upr
## 1 156.645 130.6201 182.6699
predict(lm.fit, newdata = data.frame(speedX=55), interval = "prediction", level=0.95)
##
fit
lwr
upr
## 1 156.645 94.47933 218.8107
the predicted stop distance associated with a speed of 55 is 156.645
the associated 95% confidence interval is [ 130.6201, 182.6699]
the associated 95% predict interval is [ 94.47933, 218.8107 ]
b)
plot(speedX,stopDisY,pch=20)
abline(lm.fit,lwd =3,col="red")
3
200
150
100
50
stopDisY
20
30
40
speedX
c)
plot(speedX, residuals(lm.fit))
4
50
60
40
20
0
−20
residuals(lm.fit)
20
30
40
50
60
speedX
comment: A strong pattern in the residuals indicates non-linearity in the data. The variance of the error
depends on the input.
d)
stopDisY_2=sqrt(stopDisY)
lm.fit2=lm(stopDisY_2~speedX)
plot(speedX,stopDisY_2)
5
14
12
10
8
4
6
stopDisY_2
20
30
40
50
60
speedX
We take stopDisY_2 = sqrt(stopDisY) since from the residual plot stopDisY is proportional to xˆ2.
e)
summary(lm.fit2)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = stopDisY_2 ~ speedX)
Residuals:
Min
1Q
Median
-1.23066 -0.89158 -0.04093
3Q
0.98606
Max
1.13601
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.12895
0.98160
0.131
0.899
speedX
0.22511
0.02314
9.730 1.04e-05 ***
--Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 1.035 on 8 degrees of freedom
Multiple R-squared: 0.9221, Adjusted R-squared: 0.9123
F-statistic: 94.67 on 1 and 8 DF, p-value: 1.041e-05
6
0.5
0.0
−0.5
−1.0
residuals(lm.fit2)
1.0
plot(predict(lm.fit2),residuals(lm.fit2))
6
8
10
12
predict(lm.fit2)
The Rˆ2 for this model is 0.922, which is larger than the first one. And the residual plot for this model is
in a random pattern while for the original one is in a quadratic form. So this model can better estimate the
relationship between x and y.
3、
7
4.
Elf-yitq2-zcfcxrfcxM-ElfflxrfMMtEG-2EICfM.IM
Elftof) )
2
a)
=
7 E)
Elqcxrf Mt
⼆
02
fM-ElfhtEflxD-h77tr-EICfM-EFMFCEFM-fcxitz.fr Elfh
E(
=
[
Elf⼼ 喊吼秘⼩ 刪
=
[TEÀDIEIEFM )
-
非以 利 7)
)] [
t
02
刷
[ 似 ⼭ 利 )] [EXTEHDJ
=
⼆ 0
iaiginalfnnla
Elflxrüyit EIE 1 所 ) 所 7402
=
⼀
Bialfhitvancflxl ) +02
⼆
Bìasrefers totheemorthatisintroduadbymodelingareaufe.pro
blembyamuohsimplerproblem-Sothemoreflexib.ie/aomp1
b)
amethodisthelessbiasitgenerauyhavevariana.
t
n
ifyounadadferenttnainingdata.se
比
tohowmuehyourestimatewouldchange.by
t.Sothemoreflexibleamethodis.thelangervariance.tn
nnn.x.in
_
_
_
_
_
_
_
_
_
_
.
-.-
bias-variance-traininge.mn
-
-
-.-
n-ir educible morc.IE/fMtE-fiwMFElfM-fkn lX) tEY=E
,
{ [flxrfnitijt
⼆
⼆
⼆
=
Efzfxrfknnix 7 ⾏
)
E9fM-EFKNNMIEffkNNMJ.fiXljt
02
+02
EHM-EfkuMFECElfkmnl-fh.ME EHTEX
吼 Ehn) 我啊1
"
2
六点 flxi 六点 +02
他 檔 捌 片 和 El 雑 片 吃 叫 秘 對明智
Elfx
)
-
E 1 六点 炒 了
⼗
El
)
-
i)
2
⼀
,
d)
Nhenkincreases
六点 flxilwiubecomemonebiased.becausethepedictionofx
,
points.Butatthesametime.fdeerea.es
wiuaependonmore
sothevar
i
a
needeaeases.
N
henkdeaar
e
s.
v
i
c
e.
v
er
s
a.
K
T.
t
n
T.varianeetkl.tn
1
,
varianeef
Download