A1 1 a) 1.When people decide to buy house, they will consider many elements, such as the price, the location, the area and so on, in order to find the key element that influences people’s decision most, a regression analysis is needed. This will help land agent to increase sales. predictor: price, location, area and house type response: people buy or not The goal is inference. What land agent want to know is the most important feature for most of buyers. 2.Since the member of a team is limiting, the coach will use regression analysis to find the most outstanding advantage of every member and put them on the right position. predictor: speed, response rate, vigor response: the total performance The goal is inference. What the coach want to know is the member’s strong suit. 3.According to people’s browsing records and buying records, Taobao can use regression analysis to find people’s interests. predictor: browsing records and buying records response: people’s interests The goal is prediction. Taobao use people’s living records to recommend things that people may interest in. b) 1.The entertainment company will sell their actors in a specific model according to actors’ characteristics, appearances and abilities. predictors: characteristics, appearances and abilities. response: the type the entertainment company choose The goal is prediction. Using classification analysis to find the proper type that the actor belongs to. 2.Educational institution can help high school students to find the major in university that may suits them best according to his/her characteristics, interests, the subjects he/she is good at. predictors: characteristics, interests, superiority response: recommended major The goal is prediction. Using classification analysis to find the proper major that the student can choose. 3.The luxury brands find the main features that can let their goods be in luxurious level by sell some samples for customers and collect their feedback. predictor: the features of goods response: the goods is luxury enough or not The goal is inference. The brands do not need to predict, they want to find the key feature to let them keep in high level. 1 c) 1.Classify animals and plants and classify genes to gain insights into the inherent structure of the population. 2.Classify documents online to repair information. 3.Discover different customer groups and characterize different customer groups through purchase patterns. 2 a)better. Since a more flexible method can capture more information from the data. b)worse. Since a more flexible method will highly possible to cause over-fitting. c)better. Since a highly nonlinear model always has more flexibility, a more flexible method can fit the model better. d)worse. Since a more flexible method will leads to a larger variance and smaller bias model. In the given situation, the variance of the data is already high, if we apply flexible method, variance will be larger, the irreducible error will be captured as the feature of the data and cause over-fitting. 3 see it in the last page. 4 see it in the last page. 5 a) speedX=c(20,20,30,30,40,40,50,50,60,60) stopDisY=c(16.9, 25.7, 38.2, 63.5, 65.7, 96.4, 103.1, 155.6, 218.2, 160.8) lm.fit=lm(stopDisY~speedX) summary(lm.fit) 2 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = stopDisY ~ speedX) Residuals: Min 1Q Median -32.80 -16.12 3.73 3Q 13.35 Max 40.81 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -71.5500 23.2258 -3.081 0.0151 * speedX 4.1490 0.5474 7.579 6.43e-05 *** --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 24.48 on 8 degrees of freedom Multiple R-squared: 0.8778, Adjusted R-squared: 0.8625 F-statistic: 57.44 on 1 and 8 DF, p-value: 6.431e-05 1. Is there a relationship between the predictor and the response? Yes. 2. How strong is the relationship between the predictor and the response? The p-value is 6.43e-05. The relationship is very strong. 3. Is the relationship between the predictor and the response positive or negative? Positive. Since the coefficient of the predictor is positive. 4. What is the predicted stop distance associated with a speed of 55? For this value of speed, what are the associated 95% confidence and prediction intervals for the prediction of stop distance? predict(lm.fit, newdata = data.frame(speedX=55), interval = "confidence", level=0.95) ## fit lwr upr ## 1 156.645 130.6201 182.6699 predict(lm.fit, newdata = data.frame(speedX=55), interval = "prediction", level=0.95) ## fit lwr upr ## 1 156.645 94.47933 218.8107 the predicted stop distance associated with a speed of 55 is 156.645 the associated 95% confidence interval is [ 130.6201, 182.6699] the associated 95% predict interval is [ 94.47933, 218.8107 ] b) plot(speedX,stopDisY,pch=20) abline(lm.fit,lwd =3,col="red") 3 200 150 100 50 stopDisY 20 30 40 speedX c) plot(speedX, residuals(lm.fit)) 4 50 60 40 20 0 −20 residuals(lm.fit) 20 30 40 50 60 speedX comment: A strong pattern in the residuals indicates non-linearity in the data. The variance of the error depends on the input. d) stopDisY_2=sqrt(stopDisY) lm.fit2=lm(stopDisY_2~speedX) plot(speedX,stopDisY_2) 5 14 12 10 8 4 6 stopDisY_2 20 30 40 50 60 speedX We take stopDisY_2 = sqrt(stopDisY) since from the residual plot stopDisY is proportional to xˆ2. e) summary(lm.fit2) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = stopDisY_2 ~ speedX) Residuals: Min 1Q Median -1.23066 -0.89158 -0.04093 3Q 0.98606 Max 1.13601 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.12895 0.98160 0.131 0.899 speedX 0.22511 0.02314 9.730 1.04e-05 *** --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 1.035 on 8 degrees of freedom Multiple R-squared: 0.9221, Adjusted R-squared: 0.9123 F-statistic: 94.67 on 1 and 8 DF, p-value: 1.041e-05 6 0.5 0.0 −0.5 −1.0 residuals(lm.fit2) 1.0 plot(predict(lm.fit2),residuals(lm.fit2)) 6 8 10 12 predict(lm.fit2) The Rˆ2 for this model is 0.922, which is larger than the first one. And the residual plot for this model is in a random pattern while for the original one is in a quadratic form. So this model can better estimate the relationship between x and y. 3、 7 4. Elf-yitq2-zcfcxrfcxM-ElfflxrfMMtEG-2EICfM.IM Elftof) ) 2 a) = 7 E) Elqcxrf Mt ⼆ 02 fM-ElfhtEflxD-h77tr-EICfM-EFMFCEFM-fcxitz.fr Elfh E( = [ Elf⼼ 喊吼秘⼩ 刪 = [TEÀDIEIEFM ) - 非以 利 7) )] [ t 02 刷 [ 似 ⼭ 利 )] [EXTEHDJ = ⼆ 0 iaiginalfnnla Elflxrüyit EIE 1 所 ) 所 7402 = ⼀ Bialfhitvancflxl ) +02 ⼆ Bìasrefers totheemorthatisintroduadbymodelingareaufe.pro blembyamuohsimplerproblem-Sothemoreflexib.ie/aomp1 b) amethodisthelessbiasitgenerauyhavevariana. t n ifyounadadferenttnainingdata.se 比 tohowmuehyourestimatewouldchange.by t.Sothemoreflexibleamethodis.thelangervariance.tn nnn.x.in _ _ _ _ _ _ _ _ _ _ . -.- bias-variance-traininge.mn - - -.- n-ir educible morc.IE/fMtE-fiwMFElfM-fkn lX) tEY=E , { [flxrfnitijt ⼆ ⼆ ⼆ = Efzfxrfknnix 7 ⾏ ) E9fM-EFKNNMIEffkNNMJ.fiXljt 02 +02 EHM-EfkuMFECElfkmnl-fh.ME EHTEX 吼 Ehn) 我啊1 " 2 六点 flxi 六点 +02 他 檔 捌 片 和 El 雑 片 吃 叫 秘 對明智 Elfx ) - E 1 六点 炒 了 ⼗ El ) - i) 2 ⼀ , d) Nhenkincreases 六点 flxilwiubecomemonebiased.becausethepedictionofx , points.Butatthesametime.fdeerea.es wiuaependonmore sothevar i a needeaeases. N henkdeaar e s. v i c e. v er s a. K T. t n T.varianeetkl.tn 1 , varianeef