HW1 Solution 3.0 #1.6) (a) Plot the marginal dot diagrams for all the variables. 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 6 7 8 9 10 ● 2.0 ● ● ● ● ●● ● ● ● 40 2 3 4 5 6 7 8 2 4 6 count 10 5 4 ● ● ● ● ● ● ● 3 ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 2 3 4 5 25 20 15 10 2.5 3.0 ● 20 4 ● ●●●●● ● ●●●●●●●●● ●● ●●●●●●●●●●● ●● 5 ●● 10 15 ● ●●● 20 25 O3 ● ● ● ● ● ● ● ● ● 3.5 ● 3 count 5 ● ●● 2 ● NO2 2.0 ● 1 ● 15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● NO ● 10 ● ● ● ● ● ● ● ● ● ● 100 ● 6 6 ● ● ● ● 5 80 ● 14 ● ● ● ● ● ● ● ● 15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 ●● ●●●●●● ●●●●●●●●●●●●●● solar radiation ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 count ● ●●● ● 60 CO count ●● 1.5 ● ● 1.0 ● ● count ● ● 6 2 4 count 8 ● wind count ● 2.5 ● 5 ●● ● 4.0 ● 4.5 5.0 HS Figure 1: Dot plots of the variables in the air pollution dataset. (b) Construct the x̄, Sn , and R arrays, and interpret the entries in R. 1 Wind 7.50 x̄ Solar 73.86 Table 1: Wind 2.50 -2.78 -0.38 -0.46 -0.59 -2.23 0.17 Wind Solar CO NO NO2 O3 HC Table 2: Wind Solar CO NO NO2 O3 HC CO 4.55 NO 2.19 NO2 10.05 O3 9.40 HC 3.10 The sample means of variables. Solar -2.78 300.52 3.91 -1.39 6.76 30.79 0.62 CO -0.38 3.91 1.52 0.67 2.31 2.82 0.14 NO -0.46 -1.39 0.67 1.18 1.09 -0.81 0.18 NO2 -0.59 6.76 2.31 1.09 11.36 3.13 1.04 O3 -2.23 30.79 2.82 -0.81 3.13 30.98 0.59 HC 0.17 0.62 0.14 0.18 1.04 0.59 0.48 The sample variance-covariance matrix of the variables. Wind 1.00 -0.10 -0.19 -0.27 -0.11 -0.25 0.16 Table 3: Solar -0.10 1.00 0.18 -0.07 0.12 0.32 0.05 CO -0.19 0.18 1.00 0.50 0.56 0.41 0.17 NO -0.27 -0.07 0.50 1.00 0.30 -0.13 0.23 NO2 -0.11 0.12 0.56 0.30 1.00 0.17 0.45 O3 -0.25 0.32 0.41 -0.13 0.17 1.00 0.15 HC 0.16 0.05 0.17 0.23 0.45 0.15 1.00 The sample correlation matrix of the variables. The majority of the variables have only weak linear associations, with correlations close to zero. The pollutants are mostly positively correlated with each other. Wind is negative correlated with pollutants, while solar radiation is positively correlated with pollutants. 2 Windy and Sunny Windy and Not Sunny 1 4 6 10 14 15 7 16 22 17 18 19 23 26 28 20 21 30 31 35 42 36 Not Windy and Sunny 2 3 5 9 12 25 27 37 38 39 40 Figure 2: Not Windy and Not Sunny 8 11 13 24 29 32 33 34 41 Star plots of the air pollution variables. To investigate if there is an effect on air pollution with wind and the sun, we can divide the wind and solar radiation variable in half by the median and then make the star plots in Figure 2. From the stars, we can see that solar radiation have some effects on the air pollution. When it is sunny, most pollutants are on a relative low level. And within each group, the patterns are quite different. So there are a fair amount of variation in each of the four groups, as there were days with very little pollution in each group, as well as days with quite a bit of air pollution. 3 #1.17) In this dataset the first three are measured in seconds, while the last four are measured in minutes. x̄ 100m 11.62 200m 23.64 Table 4: 100m 200m 400m 800m 1500m 3000m Marathon Table 5: 100m 200m 400m 800m 1500m 3000m Marathon 100m 0.20 0.48 1.01 0.04 0.11 0.28 9.44 400m 53.41 800m 2.08 1500m 4.33 3000m 9.45 Marathon 173.25 The sample means for the track record. 200m 0.48 1.23 2.55 0.09 0.26 0.65 23.18 400m 1.01 2.55 7.17 0.26 0.70 1.72 57.49 800m 0.04 0.09 0.26 0.01 0.03 0.08 2.57 1500m 0.11 0.26 0.70 0.03 0.11 0.27 8.88 3000m 0.28 0.65 1.72 0.08 0.27 0.68 22.57 Marathon 9.44 23.18 57.49 2.57 8.88 22.57 925.96 The sample variance covariance matrix for the track record variables. 100m 1.00 0.95 0.83 0.73 0.73 0.74 0.69 Table 6: 200m 0.95 1.00 0.86 0.72 0.70 0.71 0.69 400m 0.83 0.86 1.00 0.90 0.79 0.78 0.71 800m 0.73 0.72 0.90 1.00 0.90 0.86 0.78 1500m 0.73 0.70 0.79 0.90 1.00 0.97 0.88 3000m 0.74 0.71 0.78 0.86 0.97 1.00 0.90 Marathon 0.69 0.69 0.71 0.78 0.88 0.90 1.00 The sample correlation matrix for the track record variables. All the seven variables are strongly positively correlated. And the correlations tend to be larger when distances are close to each other.For example he correlation between 100m and 200m is 0.95, while the correlation between 100m and marathon is 0.69. This makes sense, since runners are good at races of similar length. 4 #1.18) 1 100m 8.62 Table 7: 100m 200m 400m 800m 1500m 3000m Marathon 200m 8.48 400m 7.51 800m 6.44 1500m 5.81 3000m 5.33 Marathon 4.15 The sample mean track records measured in meters/second. 100m 0.11 0.12 0.10 0.08 0.10 0.10 0.13 200m 0.12 0.15 0.13 0.09 0.11 0.12 0.16 400m 0.10 0.13 0.14 0.11 0.12 0.12 0.15 800m 0.08 0.09 0.11 0.11 0.12 0.12 0.15 1500m 0.10 0.11 0.12 0.12 0.16 0.16 0.20 3000m 0.10 0.12 0.12 0.12 0.16 0.17 0.21 Marathon 0.13 0.16 0.15 0.15 0.20 0.21 0.32 Table 8: The sample variance covariance matrix of track records measured in meters/second. 100m 200m 400m 800m 1500m 3000m Marathon 100m 1.00 0.95 0.84 0.73 0.74 0.75 0.72 200m 0.95 1.00 0.86 0.73 0.72 0.72 0.71 400m 0.84 0.86 1.00 0.90 0.80 0.78 0.71 800m 0.73 0.73 0.90 1.00 0.92 0.87 0.79 1500m 0.74 0.72 0.80 0.92 1.00 0.96 0.86 3000m 0.75 0.72 0.78 0.87 0.96 1.00 0.89 Marathon 0.72 0.71 0.71 0.79 0.86 0.89 1.00 Table 9: The sample correlation matrix of track records measured in meters/second. The results are similar with those I obtained in Exercise 1.17. All have strong, positive linear relationships with each other, and races that are closer together in distance have a stronger relationship. The differences in correlation are slightly less maybe because all the races were measured in the same units now. Compute the sample variance-covariance matrix (call it S). Obtain the spectral decomposition (also called the eigenvalue decomposition: use eigen if you are using R) of the variance covariance matrix. Next post-multiply the observation matrix (call it X) with P. Plot the pairwise scatter plots of the first three columns. 5 −7.4 −7.0 −6.6 −6.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● var 2 ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1.4 −7.4 ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● var 3 ●● −1.8 −6.2 −6.6 ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ●● −7.0 ● ●● ● −17 var 1 −18 ● −15 ● ● −16 ● −14 ● ● ● ● ● −2.2 ● ● −18 −17 Figure 3: −16 ● −15 −14 −2.2 −1.8 −1.4 pairwise scatter plots of the first three columns of Y=X*P. From the pairwise scatter plots, we could not find any obvious relationships among these three variables. The first two columns may have some negative relationship while the last two columns have a slightly positive relationship. #1.26 (a) x̄ Breed 4.38 SaleP 1742.43 Table 10: YearlingHT 50.52 FFBody 995.95 PctFFBody 70.88 Frame 6.32 The sample means of the variables in the bulls dataset. 6 Back.fat 0.20 SaleHT 54.13 Salewt 1555.29 Breed SaleP YearlingHT FFBody PctFF Frame Back.fat SaleHT Salewt Breed 9.68 -434.74 2.83 117.83 4.80 1.25 -0.17 3.04 46.94 Table 11: SaleP -434.74 388133.66 456.47 5890.60 -229.47 276.42 15.44 486.97 25645.89 YearlingHT 2.83 456.47 3.00 100.13 2.96 1.51 -0.05 2.98 82.81 FFBody 117.83 5890.60 100.13 8594.34 209.50 51.95 -1.40 129.94 6680.31 PctFF 4.80 -229.47 2.96 209.50 10.69 1.46 -0.14 3.41 83.93 Frame 1.25 276.42 1.51 51.95 1.46 0.86 -0.02 1.49 44.32 Back.fat -0.17 15.44 -0.05 -1.40 -0.14 -0.02 0.01 -0.05 2.41 SaleHT 3.04 486.97 2.98 129.94 3.41 1.49 -0.05 4.02 147.29 Salewt 46.94 25645.89 82.81 6680.31 83.93 44.32 2.41 147.29 16850.66 The sample variance covariance matrix of the variables in the bulls dataset. Breed SaleP YearlingHT FFBody PctFFBody Frame Back.fat SaleHT Salewt Breed 1.00 -0.22 0.52 0.41 0.47 0.43 -0.62 0.49 0.12 Table 12: SaleP -0.22 1.00 0.42 0.10 -0.11 0.48 0.28 0.39 0.32 YearlingHT 0.52 0.42 1.00 0.62 0.52 0.94 -0.34 0.86 0.37 FFBody 0.41 0.10 0.62 1.00 0.69 0.60 -0.17 0.70 0.56 PctFFBody 0.47 -0.11 0.52 0.69 1.00 0.48 -0.49 0.52 0.20 Frame 0.43 0.48 0.94 0.60 0.48 1.00 -0.26 0.80 0.37 Back.fat -0.62 0.28 -0.34 -0.17 -0.49 -0.26 1.00 -0.28 0.21 The sample correlation matrix of the variables in the bulls dataset. Only a few variables(Frame, Yearling height, and Sale Height) have strong relationships with each other. I do not think the breeds are well separated in this system since all the correlations between breed and other variables are not strong. The best potential variable to distinguish between breeds is back fat, which has the strongest linear relationship with breed. (b) I did not find any obvious outliers from Figure 4. From the three dimensional plot, we can observe that most bulls with breed 8 (Simental) have less back fat and larger frame. And the values of back fat and frame in breed 1 (Angus) are more spread out. 7 SaleHT 0.49 0.39 0.86 0.70 0.52 0.80 -0.28 1.00 0.57 Salewt 0.12 0.32 0.37 0.56 0.20 0.37 0.21 0.57 1.00 Figure 4: A three dimensional plot. (c) This time the points are more closely clustered, so it is more clearly to separate these three breeds. Bulls with breed 8 (Simental) have higher fat free body weight and higher sale height. And the values of fat free body weight and sale height in breed 1 (Angus) are more spread out. 8 Figure 5: A three dimensional plot. 9 #2.20) 0.526 A = PΛ P = 0.851 0.526 A−1/2 = P Λ−1/2 P 0 = 0.851 1.376 0.325 0.761 A1/2 A−1/2 = 0.325 1.701 −0.145 1/2 1/2 0 2 1 A= 1 3 −0.851 1.902 0 0.526 0.851 1.376 0.325 = 0.526 0 1.176 −0.851 0.526 0.325 1.701 −0.851 0.526 0 0.526 0.851 0.761 −0.145 = 0.526 0 0.851 −0.851 0.526 −0.145 0.616 −0.145 0.761 −0.145 1.376 0.325 1 0 = = =I 0.616 −0.145 0.616 0.325 1.701 0 1 # 2.23) √ V 1/2 ρV 1/2 = σ11 √ σ22 .. . √ 1 ρ12 .. . ρ12 1 .. . ... ... .. . √ σ11 ρ1p ρ2p .. . σpp ρ1p ρ2p . . . 1 √ √ √ √ σ11 ρ12 σ11 σ22 . . . ρ1p σ11 σpp √ √ ρ12 √σ11 √σ22 σ22 . . . ρ2p σ22 σpp = .. .. .. .. . . . . √ √ √ √ ρ1p σ11 σpp ρ2p σ22 σpp . . . σpp σ11 σ12 . . . σ1p σ12 σ22 . . . σ2p = . .. .. = Σ .. .. . . . σ1p σ2p . . . σpp 10 √ σ22 .. . √ σpp