MA930 UNIVERSITY OF WARWICK CLASS TEST 2014 Data Analysis Time Allowed: 1.5 Hours Full marks may be gained by correctly answering three complete questions. Candidates may attempt all questions. Marks will be awarded for the best three answers only. Please write your name and student number on the answer booklet. 1. Summary statistics: Let X1 , . . . , Xn denote a collection of numbers gathered during an experiment. We will model them as independent, identically distributed random variables with p.d.f. f and c.d.f. F . (a) What is a statistic? (b) Name and briefly describe 6 statistics commonly used to summarize the distribution of a collection of numbers. (c) What are the order statistics for X1 , . . . , Xn ? What is their joint distribution (i.e. the multi-dimensional pdf)? (d) Describe a simple test for checking if the X1 , . . . , Xn seem to be samples from a given distribution (for example, the N (0, 1) distribution). (e) If X1 , . . . , Xn pass the test from part (d), does that prove that they are independent? Briefly justify your answer. (f) Assume that the Xi are independent samples from the Uniform(0, 1) distribution. Show that X(1) converges to zero in probability. (g) Assume n is odd (i.e. n = 2k+1). State a statistic S that can be used to approximate the median of the distribution of the (Xi ). Using the law of large numbers, or otherwise, show that if the Xi ∼ Uniform(0, 1) then S converges in probability. Continued ... 1 MA930 2. Last year, 24 students took a “data analysis” viva, 12 in the morning and 12 in the afternoon. You want to determine if the time of the viva (morning or afternoon) affects the outcome. Let X1 , . . . , X12 denote the morning scores and Y1 , . . . , Y12 denote the afternoon scores. Assume that the students scores can be treated as independent random variables, that the morning scores Xi ∼ N (µ1 , σ 2 ) and the afternoon scores Yi ∼ N (µ2 , σ 2 ). You want to find out if µ1 < µ2 or µ1 = µ2 or µ1 > µ2 . 1 �12 1 �12 i=1 Xi and Ȳ = 12 i=1 Yi . 12 � 1 �12 2 Z1 = σ12 12 i=1 (Xi − X̄) and Z2 = σ 2 i=1 (Yi (a) What is the distribution of X̄ = (b) What is the distribution of − Ȳ )2 ? (c) Use your answers to part (a) and (b) to form 95% confidence intervals for µ1 and µ2 . (d) If the two confidence intervals from (c) overlap, is there any special statistical significance to that? (Hint: read the rest of the question before answering :-) (e) What is the distribution of (X̄ − Ȳ )/σ and Z1 + Z2 ? (f) Use your answer to part (e) to form an (exact) 95% confidence interval for µ1 − µ2 . Does the confidence interval containing zero have any special significance? (g) Suppose your two confidence intervals from part (c) overlap, but that your confidence interval from part (e) does not include zero. How would you interpret this outcome? Continued ... 2 MA930 3. An experiment produces independent pairs of observations (Xi , Yi ), i = 1, . . . , n. Assume that Yi = A + BXi + ei where the ei are random variables with mean zero. (a) Define and derive least squares esimators for A and B. (b) Let 1 X1 .. , X := ... . 1 Xn � � A β := , and let In denote the n × n identity matrix. Suppose now that the ei B are independent N (0, σ 2 ) random variables, so that Y = (Yi )ni=1 has the multivariate normal distribution N (Xβ, σ 2 In ). Stating clearly any results you use, show that β ˆ= (X t X)−1 X t y is an unbiased estimator for β and give the distribution of β.ˆ � (c) You may assume without proof that σ12 (Y − X β)ˆ ∼ χ 2n−2 and is independent of β.ˆ You suspect B = 1 . How can you test the hypothesis H0 : B = 1 against the 2 2 alternative hypothesis H1 : B �= 12 . (d) Pearson’s famous “fathers and sons “ dataset contains the heights of 1078 pairs of fathers and sons. Here is linear model for the heights of the sons in terms of their fathers’ heights, fitted in R. # Y = sons.height # X = fathers.height # Y = A + BX + e Estimator A-hat B-hat Estimate 33.88660 0.51409 Std. Error 1.83235 0.02705 t value 18.49 19.01 Pr(>|t|) <2e-16 *** <2e-16 *** The standard errors correspond to S � (X t X)−1 11 and S � (X t X)−1 22 with S2 = 1078 � 1 (Yi −Â−B̂Xi )2 . 1078 − 2 i=1 Briefly explain how to calculate confidence intervals for A and B? Continued ... 3 MA930 4. Consider two collections of random variables X1 , . . . , Xn and Y1 , . . . , Yn . The Xi all have p.d.f. f and the Yi all have p.d.f. g . (a) What does it mean to say that X1 and X2 are independent? (b) What does it mean to say that X1 , X2 , . . . , Xn are independent? (c) If X1 , . . . , Xn are independent, and Y1 , . . . , Yn are independent, is it generally the case that X1 , . . . , Xn , Y1 , . . . , Yn are independent. Give a brief proof or a counter-example. (d) Define E[Xi ] and Var(Xi ) in terms of f . (e) Use the properties of expectation to show that for random variables X, Y E [(X − E[X])(Y − E[Y ])] = E[XY ] − E[X]E[Y ]. What is this quantity called? (f) If Var(Xi ) < ∞ and Var(Yi ) < ∞, and Xi , Yi are independent, is it always true that Var(Xi Yi ) < ∞? Give a proof or a counterexample. Can you express Var(Xi Yi ) in terms of the mean and variances of the X and Y distributions. (g) Using theorems mentioned during the course, what can you say about the distribution of � � n � � n n n 1� 1� 1� 1� X̄ = C := (Xi − X̄)(Yi − Ȳ ) = Xi Yi −X̄ Ȳ , Xi , Ȳ = Yi . n n n n i=1 i=1 i=1 i=1 in the case that both the Xi and Yi have mean zero, variance 1, and with full independence for X1 , . . . , Xn , Y1 , . . . , Yn ? For large n, for which values of C would you reject a hypothesis that the Xi and Yi are independent. Your answer does not need to be exact, but you should try to be as accurate as possible in an asymptotic sense. End 4 MA930 Common distributions • X ∼ Bin(n, p) if P(X = k) = • X ∼ Uniform(a, b) if f (x) = � • Exponential: X ∼ Exp(λ) if f (x) = • X ∼ N (µ, σ 2 ) if it has p.d.f. n! pk (1 − p)n−k . k!(n − k)! 1/(b − a), 0 � a<x<b otherwise λexp(−λx) 0 √1 exp(−(x σ 2π x>0 x<0 − µ)2 /(2σ 2 )) 2 ) and Y ∼ N (µ , σ 2 ) are independent then • if X ∼ N (µX , σX Y Y 2 + b2 σY2 ) aX + bY ∼ N (aµX + bµY , a2 σX • if X1 , . . . Xn ∼ N (0, 1) then �n 2 i=1 Xi ∼ χ2n . • Students t-distribution: If Z ∼ N (0, 1) and U ∼ χ2n then Z � ∼ tn U/n • Beta pdf mean variance f (x) = mean α α+β variance αβ (α+β) • Gamma mean α β 2 (α+β+1) f (x) = variance α β 2 xα−1 (1 − x)β−1 B(a, b) β α α−1 exp(−βx) x Γ(α) . • Multivariate normal N (µ, Σ), µ ∈ Rn , Σ an n × n matrix: � � 1 1 t −1 f (x) = � exp − (x − µ) Σ (X − µ) . 2 (2π)n det Σ If X ∼ N (µ, Σ) and M is an m × n matrix then then M X ∼ N (M µ, M ΣM −1 ). 5