Group #4 AMS 572 – Data Analysis Professor: Wei Zhu 1/85 Lin Wang (Lana) Xian Lin (Ben) Zhide Mo (Jeff) Miao Zhang Yuan Bian Juan E. Mojica Ruofeng Wen Hemal Khandwala Lei Lei Xiaochen Li (Joe) 2/85 3/85 4/85 ANCOVA ANCOVA Analysis of Covariance merge of ANOVA & Linear Regression Analysis of Variance 5/85 6/85 • described by R. A. Fisher to assist in the analysis of data from agricultural experiments. • Compare the means of any number of experimental conditions without any increase in Type 1 error. H0 is rejected when it is true 7/85 ANOVA a way of determining whether the average scores of groups differed significantly. Psychology Assess the average effect of different experimental conditions on subjects in terms of a particular dependent variable. 8/85 An English statistician, Evolutionary biologist, and Geneticist. Feb.17, 1890 – July 29, 1962 Analysis of Variance(ANOVA), Maximum likelihood, F-distribution, etc. 9/85 10/85 • developed and applied in different areas with that of ANOVA • got developed in biology and psychology • The term "regression" was coined by Francis Galton in the nineteenth century to describe a biological phenomenon 11/85 studied the height of parents and their adult children 5’4’’ 5’6’’ short 5’8’’ < parents’ children are usually shorter than average, but still taller than their parents. 5’9’’ Average height Regression toward the Mean 12/85 applied to data obtained from correlational or non-experimental research helps us understand the effect of changing one independent variable on changing dependent variable value 13/85 (Feb. 16, 1822-Jan. 17, 1851) English anthropologist, eugenicist, and statistician. • widely promoted regression toward the mean • created the statistical concept of correlation • a pioneer in eugenics, coined the term in 1883 • the first to apply statistical methods to the study of human differences 14/85 • a statistical technique that combines regression and ANOVA(analysis of variance). • originally developed by R.A. Fisher to increase the precision of experimental analysis • applied most frequently in quasiexperimental research involve variables cannot be controlled directly 15/85 16/85 Balanced design, if ππ ≡ π Sample Mean Sample SD 1 π¦11 π¦12 Treatment 2 … π¦21 … π¦22 … π π¦π1 π¦π2 π¦1ππ π¦2π2 … π¦πππ π¦1 π 1 π¦2 π 2 … … π¦π π π 17/85 • πππ = ππ + πππ , where i ο½ 1, 2,..., a; j ο½ 1, 2,..., ni • πππ ~π ππ , π 2 , πππ ~π(0, π 2 ) • ππ = π + πΌπ , where π is the grand mean πππ = π + πΌπ + πππ 18/85 π=π¦ (grand mean) πΌπ = π¦π − π¦ 19/85 Sample Mean Sample SD 1 π¦11 π¦12 Treatment 2 … π¦21 … π¦22 … π π¦π1 π¦π2 π¦1ππ π¦2π2 … π¦πππ π¦1 π 1 π¦2 π 2 … … π¦π π π 20/85 • the factor A sum of squares π ππ (ππ − π)π πΊπΊπ¨ = π=π • the factor A mean square, with π − 1 d.f. πΊπΊπ¨ π΄πΊπ¨ = = π−π π π=π ππ (ππ − π)π π−π 21/85 Sample Mean Sample SD 1 π¦11 π¦12 Treatment 2 … π¦21 … π¦22 … π π¦π1 π¦π2 π¦1ππ π¦2π2 … π¦πππ π¦1 π 1 π¦2 π 2 … … π¦π π π 22/85 23/85 Sample Mean Sample SD 1 π¦11 π¦12 Treatment 2 … π¦21 … π¦22 … π π¦π1 π¦π2 π¦1ππ π¦2π2 … π¦πππ π¦1 π 1 π¦2 π 2 … … π¦π π π 24/85 • the total sum of squares π ππ (πππ − π)π πΊπΊπ» = π=π π=π • ANOVA identity π ππ πΊπΊπ» = πΊπΊπ¨ + πΊπΊπ¬ π ππ (πππ − π)π = π=π π=π π ππ (πππ − ππ )π + π=π π=π (πππ − ππ )π π=π π=π 25/85 Source of Variance Sum of Squares Treatments πππ΄ = Error Total πππ΄ πππ΄ = π−1 π−π πππΈ πππΈ = π−π (π¦ππ − π¦π )2 πππ = π−1 ππ (π¦π − π¦)2 πππΈ = Degrees of Mean Square Freedom F πππ΄ πΉ= πππΈ π−1 (π¦ππ − π¦)2 26/85 27/85 Yij ο½ ο ο« ο‘i ο« ο₯ ij Data, the jth observation of the ith group Grand mean of Y Error N(0,σ2) Effects of the ith group (We focus on if αi = 0, i = 1, …, a) 28/85 Yij ο½ ο’1 X ij ο« ο’0 ο« ο₯ ij Data, the (ij)th observation Predictor Error Slope and Intersect (We focus on the estimate) 29/85 Yij ο½ ο ο« ο‘i ο« ο’ ( X ij ο X ..) ο« ο₯ij Effects of the ith group (We still focus on if αi = 0, i = 1, …, a) Known Covariate (What is this guy doing here?) 30/85 Yij ο½ ο ο« ο‘i ο« ο’ ( X ij ο X ..) ο« ο₯ij πππ (ππππ’π π‘) = πππ − π½(πππ − π. . Yij (adjust ) ο½ ο ο« ο‘i ο« ο₯ij (This is just the ANOVA Model!) 31/85 ο’ˆ Yij ο½ ο’1 X ij ο« ο’0 ο« ο₯ ij Within each group, consider αi a constant, and notice that we actually only desire the estimate of slope β instead of INTERSECT. Yij ο½ ο ο« ο‘i ο« ο’ ( X ij ο X ..) ο« ο₯ij 32/85 ο’ˆ • Within each group, do Least Square: ο’ˆi ο₯ ο½ j ( X ij ο X i. )(Yij ο Yi. ) 2 ( X ο X ) ο₯ j ij i. • Assume that 33/85 ο’ˆ • We use Pooled Estimate of β ο’ˆi ο’ˆ ο½ ο₯ ο½ j ( X ij ο X i. )(Yij ο Yi. ) 2 ( X ο X ) ο₯ j ij i. ˆ ( X ο X )2 ο’ ο₯ i ο₯ ij i. i j 2 ( X ο X ) ο₯ο₯ ij i. i j ο½ ο₯ο₯ ( X i ij ο X i. )(Yij ο Yi. ) j 2 ( X ο X ) ο₯ο₯ ij i. i j 34/85 ANCOVA begins: In each group, find Slope Estimation via Linear Regression Yij ο½ ο ο« ο‘i ο« ο’ ( X ij ο X ..) ο« ο₯ij π½π = πππ − ππ. )(πππ − ππ. π π ο₯ ο’ˆ ο₯ ( X ο X ) ο’ˆ ο½ ο₯ο₯ ( X ο X ) i Pool them together 2 πππ − ππ. ij i 2 i. j 2 ij i i. j Get rid of the Covariate πππ (ππππ’π π‘) = πππ − π½(πππ − π. . Do ANOVA on the model πππ (ππππ’π π‘) = π + πΌπ + πππ Go home and have dinner. Yammy ο½ Cheeseburg 2 ο« ice(Coke) ο« ο₯ ? 35/85 Regression General Linear Model ANOVA /ANCOVA 36/85 Y ο½ ο’0 ο« ο’ X ο« ο₯ Error Response Variable Predictor Intersect Slope All of them are Scalars! 37/85 Y ο½ Xο’ ο«ο₯ ο¦ y1 οΆ ο§ ο· ο§ ο· ο§y ο· ο¨ mοΈ ο¦ x11 ο§ ο§ ο§ xm1 ο¨ x1,( n ο1) xm ,( n ο1) 1οΆ ο· ο· 1ο·οΈ ο¦ ο’1 οΆ ο§ ο· ο§ ο· ο§ο’ ο· ο¨ nοΈ ο¦ ο₯1 οΆ ο§ ο· ο§ ο· ο§ο₯ ο· ο¨ nοΈ 38/85 Yi ο½ ο’0 ο« ο’1Zi ο« ο₯ i Outcome of the ith unit coefficient for the intersect coefficient for the slope More about the Zi : Zi =1 if unit is the treatment group Zi =0 if unit is the control group Residual for the ith unit Categorical variable (binary) 39/85 Overall mean response Yijk ο½ ο ο« ο‘i ο« ο’ j ο« ο§ ij ο« ο₯ Residual for the ith unit ijk Response variable effect due to the ith level of factor A effect due to the jth level of factor B the effect due to any interaction between the ith level of A and the jth level of B 40 The ith response variable Random Error yi ο½ ο’0 ο« ο’1 X i1 ο« ο’2 X i 2 ο« ...ο’ p1 X p1 ο« ο’ p 2 X p 2 ο« ο₯i Categorical Variables Categorical Variables Continuous Variable Continuous Variable The above formula can be simply denoted as: Y ο½ Xο’ ο«ο₯ What can this X be? Before we see an example of X, we have learned that General Linear Model covers (1) Simple Linear Regression; (2) Multiple Linear Regression; (3) ANOVA; (4) 2-way/n-way ANOVA. 41/85 X in the GLM might be expanded as Y ο½ ο’0 ο« ο’1 X1 ο« ο’2 X 2 ο« ο’3 X 3 Where X3 in the above formula could be the INTERACTION between X1 and X2 Y ο½ ο’0 ο« ο’1 X1 ο« ο’2 X 2 ο« ο’3 X1 * X 2 Did you see the tricks? Next, let us see what assumptions shall be satisfied before using ANCOVA. 42/85 Before using ANCOVA… 1. Test the homogeneity of variance 2. Test the homogeneity of regression whether H0: ο’1 ο½ ... ο½ ο’i ο½ ... ο½ ο’a 3. Test whether there is a linear relationship between the dependent variable and covariate. 43/85 For each i, calculate the MSEi MSEi ο½ SSEi / df ο½ SSEi / n ο 2 Utilize Max( MSEi )and Min( MSEi ) to do a Fmax test i i to make sure ο³ is a constant under each different levels. F=Max(MSEi ) / Min(MSEi ) 44/85 ο’1 ο½ ... ο½ ο’i ο½ ... ο½ ο’a (1) 45/85 ο’1 ο½ ... ο½ ο’i ο½ ... ο½ ο’a (2) a (1) Define SSE G ο½ ο₯ SSEi i ο½1 SSE G Sum of Square of Errors within Groups SSEi Is calculated based on ο’ˆi AND, SSE G is generated by the random error ο₯ . 46/85 ο’1 ο½ ... ο½ ο’i ο½ ... ο½ ο’a (3) (2) SSE is generated by • Random Error ο₯ • Difference between distinct ο’ˆi We can calculate SSE based on a common ο’ˆ (3) Let SSB=SSE – SSEG. SSB Sum of Square between Groups SSB is constituted by the difference between different ο’ˆ i 47/85 ο’1 ο½ ... ο½ ο’i ο½ ... ο½ ο’a (4) df b ο½ df e ο df G e ο½ [a (n ο 1) ο 1] ο a (n ο 2) ο½ a ο 1 MSB ο½ SSB / df b ο½ SSB / a ο 1 MSE G ο½ SSE G / df eG ο½ SSE G / a (n ο 2) MSB MSE G Mean Square between Groups Mean Square within Groups Do F test on MSB and MSEG to see whether we can reject our HO F=MSB / MSEG 48/85 Assumption 3: Test a linear relationship between the dependent variable and covariate. Ho: ο’ = 0 How to do it? F test on SSR and SSE Sum of Square of Regression 49/85 How to calculate SSR and MSR? ˆ ο«ο’ ˆ x ˆi ο½ ο’ y 0 1 i From each xi yˆ i SSR is the difference obtained from the summation of the square of the differences between yˆ i and y . n SSR ο½ ο₯ ( yˆi ο y )2 i ο½1 MSR ο½ SSR /1 50/85 How to calculate SSE and MSE? ˆ ο«ο’ ˆ x ˆi ο½ ο’ y 0 1 i From each xi yˆ i SSE is the error obtained from the summation of the square of the differences between yi and y ˆ i. n SSE ο½ ο₯ ( yi ο yˆi )2 i ο½1 MSE ο½ SSE /(n ο 2) 51/85 MSR Fο½ MSE Based on the T.S. we determine whether to accept H0 ( ο’ ο½ 0 ) or not. Assume Assumptions 01 and 02 are already passed. • If H0 is true ( ο’ ο½ 0 ),we do ANOVA. • Otherwise, we do ANCOVA. So, anytime we want to use ANCOVA, we need to test the three assumptions first! 52/85 53/85 • In this hypothetical study, a sample of 36 teams (id in the data set) of 12-year-old children attending a summer camp participated in a study to determine which one of three different tree-watering techniques worked best to promote tree growth. Techniques Frequency Code Watering the base with a hose 10 minutes once per day 1 Watering the ground surrounding (drip system) 2 hours each day 2 Deep watering (sunk pipe) 10 minutes every 3 days 3 54/85 • From a large set of equally sized and equally healthy fast-growing trees, each team was given a tree to plant at the start of the camp. • Each team was responsible for the watering and general care of their trees • At the end of the summer, the height of each tree was measured. 60/85 • that some children might have had more gardening experience than others, and • that any knowledge gained as a result of that prior experience might affect the way the tree was planted and perhaps even the way in which the children cared for the tree and carried out the watering regime. How to approach? Create a indicator for that knowledge. (i.e. a 40 point scale gardering experience) 61/85 Grouping (1,2,3) Dependend Variable id watering technique 1 2 3 4 ……. 32 33 34 35 36 1 1 1 1 ……… 3 3 3 3 3 tree growth dv 39 36 30 42 ……….. 36 30 39 27 24 Covariate Variable gardening exp cov 24 18 21 24 ……… 15 18 18 9 6 Real Data 62/85 Grouping (1,2,3) Dependend Variable id 1 2 3 4 ……. 32 33 34 35 36 Overall Mean tree Response watering technique 1 1 1 1 ……… 3 3 3 3 3 growth dv 39 36 30i 42 ……….. 36 30 39 27 24 Covariate Variable gardening exp cov Residual error 24 18 21 ij 24 ……… 15 Regression coefficient parameter. 18 18 9 6 Yij ο½ ο ο« ο‘ ο« ο’ ( X ο X ..) ο« ο₯ij Real Data 63/85 Homogenity of Regression Homogenity of Variance and dv is Normal Linearity of Regression ANCOVA SAS 64/85 ο² X ,Y ο½ cov( X , Y ) ο³ Xο³Y ο½ E[( X ο ο X )(Y ο οY )] ο³ Xο³Y ο½ ο₯ ο₯ n n i ο½1 ( X i ο X )(Yi ο Y ) 2 ( X ο X ) i i ο½1 ο₯ n 2 ( Y ο Y ) i i ο½1 The Pearson correlation coefficient between the covariate and the dependent var.is .81150. 65/85 Assumptions Clearly a strong linear component to the relationship. Linearity of regression assumption appears to be met by the data set 66/85 The assumption of homogeneity of regression is tested by examining the interaction of the covariate and the independent variable. If it is not statistically significant, as is the case here, then the assumption is met. 67/85 The Model contains the effects of both the covariate and the independent variable. The effects of the covariate and the independent variable are separately evaluated in this summary table. 68/85 69/85 Watering techniques coded as 1 (hose watering) and 3 (deep watering) are the only two groups whose means differ significantly 78/85 • We can assert that prior gardening experience and knowledge was quite influential in how well the trees fared under the attention of the young campers. • when we statistically control for or equate the gardening experience and knowledge of the children, was a relatively strong factor in how much growth was seen in the trees. • On the basis of the adjusted means, we may therefore conclude that, when we statistically control for gardening experience,deep watering is more effective than hose watering but is not significantly more effective than drip watering. 79/85 GROUP VARIABLE, DEPENDENT VARIABLE and COVARIATE 80/85 81/85 Tasks->Graph->Scatter Plot 82/85 Tasks->ANOVA->Linear Models 83/85 84/85 85/85