Coding

advertisement
Coding: With class variables, the most simple model is Y =  + I where m is an overall constant (mean?)
and each i is an effect of one level (i) of the class variable. Now suppose you know (or have estimates)
 + 1 = 12
 + 2 = 21
 + 3 = 15
From middle school algebra or just common sense, you know that there are lots of solutions for the
Greek letter that will work, for example =0, 1=12, 2=21 3=15. The three equations above show that
these are really estimates of  + 1,  + 2, and + 3 . This set of solutions, in which  is arbitrarily set
to 0, is called “cell means coding” but only takes care of data sets with one classificatory variable.
That was one specific solution to the system. When we add the three equations above and divide both
1 1 0      12 


  
sides by 3 we see that     16. Another solution is 1 0
1  1      21 where under
1  1  1     15 

 2
  
the restriction that   0 (which means the alphas sum to 0) we have =16, 1=-4, 2=5 3=-1. This has
the nice property that  is the average of 12, 21, and 15 and the deviations, -4, 5, -1, add up to 0. The
restriction that 3=1-2 produces the last row of the matrix above and this particular solution is
known as “deviations coding” or sometimes “effects coding.” For estimation, this solution can be
obtained by repeating the rows of the left matrix for as many observations as are in each of the three
groups and entering the responses Y in the right hand vector. Regression is then used to estimate the
parameters. The parameters (or parameter estimates) now are 16, -4 and 5 where 16-4=12, 16+5=21,
and 16 -(-4+5)=15 are the expected means for each of the three groups. In other words we see that
1 1 0  16  12 

   
1 0 1   4    21 .
1  1  1 5  15 

   
1 1 0     3  12 


  
In GLM coding, the equation for the particular solution becomes 1 0 1  1   3    21 and the
1 0 0      15 
3

 2
 
1 1 0  15  12 

   
solution vector entries are 15, -3, and 6 because 1 0 1   3    21 . In GLM coding, the arbitrary
1 0 0  6  15 

   
assumption that 3 = 0 renders the other parameters estimable as can be seen by setting 3 to 0
everywhere in the parameter vector. In other words because +3 is (or is estimating) 12, the
(arbitrary) assumption that 3 =0 (a fourth equation in the 4 unknowns) means that we know what  is
(or that  is estimable in the case of data).
Download