BIVARIA'TE REGRESSION ANALYSIS A. I n performing t - t e s t s , one compares two groups on an a t t r i b u t e (e.g., vs. Coke d r i n k e r s on number o f c a v i t i e s ) . Pepsi L e t us c o n s i d e r t h e f o l l o w i n g variable: x = { 1 i f a Coke d r i n k e r 2 i f a Pepsi d r i n k e r We can p l a c e t h i s v a r i a b l e along t h e h o r i z o n t a l a x i s below: 1 (Coke) X = 2 (Pepsi) favorite soft drink Now we can sketch d i s t r i b u t i o n s o f a dependent v a r i a b l e (e.g., number o f c a v i t i e s i n f i v e years) along v e r t i c a l axes f o r each t y p e o f s o f t d r i n k . To i l l u s t r a t e , l e t us assume t h a t fi2 > fil (i.e., more c a v i t i e s than Coke d r i n k e r s ) . t h a t Pepsi d r i n k e r s have Thus i f t h e d a t a were l i k e t h i s , we might conclude t h a t " t a k i n g t h e Pepsi challenge" ( i . e . , d r i n k i n g Pepsi i n s t e a d o f Coke) y i e l d s a n e t increase o f 2 c a v i t i e s p e r year. We know t h i s because t h e r e i s an average increase o f 10 (25-15) c a v i t i e s p e r 5 years and (10/5)=2. B. Now l e t ' s assume t h a t we have done a chemical a n a l y s i s o f Coke and Pepsi and f i n d t h a t Coke has 2 mg. sugar whereas Pepsi has 6 mg. sugar (per 16 oz. can). With t h i s new i n f o r m a t i o n we can examine t h e r e l a t i o n between c a v i t i e s and sugar i n sodas: Instead o f 1 = Coke and 2 - Pepsi on t h e X-axis, we have 2 mg. sugar f o r Coke d r i n k e r s and 6 mg. sugar f o r Pepsi d r i n k e r s . Once t h i s i s done we can phrase t h e s u g a r - b y - c a v i t i e s a s s o c i a t i o n more g e n e r a l l y as 26 c a v i t i e s i n 5 years p e r a d d i t i o n a l m i l l i g r a m o f sugar (per 16 oz. can). Here's t h e math: (25-15) (6-2 increase i n Y increase i n x - C. Note t h a t we can a l s o use t h e d a t a t o p r e d i c t t h e number o f c a v i t i e s per f i v e years f o r someone who d r i n k s a soda w i t h 4 mg. o f sugar p e r 16 oz. can (i.e., more sugar than Coke has, b u t l e s s than Pepsi does). To f i g u r e t h i s o u t you must s t a r t w i t h a b a s e l i n e value l i k e t h e 15 c a v i t i e s i n 5 years t h a t Coke d r i n k e r s have on average. We know (from above) t h a t we would expect an a d d i t i o n a l 2 t c a v i t i e s f o r each a d d i t i o n a l m i l l i g r a m sugar. Thus i f t h e soda has 2 more m i l l i g r a m s than Coke does, one would p r e d i c t 5 more c a v i t i e s i n 5 years than t h e average f o r Coke d r i n k e r s . More math: 1 5 + ( 2 t x [ 4 - 2 ] ) = 2 0 D. Now l e t ' s p r e d i c t t h e number o f c a v i t i e s p e r year f o r someone who d r i n k s a suciar-free brand o f soda. S u b s t i t u t i n g zero f o r 4 i n t h e above equation, t h e associated p r e d i c t i o n equals 10: 1 5 + ( 2 t x [ O - 2 ] ) = 1 0 1. Given a s u g a r - f r e e soda's p r e d i c t e d v a l u e f o r t h e number o f c a v i t i e s (Y) p e r 5 years, one can c o n s t r u c t a PREDICTION EQUATION f o r t h e p r e d i c t e d value o f Y given X, where X i s Y's corresponding number o f m i l l i g r a m s 128 T h i s equation takes t h e f o l l o w i n g sugar c o n t e n t p e r 16 oz. o f soda. form: p r e d i c t e d Y-value = (a constant) + 2tX 2. N o t i c e t h a t X=O f o r any s u g a r - f r e e soda, l e a v i n g t h e soda's p r e d i c t e d v a l u e o f Y equal t o t h e constant. More g e n e r a l l y put, t h e constant i n t h e equation equals t h e p r e d i c t e d value o f Y when X equals zero. The r e s u l t i n g r e q r e s s i o n eauation i s as f o l l o w s : A Y 10 = + 2tX E. SOME COMMENTS ABOUT THIS EQUATION: 1. We can p l o t t h i s equation as f o l l o w s : Y axis = Dependent v a r i a b l e X axis = Independent v a r i a b l e NOTE: I n o u r discussions, t h e l e t t e r s Y and X w i l l g e n e r a l l y be assigned i n t h i s way!!! (You might t r y p l o t t i n g these X and Y coordinates w i t h i n t h e graph on page 127 o f these Lecture Notes.) 2. The equation i s o f t h e form, Y = a + bX a. "a" i s c a l l e d t h e y - i n t e r c e p t , because i t i s t h e p o i n t a t which t h e r e g r e s s i o n l i n e " i n t e r c e p t s " t h e Y-axis ( i . e , when X=O). b. "b" i s c a l l e d t h e SLOPE. The l a r g e r "b" i s , t h e " s t e e p e r " t h e slope and thus t h e more u n i t s o f Y a r e p r e d i c t e d by a u n i t - s h i f t i n X. NOTICE t h a t t h e "magnitude" o f t h e slope depends upon t h e UNITS you are t a l k i n g about: We can j u s t as e a s i l y c o n s i d e r Y ' = # c a v i t i e s A l p e r year, which would render t h e r e g r e s s i o n equation as 10 + fX . Clearly both eauations p r e d i c t e a u a l l v w e l l . 129 Y = In this regard, c o n s i d e r t h e f o l l o w i n g two p o i n t s : 1) Since a l i n e w i t h ZERO slope suggests no a s s o c i a t i o n , i t appears t h a t l a r g e r slopes imply l a r g e r a s s o c i a t i o n s ( i f u n i t s a r e consistent). 2) Ift h e v a r i a b l e s had "standard u n i t s " (as would standardized v a r i a b l e s ) , then slopes' magnitudes would r e f l e c t t h e magnitudes o f associations. 3. The slope o f 2; i s POSITIVE. I f a slope i s p o s i t i v e , t h e r e g r e s s i o n l i n e r i s e s t o t h e r i g h t ; i f i t i s NEGATIVE, t h e r e g r e s s i o n l i n e f a l l s t o the r i g h t . (For example, one might f i n d a n e g a t i v e a s s o c i a t i o n between c i t i e s ' instances o f leukemia and t h e number o f m i l e s they a r e l o c a t e d from t o x i c waste dump s i t e s . ) Thus i t i s w i t h t h e s i g n o f t h e s l o p e t h a t you can express t h e d i r e c t i o n o f a l i n e a r s t a t i s t i c a l r e l a t i o n s h i p . There are " f l o o r " and " c e i l i n g " e f f e c t s . 0 i i sawm "Sure yourpoflents haw 5096kwer cavities. That's b a u s e h e y houe 50% Jeruor Loah!" a. It makes no sense to speak of "minus" cavities or "minus" mg. sugar. b. Sooner or later you have no teeth left to get cavities in. 5. The prediction equation is a formula for a straight line. relation between X +Y If the is not "linear," you might try a curved line. (Later in the course we shall be fitting curved lines to data.) NOTE how a curved line might better take into account the ceiling effects just mentioned: X = mg. sugar consumption per day F. That was easy, right? Just find the means, draw a line, and write down an equation. Well, not quite. Note, for example, that the mean numbers of cavities for drinkers of these and other types of soda are not likely to "line up" into a straight line! Let's add a few other soft drinks into the analysis: Dr. Pepper (21 mg./16 Diet Dr. Pepper (7 mg./16 Tab (1 mg./16 etc. 02.) 02.) 02.) And if we were to add "switchers," who drink diet sodas when they're feeling guilty and Pepsi when they are not? FINALLY, we c o u l d work up a measure o f sugar consumption and f o r g e t sodas I f we d i d t h i s t h e s c a t t e r p l o t would probably l o o k l i k e t h i s : altogether. many . . . . . ,. .. . .. ' Y = # cavities i n 5 years . .. . . ... . . .. I few 1ow high X = mg. sugar consumption per day G. The problem now i s t o decide what i s t h e b e s t c r i t e r i o n f o r choosinq one of t h e many p o s s i b l e p r e d i c t i o n l i n e s t h a t c o u l d be drawn through these data. One c r i t e r i o n s e l e c t e d by s t a t i s t i c i a n s i s c a l l e d t h e O r d i n a r y Least Squares ( o r OLS) c r i t e r i o n o f s e l e c t i n g a r e g r e s s i o n l i n e . This c r i t e r i o n s p e c i f i e s t h a t you CHOOSE THAT LINE WHICH MINIMIZES THE SQUARED DEVIATIONS OF YOUR OBSERVATIONS FROM THE LINE. That i s , you want min [ X A ( Yi - Y i ) 2 ] , v^ RECALL ( f r o m p. 9 o f these L e c t u r e Notes) t h a t minimizes X - ( Xi - X )' . where A A Y 1. = a + A bX i ' T( i s t h e number t h a t What i s being minimized here i s s i m i l a r , except i n t h a t we a r e now m i n i m i z i n g t h e squared d e v i a t i o n s from a LINE ( i .e., from a "mean" w i t h a v a l u e t h a t depends upon t h e v a l u e o f X). o f you who have had c a l c u l u s can v e r i f y q u i c k l y t h a t t h i s c r i t e r i o n r e q u i r e s t h a t two c o n d i t i o n s a r e met: 1. Such a v e r i f i c a t i o n i s made as f o l l o w s : Those Abducted by an d e n circus c a n p w , Roluror Doyle is forced w write dculus quatlonr in center rin#. A A A l a. F i r s t l e t Y 1. = a + b X i ' where b. then f i n d a aa E ( Yi A - Yi )2 XI ( Xi - = a a; and - E X ) , ( Y i - Y iA )2 c. These two d e r i v a t i v e s w i l l g i v e you r e s p e c t i v e l y t h e f i r s t and second c r i t e r i a r e q u i r e d b~ an OLS r e q r e s s i o n line. A 2. The f i r s t c r i t e r i o n i s t h a t E ( Yi - Yi ) = 0 . these L e c t u r e Notes] t h e p a r a l l e l equivalence t h a t - A (RECALL [ f r o m p. 9 o f E ( Xi - X ) = 0 .) A- =Y-a-bX A OR a = - ;X A - which i s t h e computation formula f o r a ! ! ! Y b. N o t i c e a l s o t h a t = + X; i m p l i e s t h a t t h e p o i n t , (X, V), must T h i s i s a good check t o make when f a l l on t h e r e g r e s s i o n l i n e . p l o t t i n g a regression l i n e . c. IMPLICATIONS o f t h e f i r s t OLS requirement: 1) Say we a r e comparing income and education and we have t h e f o l l o w i n g two p a i r s o f observations: (12-years education, $15,000 ) and (16-years education, $25,000) For t h e f i r s t person i n t h e sample, we can choose a p r e d i c t e d A A value ( v i z . , Y1) A ( Yi - Yi X constraint, and then determine what Y 2 would be under t h e ) = 0 . Since a t t h i s p o i n t we can t a k e any value, l e t ' s a r b i t r a r i l y t a k e . A Y1 t h e value o f ( Yl - $5,000 ) ( $15,000 OR = + Y2 = $35,000 According t o t h e c o n s t r a i n t , A ( Y2 - Y2 ) = 0 - $5,000 A Thus $5,000 <- ) + ($25,000 - q2 ) = 0 . a BAD p r e d i c t i o n ! We can GRAPH t h e observations and p r e d i c t i o n s as f o l l o w s : 20 Y = income 15 line 2 Y1 12 14 X = years o f education 16 Choosing another p r e d i c t i o n f o r Y1, say $20,000, and evoking t h e constraint gives a d i f f e r e n t r e s u l t : ( Yl - $20,000 ) + A l ( Y2 - Y2 ) + ( $15,000 - $20,000 ) OR A 1 Al S o l v i n g f o r Y2, we g e t Y2 = = . 0 ($25,000 $20,000 . . ) = 0 NOTICE t h a t i f y i s t h e p r e d i c t o r f o r any value o f X ( o t h e r t h a n - I n t h i s example t h i s X), t h e n i t i s t h e p r e d i c t o r f o r ALL Ys. suggests t h a t no m a t t e r what one's education i s , t h e b e s t p r e d i c t o r o f income i s t h e o v e r a l l mean, $20,000. education i s of no value i n p r e d i c t i n g income. I n o t h e r words, This uninformative r e l a t i o n s h i p i s d e p i c t e d i n t h e graph as l i n e 2. Also be sure t o note t h a t t h e p o i n t a t which t h e two l i n e s i n t e r s e c t i s ( R, y ). Ti As a m a t t e r o f f a c t , ANY l i n e t h a t passes through t h e p o i n t , ( = 14-years education, Y s a t i s f i e s t h e f i r s t OLS $20,000 ) = constraint!!! E ( Y - 2) CONCLUSION: The c o n s t r a i n t t h a t ! )' = 0 only requires X, 7 t h a t t h e r e g r e s s i o n l i n e passes through t h e p o i n t , ( A 3) NOTICE a l s o t h a t s i n c e that X' = X - X , a then = Y- X' = 2~ , 0 ). t h a t i f we t r a n s f o r m X such and A Y. a = T h i s can be a A u s e f u l t r a n s f o r m a t i o n , because you then know t h a t when - A b > 0 the t p r e d i c t e d value, Y, w i l l be above t h e mean, Y, when X and w i l l be below t h e mean when X' i s negative. (When i s positive 2<0 , t h e converse w i l l be t r u e . ) 3. The second OLS c r i t e r i o n i s t h a t X (X - X) (Y A - Y) = 0 . Let's i n v e s t i g a t e what t h i s means: A a. Whereas t h e f i r s t c r i t e r i o n (viz., 135 X ( Y - Y) = 0) does no more than r e q u i r e t h a t t h e r e g r e s s i o n l i n e pass through t h e p o i n t , t h e second c r i t e r i o n (viz., - X) (Y Z (X ( X, V ) , A - Y) = 0) determines t h e " b e s t " r e g r e s s i o n l i n e from a l l p o s s i b l e l i n e s through t h i s p o i n t . b. Consider t h e t h r e e l i n e s discussed p r e v i o u s l y : line 1 x-X Y Person 1 : -2 15 Person 2 : 2 25 T o t a l (Z): 0 (-9) ( x ) 10 4. O.K. 9 9 -20 x -5 -Y ( Y - 9 ) (x-X) ( Y - 9 ) 10 0 0 -10 -20 5 10 0 0 0 -40 0 20 0 0 I' Recall t h a t line 3 line 2 C ( Xi - X ) = 0 . C closer to Zero! o3 (And i t makes sense t o o ! ) Now t h a t we understand what t h e second c o n d i t i o n means, l e t ' s see how t o use i t : T h i s can be s i m p l i f i e d as f o l l o w s : a. Numerator: C (X-X)(Y -V) = ZXY - Z X Y - C X Y = CXY-VZX-T(ZY+nW = CXY Z Y C X - 7 Z X - CY n + E n + CX n-- n CY n :x = I:XY - I: I = I:XY - (I: Y ) ( Z X) - (I: X)(C Y) n + n I: XY = n (I: Y) (I: X) n - nrjV b. Denominator: I:(X - x12 = x2) + I:x2 I: (x2 - 2xx + = I:x2 - = E x 2 - 2- 2xcx I:X n + n-- I : X n c. Thus t h e computation formula f o r E X n i s as f o l l o w s : (I: X)(I: Y ) - H. T h i s formula becomes even s i m p l e r i f we use STANDARDIZED VARIABLES. 1. R e c a l l t h a t standardized v a r i a b l e s are obtained by s u b t r a c t i n g a v a r i a b l e from i t s mean and d i v i d i n g t h i s d i f f e r e n c e by i t s standard deviation. Z = x-S( A ax That i s , , where A ox = J I: (X - x12 n - 1 2. We have shown e a r l i e r (Lecture Notes, p. 25) t h a t and t h a t That i s , i f Z i s a standardized random v a r i a b l e , A A I.N o t i c e what happens t o a and b ( i n A A Zy = a + Z = "2 0 and oZ = 1 . A bZX ) when X and Y have been standardized: 1. When v a r i a b l e s a r e standardized, we s h a l l no l o n g e r use t h e l e t t e r "b", b u t s w i t c h t o t h e Greek l e t t e r beta (8) which stands f o r t h e standardized r e g r e s s i o n c o e f f i c i e n t . (On SPSS r e g r e s s i o n o u t p u t , you w i l l n o t i c e t h a t t h e r e i s a column f o r "b" and one f o r "beta". t h a t t h e r e i s no constant i n t h e beta-column. 2. It i s u n f o r t u n a t e t h a t t h e symbol, w i l l recall that 8 8, Notice Now you know why.) has two meanings. That i s , you a l s o stands f o r t h e p r o b a b i l i t y o f a Type I 1 e r r o r Be sure n o t t o confuse these two uses o f t h e ( L e c t u r e Notes, p. ? ) . same symbol. J. NOTATION on r e g r e s s i o n c o e f f i c i e n t s unstandardized A estimate b parameter b standardized Note how t h i s r e t a i n s t h e convention t o designate an e s t i m a t e by p l a c i n g a " h a t " over t h e parameter t h a t i t estimates. K. I n t e r p r e t i n g constants and unstandardized r e g r e s s i o n slopes i n b i v a r i a t e r e g r e s s i o n equations 1. Computers a r e adept a t g e n e r a t i n g t h e r e g r e s s i o n c o e f f i c i e n t s you request. Yet i t i s up t o you, t h e s t a t i s t i c i a n , t o i n t e r p r e t these numbers. P o s s i b l y t h e best way t o e x p l a i n how a r e g r e s s i o n s l o p e should be i n t e r p r e t e d i s by g i v i n g an i n c o r r e c t r e n d e r i n g and t h e n by p o i n t i n g o u t what i s missing. a. An i n c o r r e c t l y rendered r e g r e s s i o n slope: "The slope i n d i c a t e s t h a t a change o f t h r e e u n i t s on t h e income measure would be estimated f o r each 1 u n i t change on t h e education measure." b. What i s missing: 1) No i n d i c a t i o n i s g i v e n whether t h e slope i s p o s i t i v e o r negative. 2) No i n d i c a t i o n i s g i v e n o f what t h e u n i t s are on t h e independent and dependent v a r i a b l e s . 3) No i n d i c a t i o n i s g i v e n o f what t h e u n i t o f a n a l y s i s (here, U.S. residents) i s . c. A c o r r e c t l r rendered r e g r e s s i o n slope: "The slope i n d i c a t e s t h a t an increase o f t h r e e thousand d o l l a r s o f annual income would be estimated f o r each a d d i t i o n a l year o f a U.S. r e s i d e n t ' s education." 2 . Likewise, t h e constant i n a r e g r e s s i o n equation i s more than " t h e y intercept." Thus, i t would be incomplete t o r e f e r t o a c o n s t a n t as, f o r example, " t h e value on t h e income v a r i a b l e when t h e education measure 139 equals zero." A c o r r e c t r e n d e r i n g would be one l i k e t h e f o l l o w i n g : "One would e s t i m a t e U.S. r e s i d e n t s w i t h no educations t o have annual incomes o f $15,000." L. IMPORTANT: A regression l i n e explains p a r t o f the " v a r i a b i l i t y " ( o r variance) i n t h e dependent v a r i a b l e . 1. Please note t h e d i s t i n c t i o n here between t h e v a r i a n c e " i n " a v a r i a b l e ( i . . SSy = C ( Y -y )2 ) and t h e v a r i a n c e " o f " a v a r i a b l e 2. When s t a t i s t i c i a n s speak o f t h e v a r i a n c e " i n " a v a r i a b l e , t h e y a r e u s u a l l y r e f e r r i n g t o something c a l l e d a SUM OF SQUARES. the variance i n t h e variable Y i s For example, C (Y - v ) ~. Whenever Y i s t h e dependent v a r i a b l e , t h i s v a r i a n c e i s c a l l e d t h e " T o t a l Sum o f Squares." 3. I f you t h i n k o f s c i e n t i f i c research as a game w i t h s p e c i f i c r u l e s t h a t s p e c i f y (among o t h e r t h i n g s ) which s t a t i s t i c s a r e a p p r o p r i a t e under which circumstances, t h e GOAL o f t h i s game i s t o f i n d evidence o f causal r e l a t i o n s among phenomena. T h i s evidence always c o n s i s t s ( i n p a r t ) o f a c o v a r i a t i o n between measures o f cause and e f f e c t . Thus, i t i s reasonable t o t h i n k o f t h e goal i n t h e " s t a t i s t i c s game" as being one o f e x ~ l a i n i n qas much o f t h e v a r i a n c e i n t h e dependent v a r i a b l e as you can. T h i s i s done by p a r t i t i o n i n q t h e t o t a l sum o f squares i n t o an e x p l a i n e d p a r t ( t h a t c o v a r i e s w i t h some o r a l l o f t h e independent v a r i a b l e s ) and an unexplained p a r t ( t h e r e s i d u a l variance-the variance l e f t over). 4. The p a r t i t i o n i n g o f t h e t o t a l sum o f squares begins as f o l l o w s : C (Y - V 1 2 = 1 [ (? - V) + (Y - ?) l2 = 1 ( ? - ~ ) 2 t C (Y - ?)2 t 2 1 (? -Y)(Y - $1 Now j u s t c o n s i d e r t h e term (0 - Y ) ( Y - 0) 2 2 T h i s term can be shown t o equal zero. that A A Y = a + A bX v and = , To demonstrate t h i s , f i r s t r e c a l l :+ 6~ . S u b s t i t u t i n g these e q u a l i t i e s i n t o t h e above term y i e l d s t h e f o l l o w i n g : c 0 - -0) Y = = = = = X ( [: + Gx] 6x1 - [: + )(Y - 0) (G + 6x - - ~ x ) ( Y- 0) 2 (tx - & ) ( Y - 0) X 6(x - X ) ( Y - 0) G c (X - X ) ( Y - 0 ) 2 X (X - X)(Y - But according t o t h e 2nd OLS c r i t e r i o n , 0) = 0 . Thus, t h e p r o o f concludes as f o l l o w s : A = b * O = O . Consequently, t h e t o t a l sum o f squares can be p a r t i t i o n e d as f o l l o w s : 2 (Y S c (0 - v12 - Y12 S ~ ~ ~S Total variance i n Y M. AN ASIDE: ~ + S~ 2 (Y ~ S Variance explained - t12 S ~ ~ ~~ ~ Variance unexplained The d i s t a n c e t o t h e r e g r e s s i o n l i n e i s always drawn p a r a l l e l t o A t h e Y-axis. X. - That i s , d e v i a t i o n s o f Y from Y are always f o r f i x e d values of It i s t o convey t h i s c o n d i t i o n a l i t y o f Y on X t h a t one r e f e r s t o t h e r e q r e s s i o n of X, never t o t h e r e g r e s s i o n o f X on Y. N. A demonstration o f t h e n o n i n t u i t i v e n a t u r e o f t h e i n e q u a l i t y , = c (0 - Y12 + X (Y - v ) ~ X ( Y - "Y ) 2 . . I f you c a l c u l a t e A a = - 6~ A and b = c (x - X ) ( Y - v) 2 (X - X12 , then n o t o n l y ~~ f o r each person i s [Y - Y] A [Y - = T] + [Y - t] , BUT t h e squares o f t h e 3 p a r t s of t h i s equivalence add up t o an equivalence as w e l l . 7 For instance, t h e above graphic d e p i c t s a case i n which A f o r t h e ith observation, when A Yi - -Y = 3 and A Y .1 - Y 1. Yet, i t i s c l e a r t h a t (yi - -y ) 2 = X (Y - 712 THIS I S * ( -2 ) 2 = 4 BUT i f you &J Yi = 10 and = A (Yi -5 , Y .1 = then - T)2 + 5 . = 7 and where Note, f o r example, t h a t Yi-T=3+ " 2 2 (Yi - Yi) = 3 [-51 =-2. + ( -5 ) 2 = 34 . up a l l o f these squares you DO f i n d t h a t = X ( t - Y ) ~+ NOT INTUITIVELY TRUE. X (Y - $2 . Nonetheless i t a true! 0. One important CONSEQUENCE o f t h i s equivalence i s an improved e s t i m a t e o f t h e " t r u e " variance of Y . The i d e a here i s t h a t t h e variance j~ Y explained by X makes t h e variance is. of Y appear much 1a r g e r than i t " r e a l l y " T h i s improved estimate i s c a l l e d t h e MEAN SQUARE ERROR ( o r MSE) and i s "2 assigned t h e symbol, o case sigma.) . (Note t h e absence o f a s u b s c r i p t on t h i s lower- NOTE: T h i s i s t h e general formula f o r t h e MSE i n which k equals t h e number o f independent v a r i a b l e s i n t h e r e g r e s s i o n equation. Since t h e r e i s o n l y one independent v a r i a b l e i n a b i v a r i a t e regression, t h e denominator equals n - 2 i n t h i s case. AN ASIDE: Regression o u t p u t from SPSS (e.g., L e c t u r e Notes, p. 164) l a b e l s t h e MSE as t h e MEAN SQUARE RESIDUAL and l a b e l s i t s square r o o t as t h e "Std. E r r o r o f t h e Estimaten-an e s t i m a t e o f t h e " t r u e " standard d e v i a t i o n o f Y. T h i s i s a m i s l e a d i n q use o f t h e term, standard e r r o r ! ! ! The standard e r r o r o f 3 s t a t i s t i c i s t h e standard d e v i a t i o n e s t i m a t e f o r t h e sampling distribution o f that statistic. What i s l a b e l e d a standard e r r o r on your r e g r e s s i o n o u t p u t i s a standard d e v i a t i o n e s t i m a t e f o r t h e d i s t r i b u t i o n o f average r e s i d u a l s ( o r , i f you p r e f e r , o f " t r u e " d e v i a t i o n s o f Y from t h e i r mean a f t e r t h e i r adjustment f o r t h e e f f e c t s o f t h e independent variable[s]) . P. A second i m p o r t a n t CONSEQUENCE o f t h e f a c t t h a t v a r i a n c e can be p a r t i t i o n e d i n t o e x p l a i n e d and unexplained components i s t h a t r e g r e s s i o n a n a l y s i s a f f o r d s a measure o f LINEAR a s s o c i a t i o n between two v a r i a b l e s , namely t h e p r o p o r t i o n o f v a r i a n c e i n t h e dependent v a r i a b l e l i n e a r l y e x p l a i n e d by t h e independent v a r i a b l e : Coefficient o f determination = r2 = S -- S ~ ~ THE CORRELATION COEFFICIENT ( o r Pearson c o r r e l a t i o n ) , "r", i s t h e square r o o t o f t h e c o e f f i c i e n t o f d e t e r m i n a t i o n w i t h t h e same s i g n as t h e slope (g) between t h e c o r r e l a t e d v a r i a b l e s . Two u s e f u l formulas: ~ A 1. The r e l a t i o n between b and r: r a. Recall t h a t b. And t h a t and 2 = 2 (Y - A Yi (G - n2 2 A - A + bX i + br( . a = A Y = a V12 c. Taking t h e numerator o f t h e formula f o r r2 , we g e t z ( $ - V ) ~ = ~ ( G + b x - G - b r ( ) ~= x ( b [ x - K ] = 2 b2 (X - r(12 = b2 2 (X -%12 )2 . d. T h i s i m p l i e s t h a t Therefore b"2 i s p r o p o r t i o n a l t o r2-the proportion o f variance explained i n one v a r i a b l e by another. A e. Moreover, r e c a l l i n g t h a t r and b t a k e t h e same sign, i t f o l l o w s t h a t A 2. Given t h i s r e l a t i o n between r and b, you w i l l n o t e t h a t when X and Y are standardized (i.e., when ax = ay = 1 ) t h e c o r r e l a t i o n between them i s t h e same t h i n g as t h e i r slope. Thus, f o r example, i f w i t h i n a random sample o f Canadians t h e c o r r e l a t i o n between education and income were .3 , one c o u l d express t h i s i n words as, "One would e s t i m a t e a . 3 standard d e v i a t i o n increase i n a Canadian's income f o r each 1 standard d e v i a t i o n increase i n h i s o r h e r education." (WARNING: The next s e c t i o n o f these L e c t u r e Notes i n t r o d u c e s p a r t i a l c o r r e l a t i o n s , which u n l i k e c o r r e l a t i o n c o e f f i c i e n t s may NOT be i n t e r p r e t e d as slopes.) 3. It i s w i t h t h e c o r r e l a t i o n c o e f f i c i e n t and i t s square-the o f determination-that coefficient we reach t h e boundary between a w o r l d understood by "normal people" and a w o r l d o n l y understood by s t a t i s t i c i a n s . With some e f f o r t most "normal people" can come t o understand t h e meaning o f standardized u n i t s , and w i t h i t t h e i d e a o f a slope between two standardized v a r i a b l e s . However, i t i s o n l y s t a t i s t i c i a n s who understand t h e phrase, "Education l i n e a r l y e x p l a i n s 9% o f t h e v a r i a n c e i n income." To understand what t h i s means, one must know what i s meant by t h e v a r i a n c e i n a v a r i a b l e and what i t means t o " e x p l a i n " a proportion o f t h i s . Once you understand t h e meanings o f b o t h t h e c o r r e l a t i o n c o e f f i c i e n t ( a slope between two standardized v a r i a b l e s ) and t h e c o e f f i c i e n t o f d e t e r m i n a t i o n ( t h e p r o p o r t i o n o f v a r i a n c e i n one v a r i a b l e t h a t i s l i n e a r l y explained by another v a r i a b l e ) , you have begun t o understand a language t h a t i s e x c l u s i v e l y shared among s t a t i s t i c i a n s . Q. The assumptions o f l i n e a r r e g r e s s i o n 1. HOMOSCEDASTICITY: The variance o f Y i s t h e same w i t h i n d i f f e r e n t c a t e g o r i e s o f X. 2. LINEARITY: The means o f Yi l i e i n a s t r a i g h t l i n e across d i f f e r e n t c a t e g o r i e s o f X. 3. RANDOMNESS: 4. NORMALITY: The random v a r i a b l e s Yi a r e s t a t i s t i c a l l y independent. For any f i x e d v a l u e o f X, Y i s n o r m a l l y d i s t r i b u t e d . There are TWO WAYS o f expressing t h i s l a s t assumption: A a. Y - N( a + bX, a2 ) , where a2 i s t h e v a r i a n c e o f Y around t h e regression l i n e . b. Y = a + bX + E , A where E A NOTE: means. - N( 0, a2 ) . A The d i s t r i b u t i o n s o f Y and E are i d e n t i c a l , except f o r t h e i r Also n o t e t h a t i n each case t h e e s t i m a t e o f t h e variance i s t h e MEAN SQUARE ERROR term from t h e r e g r e s s i o n . R. Using a BOX PLOT t o check t h e assumptions o f l i n e a r r e g r e s s i o n 1. What i s a box p l o t ? a. A box p l o t i s a p l o t o f t h e d a t a t h a t has t h e dependent v a r i a b l e (Y) on t h e v e r t i c a l a x i s and t h e independent v a r i a b l e (X) on t h e horizontal axis. b. Values on t h e dependent v a r i a b l e are i n terms o f t h e u n i t s i t has. (For example, t h e i l l u s t r a t i o n on t h e n e x t page was generated o n l y a f t e r an income measure (rincome) was recoded i n t o u n i t s o f d o l l a r s . ) c. Values on t h e independent v a r i a b l e should be c o l l a p s e d i n t o groups o f a t l e a s t 30 d a t a p o i n t s . ( I n any case, r o u g h l y equal numbers o f d a t a p o i n t s should belong t o each group.) d. A box i s drawn above t h a t p o i n t on t h e X - a x i s t h a t corresponds t o each group. The box's t o p and bottom d e l i n e a t e t h e i n t e r q u a r t i l e range o f values on t h e dependent v a r i a b l e f o r u n i t s o f a n a l y s i s w i t h i n t h e box's corresponding group. A l i n e across t h e box i n d i c a t e s t h e median value w i t h i n t h e group. "Whiskers" o u t each end o f t h e box show t h e range o f values w i t h i n t h e group ( e x c l u d i n g outliers). e. O u t l i e r s (namely, values w i t h i n a group t h a t exceed t h e box's i n t e r q u a r t i l e bounds by lt box l e n g t h s ) a r e i n d i c a t e d by c i r c l e s and t h e i r corresponding s u b j e c t numbers. f. The p l o t below i l l u s t r a t e s a box p l o t f o r t h e l i n e a r r e l a t i o n between RINCOME ( Y ) and PRESTIGE ( X ) from t h e 1984 NORC survey o f t h e U.S. adult population. R'S OCCUPATIONAL PRESTIGE SCORE The program used t o generate t h i s o u t p u t : r - - --, .s e l e c t i f ((rincome ne 13) and (rincome ne 99)). recode rincome (1=500) (2=2000) (3=3500) (4=4500) (5-5500) (6=6500) (7=7500) (8=9000) (9=12500)(10=17500) (11=22500) (12=35000) (13=99) recode o r e i t i q e ( l 2 ' t h r u 16-17)(18 t h r u 23=24)(25 t h r u 27-28] (29,30,31=3i) (33=34) (35=36) (37 t h r u 39=40) iii t h r u 44=45)(46=47) (48=49) (51 t h r u 56=57) (58,60=61) (62 t h r u 78=82). examine variables=rincome by p r e s t i g e / p l o t = b o x p l o t / s t a t i s t i c s = n o n e / n o t o t a l . 2. Checks made u s i n g a box p l o t . a. H e t e r o s c e d a s t i c i t y (i.e., when t h e IQRs ( i .e., t h e absence o f homoscedasticity) i s e v i d e n t t h e boxes) i n a box p l o t a r e o f d i f f e r e n t heights. b. The l i n e a r i t y assumption i s v i o l a t e d i f no monotonic i n c r e a s e ( o r decrease) i n t h e Ys can be d e t e c t e d f o r i n c r e a s i n g l y l a r g e r Xs. This can be checked by seeing i f a s i n g l e ( b u t n o t a h o r i z o n t a l ) s t r a i g h t l i n e can be drawn between upper and l o w e r q u a r t i l e bounds ( i . e . , the t o p and bottom) o f a l l boxes w i t h i n t h e p l o t . c. Randomness i s u s u a l l y b u i l t i n t o one's research design. Statistical independence among observations i s ensured by randomly sampling subjects. Nonetheless, i f your d a t a were c o l l e c t e d over TIME, t h e assumption i s probably n o t met. (E.g., i n 1999 i s n o t independent o f t h e U.S. t h e GNP o f t h e U n i t e d S t a t e s GNP i n 1998.) d a t a by t i m e ( t h a t i s , l e t t i m e be t h e X - v a r i a b l e ) , I f you p l o t t h e such a v i o l a t i o n o f t h e randomness assumption may be d e t e c t e d as TRACKING o f t h e data. T r a c k i n g i s e v i d e n t when s i m i l a r values on t h e dependent v a r i a b l e are found f o r observations t h a t f o l l o w each o t h e r i n sequence. d. V i o l a t i o n s o f t h e n o r m a l i t y assumption may r e s u l t i n biased r e g r e s s i o n c o e f f i c i e n t s as w e l l as i n m i s l e a d i n g t e s t s o f 148 significance. However, t h e data must depart r a d i c a l l y from n o r m a l i t y f o r these t e s t s t o be n o t i c e a b l y e f f e c t e d . aberrant observation (a.k.a. Sometimes an extremely an " o u t l i e r " ) can skew t h e c o n d i t i o n a l d i s t r i b u t i o n o f Y f o r a p a r t i c u l a r X-value. When t h i s happens one u s u a l l y i d e n t i f i e s t h e o u t l i e r , determines why i t i s a t y p i c a l o f t h e r e s t o f t h e data, drops t h e o u t l i e r from t h e a n a l y s i s , and acknowledges t h e o u t l i e r and i t s a t y p i c a l n a t u r e i n t h e w r i t t e n r e s u l t s (usually i n a footnote). "He m y we've ndned hk p d l w msor* bcMn -1 and weight" A S. T e s t i n g hypotheses about b A A d e r i v a t i o n o f t h e d i s t r i b u t i o n o f b i s beyond t h e scope o f t h e course, so here i t i s : A - N( b b, a,2 ) b , ? : where 1. Note t h a t t h e variance o f 2 b - MSE z ( X - x ) 2 i s t h e MEAN SQUARED ERROR standardized f o r t h e amount o f v a r i a t i o n i n X. Thus i f a sample has l a r g e d e v i a t i o n s from T( and a small corresponding e r r o r i n estimates o f Y, t h e variance 6 The converse o f t h i s i s a l s o t r u e . o f i s small. 2. Because t h e u n i t s o f a slope are " u n i t s Y p e r u n i t X" (e.g., dollars income p e r y e a r o f education), t h e u n i t s o f t h i s v a r i a n c e a r e t h e square o f t h i s (e.g., sauared d o l l a r s p e r sauared y e a r o f education). 3. T h i s v a r i a n c e can be used t o t e s t Ho: b = 0 HA: b l e v e l (e.g., a=.05). + 0 a t a given significance Here y o u r r e j e c t i o n r u l e would be as f o l l o w s : R e j e c t Ho when (61> 1.96 i n s t e a d o f 1.96 when t h e degrees o f (NOTE: You should use tn-2,a,2 freedom associated w i t h MSE [namely, df = n-21 are l e s s than 30.) h T. F i n d i n g a 95% confidence i n t e r v a l f o r b S i m i l a r l y (and again assuming a l a r g e enough sample such t h a t t h e MSE's degrees o f freedom equal 30 o r more), f i n d i n g a 95% confidence i n t e r v a l A about b i s done u s i n g t h e formula, U. AN ILLUSTRATION: Imagine t h a t you wish t o know whether g r e a t e r use o f knee braces by f o o t b a l l p l a y e r s has an e f f e c t upon t h e number o f s e r i o u s f o o t b a l l - r e l a t e d i n j u r i e s s u f f e r e d by members o f c o l l e g e f o o t b a l l teams. Because f o o t b a l l teams u s u a l l y have p o l i c i e s concerning knee braces, you f i n d t h a t use o f knee braces i s NOT a random v a r i a b l e among members o f t h e same f o o t b a l l team. Since use o f knee braces i s ( l e s s arguably) random among f o o t b a l l teams, you use " t h e f o o t b a l l team" as y o u r u n i t o f a n a l y s i s . You have d a t a on t h i r t y f o o t b a l l teams. Your SPSS d a t a s e t i n c l u d e s two v a r i a b l e s , namely "pbrace" ( p r o p o r t i o n o f t h e f o o t b a l l team comprised o f 150 players t h a t r e g u l a r l y use a knee brace) and " i n j u r i e s " (number o f serious f o o t b a l l - r e l a t e d i n j u r i e s per 100 players s u f f e r e d by t h e f o o t b a l l team l a s t year). You r u n t h e f o l l o w i n g SPSS i n s t r u c t i o n s : compute j = pbrace injuries. frequencies general = pbrace, i n j u r i e s , j / statistics = mean, variance. Parts o f your output l o o k as f o l l o w s : Statistics Mean Variance PBRACE .682 ,013 INJURIES 2.97 28.17 J 2.23 18.84 1. What i s known a t t h i s p o i n t : 2. Give t h e unstandardized regression equation a p p r o p r i a t e t o your research question and say i n words what each regression c o e f f i c i e n t means. I n words: For every a d d i t i o n a l 10% of a f o o t b a l l team's p l a y e r s who r e g u l a r l y used a knee brace, one would e s t i m a t e an a d d i t i o n a l 1.627 s e r i o u s f o o t b a l l - r e l a t e d i n j u r i e s p e r 100 p l a y e r s s u f f e r e d by t h e team l a s t year. A a = V - $1 = 2.97 - ( 16.27 * .682 ) = -8.126 I n words: One would e s t i m a t e minus ( ? ) 8.126 s e r i o u s f o o t b a l l - r e l a t e d i n j u r i e s p e r 100 p l a y e r s s u f f e r e d l a s t y e a r by a f o o t b a l l team on which none o f t h e team's p l a y e r s r e g u l a r l y used a knee brace. NOTE: An a l t e r n a t i v e formula f o r f i n d i n g t h e unstandardized r e g r e s s i o n c o e f f i c i e n t i s ( w i t h rounding e r r o r ) as f o l l o w s : 3. Would you recommend use o f a knee brace t o prevent s e r i o u s f o o t b a l l - related injuries? No. ( J u s t i f y your answer by r e f e r r i n g t o y o u r d a t a . ) The s i g n o f t h e slope suggests t h a t r a t h e r than p r e v e n t i n g s e r i o u s f o o t b a l l - r e l a t e d i n j u r i e s , knee brace use enhances such i n j u r i e s . 4. Test t h e n u l l hypothesis t h a t t h e p r o p o r t i o n o f f o o t b a l l teams comprised o f p l a y e r s t h a t r e g u l a r l y use a knee brace ( l i n e a r l y ) i n f l u e n c e s t h e number o f s e r i o u s f o o t b a l l - r e l a t e d i n j u r i e s s u f f e r e d by t h e teams. t h e .05 l e v e l o f s i g n i f i c a n c e . ) Ho: b = 0 HA: b # 0 Note t h a t t h e standard e r r o r o f The r e j e c t i o n r u l e i s t h u s $ is (Use A A "Reject Ho i f l b l > t28,.025 u; = 2.048 * ." 8.242 = 16.88 A b = 16.27 < 16.88 Because we f a i l t o r e j e c t Ho. A 5. F i n d t h e 95% confidence i n t e r v a l f o r b. A The 95% confidence i n t e r v a l f o r b i s or 16.27 + 2.048 * 8.242 or 16.27 + 16.27 t 16.88 A t28,.025 a; or (-0.61, 33.15) . V. T e s t i n g hypotheses about r 1. We know t h a t A b - N( 2 ) b, uA b , where "2 = MSE . uA b S S ~ 2. R e f e r r i n g back t o t h e e a r l i e r d i s c u s s i o n [ L e c t u r e Notes, p. 241 r e g a r d i n g conversions among normal d i s t r i b u t i o n s , you may r e c a l l t h a t i f X - N( 2 2 k ox ) . 2 ) p, ox then t h e d i s t r i b u t i o n o f * X' = k For example, i f X i s i n u n i t s o f f e e t then u n i t s o f inches, and i f - N( X A 3 . Taking i n t o account t h a t b - 5, 10 ) N( b, 2 b "A) then X' is X - X ' = 12X - N( and s e t t i n g X' N( kp, is in 60, 1440 ) . OX k = , it "Y follows that .. 4 . I f we approximate "b "x using r "Y = A b "x 7, "Y e s t i m a t i n g t h e variance o f r i s as f o l l o w s : MSE MSE t h e formula f o r 5. Now consider t h i s l a s t q u o t i e n t : Recalling t h a t C ( Y -V )2 = C ( Y - YA ) 2 + C ( ?-Y )2 , Thus t h e numerator equals t h e p r o p o r t i o n o f t h e v a r i a n c e NOT e x p l a i n e d and we are l e d t o t h e conclusion t h a t I.e., we would conclude t h a t t h e v a r i a n c e around p depends on t h e magnitude o f p. Consequently, t h e estimated v a r i a n c e o f r w i l l not be constant f o r d i f f e r e n t estimates o f p. 6. AN ASIDE: Although n o t dependent on b. - gf A i s dependent on p, t h e estimated v a r i a n c e o f b i s The former dependence i s i n t r o d u c e d when t h e above A approximation o f "x Ox -w i t h 7 is made. "2 on p presents no problem i n t e s t s o f n u l l 7. The dependence of or hypotheses o f t h e form, Ho: p = 0 . A PROBLEM does r e s u l t as one attempts t o s e t up confidence i n t e r v a l s around estimates o f p, however. a. A confidence i n t e r v a l around r i s an i n t e r v a l t h a t i n c l u d e s t h e estimated value o f p between an upper and a lower confidence bound. b. The l a r g e r t h e absolute value o f p, t h e s m a l l e r t h e variance e s t i m a t e t h a t should be used. c. These two statements i m p l y t h a t " t h e bound o f a confidence i n t e r v a l t h a t i s f u r t h e r from zero" should be c l o s e r t o t h e e s t i m a t e o f p than t h e o t h e r bound, s i n c e i t r e q u i r e s a s m a l l e r v a r i a n c e estimate. d. I n c o n t r a s t , t h i s problem does n o t e x i s t when t e s t i n g t h e n u l l hypothesis, Ho: p = 0 , s i n c e i n a t w o - t a i l e d t e s t each c r i t i c a l v a l u e w i l l be e q u i d i s t a n t from zero. Now t o an i l l u s t r a t i o n . 8. L e t ' s now r e t u r n t o t h e knee brace i l l u s t r a t i o n . a. F i r s t , we can ask how much o f t h e variance i n i n j u r i e s i s explained by use o f knee braces. (Please remember t h a t a l l p r o p o r t i o n s o f v a r i a n c e e x p l a i n e d are some form o f "squared r . " Here i t i s t h e c o r r e l a t i o n c o e f f i c i e n t t h a t i s squared.) 2 = ('35) 2 = . I 2 2 rxy b. Second, we can t e s t i f t h e amount o f v a r i a n c e found i n p a r t a i s a s t a t i s t i c a l l y s i g n i f i c a n t amount o f variance a t t h e .05 l e v e l . Ho: p2 = 0 HA: p 2 > O Conclusion: No, i s equivalent t o t e s t i n g 2 = .I22 rXy Ho: p = 0 HA: P 0 + i s not s i g n i f i c a n t l y large a t a = .05 9. A FEW COMMENTS: a. The degrees o f freedom a r e n - 2 155 , s i n c e a degree o f freedom i s . l o s t when t h e mean o f EACH v a r i a b l e i s c a l c u l a t e d . b. A c o r r e l a t i o n o f t.35 i s n o t very l i k e l y t o occur by chance! (In t h i s sample o f o n l y 30 cases i t was almost s u f f i c i e n t l y l a r g e t o r e j e c t Ho.) More s p e c i f i c a l l y , i f one v a r i a b l e e x p l a i n s ( l i n e a r l y ) 15% o r more o f t h e variance i n another, a sample o f 30 w i l l be l a r g e enough t o d e t e c t i t as s t a t i s t i c a l l y s i g n i f i c a n t a t t h e .05 l e v e l . This should g i v e you some i n t u i t i v e " f e e l " f o r t h e r e l a t i v e s t r e n g t h of lrl=.35. W. F i n d i n g a 95% confidence i n t e r v a l f o r r RECALL t h a t or2 = 1- 2 . Thus when l p l i s small, you would have a l a r g e r variance than when l p l i s large! As a r e s u l t , your confidence i n t e r v a l must be " l o p s i d e d " w i t h a l a r g e r d e v i a t i o n from r toward zero and a s m a l l e r d e v i a t i o n from r toward +1. To estimate these d e v i a t i o n s we use Table E, which can be found a t t h e end o f t h e Syllabus. Here are THE MECHANICS: 1. Table E provides a t r a n s f o r m a t i o n (i.e., v a r i a b l e w i t h a normal d i s t r i b u t i o n - a u n r e l a t e d t o p ' s magnitude. T(r) - N( T( P 1, ,- 3 r to a variable with a d i s t r i b u t i o n Specifically, 1 . 2. I n t h e knee brace problem we have r confidence i n t e r v a l i s a "mapping") o f = .35 and n = 30 , t h u s t h e 95% 3. From Table E we f i n d t h a t T(.35) = . .3654 Thus t h e confidence interval i s 4. BUT t h i s i s a confidence i n t e r v a l around T ( r ) ! these two values back t o r ' s . T- 1(.743) (-.012, = .631 . This y i e l d s So we must c o n v e r t ~-l(-.012) -.012 = Thus t h e 95% confidence i n t e r v a l f o r r = and .35 is .631). 5. N o t i c e how these confidence bounds l o o k on a number l i n e : A1=. 362 A2=.281 A I 4 I -1 I n p a r t i c u l a r , n o t i c e t h a t t h e bounds o f t h e i n t e r v a l are "lopsided." That i s , n o t e t h a t X. The d i s t r i b u t i o n o f 1 .35 - when T( = (-.012) 0 1 = .362 > .281 = 1 .35 - .631 ( . . A 1. L e t a, equal t h e constant i n a b i v a r i a t e r e g r e s s i o n equation i n which When X i s transformed t h e mean on t h e independent v a r i a b l e equals zero. by s u b t r a c t i n g o u t i t s mean ( i . e . , when T( = 0 ), then A a, =Y and A t h u s a, i s an e s t i m a t e o f fiy. NOTE: V a r i a b l e s are "centered" i f t h e i r means equal zero. A s u b s c r i p t s on ac and Xc are reminders t h a t A 2. The standard e r r o r o f a, Xc I 0 The " c " . i s u s u a l l y much s m a l l e r than t h a t associated w i t h t h e usual estimate o f y, however. A a, - N( I n particular, 2 u 7 ) , a, b u t i n t h i s case 2: = MSE . And t h e MSE i s a b e t t e r estimate o f t h e " t r u e " variance o f Y-a variance n e t o f t h e e f f e c t s o f X. A A 3. Note t h a t a, and b a r e independent. a. Recall t h a t t h e f i r s t OLS c r i t e r i o n f i x e s ( X, 7 ) as a p o i n t on t h e r e g r e s s i o n l i n e . Once one's independent v a r i a b l e has been - A centered such t h a t Xc r 0 (and consequently, such t h a t a, A t h i s p o i n t i s f i x e d a t ( 0, a, ) = Y ), . b. T h i s p o i n t w i l l be on t h e r e g r e s s i o n l i n e i n d e ~ e n d e n to f t h e value o f A b as determined by t h e second OLS c r i t e r i o n . A c. Since a (namely, - LX , whenever X = A A a i s dependent upon b under any o t h e r c o n d i t i o n f 0 ) . A Y . The d i s t r i b u t i o n o f Ye ( t h e estimated value o f Y when X = Xe): 1. We begin w i t h knowledge o f t h e f o l l o w i n g f o u r t h i n g s : X a. I f Xc A d. a, -X , Xc then = 0 . A and b are s t a t i s t i c a l l y independent. 2. I f X and Y are normally d i s t r i b u t e d and s t a t i s t i c a l l y independent, t h e distribution of W = X + Y is N( pX + py, ox2 + oy2 ). (See t h e e a r l i e r d i s c u s s i o n [Lecture Notes, p. 1071 o f t h e d i s t r i b u t i o n o f t h e d i f f e r e n c e between two p r o p o r t i o n s . ) 3. Conclusion: A t h e variance o f t h i s NOTE: Since b i s m u l t i p l i e d by t h e constant, Xc, product equals t h e variance o f b times Xc.2 A 4. I n t h e more general case i n which d a t a on t h e independent v a r i a b l e have A n o t been centered, t h e d i s t r i b u t i o n o f Y would be as f o l l o w s : A which, a f t e r "uncentering" t h e data by l e t t i n g a = aC - b , A a = ac - and can be r e w r i t t e n as f o l l o w s : Yet n o t e t h a t t h i s i s r e a l l y a distribution i f ? class o f distributions, since the i s d i f f e r e n t f o r every value o f X. Accordingly, i t A now becomes meaningful t o speak o f t h e d i s t r i b u t i o n o f Ye ( i . e . , A d i s t r i b u t i o n o f Y f o r a f i x e d value o f X Xe ) . = the For example, i f Y measures " s t a r t i n g income on one's f i r s t j o b " and i f X measures " l a s t year o f education completed," one might wish t o e s t i m a t e t h e f i r s t - j o b s t a r t i n g - i n c o m e o f h i g h school graduates" ( f o r whom Xe = 12 ) . And so, A 5. T h i s formula y i e l d s a more general formula f o r t h e standard e r r o r o f a than t h e one j u s t given-a A Since (as always) A a = Ye formula t h a t a p p l i e s when I Xe = 0 (i.e., A X+ 0 : a equals t h e estimated value o f Ye when Xe = 0 ) , t h e formula f o r t h e standard e r r o r o f t h e constant i s t h u s as f o l l o w s : Xe = 0 Note t h a t t h i s formula i s j u s t f o r t h e s p e c i a l case i n which Xe The standard e r r o r f o r t h e more general case i n which = k . i s found by s u b s t i t u t i n g k f o r t h e 0 i n t h e formula. A Z. The d i s t r i b u t i o n o f Ye a l l o w s you t o estimate t h e parameter, pylXe. For example, you may wish an i n t e r v a l e s t i m a t e f o r t h e mean income (py) o f Americans w i t h a Ph.D. degree (X,). That i s , you might want a confidence A i n t e r v a l around Ye. A U n t i l now Y has been r a t h e r f r e e l y r e f e r r e d t o as a " p r e d i c t i o n " i n s t e a d o f A I n f a c t , you may use Y as both as an e s t i m a t e o f a p o p u l a t i o n parameter. A A p r e d i c t o r (Y ) and e s t i m a t o r (Ye). You w i l l f i n d o n l y a s l i q h t d i f f e r e n c e P in t h e formula f o r i t s variance when i t i s used i n these two ways. The A A v a r i a n c e i n t r o d u c e d above i s t h e variance o f Ye-the v a r i a n c e o f Y when i t i s used as an e s t i m a t o r . When ? i s used t o make a p r e d i c t i o n based upon d a t a p r e v i o u s l y c o l l e c t e d , a d i f f e r e n t v a r i a n c e estimate i s needed. how much money yo^ L e t ' s say t h a t you wish t o p r e d i c t would earn i f you earned a Ph.D. I f you wish t o make an i n t e r v a l estimate o f a d a t a p o i n t f o r a p a r t i c u l a r u n i t o f a n a l y s i s , y o u r PREDICTION INTERVAL must t a k e i n t o account t h e a d d i t i o n a l v a r i a n c e due t o A t h e i n t r o d u c t i o n o f t h i s new datum. L i k e Ye, A Y A P A = a + bX P . However, Adding a2 i n t o t h e variance term increments t h e v a r i a n c e by t h e v a r i a t i o n 160 c o n t r i b u t e d by a p a r t i c u l a r u n i t o f a n a l y s i s ( i n t h i s case, you-a potential NOTE: A t i s s u e i n d i s t i n g u i s h i n g a p r e d i c t i o n from an Ph.D. r e c i p i e n t ) . A estimate i s t h a t i n t h e former case Y i s being viewed as t h e dependent variable's value f o r a p a r t i c u l a r u n i t o f a n a l v s i s ( i . one w i t h a name: Nebraska, t h e Ford Foundation, you, e t c . ) having a s p e c i f i c value(s) on an A independent v a r i a b l e ( s ) . When Y i s an estimate, i t i s viewed as t h e dependent v a r i a b l e ' s value f o r a t v ~ eo f u n i t o f a n a l v s i s (again, having a s p e c i f i c v a l u e [ s ] on t h e independent v a r i a b l e [ s ] ) w i t h i n t h e p o p u l a t i o n from I n t h e former case one i s p r e d i c t i n g a s i n g l e which t h e sample was drawn. d a t a p o i n t ; i n t h e l a t t e r case one i s e s t i m a t i n g a c o n d i t i o n a l mean. AA. AN ILLUSTRATION: You have a random sample o f 100 American c i t i e s and have f i t t h e f o l l o w i n g r e g r e s s i o n model t o d a t a on ROBBEREZ ( t h e annual number o f r o b b e r i e s i n a c i t y ) and POPPTHOU ( c i t y p o p u l a t i o n i n thousands): ROBBEREZ Also assume t h a t :Z = MSE + -1 = = 4 .1 POPPTHOU + c and t h a t you have t h e f o l l o w i n g o u t p u t : StatisUcs I Mean Variance ROBBEREZ 9.00 4.21 I POPPTHW lW.W 25.00 1. Assuming t h a t Ames has a p o p u l a t i o n o f 60,000, one can use these d a t a t o p r e d i c t t h e number o f r o b b e r i e s t h a t w i l l occur i n Ames n e x t y e a r t o equal 5 ( i . e . , -1 + 6). 2. Assuming t h a t Los Angeles has a p o p u l a t i o n o f 10 m i l l i o n , one would p r e d i c t t h e number o f r o b b e r i e s t h a t w i l l occur t h e r e n e x t year t o equal 999 ( i . e . , -1 + 1,000). 3. Now, l e t us f i n d a 95% p r e d i c t i o n i n t e r v a l around t h e p r e d i c t i o n t h a t 5 r o b b e r i e s would occur i n Ames next year. We begin w i t h t h e general formula, I.e., assuming t h a t t h e f a c t o r s t h a t determined t h e number o f r o b b e r i e s i n c i t i e s w i t h i n our sample a r e t h e same as those i n p l a y i n Ames next year, we p r e d i c t t h a t next y e a r from 0 t o 11 r o b b e r i e s w i l l t a k e p l a c e i n Ames. Thus i f t h e r e were 12 r o b b e r i e s next year, you might suspect t h a t t h e h i g h r a t e i n d i c a t e d something p e c u l i a r about t h a t year i n Ames. One can begin graphing t h i s and o t h e r p r e d i c t i o n i n t e r v a l s as f o l l o w s : . # robberies per year . . . Ames - X C i t y s i z e ( i n thousands) 4. We can now add t o t h i s graph t h e 95% p r e d i c t i o n i n t e r v a l around t h e p r e d i c t i o n o f 999 robberies i n LA n e x t year: <- OR (218 t o 1780) a VERY wide p r e d i c t i o n i n t e r v a l ! 5. F i n a l l y , note t h a t t h e s m a l l e s t p r e d i c t i o n i n t e r v a l i s a t OR ( 5.06, 12.94 ) OR X X = : from 5 t o 13 r o b b e r i e s AB. CONCLUSIONS: a. The f u r t h e r t h e basis o f your p r e d i c t i o n ( i . e . , b. t h e s m a l l e r t h e sample s i z e X P ) i s from TI AND AND c. t h e smaller t h e estimated p o p u l a t i o n variance o f X, THE LESS PRECISE YOUR PREDICTION w i l l be. These conclusions f o l l o w as a d i r e c t consequence o f t h e formula f o r a prediction interval: NOTE: A c r i t i c a l value o f t ( r a t h e r than 2) w i l l be needed i n computing both confidence and p r e d i c t i o n i n t e r v a l s whenever k = n - k - 1 < 30 , t h e number o f independent v a r i a b l e s i n t h e r e g r e s s i o n equation. i n t h e b i v a r i a t e case, t h e number o f degrees o f freedom f o r t i s n where Thus - 2 . HANDOUT Some SPSS Regress1 on Output The program: d a t a l i s t r e c o r d s = l / a t t e n d 1-2 p r e j u d 4 - 5 . compute newx = ( a t t e n d - 28)**2. begin d a t a . 11 3 6 46 33 3 6 16 42 41 49 2 1 51 23 6 1 10 23 34 57 48 18 28 6 5 55 3 end d a t a . r e g r e s s i o n v a r i ables=newx, prejud/dependent=prejud/enter. The output: Yodel 1 I 1 R 1 .971421 RSquan ,94366 I 1 Square I th?Mlmale ,93803 1 S.Xl0BB a. Redldon: (Corrtsnt). NEWX Sum of squares Modd Regression 4544.67558 ResYuaI 271.32442 Total 4816.00000 a. PreddoR: &%n&ntl. NEWX b DependenlVaflable: PREJUD dl 1 1 I Mean Square 1 10 11 I F 4544.67558 167.49969 27.13244 Sb. .WW I Isndardi UrIShrdardiied Coeffcknls CmHiiien 25.953 NEWX a. Dependent Variable: PREJUD ,0000 .OOW