A Technical Note Accompanying the Excel Spreadshhet

A Technical Note Accompanying the Excel Spreadshhet “creating_scatterplot_with_desired_r.xls” Hungwen Chang Department of Mathematics Laney College May 7, 2011 Revision history: [v.1.01, April 1, 2012] In the paragraph following the first displayed formula, an erroneous statement on cosine, not used elsewhere, is removed from one sentence. [v.1.02, April 1, 2012] A typo was fixed. This note describes a strategy taken by the spreadsheet “creating_scatterplot_with_desired_r.xls” (starting with version 2.0 of March 2011) to create a bivariate date set with prescribed values for x , y , s x , s y , and r . The strategy relies on a formulation of linear regression in terms of linear algebra, a formulation we first review below. Given a bivariate data set ( x1 , y1 ) , … , ( xn , y n ) , the regression line y p  b  ax is the least squares line, i.e. the line that minimizes the sum of squares of residuals. Denote by 1 the vector  1, 1,,1  R n , X the vector  x1 , x 2 , , x n  R n , and Y the vector  y1 , y 2 ,, y n  R n . Assume that xi ’s are not all identical, and so 1 and X are linearly independent. Endow R n with the usual Euclidean inner product. Then finding the least squares line y p  b  ax is the same as finding b and a to minimize || Y  (b1  aX ) || . This b1  aX is therefore the orthogonal projection of Y onto the 2-dim linear subspace of R n spanned by 1 and X . For any V  v1 , v 2 ,  , v n  R n , let V stand for the mean of the components, i.e. V  (v1  v 2    v n ) / n . The orthogonal projection of V onto the linear span of 1 is thus V 1 , Shifting all components of V together to get V  V 1 in order to force the mean to 0 is nothing other than extracting the part of V orthogonal to 1 R n . The 2-dim linear span of 1 and X has, as an orthogonal basis, 1 and X  X 1 . So the orthogonal projection of Y onto this 2-dim linear subspace is Y 1  with b1  aX , we see a  ( X  X 1)  (Y  Y 1) ( X  X 1) . Compare this || X  X 1 || 2 ( X  X 1)  (Y  Y 1) and b  Y  aX . Moreover, b1  aX can be written || X  X 1 || 2 as Y 1  a( X  X 1) . The vector Y is thus decomposed as Y  Y 1  a( X  X 1)  Y , with the three vectors on the right-hand side mutually orthogonal. The first two pieces Y 1  a( X  X 1) encode the regression line, whereas Y encodes the n residuals. The correlation 1 coefficient r , written as ( X  X 1)  (Y  Y 1) , is nothing other than the cosine of the angle between || X  X 1 || || Y  Y 1 || Y  Y 1 and X  X 1 . Since Y  Y 1 is orthogonally decomposed as Y  Y 1  a( X  X 1)  Y , it follows that the square of the cosine of this angle is r 2  1 r2  || a( X  X 1) || 2 , and that || a( X  X 1)  Y || 2 || Y || 2 . So || a( X  X 1)  Y || 2 || Y ||  1  r 2 || a( X  X 1)  Y ||  1  r 2 || Y  Y 1 || s y 1  r 2 n  1 . Also observe that a  ( X  X 1)  (Y  Y 1) r || Y  Y 1 || rs y   . || X  X 1 || sx || X  X 1 || 2 This geometric picture makes it easier to see how to generate a data set with prescribed values for x , y , s x , s y , and r . First, calculate a  rs y / s x . Then choose X  x1 , x2 ,, xn  with the desired mean X  x and the desired standard deviation s x . Then let Y  y 1  a( X  x 1)  Y  y 1  (rs y / s x )( X  x 1)  Y , where Y can be any vector orthogonal to 1 and X  x 1 , with length || Y ||  s y 1  r 2 n  1 . The n components of Y will be our y1 , y 2 ,, y n . Suppose, in addition, we would like the data set ( x1 , y1 ) , … , ( xn , y n ) to mimic the result of repeating for n times a random procedure that has two random variables x and y following jointly the bivariate normal distribution with parameters  x ,  y ,  x ,  y , and  . Then the variable x follows N ( x ,  x ) , and the variable y , when conditioned on x  x0 , follows 2 N (  y   ( x0   x ), (1   2 ) y ) , where    y /  x . To create X  x1 , x2 ,, xn  , we first 2 generate n numbers one by one using a random generator modeled on N (0,1) . Denote the result by Z   z1 , z 2 ,, z n  . Then extract the part of Z  orthogonal to 1 , arriving at Z 0  Z   Z 1 , with Z 0  1 . Scaling it properly (to ensure the desired standard deviation s x ) and then adding x 1 (to ensure the desired mean x ) to get X  x 1  (s x n  1 / || Z 0 ||) Z 0 . Conditioned on 2  x1 , x2 ,, xn  ,  y1 , y 2 ,, y n  follows  y 1  a( X   x 1)  (1   2 ) y  z1, z 2 ,, z n  , where z1, z 2 ,, z n are independent, each being N (0,1) . We first generate n numbers one by one using a random generator modeled on N (0,1) . This gives Z   z1, z 2 ,, z n  . Then Z  can be 2 decomposed into three mutually orthogonal pieces as Z   Z 1  Z 0  ( Z   Z 1) Z 0  Z  , with the || Z 0 || 2 first piece parallel to 1 , the second parallel to Z 0 (and thus to X  x 1 .) Now we keep the direction of each of the three pieces while fine-tuning their lengths to the desired values. This leads to Y  y 1  (rs y / s x )( X  x 1)  Y , where Y  (s y 1  r 2 n  1 / || Z  ||) Z  , which has length || Y || s y 1  r 2 n  1 . The n components of Y are taken to be y1 , y 2 ,, y n . We now have the desired ( x1 , y1 ) , … , ( xn , y n ) . The implementation of the strategy laid out here appears mainly in the sheet with tab “Norm Gen” in version 2.0 or newer of the Excel spreadsheet creating_scatterplot_with_desired_r.xls”, which is part of the collection of spreadsheets for use in an introductory course in Statistics, a collection originally created by Bill Lepowsky of the Math Department of Laney College in 2010. References  Freedman, David A. 2009. Statistical Models – Theory and Practice. Cambridge University Press. 3

A Technical Note Accompanying the Excel Spreadshhet

Related documents

Products

Support

A Technical Note Accompanying the Excel Spreadshhet

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib