A Technical Note Accompanying the Excel Spreadshhet

advertisement
A Technical Note Accompanying the Excel Spreadshhet “creating_scatterplot_with_desired_r.xls”
Hungwen Chang
Department of Mathematics
Laney College
May 7, 2011
Revision history:
[v.1.01, April 1, 2012] In the paragraph following the first displayed formula, an erroneous statement on cosine, not used elsewhere, is
removed from one sentence.
[v.1.02, April 1, 2012] A typo was fixed.
This note describes a strategy taken by the spreadsheet “creating_scatterplot_with_desired_r.xls”
(starting with version 2.0 of March 2011) to create a bivariate date set with prescribed values for x , y ,
s x , s y , and r . The strategy relies on a formulation of linear regression in terms of linear algebra, a
formulation we first review below.
Given a bivariate data set ( x1 , y1 ) , … , ( xn , y n ) , the regression line y p  b  ax is the least squares
line, i.e. the line that minimizes the sum of squares of residuals.
Denote by 1 the vector  1, 1,,1  R n , X the vector  x1 , x 2 , , x n  R n , and Y the vector
 y1 , y 2 ,, y n  R n . Assume that xi ’s are not all identical, and so 1 and X are linearly
independent. Endow R n with the usual Euclidean inner product. Then finding the least squares line
y p  b  ax is the same as finding b and a to minimize || Y  (b1  aX ) || . This b1  aX is therefore
the orthogonal projection of Y onto the 2-dim linear subspace of R n spanned by 1 and X .
For any V  v1 , v 2 ,  , v n  R n , let V stand for the mean of the components, i.e.
V  (v1  v 2    v n ) / n . The orthogonal projection of V onto the linear span of 1 is thus V 1 ,
Shifting all components of V together to get V  V 1 in order to force the mean to 0 is nothing other
than extracting the part of V orthogonal to 1 R n .
The 2-dim linear span of 1 and X has, as an orthogonal basis, 1 and X  X 1 . So the orthogonal
projection of Y onto this 2-dim linear subspace is Y 1 
with b1  aX , we see a 
( X  X 1)  (Y  Y 1)
( X  X 1) . Compare this
|| X  X 1 || 2
( X  X 1)  (Y  Y 1)
and b  Y  aX . Moreover, b1  aX can be written
|| X  X 1 || 2
as Y 1  a( X  X 1) . The vector Y is thus decomposed as
Y  Y 1  a( X  X 1)  Y ,
with the three vectors on the right-hand side mutually orthogonal. The first two pieces
Y 1  a( X  X 1) encode the regression line, whereas Y encodes the n residuals. The correlation
1
coefficient r , written as
( X  X 1)  (Y  Y 1)
, is nothing other than the cosine of the angle between
|| X  X 1 || || Y  Y 1 ||
Y  Y 1 and X  X 1 . Since Y  Y 1 is orthogonally decomposed as Y  Y 1  a( X  X 1)  Y , it
follows that the square of the cosine of this angle is r 2 
1 r2 
|| a( X  X 1) || 2
, and that
|| a( X  X 1)  Y || 2
|| Y || 2
. So
|| a( X  X 1)  Y || 2
|| Y ||  1  r 2 || a( X  X 1)  Y ||  1  r 2 || Y  Y 1 || s y 1  r 2 n  1 .
Also observe that a 
( X  X 1)  (Y  Y 1) r || Y  Y 1 || rs y


.
|| X  X 1 ||
sx
|| X  X 1 || 2
This geometric picture makes it easier to see how to generate a data set with prescribed values for x ,
y , s x , s y , and r . First, calculate a  rs y / s x . Then choose X  x1 , x2 ,, xn  with the desired
mean X  x and the desired standard deviation s x . Then let
Y  y 1  a( X  x 1)  Y  y 1  (rs y / s x )( X  x 1)  Y , where Y can be any vector orthogonal to
1 and X  x 1 , with length || Y ||  s y 1  r 2 n  1 . The n components of Y will be
our y1 , y 2 ,, y n .
Suppose, in addition, we would like the data set ( x1 , y1 ) , … , ( xn , y n ) to mimic the result of repeating
for n times a random procedure that has two random variables x and y following jointly the bivariate
normal distribution with parameters  x ,  y ,  x ,  y , and  . Then the variable x follows
N ( x ,  x ) , and the variable y , when conditioned on x  x0 , follows
2
N (  y   ( x0   x ), (1   2 ) y ) , where    y /  x . To create X  x1 , x2 ,, xn  , we first
2
generate n numbers one by one using a random generator modeled on N (0,1) . Denote the result by
Z   z1 , z 2 ,, z n  . Then extract the part of Z  orthogonal to 1 , arriving at Z 0  Z   Z 1 , with
Z 0  1 . Scaling it properly (to ensure the desired standard deviation s x ) and then adding x 1 (to
ensure the desired mean x ) to get X  x 1  (s x n  1 / || Z 0 ||) Z 0 . Conditioned on
2
 x1 , x2 ,, xn  ,  y1 , y 2 ,, y n  follows  y 1  a( X   x 1)  (1   2 ) y  z1, z 2 ,, z n  ,
where z1, z 2 ,, z n are independent, each being N (0,1) . We first generate n numbers one by one
using a random generator modeled on N (0,1) . This gives Z   z1, z 2 ,, z n  . Then Z  can be
2
decomposed into three mutually orthogonal pieces as Z   Z 1 
Z 0  ( Z   Z 1)
Z 0  Z  , with the
|| Z 0 || 2
first piece parallel to 1 , the second parallel to Z 0 (and thus to X  x 1 .) Now we keep the direction of
each of the three pieces while fine-tuning their lengths to the desired values. This leads to
Y  y 1  (rs y / s x )( X  x 1)  Y , where Y  (s y 1  r 2 n  1 / || Z  ||) Z  , which has length
|| Y || s y 1  r 2 n  1 . The n components of Y are taken to be y1 , y 2 ,, y n . We now have the
desired ( x1 , y1 ) , … , ( xn , y n ) .
The implementation of the strategy laid out here appears mainly in the sheet with tab “Norm Gen” in
version 2.0 or newer of the Excel spreadsheet creating_scatterplot_with_desired_r.xls”, which is part of
the collection of spreadsheets for use in an introductory course in Statistics, a collection originally
created by Bill Lepowsky of the Math Department of Laney College in 2010.
References

Freedman, David A. 2009. Statistical Models – Theory and Practice. Cambridge University
Press.
3
Download