Lecture 2

advertisement
Econ 496/895
Introduction to Design and Analysis of Economics Experiments
Professor Daniel Houser
Background
A. Useful definitions and results from probability theory.
1. A “random experiment” is an experiment for which the outcome cannot be
predicted with certainty, although the set of all possible outcomes can be
described.
2. The sample space S of a random experiment is the set of all possible outcomes.
3. Any function X : S   is a “random variable.” Note that the choice of function
X generally depends on the purpose of the experiment. For example, an
experiment on brain activation one could look for activation in a single area or for
the average over several areas or for certain activation paths.
Ex: Tossing a six-sided die and observing the number of spots facing up is a random
experiment. The sample space is S  1,2,3,4,5,6 and the random variable is
X ( s )  s, the identity map. An alternative random variable would be X ( s )  1 if
s  4 and X ( s )  0 otherwise, which provides an indicator for whether the realized
outcome exceeded four.
4. Let A be a subset of S , A  S . Suppose P is a set function defined on the set of
all subsets of S, P( A)  for all A  S . P is called a “probability” if and only if
it satisfies the following.
(i)
(ii)
(iii)
For all subsets A of S, P( A)  0,
P ( S )  1,
If A1 , A2 , ..., Ak are such that Am  An  , m  n, then
P  A1  ...  Ak  = P( A1 )  P( A2 )  ...  P( Ak ), for k  .
Ex: In the die tossing example P(1)  P(2)  ...  P(6)  1/ 6, P(1  2)  1/ 3, and
P([1  2]  [2  3])  0.5.
5. The “conditional probability” of A given that B has occurred is
P( A  B )
P( A | B ) 
.
P( B )
Ex: In the die tossing example, the probability of a “2” given that a “1” or “2”
has been observed is P(2 |1 or 2)  P(2  [1 or 2]) / P(1 or 2) = (1/ 6) /(1/ 3)  0.5.
1
6. Events A and B are “independent” if and only if P( A  B )  P( A) P( B ).
From the definition of conditional probability we see that if two events are
independent then knowledge about the occurrence of one does not help to improve
predictions with regard to the likelihood of the other.
7. Events A, B and C are mutually independent if they are pairwise independent and
P( A  B  C )  P( A) P( B ) P(C ). This can be extended to four or more events
easily: each pair, triplet, etc… must satisfy independence.
Ex: Suppose three coins are tossed simultaneously but that these are trick coins and,
with equal probability, one of the four outcomes occurs. {H,T,H}, {H,H,T},{T,T,T}
or {T,H,H}, where H is heads, T is tails, and the order is {Coin 1, Coin 2, Coin 3}.
Then it is easy to see that the realizations are pairwise independent, P(Coin 1=H|Coin
2 = T)=0.5, etc… although there is obviously dependence in their joint realization
since P(H,H,H) is zero, not 1/8.
8.
From the definition of conditional probability one can deduce Bayes’ Formula:
P( A) P( B | A)
P( A | B ) 
.
P( B )
This is often referred to as the posterior probability of A given B.
9. If X is a random variable, we will often denote its “probability density function”
(pdf) by f  . If X is discrete, then f ( x )  P( X  x ). If X is continuous then
P( X  A)   f ( x )dx. A pdf is nonnegative and integrates to unity over the
A
sample space.
10. The “cumulative distribution function” (cdf) of a random variable X is defined by
F ( x )  P ( X  x )   f ( x ). A cdf lies between zero and one and is
X x
nondecreasing.
11. The “expected value” of a random variable X, E(X), is
 xf ( x ),
providing that the
X
value of this integral is finite. This is the weighted average of all possible values
of X, where the weights are interpreted as the chance that the observation will
occur. If the value of the integral is infinite, then the expectation is said not to
exist.
Expected values satisfy:
 E(c) = c for every constant c.
 E(cX)=cE(X) for constant c and rv X.
 E(u(X)+v(X))=E(u(X)) + E(v(X)) for functions u and v.
2
12. The variance of a rv X is its expected squared deviation from its mean,
 X2   ( X  E ( X )) 2 f ( x )dx. The positive square root of the variance is called the
X
standard deviation. The variance is a measure of the dispersion of the distribution
of X about its mean. Chebyshev’s inequality shows that the probability that a
realization of X is within k  1 standard deviations of its mean is at least 1  1/ k 2 .
Hence, at least 84% of the mass of any rv lies within 2.5 sd’s of its mean.
13. The covariance of two rv’s X and Y is Cov ( X , Y )  E ([ X   x ][Y   y ]). Their
correlation coefficient is  xy 
Cov( X , Y )
 x y
. The correlation coef. lies between –1
and +1. Intuition about the meaning of covariance and correlation is gained by
noting that the least squares regression of Y on X is y   y  
y
( x   x ).
x
14. A rv X has a “normal” (or Gaussian) distribution if its pdf is
 ( x   x )2 
1
f ( x) 
exp  
 ,    x  .
2 X2 
2 X2

If X follows a normal distribution we write X ~ N (  x ,  X2 ).

If X ~ N (  ,  2 ) then Z 
( X  )
~ N (0,1).

 The finite sum of normally distributed rv’s follows a normal
distribution.
B. Useful definitions and results from sampling distribution theory.
1. A “statistic” is a function of a random sample of realizations of random variables.
The mean and variance of a sample are both statistics. Since statistics are
functions of random variables, they are themselves random variables.
2. “Sampling distribution theory” is concerned with determining the distribution of
statistics.
3. Example: Finding the sampling distribution of a function of random variables in
a random sample.
An unbiased four-sided die is cast twice with outcomes X1 and X2. Find the
expected value and variance of Y = X1 + X2.
Y takes values in {2, 3, …, 8}. Let g(Y) denote the pdf of Y.
3
g(2) = P(Y=2) = P(X1=1 and X2=1) = (1/4)(1/4)=1/16.
g(3) = P(Y=3) = P(X1=1 and X2=2) + P(X1=2 and X2=1) = 2/16.
similarly:
g(4) = 3/16.
g(5) = 4/16.
g(6) = 3/16.
g(7) = 2/16.
g(8) = 1/16.
Hence: E[Y] = 2(1/16)+3(2/16)+4(3/16)+5(4/16)+6(3/16)+7(2/16)+8(1/16)=5.
One may then calculate Var(Y) = E([Y-5]2) = 5/2.

Note that in many cases sampling distributions can be
simulated more quickly and easily than they can be computed.
4. In many empirical applications the sampling distributions of the statistics of
interest turn out to be chi-square, t or F. We next discuss each of these
distributions.
-
The chi-square distribution.
(i)
A rv X is said to have a chi-square distribution with r degrees of
freedom,  2 ( r ), if its pdf is
1
f ( x) 
x r / 21e  x / 2 , 0  x  .
r/2
( r / 2)2
E[X] = r, Var(X) = 2r.
(ii)
If the rv X is N (  , 2 ) then the rv V   X    /  2 is  2 (1). That
is, the square of a standard normal rv follows a chi-square distribution
with one degree of freedom.
2
(iii)
If X 1 is  2 ( r1 ) and X 2 is  2 ( r2 ) and if they are independent, then
X 1  X 2 ~  2 ( r1  r2 ).
(iv)
Let Z1 ,..., Z k be independent standard normal rv’s. Then the sum of
the square of these rv’s has a chi-square distribution with k degrees of
freedom: W  ( Z12  ...  Z k2 ) ~  2 ( k ).
(v)
If X 1 ,..., X n have mutually independent normal distributions
n
N ( i ,  i2 ) then the distribution of W  
i 1
4
( X i  i )2

2
i
is  2 (n).
Suppose X 1 ,..., X n are independent random draws from a N (  X , X2 )
distribution. Define
1 n
X   X i (the sample mean)
n i 1
(vi)
S2 
2
1 n
X i  X  , (estimate of the variance)


n  1 i 1
then
(a) X and S 2 are independent.
n
(b)  n  1
S

2
2
X

 X
i 1
i

X
2
X
2
is  2 ( n  1).
Use of the sample mean instead of the true mean reduces the degrees of
freedom by one. This reduces the expected value of the statistic relative to
the value under the true mean. The reason is that the squared deviation of
the observations in the sample from the sample mean will never be greater
than the difference from the mean of the population’s distribution.
-
The t distribution
(i)
If Z is a rv that is N(0,1), if U is a rv that is chi-square with r degrees
of freedom, and if Z and U are independent, then
Z
T
U /r
has a t distribution with r degrees of freedom.

This distribution was discovered by W. Gosset who published
under the name of “Student.”
(ii)
The pdf of a t distribution is
 ( r  1) / 2
g (t ) 
, -<t<.
 r ( r / 2)(1  t 2 / r )( r 1) / 2
(iii)
When r = 1 the t distribution is the Cauchy distribution, hence its mean
and variance do not exist. The mean exists and is zero when r  2,
r
and the variance exists and is equal to
when r  3.
r2
(iv)
The t distribution is symmetric about the origin but has fatter tails than
the normal distribution. The tails shrink as r grows larger: as r
approaches infinity the t distribution approaches the normal
distribution.
5
(v)
Suppose X 1 ,..., X n is a random sample from a N (  X , X2 ) distribution.
Then ( X   X ) /( X / n ) is N(0,1), (n  1) S 2 /  X2 is  2 (n  1) and the
two are independent. Thus,
T=[ ( X   X ) /( X / n ) ]/ ( n  1) S 2 /  X2 =

-
X  X
is t(n-1).
S/ n
This provides a convenient way to test for whether a mean is
statistically different from, say, zero when the variance of the
underlying normal distribution is not known but must be
estimated.
The F distribution
(i)
If U and V and independent chi-square variables with q and r degrees
of freedom, respectively, then F = (U/q)/(V/r) has an F distribution
with q and r degrees of freedom.

The F distribution is named in honor of R. A. Fisher.
(ii)
The F statistic can take only positive values.
(iii)
E(F) = r/(r-2) when r > 2, Var(F) =
(iv)
Let X 1 ,..., X n and Y1 ,..., Ym be random samples of size n and m from
2r 2 ( q  r  2)
when r > 4.
q( r  2)2 ( r  4)
two independent normal distributions N (  X , X2 ) and N ( Y , Y2 ).
Then (n  1) S X2 /  X2 is  2 (n  1) and (n  1) SY2 /  Y2 is  2 (m  1). Since
the distributions, and hence S X2 and SY2 , are independent we have
h

(m  1) SY2 /  Y2 (m  1) SY2 /  Y2

~ F (m  1, n  1).
(n  1) S X2 /  X2 (n  1) S X2 /  X2
This statistic is often useful when testing for the statistical
significance of treatment effects.
(5) The central limit theorem
Suppose X 1 ,..., X n is a random sample from a N (  X , X2 ) distribution. What is
the sampling distribution of the mean X ? Note that:
6
E[ X ] 
1 n
1
E[ X i ]  n X   X .

n i 1
n
1 n
1
 X2
2
X
)

n


 i n2 X n .
n i 1
Moreover, since X is a scaled sum of independent normal rv’s it also follows a
normal distribution. Thus,
Var ( X )  Var (
X ~ N (X ,
 X2
n
) and W=
X  X
~ N (0,1).
X / n
As n   the distribution of X   X degenerates to zero. However, the
distribution of W does not degenerate. The scale factor  X / n spreads out the
mass so that the variance of W is unity for all n.
The statistic W is well defined even if the underlying sample is not normal.
Suppose X 1 ,..., X n is a random sample from an arbitrary distribution. What is the
sampling distribution of the W statistic in this case? It is easy to show (and you
should) that E[W] = 0 and Var(W) = 1 in this case as well. However, since the
distribution of the X is not normal the distribution of W will not generally be
normal. Nevertheless, the central limit theorem (clt) assures us that a standard
normal distribution closely approximates the true sampling distribution of W, as
long as fairly weak regularity conditions hold and n is sufficiently large.
Typically, n = 15 is large enough for practical purposes.
Central Limit Theorem: If X is the mean of a random sample X 1 ,..., X n of size
n from a distribution with a finite mean  X and a finite, positive variance  X2
then the distribution of
X  X
W
X / n
is N(0,1) in the limit as n  .

It is often the case that assessing treatment contrasts involves
comparing means from two treatment groups. One way to
determine the significance of the contrasts is to use parametric
tests that assume the means are normally distributed. The clt
can be invoked to justify this normality assumption,
particularly if the number of observations in each treatment is
15 or more.
7
C. Comments on Tests of Statistical Hypotheses
1. Definitions
- A statistical hypothesis is an assertion about the distribution of one or more
rv’s.
-
A test of a statistical hypothesis is a procedure, based on the observed values
of the rv’s, that leads to the acceptance or rejection of the hypothesis.
-
The critical region C is the set of points in the sample space that leads to the
rejection of the null hypothesis.
-
A statistic used to define the critical region is referred to as a test statistic.
The critical region can be defined as the set of values of the test statistic that
leads to rejection of the null.
-
A type I error is made when the null is rejected even though it is true. A
type II error is made when the null is accepted even though it is false.
-
The significance level of the test is the probability of a type I error.
-
The power function of a test of a null H 0 against an alternative H1 is the
function that gives the probability of rejecting H 0 for each parameter point in
H 0 and H1 . The value of the power function at a particular point is called the
“power” of the test at that point.
-
The p-value is the probability, under H 0 , of all values of the test statistic that
are as extreme (in the direction of rejection of H 0 ) as the observed value of
the test statistic.
2. Example
Suppose we have a random sample of 25 observations X i from an unknown
distribution with a known variance of 100. We believe the mean of this
distribution is 60, but we want to entertain the possibility that it exceeds 60.
Hence, we will test the null hypothesis
H 0 :   60
against the alternative hypothesis
H1 :   60.
Note that each of these are statistical hypotheses, since the make assertions about
the distribution of X.
8
We decide to reject the null in favor of the alternative if the mean of the
observations is at least 62, or X  62. This defines the critical region for the test:
1


C  ( X i )i 1,25 :  X i  62  . If the realized sample does not fall in the critical
25


region will accept the null.
Since the critical region is known we may now determine the significance level
(or size) and power of this test. The power function is
 (  )  P( X  62;  ).
Since by the clt the distribution of X should be well approximated by a
N(  ,100/25=4) distribution, we have that
X   62  
 (  )  P(

; )
2
2
62  
 1  (
), 60   .
2
where  () denotes the standard normal cdf.
Some values of the power function follow.

60
61
62
63
64
65
66
 ( )
0.16
0.31
0.50
0.69
0.84
0.93
0.98
Hence, the significance level of this test is 0.16: there is a 16% chance of
rejecting the null when it is true. The power of the test when the true mean is 65
is 93%. The p-value associated with a realized mean of 63 is
X  60 63  60
p-value = P( X  63 ;   60 )=P(
)

2
2
63  60
) = 0.067
= 1  (
2
so we would not reject at a 5% significance level, but would reject at a 10% level.

Generally, we would like a power of 100% and a size of 0%.
This is not usually attainable. In fact, for a fixed sample size,
more powerful tests usually have greater chances of generating
a type I error.
9
3. The power function and sample size
The power function can be used to help determine an appropriate sample size.
Ex: Let X have a N(  X ,36) distribution. We want to use a sample of size n
from this distribution to test the hypothesis that  X  50 against the
alternative that  X  50. We want the size of this test (the probability of a
type I error) to be 0.05 and the power of the test at  X  55 to be 0.90. This
requires us to solve simultaneously the following two equations.
X  50 c  50

),
6/ n 6/ n
X  55 c  55
0.10  P( X  c;  X  55)  P(

).
6/ n 6/ n
0.05  P( X  c; H 0 )  P(
Thus, we must solve
c  50
 1.645
6/ n
c  55
 1.282
6/ n
which has solution {c, n}  {52.8,12}.
10
Download