Statistics for Finance

advertisement
Statistics for Finance
1. Lecture 5: Confidence Intervals, Hypothesis Testing.
1.1. Confidence Intervals.
Suppose that we have a Normal N (µ, σ 2 ) distribution with unkown mean and
and standard deviation. We have seen so far how to produce estimators for these
quantities. These estimators tell what is a most likely value for these parameters.
However it is very unlikely that these estimators will produce the exact value of
the inknown parameter. It would be more desirable to produce a range of value,
such that the unkown parameter will lie within these values with high probability.
This is achieved by the construction of confidence intervals. We will defer a formal
defintion for later on, after we see the philosophy behind the construction of the
confidence interval.
1.1.1. Confidence Interval for Mean with Known Standard Deviation.
Suppose we have a distribution, not necessarily normal, with known variance σ 2 ,
but unkown mean µ. Suppose that we form a sample X1 , X2 , . . . , Xn from this
distribution. Then the sample mean
X=
X1 + · · · + Xn
n
will have variance Var(X) = σ 2 /n and mean E[X] = µ, while the Central Limit
Theorem will tell us that the distribution of
X −µ
√
σ/ n
is approximately standard normal, that is
µ
¶
X −µ
√ <z
P −z <
' Φ(z) − Φ(−z)
σ/ n
' 2Φ(z) − 1
R z −x2 /2 √
where Φ(z) = −∞ e
/ 2π. By a simple manipulation in the above we get that
µ
¶
σ
σ
P X − z√ < µ < X + z√
' 2Φ(z) − 1.
n
n
Suppose now that we choose z = zα/2 such that 2Φ(zα/2 ) − 1 = 1 − α (this equation
cannot be solved explicitly, but there are tables that give you the values zα/2 for
1
2
different values of α, or you can use some statistical software) then
µ
¶
σ
σ
P X − zα/2 √ < µ < X + zα/2 √
' 1 − α.
n
n
So, for example, if α = 0.05, we obtain from the tables that z0.025 ' 1.96 and
therefore the interval [X − 1.96 √σn , X + 1.96 √σn ] will contain the unkown value of the
mean µ with probability 0.95.
1.1.2. Confidence Interval for Mean with Unkown Standard Deviation.
Suppose now that we want to construct a conidence interval for the mean of a
distribution, but this thime we don’t know itsh standard deviation. In this
i case the
σ
σ
2
variance σ in the (1 − α)-confidence interval X − zα/2 √n , X + zα/2 √n should be
P
replaced by an estimator, which we choose to be the sample variance ŝ2 = ni=1 (Xi −
X)2 /n − 1. In other words we would tend to say that the (1−α)−confidence interval
for the mean µ is
·
¸
ŝ
ŝ
X − zα/2 √ , X + zα/2 √
.
n
n
This is not exactly correct, though. The reason is that when we replace σ by s i.e.
when we consider the fraction
X −µ
√
ŝ/ n
the correct approximation of the distribution of this random variable is not the
standard normal distribution, but rather the tn−1 , the t-distribution with (n − 1)
degrees of freedom. Therefore the zα/2 normal quantiles should be replaced with the
corresponding tn−1,α/2 quantiles of the t-distribution with (n−1) degrees of freedom.
The correct (1 − α) -confidence interval in this case is
·
¸
ŝ
ŝ
X − tα/2 √ , X + tα/2 √
.
n
n
We will now attempt to give an explanation as to why the t-distribution is the
correct distribution to consider, rather than the normal distribution.
Assume that the underlying distribution is normal N (µ, σ 2 ), with unkown mean
and standard deviation. Suppose that a sample of size n, X1 , X2 , . . . , Xn is drawn
from this distribution and consider the fraction
√
X −µ
n(X − µ)/σ
√ =
(1)
.
ŝ/σ
ŝ/ n
Then we claim that the distribution of the above random variable is exaclty the
tn−1 -distribution with (n − 1) degrees of freedom.
To prove this we need the following very interesting lemma
3
Lemma 1. Consider a sequence X1 , X2 , . . . , Xn of i.i.d. standard normal variables.
Then the sample average X is independent of the random vector (X1 − X, X2 −
X, . . . , Xn − X).
The proof of this fact is not difficult, but we will skip it since it is a bit lengthy.
The detailed proof can be found in the book of Rice, Section 6.3. Let us just say
that to prove the statement it is enough to prove the for any numbers u, u1 , . . . , un
it holds that
h
i
h
i h Pn
i
Pn
E euX+ i=1 ui (Xi −X) = E euX E e i=1 ui (Xi −X) .
√
Then, clearly n(X − µ)/σ is a standard normal distribution. On the other hand
we have
Lemma 2. If X1 , X2 , . . . , Xn are i.i.d. normal N (µ, σ 2 ) then the distribution of
n
ŝ2
1 X
(n − 1) 2 = 2
(Xi − X)2
σ
σ i=1
is a χ2n−1 distribution, with (n − 1) degrees of freedom.
Proof. Note that
¶2
n
n µ
X
Xi − µ
1 X
2
(Xi − µ) =
∼ χ2n
σ 2 i=1
σ
i=1
as a sum of the squares of n i.i.d. standard normals. Moreover,
n
n
1 X
1 X
2
(Xi − µ) =
((Xi − X) + (X − µ))2
2
2
σ i=1
σ i=1
µ
¶2
n
1 X
X −µ
2
√
=
((Xi − X)) +
,
σ 2 i=1
σ/ n
P
where we also used the fact the ni=1 (Xi − X) = 0. The above equation is of the
form W = U + V , where U,V are independent by the previous lemma. Also, W, V
have distributions χ2n , χ21 , respectively. If MW (t) denotes the moment generating
function of W , and similarly for U, V we have by independence that
MU (t) =
(1 − 2t)−n/2
MW (t)
=
= (1 − 2t)−(n−1)/2 ,
−1/2
MV (t)
(1 − 2t)
where we used the fact that the moment generating function of a χ2n with n degrees
of freedom is (1 − 2t)−n/2 .
¤
From the above Lemma, as well as the definition
p of a t-distribution (recall that
2
if Z is standard normal and U ∼ χr then Z/ U/r ∼ tr ) , it follows that the
distribution of (1) is exactly the tn−1 -distribution.
4
1.1.3. Confidence Intervals for the Variance.
Let us consider the particular case of an i.i.d. Normal sample X1 , X2 , . . . , Xn .
Let σ̂ 2 the maximum likelihood estimator of the variance, i.e.
n
1X
σ̂ =
(Xi − X)2 .
n i=1
2
Then by Lemma 2 we have that
nσ̂ 2
∼ χ2n−1 .
σ2
Let us denote by χ2m,α , the chi square quantile, i.e. the point beyond which the chi
square distribution with m degrees of freedom has probability α. Then we have
µ
¶
nσ̂ 2
2
2
P χn−1,1−α/2 < 2 < χn−1,α/2 = 1 − α
σ
and solving for σ 2 we get that
!
Ã
2
nσ̂
nσ̂ 2
< σ2 < 2
=1−α
P
χ2n−1,α/2
χn−1,1−α/2
Therefore the (1 − α)-confidence interval for the variance is
"
#
2
2
nσ̂
nσ̂
, 2
2
χn−1,α/2 χn−1,1−α/2
1.1.4. Confidence Intervals for General Parameters.
Suppose now that we want to construct confidence intervals for some parameter θ
of a distribution. For a general parameter, other than the mean or the variance the
construbtion of confidence intervals as above is more difficult, since rather detailed
information on the distribution is required. We can get around this difficult problem, by constracting approximate confidence intervals with the help of maximum
likelihood.
In particular, we know
p from Theorem 1, of Lecture 3, that for a parameter θ, with
MLE θ̂ it holds that nI(θ)(θ̂ − θ) is appoximately standard normal. Therefore, if
zα/s is the corresponding quantile for the standard normal we have that
´
³
p
P −zα/2 < nI(θ)(θ̂ − θ) < zα/2 ' 1 − α.
The difficulty in this equation is to
qsolve for θ. We can make our life easier by
assuming that the distribution of nI(θ̂)(θ̂ − θ) is also approximately standard
5
normal. Therefore, it follows that the (1 − α)-confidence interval is approximately


z
z
θ̂ − q α/2 , θ̂ + q α/2  .
nI(θ̂)
nI(θ̂)
1.1.5. What is the Confidence Interval ?
We have so far several way of constructing confidence intervals. Let us now discuss
how we should interpret a confidence interval.
The confidence interval should be interpreted itself as a random object. It is a
random inerval e.g., in the case of the mean, of the form
·
¸
ŝ
ŝ
(2)
X − tα/2 √ , X + tα/2 √
.
n
n
but X and ŝ are functions of the sample and so they should be considered are
random variables.
The interpretation of an interval like the above one, should be a realization of a
random interval, which with probability (1 − α) contains the unkown parameter (in
this case the mean).
As an example we do the following experiment. We generate 20 independent
samples of size 9 each, from a normal distribution with mean µ = 10 and variance
σ 2 = 9. For each one of these samples we form the resulting 0.9-confidence intervals
for the mean, which will be of the form
·
¸
σ
σ
X − zα/2 √ , X + zα/2 √
n
n
·
¸
3
3
= X − 1.64 √ , X + 1.64 √
9
9
¤
£
= X − 1.64, X + 1.64 ,
where X is the corresponding sample mean in each one of the 20 samples. Once we
generate these intervals, we expect that 90% of them, that is about 18 of them to
contain the value 10, which corresponds to the real population mean. Be careful we
expect that about 18 will contain the mean ! This does not mean that for sure 18
of the intervals will contain the actual mean, since as we said the outcome of the
interval is itself random and depends on the realisation of the sample.
1.2. Hypothesis Testing. Let us start with an example. Suppose that X1 , X2 , . . . , Xn
is a sample drawn from a normal distribution with unkown mean µ and variance σ 2 .
6
Consider testing the following hypotheses:
H0 : µ = µ0
HA : µ 6= µ0
The hypothesis H0 is called null hypothesis, while the hypothesis HA is called
alternative hypothesis. The idea is that one starts assuming that the mean of
the normal distribution is µ0 and then proceeds in checking whether this assumption
should be true and therefore be accepted or whether it should rejected in favor of
the alternative hypothesis HA , which claims that the mean µ 6= µ0 .
We would like to construct a test, based on which we will be rejecting or accepting
the null hypothesis. Of course, since we deal with random events, there will always
be a probability of false decision, that is, to accept the null hypothesis as correct,
when it is not, or to accept the alternative hypothesis as correct, while it is not. The
former type of error is called Type II error, while the latter error is called Type
I error. We will come back to this point in a minute. First, we need to construct
the test. Again, as in many occasions so far, there are several ways to construct an
appropriate test. Here we will present the thest that is dual to confidence intervals.
We start with the assumption that the null hypothesis is correct, i.e. that the
mean of the distribution is µ0 . Then, as before, the random variable
X − µ0
√
σ/ n
is standard normal. The random variable Z is called the test statistic, that we
use. Suppose that the actual value of the random variable Z, as this emerges from
the sampling is such that
¯
¯
¯ X − µ0 ¯
¯
¯
(3)
¯ σ/√n ¯ > zα/2
Z=
where zα/2 is the α/2 standard normal quantile. The porbability that something
like this happens is
P (|Z| > zα/2 ) = α.
Therefore, if α is sufficiently small, the probability of obtaining sample data that
result to a test statistic satisfying (3) is very small (and equal to α). It is, therefore,
unlikely that we got “strange data” and we prefer to say that our null hypothesis
was wrong and reject it in favor of the alternative hypothesis. Of course there is
always the possibility that we really got “strange data” and we falsely rejected the
null hypothesis. In this case we fell into a type I error. The probability of this
happening is α and it is called the significance level of our test.
We finally say that the region
¯
¯
ª
© ¯ x − µ0 ¯
x : ¯¯ √ ¯¯ > zα/2
σ/ n
7
is the rejection region for the test statistic (3) at significance level α. In other
words we will reject the null hypothesis, if the data form a sample mean that falls
into the rejection region.
The above type of hypothesis testing is called two-side. We could also have a
one-sided hypothesis testing, which would consist of
H 0 : µ = µ0
H1 : µ > µ 0
In this case it is easy to see that the rejection region at significance level α should
be
© x − µ0
ª
√ > zα .
x:
σ/ n
One proceeds similarly in the case where > is replaced by <.
Since we were dealing with normal distributions the computations of the above
probabilities were exact. In the case that we want to test the mean of a general
distribution, we to make use of the Central Limit Theorem and then proceed similarly. The only thing that will change is that the equation P (|Z| > zα/2 ) = α will
be replaced by P (|Z| > zα/2 ) ' α.
As in the case of confidence intervals with unknown variance, when the variance
of the distribution is unknown, we will have to replace it with th esample variance
ŝ2 . Then we also need to make use of the t-distribution, instead of the normal. In
this case the test statistic that we will be using is
X − µ0
√ .
s/ n
The rejection region at significance level α (in the case of o two-sided hypothesis
testing) will be
¯
¯
© ¯ x − µ0 ¯
ª
¯
x : ¯ √ ¯¯ > tn−1,α/2
σ/ n
where tn−1,α/2 is quantile for the t-distribution with (n − 1) degrees of freedom (if
our sample size is n).
The smallest significance level at which the null hypothesis would be rejected is
called the p-value.
Example 1. A stock trading company institutes a new system, in order to reduce the
trade time of a stock. The mean waiting time under the specific conditions with the
previous system was 6.1 minutes. A sample of 14 stock trades is taken. The times
are measured at widely separated times so to eliminate the possibility of dependent
observations. The resulting sample mean is 5.043 and the sample standard deviation is 2.266 Test the null hypothesis of no change against an appropriate research
hypothesis using α = .10
8
We are interested in the value of the mean trading time and if the new system
reduces it. Since the current mean waiting time is 6.1 we can formulate the null and
alternative hypothesis as
H0 : µ = 6.1
H1 : µ < 6.1
Since we use the sample standard deviation we will use the quantiles of the tdistribution. We form the t- test statistic
X −µ
5.043 − 6.1
√ =
√ = −1.75
s/ n
2.266/ 14
For α = .10 and 13 degrees of freedom we have the the quantile is t.10 = 1.350 and
the rejection region is
©
ª
t : t < −1.350
This is because we are dealing we one-sided hypothesis testing. Since the observed
value belongs into the rejection region we reject the null hypothesis in favor the
alternative hypothesis.
The p-value is equal to P (T13 < −1.75), where T13 is a random variable with
t-distribution and 13 degrees of freedom. The exact value is found using the appropraite tables or software.
t=
We summarise the hypothesis testing for the mean when the standard deviation
is unkown in the follwoing table
H0
µ = µ0
H1 : 1.µ > µ0
2.µ < µ0
3.µ 6= µ0
X − µ0
√
T.S. : t =
s/ n
R.R > : For a given probability α of Type I error, reject H0 if
1.t > tα
2.t < −tα
3.|t| ≥ tα/2
where tα cuts off a right-tail are of α
in a t distribution with n − 1 degrees of freedom
p-value
1.P (Tn−1 > tactual )
2.P (Tn−1 < tactual )
3. 2P (Tn−1 > |tactual |)
9
1.3. Exercises.
1. A random sample of 20 vice executives of Fortune 500 firms is taken. The
amount each executive paid in federal income taxes as a percentage of gross income
is determined. The data are
16.0 18.1 18.6 20.2 21.7 22.4 22.4 23.1 23.2 23.5
24.1 24.3 24.7 25.2 25.9 26.3 27.9 28.0 30.4 33.7
A. Compute the sample mean and the sample standard deviation. B. Calculate a
95% confidence interval for the (population) mean. C. Calculate a 99% confidence
interval for the (population) mean. D. Calculate a 95% confidence interval for the
(population) variance. E. Calculate a 99% confidence interval for the (population)
variance. F. GIve a careful verbal interpretation of the above confidence intervals.
2. In the above exercise repeat question A., B. assuming that you know that the
population standard deviation is σ = 4.0.
3. Use Minitab to compute the confidence intervals in Exercise 1.
4. Often we are interested in how large a sample we need to take in order to have
an appropriate confidence interval. This is outlined in the following statement:
The sample size required to obtain a 100(1 − α)% confidence interval for a population mean µ of the form X ± E (assuming that the population standard deviation
σ is known) is
2
zα/2
σ2
n=
.
E2
A. Derive the above statement, i.e. prove it ! B. What would be the corresponding
statement if the population standard deviation σ is unkown ? Prove it !
Note: Often 2E is called the width of the confidence interval.
5. Union officials are concerned about reports of inferior wages being paid to
employees of a company under its jurisdiction. How large a sample is needed to
obtain a 90% confidence interval for the population mean hourly wage µ with width
equal to 1.00£. Assume that σ = 4.00£.
6. The manager of a health maintenance organization has set as a target that the
mean waiting timeof nonemergency patients not exceed 30 minutes. In spot checks
the manager finds the waiting times for 22 patients. The patients are selected
randomly on different days. Assume that the population standard deviation of
waiting times is 10 minutes.
A. What is teh relevant parameter to be tested ?
B. Formulate the null and alternative hypotheses.
C. State the test statistics and the rejection region corresponding to α = .05
10
7. The battery pack of a hand calculator is supposed to perform 20, 000 calculations before needing recharge. The quality control manager for the manufacturer is
concerned that the pack may not be working for as long as the specifications state.
A test of 114 battery packs gives average of 19, 695 calculations and a standard
deviation of 1103.
A. Formulate the null and alternative hypothesis.
B. Calculate the appropriate test statistic and p-value.
C. Calculate a 95% confidence interval.
8. Use Minitab to confirm the example given in section ??. That is genearate
20 samples of size 9 each from a normal with mean 10 and variance 9. Construct
(using Minitab or by hand) the 20 corresponding 90% confidence intervals for the
mean. How many contain the actual value µ = 10
Do the same thing by constructing the confidence intervals for the variance.
Download