Hypothesis Testing

advertisement
Econ 413
Parks
Hypothesis Testing
Page 2 of 32
Econ 413
Hypothesis Testing
Hypothesis Testing
A statistical hypothesis is a set of assumptions about a model of observed data.
Example 1 (coin toss): The number of heads of 11 coin flips are random and distributed as a
binomial with success rate 0.5 and n=11 Recall the binomial distribution has two parameters, the
probability of success and the number of trials. We call the first parameter the success rate to
distinguish it from probabilities we calculate using the binomial.
Example 2 (income): Income is distributed as a normal random variable with mean μ and
variance σ2
Example 3 (univariate regression): Y = a + b*X + ε and the seven classical assumptions are
true.
Example 1 specifies the exact distribution of the data (number of heads). Example 1 has no
unknowns. A statistical hypothesis (about data) which has no unknowns is called a simple
hypothesis. Examples 2 and 3 have unknown parameters and do not specify the exact distribution
of the data. They are called complex hypotheses.
A statistical hypothesis test is a decision about a statistical hypothesis. The decision is to
accept or reject the hypothesis. The statistical hypothesis we test is called the maintained. The
alternate hypothesis is a different specification of the distribution of the data. Either or both can
be simple or complex.
Most books use the term null hypothesis. I have three reasons to use the word maintained
rather than null:
1.
Null is defined as amounting to nothing, having no value, and being 0 (among
other definitions). Often the labeling of the null hypothesis is H0 and I suppose
null hypothesis was preferred to zero hypothesis or naught hypothesis. .
2.
You have learned things about the null hypothesis which may or may not be true.
Using maintained hypothesis starts us off on a neutral path.
3.
Maintained hypothesis may, I hope, remind you that the maintained hypothesis
usually has many assumptions. For our regression tests, the maintained hypothesis
assumes A1 to A7 and possibly other assumptions.
A statistical hypothesis test specifies a critical region – a set of numbers. If the observed
data (or a function of the data) is in the critical region, then reject the maintained hypothesis.
If the observed data is NOT in the critical region, then accept the maintained hypothesis. I
think REJECTION region would be a better name than critical region. Alas, the literature has
critical region. I will use critical/rejection to help solidify the concept. I use ACCEPTANCE
region rather than the cumbersome 'not in the rejection region'.
Example 1 test: Let the critical (rejection) region be {0, 1, 2, 9, 10, 11} heads. If you flip
the coin 11 times, reject the maintained hypothesis: the number of heads is a binomial distribution
with success rate .5 of heads and n=11 if you observe {0, 1, 2, 9, 10, 11} heads.
Accepting the maintained hypothesis does not prove it to be true and rejecting the
maintained hypothesis does not prove it to be false. Similarly, accepting the alternate hypothesis
does not prove it to be true and rejecting the alternative does not prove it to be false. A statistical
test can prove nothing.
I believe many authors use 'fail to reject' so students will not think the hypothesis was
proved with a statistical hypothesis. But the only meaning that 'fail to reject' can have in statistical
hypothesis testing is accept. The outcome of a statistical hypothesis test is BINARY – only two
outcomes. The data is either in the critical (rejection) region or the data is not in the critical
(rejection) region. The wording 'fail to reject' connotatively conveys something different than
Page 3 of 32
Econ 413
Hypothesis Testing
'accept' because in English we often use a double negative to convey something other than a binary
outcome.
A statistical test has exactly two outcomes. The data is either in the critical (rejection)
region or it is not in the critical (rejection) region. If the data is NOT in the critical (rejection)
region, you accept the maintained. You reject the alternative. Reject the alternative must mean
accept the maintained. Fail to reject the maintained must mean accept the maintained.
If 'fail to reject' had any real meaning other than accept, then 'fail to accept' would also have
a different meaning. Now you would have four outcomes:
accept the maintained,
reject the maintained,
fail to reject the maintained,
fail to accept the maintained.
A statistical test has exactly two outcomes: the data is either in the critical (rejection) region or it is
not in the critical (rejection) region. The only outcomes are to accept the maintained (reject the
alternative) or accept the alternative (reject the maintained). Fail to reject must mean accept and fail
to accept must mean reject.
'failed to reject' may have a connotation that you are trying to reject (and failed). Whether
you want to accept or reject a statistical hypothesis is outside of the discussion of statistical
hypotheses. Want is a normative concept. I never use 'failed to accept' (except in moments of
brain failure). I never want to accept or reject a hypothesis unless someone is paying me money,
reputation, or other reward (which then makes me want). You will not want to accept or reject a
hypothesis in this course. Your grade does not depend on whether the hypothesis is accepted or
rejected, but rather on what you do with the acceptance or rejection.
A third reason authors use 'fail to reject' is Karl Popper's influence on scientific method.
Popper touted falsification of theories. Specifically, "Logically, no number of experimental testing
can confirm (read prove) a scientific theory, but a single counterexample is logically decisive: it
shows the theory … to be false." For Popper, experimental evidence would either fail to reject the
theory or would reject the theory.
For Popper, reject requires one data point which is inconsistent with the theory. For
example the Cobb-Douglas production function (be an economist for a moment) requires 0 output if
either labor or capital = 0. We can reject Cobb-Douglas if we observe positive output with 0 labor
or 0 capital. Most econometric models do not have the property of rejection by one observation.
Many statistical tests exist for some statistical hypotheses. In example 1 (coin test) we may
reject the maintained hypothesis if we observed 5 heads and then 6 tails. (which is not in the
rejection region {0,1,2,9,10,11}). With regressions, we have homoscedasticity tests, serial
correlation tests, endogeneity tests, model specification tests, and normality tests. Each test has the
same hypothesis – all seven assumptions. ‘fail to reject’ one of many statistical tests of the same
hypothesis means that the current test accepts the maintained but some other test remaining to be
done might reject the maintained. Then ‘fail to reject’ is not about a hypothesis test, but about
many hypothesis tests. In such a case the many statistical tests have many critical (rejection)
regions (as many as there are tests). ACCEPT or REJECT is about one single critical (rejection)
region. We will discuss distinguishing among hypothesis tests, but we will not use ‘fail to reject’. I
never use fail to reject and never use fail to accept.
If accepting a hypothesis does not prove the hypothesis, then what does accepting a
hypothesis do? Acceptance allows one to proceed as if the hypothesis were true.
We may either accept a true hypothesis or accept a false hypothesis. Accepting a true
hypothesis would be a correct decision and rejecting a true hypothesis would be an incorrect
decision – that is an error.
Page 4 of 32
Econ 413
Hypothesis Testing
Type I and Type II errors
Type I error: Reject a true maintained hypothesis = accept a false alternative
hypothesis.
2.
Type II error: Reject a true alternative hypothesis = accept a false maintained
hypothesis.
In classical statistical hypothesis testing a hypothesis is true or false. Hypotheses do not
have a probability of being true or false.
The probability of making a Type I error is the probability the data is in the critical
(rejection) region conditional upon assuming the data is distributed by the maintained hypothesis.
The probability of making a Type II error is the probability the data is NOT in the critical
(rejection) region conditional upon assuming the data is distributed by the alternative hypothesis.
Or the probability of making a Type II error is the probability the data is in the ACCEPTANCE
region conditional upon assuming the data is distributed by the alternative hypothesis
Example 1: Return to the coin flip. A critical (rejection) region is {0, 1, 2, 9, 10, 11}. The
probability of {0, 1, 2, 9, 10, or 11} heads occurring given the number of heads is a Binomial (0.5,
11) is 0.0005 + 0.0054 + 0.0269 + 0.0269 + 0.0054 + 0.0005 = 0.0654. I used the Excel function
BINOMDIST to calculate the probabilities – e.g., for two heads I used =BINOMDIST(2,11,0.5,FALSE) .
The probability of a Type I error for the critical (rejection) region {0, 1, 2, 9, 10, 11} is 0.0654. It is
the probability we observe {0, 1, 2, 9, 10, or 11} heads in 11 flips assuming the flips are a binomial
distribution with n=11 and p=0.5 - the maintained distribution. If we observe 0, 1, 2, 9, 10, or 11
heads we reject the maintained hypothesis and accept the alternative hypothesis. If we observe 3, 4,
5, 6, 7, or 8 heads we accept the maintained hypothesis and reject the alternative hypothesis.
See lecture8.pptx near slides 28-39 for the calculation of the probability of Type I errors for
the following critical regions:
CR1. {0,1,2,9,10,11}
P=0.06543
CR2. {0,1,10,11}
P=0.01172
CR3. {0,1,2}
P=0.03271
CR4. {0,1,2,3}
P=0.11328
CR5. {8,9,10,11}
P=0.11328
CR6. {9,10,11}
P=0.03271
CR7. {1,3,7,9}
P=0.27393
CR8. {2,10,11}
P=0.03271
What is the alternative hypothesis? Unspecified. One alternative is the data was generated
by a different distribution. For example, the data is generated by flipping the coin until 2 heads
were observed and it took 11 trials. The distribution (flipping until a certain number of successes is
observed) is called the negative binomial.
Another alternative hypothesis in example 1 is the distribution is binomial, n=11 and the
success rate is any number zero to one.
Unless the alternative hypothesis is specified we can not know the probability of a Type II
error (reject a true alternative). In most real life cases, the alternative is complex and the probability
of Type II error is unknown unless we specify a particular alternative.
Sometimes we can calculate the probability of a Type II error. In example 1, if we specify
the alternative is a binomial distribution, then we can calculate the probability of Type II error for
each success rate from 0 to 1. The following table exhibits probabilities of Type II errors eight
different critical (rejection) regions.
1.
Page 5 of 32
Econ 413
0.545
0.541
0.632
0.721
0.726
0.651
0.576
0.594
0.771
0.03
0,1,3,4,5,6,7,8,9
1.000
1.000
0.999
0.994
0.967
0.881
0.687
0.383
0.090
0,2,4,5,6,7,8,10,11
1.000
1.000
0.996
0.971
0.887
0.704
0.430
0.161
0.019
0,1,2,3,4,5,6,7,8
0.019
0.161
0.430
0.704
0.887
0.971
0.996
1.000
1.000
0,1,2,3,4,5,6,7
0.090
0.383
0.687
0.881
0.967
0.994
0.999
1.000
1.000
4,5,6,7,8,9,10,11
0.303
0.678
0.887
0.969
0.988
0.969
0.887
0.678
0.303
3,4,5,6,7,8,9,10,11
Alternative
success
rates
P(Type II) various
alternative hypotheses
0.090
0.383
0.687
0.875
0.935
0.875
0.687
0.383
0.090
2,10,11
1,3,7,9
P(Type I) 0.065 0.012 0.033 0.113 0.113 0.033 0.274
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
8
7
6
9,10,11
8,9,10,11
2,3,4,5,6,7,8,9
0,1,2,3
0,1,10,11
3,4,5,6,7,8
0,1,2
0,1,2,9,10,11
CR
Critical Region
5
4
3
2
1
Hypothesis Testing
0.787
0.705
0.800
0.911
0.967
0.965
0.886
0.678
0.303
See lecture 8.
The last eight columns of the table are different critical regions. The probability of a Type I
error is the same for critical regions 3, 6 and 8 and the same for critical regions 4 and 5. Comparing
critical regions 1 and 2 critical region 1 has a larger Type I error and a smaller Type II error than
critical region 2 for each value of the alternative success rate of the binomial (the graph makes the
comparison easy).
1.000
CR 2 0.012
0.900
0.800
Probability
Of
Type II
Error
0.700
0.600
0.500
CR 1 0.065
0.400
0.300
0.200
0.100
Alternative success rates
0.000
0
0.2
0.4
0.6
0.8
Probability of Type II error (vertical axis)
Success rates of the binomial (horizontal access).
1
Page 6 of 32
Econ 413
Hypothesis Testing
Suppose that two hypothesis tests had identical Prob(Type I error), say .05. Suppose also
that one test had a greater Prob(Type II error) for every specification of the alternative than the
other. The hypothesis test with the larger probability of Type II error is dominated by the one with
the smaller probability of Type II error.
Among UNDOMINATED hypothesis tests, decreasing probability of Type I error increases
probability of Type II error. A trade off exists between probability of Type I error and probability
of Type II error – decrease one and the other increases.
A theoretical result is: for testing a simple hypothesis against a simple hypothesis, there
exists a critical region with no lower probability of a Type II error given a fixed probability of Type
I error. This is a beautiful result. One test is dominant for a simple versus simple situation.
Unfortunately, in econometrics, both the maintained and the alternative are usually complex
hypotheses and we have no such result.
The probability of a Type I error is called the size (or significance level) of a statistical test.
The POWER of a test is 1 minus the probability of a Type II error. For a given size, we
want a statistical test with greatest power. For most tests we encounter, we specify a size, we obtain
a critical region and theoretical results indicate what alternatives have relatively large power and
what alternatives may not. For most tests we encounter, we never know the probability of a Type II
error. We rely on prior research to tell us what tests are powerful against what alternatives.
Both the size (sometimes called the significance level) and the power are probabilities of the
critical region. The difference is the assumption made to compute the probability. For the size, the
probability is computed assuming the maintained hypothesis. For the power, the probability is
computed assuming some alternative hypothesis.
Size = Prob(CR| maintained)=Prob(Type I error)
Power=Prob(CR| alternative) = 1 – Prob(Acceptance|alternative)=1-Prob(TypeII error)
The following table shows the powers for the eight critical region.
1
2
Critical Region
3
4
5
6
7
0.03
3,4,5,6,7,8
2,3,4,5,6,7,8,9
3,4,5,6,7,8,9,10,11
4,5,6,7,8,9,10,11
0,1,2,3,4,5,6,7
0,1,2,3,4,5,6,7,8
0,2,4,5,6,7,8,10,11
0,1,3,4,5,6,7,8,9
Power for various
alternative hypotheses
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
2,10,11
1,3,7,9
9,10,11
8,9,10,11
0,1,2,3
0,1,2
0,1,10,11
0,1,2,9,10,11
CR
P(Type I) 0.065 0.012 0.033 0.113 0.113 0.033 0.274
8
CR1
0.910
0.617
0.313
0.125
0.065
0.125
0.313
0.617
0.910
CR2
0.697
0.322
0.113
0.031
0.012
0.031
0.113
0.322
0.697
CR3
0.910
0.617
0.313
0.119
0.033
0.006
0.001
0.000
0.000
CR4
0.981
0.839
0.570
0.296
0.113
0.029
0.004
0.000
0.000
CR5
0.000
0.000
0.004
0.029
0.113
0.296
0.570
0.839
0.981
CR6
0.000
0.000
0.001
0.006
0.033
0.119
0.313
0.617
0.910
CR7
0.455
0.459
0.368
0.279
0.274
0.349
0.424
0.406
0.229
CR8
0.213
0.295
0.200
0.089
0.033
0.035
0.114
0.322
0.697
We can graph the power of the test used in Example 1 just as we graphed the probability of
Type I error.
Page 7 of 32
Econ 413
Hypothesis Testing
1.000
CR3
0.900
P(Type I)=.033
CR6
0.800
0.700
0.600
CR2
CR3
0.500
CR6
CR7
0.400
0.300
CR2
P(T I)=.012
0.200
CR7 P(T I)=.274
0.100
0.000
0
0.2
0.4
0.6
0.8
1
The graph shows two power curves with the same size – namely CR3 ={0,1,2} and CR6={9.10,11}.
CR3 is more powerful for alternative success rates of heads less than .5 and CR6 is more powerful
for alternative probabilities of heads greater than .5.
Test CR2={0,1,10,11} has a smaller size (0.012) than CR3 or CR6 (.033). CR2 is less powerful
for some alternatives and more powerful for other alternatives than either CR3 or CR6. For
alternatives .45 to .55, CR2 is less powerful than either CR3 or Cr6. CR7 has greater power for
some alternatives but also has greater size. Smaller size => larger power = less Prob(Type II error).
1.000
0.900
P(Type I)=.065
0.800
0.700
P(Type I)=0.012
0.600
CR1
0.500
CR2
0.400
0.300
0.200
0.100
0.000
0
0.2
0.4
0.6
0.8
1
To illustrate that power decreases as size decreases, compare the graph of CR1={0,1,2,9,10,11} and
CR2={0,1,10,11}. The size of CR1 is .065 and the size of CR2 is .012 . For every alternative
success rate of heads, CR1 has greater power but its size is also greater. The graph shows the trade
off between size (we want smaller size) and power (we want greater power). Smaller size comes
with smaller power for a given test.
The important concepts are:
1.
Hypothesis tests are critical (rejection of maintained) regions.
2.
A type I error is rejecting a true maintained. A type II error is accepting a false
maintained (rejecting a true alternative).
Page 8 of 32
Econ 413
Hypothesis Testing
3.
The Probability of a Type I error is the probability of the critical region using the
maintained distribution. The Probability of a Type II error is the probability of the
acceptance region using an alternative distribution.
4.
Size is the probability of Type I error (rejecting a true maintained). Power is 1
minus probability of Type II error.
5.
Every test has some power for some alternative hypotheses and less power for
other alternative hypotheses.
6.
A smaller size results in a smaller power (or larger Prob(Type II error). Illustrated
by CR1 and CR2, or CR3 and CR4, or CR5 and CR6.
7.
For some statistical hypothesis tests, two tests exist. One has higher power for
some alternatives and the other has higher power for the remaining alternatives.
CR3 and CR6 have the same size. CR3 is more powerful for alternative success
rate of heads less than .5 and CR6 is more powerful for alternative success rate of
heads greater than .5. CR4 and CR5 have smaller size than CR3 and CR6 but
have a similar comparison for alternatives less or greater than .5.
The critical regions CR3, CR4, CR5 and CR6 are often called one sided. The critical
regions contain only small or only large number of heads. Such one sided critical regions are
powerful for only large or only small alternative success rate of heads. For example, CR3={0,1,2}
is more powerful for the alternative hypotheses of small probabilities of heads while
CR6={9,10,11} is more powerful for the alternative hypotheses of large probabilities of heads.
Summary of POWER. Understanding POWER explains why we would use more than one
test. For example, the Ramsey test may have 1,2,3,4,… terms. Why use more than just 2 terms?
To increase the power of the specification test albeit at changing the size (since doing 1,2,3,4…
terms means you are doing sequential statistical testS not just one test). A Ramsey test with 2 terms
will be more powerful for some alternative hypotheses than a Ramsey test with 4 terms will be more
powerful for some other alternatives. Explaining which statistical test(s) to use is our only use of
POWER.
REGRESSION TESTS – THE T TEST
For a regression, we might wish to test whether some independent variable has a statistically
significant effect on the dependent variable: Income on Consumption, rebounds on percent win,
number of competitors on sales, gender on wages, high school rank on financial aid, etc. We
usually test statistical significance by testing whether the coefficient of the variable is equal to 0.
In OLS regression, with all 7 classical assumptions true, and the additional assumption that
the corresponding coefficient is ZERO, the reported T-statistic for a coefficient is an observation of
a random variable that has a T-distribution. The T-distribution was authored by W.S. Gosset, who
worked for Guinness brewery and wrote under the name Student. Often the T is called Student's T
distribution.
The maintained hypothesis does not specify the remaining coefficients nor the variance of
the error of the equation σ2 – they can be any value. The maintained hypothesis is complex and the
alternative hypothesis is complex.
With the 7 classical assumptions in a simple, one variable regression, the OLS estimator

2
1 is distribute d as Normal ( 1 ,
) . σ2 is unknown!
2
x
 i

We derive a random variable based on 1 which has a T-distribution and does not depend
on any unknown parameters.
Page 9 of 32
Econ 413
Hypothesis Testing

( 1   ) / 
1

2
(

x
2
i
is distribute d as T (n  K  1)
) / 2
Note the σ in the numerator cancels with the square root of the σ2 in the denominator and the
only unknown in the formula is β1 .
The T-distribution has one parameter called Degrees of Freedom (DOF). For most tests, the
value of the DOF parameter is number of observations minus number of estimated coefficients.
In a simple regression there are two coefficient estimates – the intercept and the coefficient of the
single variable. The DOF is n-2. In a K variable regression, there are K+1 coefficients to estimate:
β0 β1 β2 β3 … βK and the DOF is n-(K+1) = n –K - 1.
The display above has a T-distribution if the maintained hypothesis (all 7 classical
assumptions) is true. It does not have a T-distribution if any of 7 assumptions is not true.

( 1   )
1

(
cannot be reported by a statistics program because β1 is unknown.
2
)
x
2
i
The reported T-statistic is

(  1  0)

(
2
x
2
i
)
It will have a T-distribution if all 7 classical assumptions are true and β1=0.
The T-test is a critical/rejection region for the T-statistic – values of the T-statistic for which
you REJECT the maintained hypothesis that all 7 classical assumptions are true and β1=0. T-tests
can have one sided or two sided critical regions. To determine the critical region, you must choose
a size for the test – the probability of a Type I error – the probability you reject a true maintained
hypothesis. What size you choose is your own choice. It is common to have sizes of 0.01, 0.05 or
0.10. In fact in reporting regression results, generally one reports whether the reported t-statistic is
in a 10%, or 5% or 1% critical region. If the reported t-statistic is in the 1% region, it is in the 5%
and 10%.
In most cases, we report significance rather than stating ‘we reject the maintained hypothesis
at the 5% level’. We state ‘the coefficient is statistically significant at 5%’. The meaning is the
same – namely the reported T-statistic is in the 5% critical region. You would report significant at
1% understanding it is also significant at 5% and 10% (and 15% and …).
Below is a plot of the density of a T-distribution for 10 degrees of freedom. For a 5% size,
two sided test, we split the 5% into each tail - 2.5% of the distribution is below – 2.228 and 2.5% of
Page 10 of 32
Econ 413
Hypothesis Testing
the distribution is above 2.228. I found 2.228 on page 585 Studenmund 6th edition (Critical values
of the t-distribution) in row 10 observations and column 2.5% one sided.
Page 11 of 32
Econ 413
Hypothesis Testing
The blue area illustrates a two sided critical/rejection region 5% test..
For 31 degrees of freedom, the probability you would observe a random variable (with a Tdistribution) below -2.0395134464 is 0.025 and similarly above +2.0395134464 is 0.025. so
there is a 5% chance that you would observe a T-random variable below -2.0395134464 or
above +2.0395134464. If you sampled 1,000,000 T-random variables with 31 degrees of freedom,
then approximately 25,000 would be below -2.0395134464 and approximately 25,000 would be
above +2.0395134464.
These calculations used
http://surfstat.anu.edu.au/surfstat-home/tables/t.php very easy with good graphics
http://www.tutor-pages.com/Statistics-Calculator/statistics_tables.html similar graphics
http://socr.ucla.edu/htmls/SOCR_Distributions.html I had to use with IE
http://www.distributome.org/js/calc/index.html
http://www.distributome.org/js/calc/StudentCalculator.htmld for the T-distribution
http://bcs.whfreeman.com/ips4e/cat_010/applets/statsig_ips.html Java Security error used to work
All of these pages use JAVA. JAVA has security issues. Some browsers will not run the
JAVA required. They all used to work.
http://www.danielsoper.com/statcalc3/calc.aspx?id=10 does not use JAVA and calculates to
8 decimals! See http://www.danielsoper.com/statcalc3/default.aspx for other distributions.
Example of T-test: Gender discrimination
To be more explicit, consider a gender discrimination case. The plaintiff contends males are
discriminated against while the defense contends males are not discriminated against. Below is a
(partial) estimation output in the case:
Variable
GENDER
Coefficient
-3.848931
Std. Error t-Statistic
1.863662 -2.065251
Prob.
0.0473
The Degrees of Freedom equals 31.
The variable GENDER is 1 for males, and 0 for females. The negative coefficient indicates
if the individual is male (GENDER=1) then the dependent variable is estimated to be -3.848931 less
than if the individual is female.
The reported Prob. of 0.0473 is the size of a critical/rejection region [-∞,-2.065251]
[+2.065251,+∞] which uses the reported T-statistic to determine the critical/rejection region. The
reported Prob. value is called a p-value. With DOF=31, the probability that you would observe a T
Page 12 of 32
Econ 413
Hypothesis Testing
random variable in [-∞,-2.065251] is 0.02365 (=.0473/2). The probability that you would observe a
T random variable in [+2.065251,+∞] is 0.02365 (=.0473/2).
http://surfstat.anu.edu.au/surfstat-home/tables/t.php shows the two sided critical/rejection
regions.
A 5% critical/rejection region is [-∞,-2.04] [+2.04,+∞]
For accuracy but no picture, http://www.danielsoper.com/statcalc3/calc.aspx?id=10
Page 13 of 32
Econ 413
Hypothesis Testing
The observed T = -2.065251 is in the critical/rejection region [-∞,-2.03951345]. REJECT
the maintained hypothesis at 5% size – REJECT the conjunction of all 7 classical assumptions plus
β=0.
An easier but identical critical/rejection region is the p-value space. If the reported P-value
(Prob. in the output) is LESS THAN the chosen (by you) size of the test, REJECT. For example,
0.0473 is less than .05 and we reject at a size=5% test.
Pvalue or Size
Reported T or Tabled T
.0473
-2.065251
Is less than
Is greater in absolute value
Tabled
.0500
-2.03951345
The critical/rejection region for a 5% test in p-value space is 0.05 . Reported p-values less
than 0.05 REJECT the maintained exactly as reported T-values smaller or larger than the p-value
corresponding to the reported T statistic.
In our example, 0.0473 is greater than .01 => ACCEPT. The .01 critical/rejection region is
For a 1% test, we accept the maintained hypothesis. For a 1% test, the critical/rejection
region is [-∞,-2.744] [+2.744,+∞] . Our reported T-statistic is not in the critical/rejection region.
Reported
Always use the reported p-value to test unless you love extra work!.
If your chosen size of the test is greater than the p-value, REJECT,
If your chosen size of the test is less than the p-value, ACCEPT.
No need to look up in a table of numbers, no need to use an internet calculator. If the pvalue is small, reject. If the p-value is large, accept. You can use the p-value for all the tests we do.
For an test, if the reported p-value is small, say less than .01, REJECT and if the reported p-value is
large, say .20, ACCEPT. How easy can your life get?
The t-test for a coefficient = 0 is theoretically proved to be powerful against alternative
hypotheses in which the classical 7 assumptions are true but the particular coefficient is not 0. The
reported T-statistic has a non-central T-distribution if all 7 classical assumptions are true and the
coefficient ≠ 0. But we do not know the distribution unless we specify a particular value for the
coefficient which then determines the non-centrality parameter of the non-central T-distribution.
If the alternative value for the coefficient is a large absolute value of the coefficient (say
1,000) then the power is greater than if the alternative value for the coefficient is a smaller absolute
value of a coefficient (say 10). We also know increases in size increase power and smaller sizes
have less power. A 1% test has less power for any specific alternative than does a 5% test.
Page 14 of 32
Econ 413
Hypothesis Testing
The formula for the T-statistic provides intuition for the power. The reported T-statistic is

(  1  0)

(
2
 xi2
. If 𝛽̂1 is large, the reported T-statistic is large and the test will reject.
)
One sided tests:
The critical/rejection region [-∞,-2.065251] [+2.065251,+∞] is two sided. Two one sided
5% critical/rejection regions are:
MINUS=[-∞,-1.696] and
PLUS=[1.696,∞] .
A one sided test must specify which side. For our example, the reported T-statistic =
in the critical/rejection region MINUS and is not in the critical/rejection region PLUS.
The critical/rejection region PLUS is more powerful for alternatives with β>0 and the
critical/rejection region MINUS is more powerful for alternatives with β<0.
Often, the one sided tests are phrased with the maintained hypothesis one side and the
alternative the other side. E.g., Using the labels HM and HA for the maintained and alternative
hypothesis:
Test 1. HM1: β≥0 versus HA1: β<0. Use the critical/rejection region MINUS. Relatively
large negative values of the estimated coefficient reject the maintained and accept the alternative.
Test 2. HM2: β≤0 versus HA2: β>0. Use the critical/rejection region PLUS. Relatively large
positive values of the estimated coefficient reject the maintained and accept the alternative.
Using the critical/rejection region MINUS, TEST 1, implies any positive coefficient
estimate (positive reported T-statistic) will accept the maintained hypothesis. The maintained is
all 7 classical assumptions and no NEGATIVE effect. The one sided test, TEST 1, is testing NO
NEGATIVE effect versus the alternative NEGATIVE effect.
Using the critical/rejection region PLUS, TEST 2, implies any negative coefficient estimate
(negative reported T-statistic) will accept the maintained hypothesis. The maintained is all 7
-2.065251is
Page 15 of 32
Econ 413
Hypothesis Testing
classical assumptions and no POSITIVE effect. The one sided test, TEST 2, is testing NO
POSITIVE effect versus the alternative POSITIVE effect.
One sided tests are used if the a priori evidence or theory predicts a positive or a negative
coefficient. For example, testing the slope of the demand curve would be one sided HM: β≥0
versus HA: β<0. Theory and a huge amount of prior evidence indicates demand slopes downward.
We use TEST 1 (critical/rejection region MINUS). If we estimate a positive slope, we accept the
maintained - no negative relationship between price and quantity plus the classical 7 assumptions.
One sided tests have more power on one side compared to a two sided test. Which test we
use (TEST 1 or TEST 2) depends on which side we want more power. Power is 1 minus probability
of Type II error. Type II error is rejecting a true alternative. With a demand function, the prior
evidence is a negative relationship. TEST 1 has more power, less probability of Type II error, for
alternatives with β<0 while TEST 2 has more power, less probability of Type II error, for
alternatives with β>0. TEST 2 has greater probability of Type II error for negative β while TEST 1
has less probability of Type II error for negative β. You use the test which has less probability of
Type II error for the prior beliefs – less probability of rejecting a true prior evidence. Using TEST 2
for a demand function has great probability of Type II error for β<0 versus TEST 1.
The left graph illustrates a one sided test size = 0.05 while the right illustrates a two sided
size=0.05. The one sided power is 0.113 versus the two sided power 0.0078.
CR
CR
CR
The power for this one sided alternative is 0.113 while for the two sided alternative is
0.00708 – about 20 times smaller.
The further the alternative is from the maintained the smaller the difference in power
between one sided and two sided. Above the alternative was 0.1. Below, the graphs illustrate an
alternative equal to 1.2 and the power for both tests is 1.0 – at least to 4 significant digits.
Page 16 of 32
Econ 413
Hypothesis Testing
CR
CR
CR
For a one sided test, divide the reported p-value by 2 to report the significance level for
the one sided test. Recall the two sided p-value is the percent of the distribution below the reported
t-Statistic plus the percent of the distribution above the reported t-Statistic. Two sides! Dividing by
two yields one side.
Variable
GENDER
Coefficient
-3.848931
Std. Error t-Statistic
1.863662 -2.065251
Prob.
0.0473
The reported p-value is 0.0473. For the one sided test, the significance is 0.0473/2 =
0.02365 (the percent of the distribution in one tale, e.g., below the reported t-Statistic). You reject
the maintained hypothesis (β≥0) at 3% but not at 2%. For one sided TEST 2 (no positive effect),
accept the maintained hypothesis at any significance level.
The main points are:
1.
If a priori theory or evidence indicates a sign of the coefficient, use a one sided
test because it is more powerful. The alternative is determined by prior evidence.
2.
The further away the alternative is from the maintained, the greater the power.
One side and two sided tests obtain identical power for alternatives far from the
maintained.
STATISTICAL SIGNIFICANCE
For most regression coefficients, economists state whether the coefficient is statistically
significant at some level (10%, 5%, 1%). Statistically significant means statistically different from
zero, i.e., the maintained hypothesis β=0 is rejected. They (economists) do not say 'REJECT the
coefficient is 0'. They say the coefficient is statistically significant! In many reports, the
coefficients are labeled with a '*', '**', or '***' to indicate significance at 10%, 5%, or 1%. My
tabling program displays coefficients in GREEN, BLUE or RED to indicate significant at 10%,
5%, or 1%
For some analyses, such as a gender discrimination case, rather than state significance,
economists may revert to the 'accept/reject' language. For example,
Variable
GENDER
Coefficient
-3.848931
Std. Error t-Statistic
1.863662 -2.065251
Prob.
0.0473
we would say accept gender discrimination at the 5% level or reject no gender discrimination at the
5% level. The coefficient is statistically significant at 5% is identical to rejecting the coefficient is 0
at 5% or accepting the coefficient not 0 at 5%.
Page 17 of 32
Econ 413
Hypothesis Testing
When you REJECT the maintained hypothesis, you are REJECTing the conjunction of the 7
assumptions and the assumption the particular coefficient is ZERO. The alternative is a large place
– assumption 1, 2, 3, 4, 5, 6 or 7 could be false while the particular coefficient is ZERO or
1,2,3,4,5,6, and 7 may be true while the particular coefficient is not ZERO or every assumption may
be false. The T-test has relatively large power for the alternative all 7 classical assumptions are true
but β=0 is not true.
F-TEST
The T-test maintained is all 7 classical assumptions and one β=0. The coefficient F-test
maintained hypothesis is all 7 classical assumptions and two or more β=0. In Eviews the coefficient
F-test is called the Wald coefficient restriction test. The reported F-statistic in regression output
has an F distribution if all 7 classical assumptions and all of the coefficients in the regression,
except the intercept, are equal to 0 simultaneously. Sometimes, if we reject the maintained (all 7
classical assumptions and all coefficients are 0) we say the regression is significant (at some size).
The reported F-statistic coefficient F-test is powerful for the alternative that all 7 classical
assumptions are true and some or all of the coefficients are not 0 .
The reported F-statistic is computed as:
Explained sum of squares/K
Residual sum of squares/(N-K-1)
The reported F-statistic may be computed from R^2:
𝑅 2 /K
̂ =
𝐹𝑠𝑡𝑎𝑡
(1 − 𝑅 2 )/(N-K-1)
Because
Explained sum of squares
𝑅2 =
Total sum of squares
and for least squares with an intercept
Residual sum of squares = Total sum of squares minus Explained sum of squares
̂ =
𝐹𝑠𝑡𝑎𝑡
The FINAID example:
Dependent Variable: FINAID
Method: Least Squares
Sample: 1 50
Included observations: 50
Variable
Coefficient Std. Error
t-Statistic
C
9813.022
1743.1 5.629638
HSRANK
83.26124
20.14795 4.132492
MALE
-1570.143
784.2971 -2.001975
PARENT
-0.342754
0.031505 -10.87921
R-squared
0.764613 Mean dependent var
Adjusted R-squared
0.749262 S.D. dependent var
S.E. of regression
2686.575 Akaike info criterion
Sum squared resid
3.32E+08 Schwarz criterion
Log likelihood
-463.6635 Hannan-Quinn criter.
F-statistic
49.80764 Durbin-Watson stat
Prob(F-statistic)
0
Prob.
0
0.0002
0.0512
0
11676.26
5365.233
18.70654
18.8595
18.76479
2.301406
Page 18 of 32
Econ 413
Hypothesis Testing
The reported F-statistic has an F distribution if all 7 classical assumptions and the
coefficients of HSRANK, MALE and PARENT are equal to 0. The F-distribution's graph from
http://www.statdistributions.com/f/
The R^2 copied unformatted from Eviews is 0.764613059688
0.764613059688 46
*
 3.248324052 *15.33333  49.80764 the reported F-statistic value.
0.235386940312 3
The reported p-value is 0 (rounded to 5 decimals). The reported p-value is small, less than
all but the tiniest size. REJECT the maintained which is all 7 classical assumptions and the
coefficients of HSRANK, MALE and PARENT equal 0.
The reported F-statistic has a non-central F distribution if all 7 classical assumptions are
true but one or more of the coefficients of HSRANK, MALE and PARENT is not zero. In the
following graph, the non-centrality parameter λ depends on the value of the coefficients. The larger
in absolute value the alternative coefficient values, the larger value of λ.
See
http://www.boost.org/doc/libs/1_36_0/libs/math/doc/sf_and_dist/html/math_toolkit/dist/dist_ref/dist
s/nc_f_dist.html
Page 19 of 32
Econ 413
Hypothesis Testing
Rarely do econometricians compute the power of an F-test. The formula for the reported Fstatistic based on R^2 provides intuition about its power:
𝑅 2 /K
̂ =
𝐹𝑠𝑡𝑎𝑡
(1 − 𝑅 2 )/(N-K-1)
2
If R is large, the reported F-statistic will be large rejecting the maintained that all 7 classical
assumptions and the all the coefficients equal zero. R2 is the correlation squared between the actual
and predicted. R2 can not be large if ALL the estimated coefficients are near zero. If some or all of
the coefficients are not zero, we expect R2 is relatively large rejecting the (false) maintained
hypothesis – large power.
The Wald coefficient restriction F-statistic is used to test a subset of the coefficients equal to
zero. A subset of the coefficients are equal to 0 restricts the equation. Consider a house price
equation from Studenmund chapter 11:
c(1)
c(2)
c(3)
c(4)
c(5)
c(6)
c(7)
c(8)
c(9)
Dependent Variable: P
Method: Least Squares
Sample: 1 43
Included observations: 43
Coefficient Std. Error t-Statistic Prob.
Variable
0
153.4732 32.72537 4.689731
C
0.1261
-0.41988 0.267725 -1.56833
AGE
0.4787
-10.7504 15.00847 -0.71629
BATH
0.8843
9.37739 -0.14666
-1.37532
BED
0.8512
-2.58036 13.65486 -0.18897
CA
0
-5.5561
-30.698 5.525107
N
0
0.10713 0.020243 5.292093
S
0.3986
-10.1114 11.82685 -0.85495
SP
0.0058
0.004618 0.001569 2.944022
Y
Mean dependent var 242.3023
R-squared 0.915694
79.2415
S.D. dependent var
0.895858
Adjusted R-squared
Akaike info criterion 9.504646
25.5721
S.E. of regression
9.873269
Schwarz criterion
22233.7
Sum squared resid
46.16177
F-statistic
Log likelihood -195.35
0
Prob(F-statistic)
stat
1.502265
Durbin-Watson
AGE, BATH, BED, CA (central air) and SP (swimming pool) are statistically insignificant
at the 10% level individually. But the estimators for the coefficients of AGE, BATH, BED, CA
and SP have a joint distribution. The estimators are not independent. If we want to test whether the
coefficients of AGE, BATH, BED, CA and SP are jointly equal to 0, we may use the test that they
are all insignificant (equal to 0) simultaneously or jointly. This is done in Eviews by clicking
View in the regression output window, select COEFFICIENT TESTS and Wald Coefficient
restrictions. In Eviews, you have to enter the restrictions by forming an equation with C(i) where i
is the number of the coefficient. In this case, c(2)=0,c(3)=0,c(4)=0,c(5)=0,c(8)=0 or equally well
c(2)=c(3)=c(4)=c(5)=c(8)=0
Page 20 of 32
Econ 413
Hypothesis Testing
We accept the maintained. The p-value (reported probability) is large. If our size is .10,
.505181> .10. The reported F-statistic is 0.979647. The probability of the critical region
[0.879647,∞] under the maintained is .505181.
Critical region
Page 21 of 32
Econ 413
Hypothesis Testing
We accept the maintained hypothesis.
The size=0.01 (1%) critical region
Critical region
The reported F-statistic is calculated:
(Residual sums of squaresrestricted − Residual of sums of squaresunrestricted )/𝑟
Residual of sums of squaresunrestricted /(N-K-1)
where r is the number of coefficients = 0, the number of restrictions.
The 'restricted' residual sums of squares is obtained by a regression without the variables
whose coefficients are tested to be equal to 0. The unrestricted sums of squares is from the
regression with all the variables. In Eviews
unrestricted regression: LS P C AGE BATH BED CA N S SP Y
restricted regression: LS P C
N S
Y
The alternate method to calculation the F-statistic is change in R^2.
2
^ R2
 RRestricted
n  K 1
Unrestrict
ed
F
*
2
r
1 RUnrestrict
ed
Page 22 of 32
Econ 413
Hypothesis Testing
The R^2 from the unrestricted is 0.915694.
Dependent Variable: P
Method: Least Squares
Sample: 1 43
Included observations: 43
Variable Coefficient Std. Error t-Statistic Prob.
C
117.4655 19.98546 5.877548
0
N
-29.1998 5.139596 -5.68134
0
S
0.102644 0.009319 11.01483
0
Y
0.004117 0.00144 2.858676
0.0068
R-squared 0.904789 Mean dependent var 242.3023
Adjusted R-squared
0.897465 S.D. dependent var
79.2415
S.E. of regression
25.37404 Akaike info criterion 9.393739
Sum squared25109.84
resid
Schwarz criterion
9.557571
Log likelihood-197.965 F-statistic
123.5382
Durbin-Watson
1.67979
stat
Prob(F-statistic)
0
The R^2 from the restricted equation is 0.904789.
^ 0.915694  0.904789 43  8 1
F
*
 .879581524446663
5
1 0.915694
Compare to the Wald coefficient restriction report:
Exact to 4 decimals. If we use the unformatted values rather than the displayed values for R^2 in
the Eviews output:
^ 0.915694295982435  0.904788522663947 43  8 1
F
*
 0.879646987471538
5
1 0.915694295982435
0.879646987471578
and the unformatted value from the Wald coefficient F-test =
which differs at the 15th decimal!
The number of restrictions (coefficients equal to 0) is 5, and N-K-1 is 43-8-1= 34. The
reported F-statistic is 0. 0.879646987471578. We look up the p-value in a statistical calculator
http://rockem.stat.sc.edu/prototype/calculators/index.php3?dist=F
Page 23 of 32
Econ 413
Hypothesis Testing
The blue area is the critical/rejection region for a 50.52% test.
Why use the R2 formula? It may avoid much typing and possible typing mistakes. Why use
the Eviews WALD test? It is easy and avoids looking up the p-value for an F-statistic.
I have an Excel spreadsheet which does the R^2 calculation for you. On
http://econ413.wustl.edu/adata/functionalform.xlsx the sheet r-square will calculate the F
Unrest. R^2
0.91569
Rest. R^2
0.90479
Num Obs
TOTAL Num coef
Num restrictions
Change R^2
1-R^2
Ratio
DOF
DOF/restrictions
F statistic
P-value
43
9
5
0.010905
0.084306
0.12935
34
6.8
0.879582
0.505223
and the p-value.
We do not calculate the power of the test. We know that the power is greater for
alternative coefficient values which are farther from the maintained two or more β=0 and the power
is greater for greater sizes. We do not know its numerical value. Most empirical work will never
report the power.
Another example of the Wald coefficient restriction F-test tests whether some non-linear
variable has an effect on the dependent variable. In the Woody example (page 76 Studenmund),
SALES is explained by Income, Population and competitioN. In Chapter 7 we will consider nonlinear variables in depth but consider the following regression:
Page 24 of 32
C(1)
C(2)
C(3)
C(4)
C(5)
Dependent Variable: SALES
Method: Least Squares
Sample: 1 33
Included observations: 33
Variable Coefficient Std. Error
C
1.04E+08 97707989
I
494.632 484.5582
I^2
-0.003929 0.003701
1/I
-6.64E+10 7.09E+10
LOG(I)
-10042435 10296971
Econ 413
t-Statistic Prob.
1.063095
0.3004
1.02079
0.3195
-1.06163
0.3011
-0.93576
0.3606
-0.97528
0.3411
Hypothesis Testing
Variable CoefficientStd. Error t-Statistic Prob.
N
149544.3 347002.6 0.43096
0.6711
N^2
-5517.122 12273.27 -0.44952
0.6579
1/N
-767689 1971341 -0.38943
0.7011
LOG(N)
-661337.1 1500117 -0.44086
0.664
P
8.472545 5.806703 1.459097
0.1601
P^2
-1.24E-05 8.96E-06 -1.38794
0.1804
1/P
-2.41E+10 1.61E+10 -1.49728
0.1499
LOG(P)
-797545.6 560319.3 -1.42338
0.17
R-squared 0.727247 Mean dependent var
125634.6
Adjusted R-squared
0.563595 S.D. dependent var22404.09
S.E. of regression
14800.35 Akaike info criterion
22.32979
Sum squared4.38E+09
resid
Schwarz criterion 22.91933
Log likelihood
-355.4416 Hannan-Quinn criter.
22.52815
F-statistic 4.443866 Durbin-Watson stat1.961389
Prob(F-statistic)
0.001662
To test whether Income affects SALES we have to test whether the coefficients of I, I^2, 1/I
and LOG(I) are jointly equal to zero, i.e., C(2)= C(3)=C(4)=C(5)=0. If we accept the maintained
(all 7 classical assumptions and β1= β2= β3= β4=0), Income has no effect on SALES.
Wald Test:
Equation: EQ01
Test Statistic Value
df
Probability
F-statistic
1.236889 (4, 20)
0.3271
Chi-square
4.947556
4
0.2927
Null Hypothesis: C(2)= C(3)=C(4)=C(5)=0
We accept Income has no statistically significant effect on SALES in the regression.
We might also test the effect of competitioN and Population. CompetitioN is similar to
Income because each coefficient is individually insignificant. For the joint Wald coefficient
restriction test:
Wald Test:
Equation: EQ01
Test Statistic
Value
df
Probability
F-statistic 5.599061 (4, 20)
0.0034
Chi-square22.39624
4
0.0002
Null Hypothesis: C(6)=C(7)=C(8)=C(9)=0
Null Hypothesis Summary:
Normalized Restriction
Value
(= 0) Std. Err.
C(6)
149544.3 347002.6
C(7)
-5517.12 12273.27
C(8)
-767689 1971341
C(9)
-661337 1500117
The individual effects are statistically insignificant but jointly they are significant.
CompetitioN has a statistically significant effect on SALES. Population also has a statistically
significant effect on SALES.
Page 25 of 32
Econ 413
Hypothesis Testing
Wald Test:
Equation: EQ01
Test Statistic
Value
df
Probability
F-statistic 7.21772 (4, 20)
0.0009
Chi-square28.87088
4
0
Null Hypothesis: C(10)=C(11)=C(12)=C(13)=0
Null Hypothesis Summary:
Normalized Restriction
Value
(= 0) Std. Err.
C(10)
8.472545
5.806703
C(11)
-1.24E-05
8.96E-06
C(12)
-2.41E+10
1.61E+10
C(13)
-797545.6
560319.3
Restrictions are linear in coefficients.
RAMSEY RESET TEST
An econometric specification test is the RAMSEY RESET (page 202 in the text). The
Ramsey test is not a coefficient test. Few examples of the Ramsey test exist in the empirical
literature and the few that do exist always exhibit a Ramsey which accepts the maintained, all 7
classical assumptions. Few packages calculate the Ramsey. The main problem in using the
Ramsey is that if the Ramsey rejects your model, no prescribed fixes are known. The Ramsey test
is a specification test and if the Ramsey rejects your model, you have a problem with no easy cure.
The maintained hypothesis for the Ramsey is all 7 classical assumptions – nothing more.
RESET stands for REgression Specification Error Test. I believe it is less confusing to exclude the
word Error because Error used in RESET does not mean the error of the equation. Error in RESET
means incorrect specification. REIST would exclude the 'error' – REgression Incorrect
Specification Test. Or even REgression Specification Test – REST. I doubt you would rest while
doing empirical econometrics (a joke).
The RAMSEY RESET is powerful against omitted variables (assumption 1), incorrect
functional form (assumption 1), and correlation between X and the error of the equation
(assumption 3). We use the Ramsey RESET test on all equations we estimate. If we reject the
maintained, we are rejecting one or more of the classical 7 assumptions.
Similar to many econometric tests, the Ramsey RESET is an asymptotic test – unlike the Ttest and F-test for coefficient restrictions which are exact finite sample tests. An asymptotic test
means that the distribution of the test statistic is known if N is infinite.
The RAMSEY test is calculated via a two stage proceedure. The first stage is a standard
regression. The predicteds from the standard regression are used in a second regression, called an
auxiliary regression. The first regression regresses the dependent on the independents. The
predicteds are calculated, and then the auxiliary regression regresses the dependent on all the
independents and the square, cube, fourth power, etc. of the predicted dependent variable.
Page 26 of 32
Econ 413
Hypothesis Testing
The first regression obtains






Y i   0   1* X 1i   2 * X 2i   3 * X 3i   4 * X 41i
The auxiliary regression is


Y i   0   1* X1i   2 * X 2i   3* X 3i   4 * X 41i  5 * Yi 2  6 * Yi 3  ...   i
If all 7 classical assumptions are true (in particular assumption 1), then β5 and β6 and … are
theoretically 0. If we have the correct specification, the predicteds squared, the predicteds
cubed, etc. have no explanatory value in the regression. Their coefficients are 0 if we have the
correct specification.
2
If the regression has a misspecified the function form, say X2𝑖
is excluded from the
regression, then because






Y i   0   1* X 1i   2 * X 2i   3 * X 3i   4 * X 41i
^

Y i depends on X 2i and Yi 2 will be correlated with X 22i . The estimate of β5 will be non-zero and
statistically significant.
The Ramsey RESET test jointly tests whether the coefficients on the predicted squared, the
predicted cubed, etc. are equal to 0. The predicted squared, the predicted cubed, etc are not
independent variables in the correctly specified model. They are observations of random variables
which depend on the error of the model. The auxiliary regression with the predicted squared, the
predicted cubed, etc. fails assumption 3 and the standard F-test for coefficients jointly equal to 0 is
not correct – the F-statistic, even if all 7 classical assumptions are true, is not distributed as an Fdistribution. The report of the Ramsey test has a test statistic called the Log Likelihood Ratio
statistic. The Log Likelihood Ratio statistic is an asymptotic test and the reported statistic has a
Chi-square distribution with infinite data, i.e., the Log Likelihood Ratio statistic has an asymptotic
Chi-square distribution. No need to look up the Chi-square distribution - view the p-values reported
for the test to determine whether to accept or reject the model specification.
Page 27 of 32
Econ 413
Hypothesis Testing
The panel on the left details the test and the panel on the right exhibits the auxillary
regression.
Ramsey RESET Test
Equation: EQ01
Specification: P C A BA BE CA N S SP Y
Omitted Variables: Squares of fitted values
Value
df
t-statistic
3.108585
F-statistic
9.663303 (1,
Likelihood ratio
11.04376
F-test summary:
Sum of Sq.df
Test SSR
5035.966
Restricted SSR
22233.7
Unrestricted SSR
17197.73
Unrestricted SSR
17197.73
LR test summary:
Value
df
Restricted LogL
-195.35
Unrestricted LogL
-189.828
Unrestricted Test Equation:
Dependent Variable: P
Method: Least Squares
Sample: 1 43
Included observations: 43
33
33)
1
1
34
33
33
34
33
Unrestricted Test Equation:
Dependent Variable: P
Method: Least Squares
Sample: 1 43
Probability
Included observations: 43
0.0039
Variable
Coefficient
0.0039
C
131.0733
0.0009
A
0.050553
BA
14.45407
Mean Squares BE
3.712888
5035.966
CA
1.51516
653.9322
N
-13.892
521.1433
S
-0.00247
521.1433
SP
4.933024
Y
0.000985
FITTED^2
0.001481
R-squared
0.93479
Adjusted R-squared 0.917005
S.E. of regression
22.82856
Sum squared resid 17197.73
Log likelihood
-189.828
F-statistic
52.56164
Prob(F-statistic)
0
Std. Error t-Statistic
30.08994
4.35605
0.282885
0.178706
15.66057
0.922959
8.529846
0.435282
12.26087
0.123577
7.318212
-1.898281
0.039618
-0.062239
11.61436
0.424735
0.001824
0.539897
0.000476
3.108585
Mean dependent var
S.D. dependent var
Akaike info criterion
Schwarz criterion
Hannan-Quinn criter.
Durbin-Watson stat
Prob.
0.0001
0.8593
0.3627
0.6662
0.9024
0.0664
0.9507
0.6738
0.5929
0.0039
242.3023
79.2415
9.294326
9.703907
9.445367
1.813869

For one fitted term, namely Yi 2 , both the T and the F are reported. Because Assumption 3 is false,
the reported T-statistic does not have a T-distribution and the reported F-statistic does not have an F
distribution. The reported Likelihood ratio has an asymptotic Chi-square distribution and the
reported Probability (p-value) is derived from the Chi-square distribution.
From
http://www.boost.org/doc/libs/1_57_0/libs/math/doc/html/math_toolkit/dist_ref/dists/chi_squared_dist.html
Page 28 of 32
Econ 413
Hypothesis Testing
From http://www.statdistributions.com/chisquare/
The reported Chi-square statistic 11.04376 is in the rejection region for any size > .001.


A two term, Yi 2 , Yi 3 Ramsey test:
Ramsey RESET Test
Equation: EQ01
Specification: P C A BA BE CA N S SP Y
Omitted Variables: Powers of fitted values from 2 to 3
Value
df
Probability
F-statistic 5.892869 (2, 32)
0.0066
Likelihood 13.48361
ratio
2
0.0012
F-test summary:
Sum of Sq.df
Mean Squares
Test SSR 5984.609
2 2992.305
Restricted SSR
22233.7
34 653.9322
Unrestricted16249.09
SSR
32 507.784
Unrestricted16249.09
SSR
32 507.784
LR test summary:
Value
df
Restricted LogL
-195.35
34
Unrestricted -188.608
LogL
32
Dependent Variable: P
Method: Least Squares
Sample: 1 43
Included observations: 43
Variable
Coefficient
C
57.5061
A
0.447802
BA
16.27631
BE
2.55429
CA
-0.96894
N
17.38725
S
-0.10351
SP
14.53782
Y
-0.00438
FITTED^2
0.005835
FITTED^3
-5.33E-06
R-squared
0.938387
Adjusted R-squared 0.919133
S.E. of regression
22.53406
Sum squared resid
16249.09
Log likelihood
-188.608
F-statistic
48.73686
Prob(F-statistic)
0
Std. Error t-Statistic
61.47493
0.93544
0.403041
1.111058
15.51592
1.049007
8.462367
0.301841
12.2384 -0.079172
23.99774
0.724537
0.083635 -1.237683
13.44676
1.081139
0.004318 -1.014324
0.00322
1.812175
3.90E-06 -1.366822
Mean dependent var
S.D. dependent var
Akaike info criterion
Schwarz criterion
Hannan-Quinn criter.
Durbin-Watson stat
The p-value is .0012 – reject for any size test > .0012.
My suggestion is to use 1, 2, 3, and/or 4 fitted items. If Ramsey rejects at 1 term, i.e., a
small reported p-value for the Likelihood ratio, reject all 7 classical assumptions and stop. If
Ramsey accepts at 1 fitted term, try 2 fitted terms. If Ramsey rejects, reject and stop. If Ramsey
Prob.
0.3566
0.2748
0.302
0.7647
0.9374
0.474
0.2248
0.2877
0.318
0.0794
0.1812
242.3023
79.2415
9.284097
9.734636
9.450242
1.974655
Page 29 of 32
Econ 413
Hypothesis Testing
accepts, try 3 terms. If the Ramsey rejects at 3 terms, reject and stop. If it accepts at 3 terms, then
try 4 terms. If the Ramsey accepts at 4 terms, ACCEPT and stop. If it rejects at 4 terms, reject and
stop. Accept the maintained only if Ramsey does not reject at 2, 3, or 4 fitted items (well, you
could do 5, 6, 7, …).
The test composed of 1 term Ramsey, then 2 term Ramsey, then 3 term Ramsey, then 4 term
Ramsey is called a sequential test. I will call the test a sequential Ramsey test. The sequential
Ramsey test does not have to be done sequentially. My tabling program exhibits the results of all 4
tests at once. The sequential Ramsey test rejects if any one of the Ramsey tests (1,2,3 or 4) term
tests rejects.
Using multiple Ramsey tests increases power (reduces the probability of Type II error). The
intuition is 4 opportunities to reject will reject more models than just one opportunity. The size
indicates the probability of rejecting a true maintained.
The literature indicates using a large size for Ramsey test – say 10%. Better to reject a true
maintained (Type I) than to accept a false maintained (Type II). For undominated tests, larger size
obtains larger power. Using 4 tests rather than just one increases the power and increases the size.
Four opportunities rejects more models, some of which are true models and some of which are false
models. Using a stated size equal 10% but using 1 term then 2 terms then 3 terms then 4 terms
means the size of the sequential Ramsey test is greater than the stated 10% size for each test.
The problem is determining the size of the sequential Ramsey test. The 1 term Ramsey test
is not independent of the 2 term Ramsey test which is not independent of the 3 term Ramsey test,
etc. If the tests were independent, we could determine the size to use for the sequential Ramsey
test. The analogous problem is what success rate for heads obtains a .10 success rate for 4 heads.
With one coin flip, if the success rate of heads is 0.10, then the Probability(at least one
heads) is 0.10. With 2 coins, the
Probability(at least one heads in two flips) = 1 – Probability(2 tails) = 1 – 0.81 = 0.19.
A success rate = 0.050954940282302 for heads obtains
Probability(at least one heads in two flips) = 0.10.
A success rate = 0.0260002333539697 for heads obtains
Probability(at least one heads in four flips)=0.10
If the 1 term Ramsey and the 2 term Ramsey sizes are 0.10 and they are independent, then a
two test (1 term and 2 terms) Ramsey would have 0.19 size. For the 4 part sequential Ramsey test
(1 term, 2 terms, 3 terms, 4 terms), if the tests were independent, size= 2.6% obtains size = 10% for
the sequential Ramsey test.
The 1 term Ramsey and the 2 term Ramsey and the 3 term Ramsey and the 4 term Ramsey
are not independent. Determining the size for the sequential Ramsey test requires a Monte Carlo
study. My preliminary study indicates using a 5% size for each Ramsey test to obtain a size = 10%
sequential Ramsey test.
The three tests we use are the Ramsey test (assumption 1 and 3), the homoskedasticity tests
(assumption 5) and the serial correlation tests (assumption 4). If the homoskedasticity tests or the
serial correlation tests reject the model, we fix the model with known proceedures. If the Ramsey
rejects a model, the procedure is find excluded relevant variables or change the functional form. If
we do not have data for the excluded relevant variable, we are up the creek without a paddle.
For functional form, and for this class, I recommend using the fractional polynomial forms
X, X^2, 1/X and LOG(X).
The financial aid example:
Page 30 of 32
Econ 413
Hypothesis Testing
Dependent Variable: FINAID
Method: Least Squares
Sample: 1 50
Included observations: 50
Variable CoefficientStd. Error t-Statistic Prob.
C
9813.022
1743.1 5.629638
0
HSRANK 83.26124 20.14795 4.132492
0.0002
MALE
-1570.14 784.2971 -2.00198
0.0512
PARENT
-0.34275 0.031505 -10.8792
0
R-squared 0.764613 Mean dependent var
11676.26
Adjusted R-squared
0.749262 S.D. dependent var 5365.233
S.E. of regression
2686.575 Akaike info criterion18.70654
Sum squared
3.32E+08
resid
Schwarz criterion
18.8595
Log likelihood
-463.664 Hannan-Quinn criter.
18.76479
F-statistic 49.80764 Durbin-Watson stat 2.301406
Prob(F-statistic) 0
Ramsey RESET Test
Value
df
Probability
Likelihood ratio
21.05309
1
0
Likelihood ratio
28.00845
2
0
Likelihood ratio
28.02948
3
0
Likelihood ratio
28.20443
4
0
All 4 Ramsey tests reject the model. We did not need to do 2 terms or 3 terms or 4 terms
because the model was rejected with a 1 term Ramsey test.
The model is mispecified. One or more of the 7 classical assumptions is false.
We discussed missing variables such as SAT or ALUMNI etc. We do not have data on
those variables. My suggestion is functional form.
Dependent Variable: FINAID
Method: Least Squares
Sample (adjusted): 2 50
Included observations: 42 after adjustments
Variable
Coefficient Std. Error
t-Statistic
C
-1908106
890201.9 -2.143454
HSRANK
-11414.32
5055.282 -2.257899
HSRANK^2
34.30128
14.81629
2.315105
1/HSRANK
8718296
3990332
2.184855
LOG(HSRANK)
582893
262338.2
2.221914
PARENT
0.009082
0.281495
0.032262
PARENT^2
-2.02E-07
2.87E-06 -0.070425
1/PARENT
-5333221
3287665 -1.622191
LOG(PARENT)
-5548.827
2621.593 -2.116586
R-squared
0.838643 Mean dependent var
Adjusted R-squared
0.799527 S.D. dependent var
S.E. of regression
2024.552 Akaike info criterion
Sum squared resid
1.35E+08 Schwarz criterion
Log likelihood
-374.2814 Hannan-Quinn criter.
F-statistic
21.43949 Durbin-Watson stat
Prob(F-statistic)
0
Ramsey RESET Test
Value
df
Probability
Likelihood ratio
0.715182
1
0.3977
Likelihood ratio
2.720409
2
0.2566
Likelihood ratio
4.104027
3
0.2504
Likelihood ratio
4.189997
4
0.3809
Prob.
0.0395
0.0307
0.027
0.0361
0.0333
0.9745
0.9443
0.1143
0.0419
10235.83
4521.689
18.25149
18.62385
18.38798
1.814212
Page 31 of 32
Econ 413
Hypothesis Testing
The model passes the sequential Ramsey test.
Is the model correct? Are all 7 classical assumptions true? We do not know. We accept the
model on the basis of this test. We must also test for heteroskedasticity.
The individual Ramsey test has greater power the more variables are on the RHS of the
equation. To illustrate, consider the Woody example:
Dependent Variable: SALES
Method: Least Squares
Sample: 1 33
Included observations: 33
Variable CoefficientStd. Error t-Statistic Prob.
C
102192.4 12799.83 7.983891
0
I
1.287923 0.543294 2.370584
0.0246
N
-9074.67 2052.674 -4.4209
0.0001
P
0.354668 0.072681 4.87981
0
R-squared 0.618154 Mean dependent var
125634.6
Adjusted R-squared
0.578653 S.D. dependent var 22404.09
S.E. of regression
14542.78 Akaike info criterion22.12079
Sum squared
6.13E+09
resid
Schwarz criterion 22.30218
Log likelihood
-360.993 Hannan-Quinn criter.
22.18182
F-statistic 15.64894 Durbin-Watson stat 1.758193
Prob(F-statistic)
0.000003
Ramsey RESET Test
Equation: EQ02
Specification: SALES C I N P
Value
Likelihood ratio
2.28535
Likelihood ratio 6.619518
Likelihood ratio 7.219245
Likelihood ratio 7.509908
df Probability
1
0.1306
2
0.0365
3
0.0652
4
0.1113
Using a 5% size, the 2 term Ramsey test rejects but if we used a 3.5% size, the 2 term Ramsey
accepts. The example is a 'knife edge' case. Without other information, I would accept the model.
But the model only has 3 variables. Each Ramsey has low power because there are only 3
variables. Consider:
Dependent Variable: SALES
Method: Least Squares
Sample: 1 33
Included observations: 33
Variable CoefficientStd. Error t-Statistic Prob.
C
1.04E+08 97707989 1.063095
0.3004
I
494.632 484.5582 1.02079
0.3195
I^2
-0.00393 0.003701 -1.06163
0.3011
1/I
-6.64E+10 7.09E+10 -0.93576
0.3606
LOG(I)
-1E+07 10296971 -0.97528
0.3411
N
149544.3 347002.6 0.43096
0.6711
N^2
-5517.12 12273.27 -0.44952
0.6579
1/N
-767689 1971341 -0.38943
0.7011
LOG(N)
-661337 1500117 -0.44086
0.664
P
8.472545 5.806703 1.459097
0.1601
P^2
-1.24E-05 8.96E-06 -1.38794
0.1804
1/P
-2.41E+10 1.61E+10 -1.49728
0.1499
LOG(P)
-797546 560319.3 -1.42338
0.17
R-squared 0.727247 Mean dependent var
125634.6
Adjusted R-squared
0.563595 S.D. dependent var 22404.09
S.E. of regression
14800.35 Akaike info criterion22.32979
Sum squared
4.38E+09
resid
Schwarz criterion 22.91933
Log likelihood
-355.442 Hannan-Quinn criter.
22.52815
F-statistic 4.443866 Durbin-Watson stat 1.961389
Prob(F-statistic)
0.001662
Ramsey RESET Test
Value
Likelihood ratio 4.145313
Likelihood ratio 10.07778
Likelihood ratio 10.17031
Likelihood ratio 11.47602
df Probability
1
0.0417
2
0.0065
3
0.0172
4
0.0217
The 2 term Ramsey rejects at size 1%. The model must be rejected by the Ramsey test. More
variables obtain a more powerful Ramsey test. The predicteds have more 'information' – more variables are
Page 32 of 32
Econ 413
Hypothesis Testing
used to calculate the predicteds and hence the predicted squared will be correlated with more variables
compared to the situation with fewer variables in the mode.
The (marginal) acceptance of the 3 variable model is a LOW POWER acceptance, large probability
of Type II error. My suggestion is to start with a large model with many variables and with the fractional
polynomials. If the large model passes the Ramsey, then analyze models with fewer variables. If the large
model is rejected by the sequential Ramsey test, drop back 15 yards and punt (joke). For this class, you
state "I am doing a project. The model fails the Ramsey. I am going to pretend that it passed and do the
project as if it passed. My results are suspect!"
A final warning: Do not exclude variables to obtain a passing Ramsey. Excluding variables
obtains a low power Ramsey. For most data sets I have encountered, if you exclude variables, you
will be able to get a passing Ramsey, albeit a LOW POWER passing Ramsey.
Download