Reading Assignment 12 - Department of Agricultural Economics

advertisement
Statistics Review
AGEC 317
Both MATH 141 and STAT 303 provide a statistical foundation for AGEC 317.
Econometrics, the application of statistical methods to economic problems, is a basic tool all
economists need to possess. AGEC 317 will introduce econometrics and build on material from
MATH 141 and STAT 303. Econometrics will be used in upper division AGEC classes.
1.
2.
Three general types of economic variables are continuous, discrete, and categorical. A
continuous variable takes on a continuum in the sample space, such as all points on a line
or all real numbers. Discrete variables are finite number of elements or an infinitely
countable number, such as all positive integers. Categorical data are grouped accordingly
to some quality or attribute, such as sex or type of automobile.
a)
Is it possible to have another observation between any two discrete variables?
Continuous variables? Categorical variables?
b)
Determine the type of data each for each of the following: observations on i) on
flips of a coin, ii) distances from the earth to stars, and iii) number of people
entering Blocker on any given day.
The population is defined as the total group set of elements of interest. A sample is a
subset of the population. In econometrics, we are usually interested in samples as it is
usually too costly to sample the entire population. For example, if you are interested in
determining the average number of computers per household in the United States, the
population is one figure per household for all households in the U.S. A sample might be
using 1,000 households.
a)
3.
If you were interested in determining the average number of drinks served per
customer at the Chicken on Friday nights in January, what would be the
population? What would be a sample?
There are many definitions for probability, but a good working definition is probability
is the relative frequency or occurrence of an event after repetitive trials or experiments.
a)
What is the range for probabilities? What is the interpretation of the endpoints of
this range?
b)
What is the probability of obtaining a head (tail) from flipping a fair coin?
c)
For the following sample observations on water requirement (gal/day/head) for
swine, determine the probability of each water requirement (assume each
observation is equally likely):
Water Requirements for Swine
(gal/day/head)
1.0
1.2
1.4
1.5
2.1
2.2
1.9
1.8
2.1
2.3
2.4
2.5
1.9
1.8
2.0
2.0
2.3
2.0
1.8
2.0
Modified from Hoshmand, A.R.
Statistical Methods for Ag. Sciences
4.
5.
Probability distribution functions are a function that associates each value of a discrete
random variable with the probability that this value will occur. Usual notation is to use
symbols such as p(x) or f(x) to denote the probability distribution of variable x. A
cumulative probability distribution function gives the probability that the value of the
random variable is less than or equal to x. CDF’s are usually denoted by a capital letter,
such as P(x) or F(x). For continuous variables, probability density functions (pdf) and
cumulative density functions (cdf) are the proper notation. Other than the continuous
nature, there is little difference in the use of the two types of functions. In fact, many
people will not make a distinction. Pdf’s and cdf’s will be used in AGEC 317 to refer to
either type of probability function.
a)
Create a histogram for the data in question 3c assuming each observation is
equally likely. A histogram is a series of rectangles with areas proportional to the
probabilities of a probability distribution; therefore, histograms are a form of
probability distributions. Histograms are normally bar charts with categories on
the x-axis and probability on the y-axis. They are normally used for discrete and
categorical data, along with sample data. Use the categories of water
requirements for the x-axis.
b)
Create a discrete CDF for the data in 3c. Use the categories of water requirements
for the x-axis.
c)
Draw a hypothetical continuous pdf and cdf.
Besides probability functions, descriptive statistical measures are important to describe
a sample or population. Descriptive statistics include measures of central tendencies and
measures of dispersion. Measures of central tendencies include the mean, median, and
mode. The mean of a sample or population is calculated by the formula x   x i f (x i )
where f(x) is the pdf and x is the random variable. In samples, it is usually assumed each
1
observation is equally likely and the formula mean becomes x   x i where n is the
n
number of observations. Note, with equally likely outcomes, 1/n is the probability of
occurrence. The median is the middle point or observation when the data are ordered
from smallest to largest. The mode is the value, which occurs most often in a
distribution. Note, with categorical data, these definitions may change slight, especially
calculation of the mean. Check appropriate statistics books for proper calculation.
6.
a)
Find the mean, mode, and median for the data in 3c.
b)
What is the impact of changing the mean on a bell-shaped pdf (see 4c answers)?
c)
What type of distribution has the mean = mode = median?
Measures of dispersion of interest are the range, maximum, minimum, variance,
standard deviation, and coefficient of variation. The range is the difference between
the largest value in the sample (the maximum) and the smallest value (the minimum), that
is R  x max  x min , where R denotes the range. Variance,  2 , is a measure of deviation
from the mean. Interpretation of variance is difficult for a single set of observations.
Variance is used to compare distributions. One reason for difficulty in interpretation is
that variance is in units of the variable squared. To overcome this issue, the standard
deviation,  , is used which is the square root of the variance. Standard deviation is
interpreted as a measure of variability that indicates by how much values of a distribution
typically deviate from its mean. Unfortunately, the standard deviation is in the units of
the variable (i.e. miles, dollars, kilometers, etc.). Therefore, it is not proper to compare
standard deviations between samples that are in different units. The coefficient of
variation, CV, is a unitless measure of dispersion. The CV is used as a measure of
relative variation and can be used to compare variation in several data sets. For samples
under the assumption of equally likely observations, these measures are calculated as
follows (where the hat denotes sample and not population value):
n
ˆ 2 
 (x
i 1
i
 x) 2
n 1
ˆ  
ˆ
CV   100
x
where x denotes the mean of the sample. We divide by n - 1 instead of n, because only
n-1 deviations are independent. The deviations will sum to zero so n - 1 deviations will
determine the nth deviation. As with all measures, check a statistic book for the
appropriate calculation for frequency data and if observations are not equally likely.
2
a)
Calculate the maximum, minimum, range, variance, standard deviation, and
coefficient of variation for the data in 3c.
b)
7.
For a bell-shaped curve, what is the impact of changing the variance on the pdf?
Besides the bell-shaped curve, pdfs can take on many different shapes. If a curve is not
symmetric, it is skewed. A positively skewed distribution has a longer tail on the right,
whereas a negatively skewed distribution has a longer tail on the left. Examples of other
distributions included (from Hoshmand, A.R. Statistical Methods for Agricultural
Sciences).
The impact of skewness on the mean, mode, and median can be illustrated as follows for
the bell-shaped curve (from Hoshmand, A.R. Statistical Methods for Agricultural
Sciences).
.
8.
Statistical inference usually involves hypothesis testing. Recall, two different
hypotheses, a null and an alternative, are necessary to properly form a statistical test.
The null hypothesis, commonly denoted as H0, is the hypothesis of interest; whereas the
alternative hypothesis, HA, is the complement of the null. Properly stated null and
alternative hypothesis cover all alternatives. A common mistake is to use the following
hypothesis H0: x = 2 and HA: x > 2. In this case, the potential exist for x to be less than
two, which is not covered by the hypothesis. A properly stated test would be H0: x = 2
and HA: x ≠ 2.
a)
State correct null and alternative hypothesis for the following tests: x is greater
than 5; x is equal to zero; and x is less than or equal to a.
b)
Two general forms of a test (using a t-distribution) are one- and two-tailed tests.
A one tailed test is a test in which we are interested in rejecting the null hypothesis only if
the variable of interest is sufficiently large or small, but not both. A two-tailed test is
interested if the variable is either larger or smaller than the hypothesized value. For the
three tests given in 8a, state whether they are one- or two-tailed tests.
9.
One of the most important statistical tests in econometrics is based on the student tdistribution. The t-distribution is a symmetric bell-shaped distribution, but the shape
(probabilities) depends on the degrees of freedom of the distribution. As the degrees of
freedom approach infinity, the t-distribution approaches the standard normal distribution.
x  x0
The general t-test is t 
where x is the variable of interest, x0 is the null hypothesis
ˆ
n
value, ̂ is a appropriate estimate of the standard deviation of x, and n is the number of
observations. The null hypothesis is not rejected if the test statistic falls in the fail to
reject (acceptance) region, whereas the null hypothesis is rejected if the statistic falls in
the rejection region. These regions are determined by the level of significance of the test,
 . Significant values are obtained from tabulated test values found in most statistic
books.
Fail to reject and rejection regions for one- and two-tailed tests are:
Two-tailed test
rejection region defined by α/2
Rejection
Region
α/2
One-tailed test
rejection region defined by α
Rejection
Region
α/2
Fail to
Reject
Region
Fail to
Reject
Region
For a one-tailed test, the rejection region could also be on the left side. Fail to reject or
reject the null hypothesis based on were the calculated t-value falls.
Selected Critical Values for the t-distribution
Level of Significance α - see diagrams above
Degrees of Freedom
.10
.05
.025
.01
1
3.078
6.314
12.706
63.657
15
1.341
1.753
2.131
2.947
19
1.328
1.729
2.093
2.861
20
1.325
1.725
2.086
2.845
21
1.323
1.323
2.080
2.518
∞
1.282
1.282
1.960
2.326
Rejection
Region
α
a)
Test the following null hypothesis using a t-test for the data in 3c. Be sure to state
the null and alternative hypotheses. Are swine water requirements are less than 2 gallons
/ day / head. Test at a level of significance of 0.05. Are swine water requirements are
equal to zero gallons /day /head, test at a level of significance of 0.05 in each tail.
b)
Why do the values in the table increases as the significant level decreases?
c)
Why do the values in the table decrease as the degrees of freedom increase?
Statistical tests are not perfect; they have errors associated with them. A type I error is
rejecting the null hypothesis when it is true, and a type II error is failing to reject a false
null hypothesis.
d)
How can you decrease the chance of a type I error? In this case, what happens to
the chance of a type II error?
e)
How can you decrease the chance of a type II error? In this case, what happens to
the chance of a type I error?
f)
Complete the following table.
Table of Decisions in Hypothesis Decision Making
Decision Regarding Statistical
Status of Null Hypothesis
Test
True
False
Fail to reject null
Reject null
10.
The F-distribution and test is important in econometrics for testing if sample or
regression variances differ. A common use is to test hypothesis concerning multiple
coefficients in a regression (more on this in class). The F-test takes on several forms, but
in general, it is a variance divided by a variance. Recall, a variance is a squared term;
therefore, the F-test consists of two positive terms. The F-test is a one-tailed test
associated with the right-hand tail. Further, because both the numerator and dominator
are estimated variances, they have degrees of freedom associated with them. Calculated
critical values for the F-test, therefore, have two degrees of freedom associated with
them: a numerator degree of freedom and a denominator degree of freedom. Tables give
the critical values for the F-test based on the level of significance and the two degrees of
freedom.
The F-distribution and test is as follows:
F critical value
Rejection Region
α
0
Values of F
Answers
1.
a)
discrete - no values between adjacent values, continuous always a value between
continuous numbers, categorical makes no sense to talk about adjacent variables
b)
(i) categorical takes on the values of heads or tails, (ii) continuous could take on
any value including fractional, and (iii) discrete takes on only integer values.
2.
The population would be the number of beers every customer on Friday nights consumed
in January. A sample would consist of only some of the customers.
3.
a)
range 0 -1, a zero indicates the event never occurs and a one indicates the event
allows occurs.
b)
0.50 or 50% for heads and tails.
c)
Assuming each observation is equally likely, the following probabilities are
obtained for 2.0 - probability is 0.2; 1.8 - 0.15; 1.9, 2.1, 2.3 - 0.1; for the remaining
requirements the probability is 0.05 or 1/20.
a)
PDF Water Requirements
0.25
0.2
0.15
0.1
0.05
Water Requirement
2.5
2.4
2.3
2.2
2.1
2
1.9
1.8
1.5
1.4
1.2
0
1
Probability
4.
b)
CDF Water Requirements
1.2
Probability
1
0.8
0.6
0.4
0.2
1
1.2
1.4
1.5
1.8
1.9
2
2.1
2.2
2.3
2.4
2.5
0
Water Requirement
Example of a bell-shaped curve.
General CDF
Probability
1
1
0.
8
0.
6
0.
4
0.
2
0
.2
-0
.4
-0
.6
-0
-0
.8
Continuous pdf
-1
c)
5.
a)
mean = 1.91, median = 2, and mode = 2. In general, median, mode, and mean do
not have to be equal.
b)
Increasing the mean shifts the pdf to the right as shown by the dotted pdf.
6.
2
1.
6
1.
2
0.
8
0.
4
0
-2
-1
.6
-1
.2
-0
.8
-0
.4
Continuous pdf
c)
If mean = mode = median, the distribution is symmetrical.
a)
max = 2.5, min = 1, range = 1.5, variance = 0.151, st. dev. = 0.389, CV = 20.377
b)
Decreasing the variance makes the pdf mored peaked as in the dotted pdf.
7.
8.
2
1.
6
1.
2
0.
8
0.
4
0
-2
-1
.6
-1
.2
-0
.8
-0
.4
Continuous pdf
No problems.
a)
H0: x > 5 and HA: x ≤ 5; H0: x = 0 and HA: x ≠ 0; H0: x ≤ a and HA: x > a.
9.
b)
one-tailed test; two-tailed test; and one-tailed test
a)
H0: x < 2 and HA: x ≥ 2 α = .05 gives a critical value of 1.729.
x  x 0 (1.91  2)

 1.034 Calculated value is < 1.729 so fail to reject,
ˆ
.389
n
20
one-tailed test.
t
H0: x = 0 and HA: x ≠ 2 calculated value is 21.94, which falls outside of the range
of critical value from the table of -1.328 to 1.328 two-tailed test, so reject the null
hypothesis.
b)
Less probability in the tail.
c)
Greater degrees of freedom, more confident in the estimates.
d)
Decreasing the level of significance (less area in the tail) will decrease the chance
of a type I error, but will increase the chance of a type II error.
e)
Increasing the level of significance (more area in the tail) will decrease the chance
of a type II error, but increase the chance of a type I error.
f)
Table of Decisions in Hypothesis Decision Making
Decision Regarding Statistical
Status of Null Hypothesis
Test
True
False
Fail to reject null
correct decision
Type II error
Reject null
Type I error
correct decision
10.
More on the F-distribution in class.
Download