Statistics Review AGEC 317 Both MATH 141 and STAT 303 provide a statistical foundation for AGEC 317. Econometrics, the application of statistical methods to economic problems, is a basic tool all economists need to possess. AGEC 317 will introduce econometrics and build on material from MATH 141 and STAT 303. Econometrics will be used in upper division AGEC classes. 1. 2. Three general types of economic variables are continuous, discrete, and categorical. A continuous variable takes on a continuum in the sample space, such as all points on a line or all real numbers. Discrete variables are finite number of elements or an infinitely countable number, such as all positive integers. Categorical data are grouped accordingly to some quality or attribute, such as sex or type of automobile. a) Is it possible to have another observation between any two discrete variables? Continuous variables? Categorical variables? b) Determine the type of data each for each of the following: observations on i) on flips of a coin, ii) distances from the earth to stars, and iii) number of people entering Blocker on any given day. The population is defined as the total group set of elements of interest. A sample is a subset of the population. In econometrics, we are usually interested in samples as it is usually too costly to sample the entire population. For example, if you are interested in determining the average number of computers per household in the United States, the population is one figure per household for all households in the U.S. A sample might be using 1,000 households. a) 3. If you were interested in determining the average number of drinks served per customer at the Chicken on Friday nights in January, what would be the population? What would be a sample? There are many definitions for probability, but a good working definition is probability is the relative frequency or occurrence of an event after repetitive trials or experiments. a) What is the range for probabilities? What is the interpretation of the endpoints of this range? b) What is the probability of obtaining a head (tail) from flipping a fair coin? c) For the following sample observations on water requirement (gal/day/head) for swine, determine the probability of each water requirement (assume each observation is equally likely): Water Requirements for Swine (gal/day/head) 1.0 1.2 1.4 1.5 2.1 2.2 1.9 1.8 2.1 2.3 2.4 2.5 1.9 1.8 2.0 2.0 2.3 2.0 1.8 2.0 Modified from Hoshmand, A.R. Statistical Methods for Ag. Sciences 4. 5. Probability distribution functions are a function that associates each value of a discrete random variable with the probability that this value will occur. Usual notation is to use symbols such as p(x) or f(x) to denote the probability distribution of variable x. A cumulative probability distribution function gives the probability that the value of the random variable is less than or equal to x. CDF’s are usually denoted by a capital letter, such as P(x) or F(x). For continuous variables, probability density functions (pdf) and cumulative density functions (cdf) are the proper notation. Other than the continuous nature, there is little difference in the use of the two types of functions. In fact, many people will not make a distinction. Pdf’s and cdf’s will be used in AGEC 317 to refer to either type of probability function. a) Create a histogram for the data in question 3c assuming each observation is equally likely. A histogram is a series of rectangles with areas proportional to the probabilities of a probability distribution; therefore, histograms are a form of probability distributions. Histograms are normally bar charts with categories on the x-axis and probability on the y-axis. They are normally used for discrete and categorical data, along with sample data. Use the categories of water requirements for the x-axis. b) Create a discrete CDF for the data in 3c. Use the categories of water requirements for the x-axis. c) Draw a hypothetical continuous pdf and cdf. Besides probability functions, descriptive statistical measures are important to describe a sample or population. Descriptive statistics include measures of central tendencies and measures of dispersion. Measures of central tendencies include the mean, median, and mode. The mean of a sample or population is calculated by the formula x x i f (x i ) where f(x) is the pdf and x is the random variable. In samples, it is usually assumed each 1 observation is equally likely and the formula mean becomes x x i where n is the n number of observations. Note, with equally likely outcomes, 1/n is the probability of occurrence. The median is the middle point or observation when the data are ordered from smallest to largest. The mode is the value, which occurs most often in a distribution. Note, with categorical data, these definitions may change slight, especially calculation of the mean. Check appropriate statistics books for proper calculation. 6. a) Find the mean, mode, and median for the data in 3c. b) What is the impact of changing the mean on a bell-shaped pdf (see 4c answers)? c) What type of distribution has the mean = mode = median? Measures of dispersion of interest are the range, maximum, minimum, variance, standard deviation, and coefficient of variation. The range is the difference between the largest value in the sample (the maximum) and the smallest value (the minimum), that is R x max x min , where R denotes the range. Variance, 2 , is a measure of deviation from the mean. Interpretation of variance is difficult for a single set of observations. Variance is used to compare distributions. One reason for difficulty in interpretation is that variance is in units of the variable squared. To overcome this issue, the standard deviation, , is used which is the square root of the variance. Standard deviation is interpreted as a measure of variability that indicates by how much values of a distribution typically deviate from its mean. Unfortunately, the standard deviation is in the units of the variable (i.e. miles, dollars, kilometers, etc.). Therefore, it is not proper to compare standard deviations between samples that are in different units. The coefficient of variation, CV, is a unitless measure of dispersion. The CV is used as a measure of relative variation and can be used to compare variation in several data sets. For samples under the assumption of equally likely observations, these measures are calculated as follows (where the hat denotes sample and not population value): n ˆ 2 (x i 1 i x) 2 n 1 ˆ ˆ CV 100 x where x denotes the mean of the sample. We divide by n - 1 instead of n, because only n-1 deviations are independent. The deviations will sum to zero so n - 1 deviations will determine the nth deviation. As with all measures, check a statistic book for the appropriate calculation for frequency data and if observations are not equally likely. 2 a) Calculate the maximum, minimum, range, variance, standard deviation, and coefficient of variation for the data in 3c. b) 7. For a bell-shaped curve, what is the impact of changing the variance on the pdf? Besides the bell-shaped curve, pdfs can take on many different shapes. If a curve is not symmetric, it is skewed. A positively skewed distribution has a longer tail on the right, whereas a negatively skewed distribution has a longer tail on the left. Examples of other distributions included (from Hoshmand, A.R. Statistical Methods for Agricultural Sciences). The impact of skewness on the mean, mode, and median can be illustrated as follows for the bell-shaped curve (from Hoshmand, A.R. Statistical Methods for Agricultural Sciences). . 8. Statistical inference usually involves hypothesis testing. Recall, two different hypotheses, a null and an alternative, are necessary to properly form a statistical test. The null hypothesis, commonly denoted as H0, is the hypothesis of interest; whereas the alternative hypothesis, HA, is the complement of the null. Properly stated null and alternative hypothesis cover all alternatives. A common mistake is to use the following hypothesis H0: x = 2 and HA: x > 2. In this case, the potential exist for x to be less than two, which is not covered by the hypothesis. A properly stated test would be H0: x = 2 and HA: x ≠ 2. a) State correct null and alternative hypothesis for the following tests: x is greater than 5; x is equal to zero; and x is less than or equal to a. b) Two general forms of a test (using a t-distribution) are one- and two-tailed tests. A one tailed test is a test in which we are interested in rejecting the null hypothesis only if the variable of interest is sufficiently large or small, but not both. A two-tailed test is interested if the variable is either larger or smaller than the hypothesized value. For the three tests given in 8a, state whether they are one- or two-tailed tests. 9. One of the most important statistical tests in econometrics is based on the student tdistribution. The t-distribution is a symmetric bell-shaped distribution, but the shape (probabilities) depends on the degrees of freedom of the distribution. As the degrees of freedom approach infinity, the t-distribution approaches the standard normal distribution. x x0 The general t-test is t where x is the variable of interest, x0 is the null hypothesis ˆ n value, ̂ is a appropriate estimate of the standard deviation of x, and n is the number of observations. The null hypothesis is not rejected if the test statistic falls in the fail to reject (acceptance) region, whereas the null hypothesis is rejected if the statistic falls in the rejection region. These regions are determined by the level of significance of the test, . Significant values are obtained from tabulated test values found in most statistic books. Fail to reject and rejection regions for one- and two-tailed tests are: Two-tailed test rejection region defined by α/2 Rejection Region α/2 One-tailed test rejection region defined by α Rejection Region α/2 Fail to Reject Region Fail to Reject Region For a one-tailed test, the rejection region could also be on the left side. Fail to reject or reject the null hypothesis based on were the calculated t-value falls. Selected Critical Values for the t-distribution Level of Significance α - see diagrams above Degrees of Freedom .10 .05 .025 .01 1 3.078 6.314 12.706 63.657 15 1.341 1.753 2.131 2.947 19 1.328 1.729 2.093 2.861 20 1.325 1.725 2.086 2.845 21 1.323 1.323 2.080 2.518 ∞ 1.282 1.282 1.960 2.326 Rejection Region α a) Test the following null hypothesis using a t-test for the data in 3c. Be sure to state the null and alternative hypotheses. Are swine water requirements are less than 2 gallons / day / head. Test at a level of significance of 0.05. Are swine water requirements are equal to zero gallons /day /head, test at a level of significance of 0.05 in each tail. b) Why do the values in the table increases as the significant level decreases? c) Why do the values in the table decrease as the degrees of freedom increase? Statistical tests are not perfect; they have errors associated with them. A type I error is rejecting the null hypothesis when it is true, and a type II error is failing to reject a false null hypothesis. d) How can you decrease the chance of a type I error? In this case, what happens to the chance of a type II error? e) How can you decrease the chance of a type II error? In this case, what happens to the chance of a type I error? f) Complete the following table. Table of Decisions in Hypothesis Decision Making Decision Regarding Statistical Status of Null Hypothesis Test True False Fail to reject null Reject null 10. The F-distribution and test is important in econometrics for testing if sample or regression variances differ. A common use is to test hypothesis concerning multiple coefficients in a regression (more on this in class). The F-test takes on several forms, but in general, it is a variance divided by a variance. Recall, a variance is a squared term; therefore, the F-test consists of two positive terms. The F-test is a one-tailed test associated with the right-hand tail. Further, because both the numerator and dominator are estimated variances, they have degrees of freedom associated with them. Calculated critical values for the F-test, therefore, have two degrees of freedom associated with them: a numerator degree of freedom and a denominator degree of freedom. Tables give the critical values for the F-test based on the level of significance and the two degrees of freedom. The F-distribution and test is as follows: F critical value Rejection Region α 0 Values of F Answers 1. a) discrete - no values between adjacent values, continuous always a value between continuous numbers, categorical makes no sense to talk about adjacent variables b) (i) categorical takes on the values of heads or tails, (ii) continuous could take on any value including fractional, and (iii) discrete takes on only integer values. 2. The population would be the number of beers every customer on Friday nights consumed in January. A sample would consist of only some of the customers. 3. a) range 0 -1, a zero indicates the event never occurs and a one indicates the event allows occurs. b) 0.50 or 50% for heads and tails. c) Assuming each observation is equally likely, the following probabilities are obtained for 2.0 - probability is 0.2; 1.8 - 0.15; 1.9, 2.1, 2.3 - 0.1; for the remaining requirements the probability is 0.05 or 1/20. a) PDF Water Requirements 0.25 0.2 0.15 0.1 0.05 Water Requirement 2.5 2.4 2.3 2.2 2.1 2 1.9 1.8 1.5 1.4 1.2 0 1 Probability 4. b) CDF Water Requirements 1.2 Probability 1 0.8 0.6 0.4 0.2 1 1.2 1.4 1.5 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 0 Water Requirement Example of a bell-shaped curve. General CDF Probability 1 1 0. 8 0. 6 0. 4 0. 2 0 .2 -0 .4 -0 .6 -0 -0 .8 Continuous pdf -1 c) 5. a) mean = 1.91, median = 2, and mode = 2. In general, median, mode, and mean do not have to be equal. b) Increasing the mean shifts the pdf to the right as shown by the dotted pdf. 6. 2 1. 6 1. 2 0. 8 0. 4 0 -2 -1 .6 -1 .2 -0 .8 -0 .4 Continuous pdf c) If mean = mode = median, the distribution is symmetrical. a) max = 2.5, min = 1, range = 1.5, variance = 0.151, st. dev. = 0.389, CV = 20.377 b) Decreasing the variance makes the pdf mored peaked as in the dotted pdf. 7. 8. 2 1. 6 1. 2 0. 8 0. 4 0 -2 -1 .6 -1 .2 -0 .8 -0 .4 Continuous pdf No problems. a) H0: x > 5 and HA: x ≤ 5; H0: x = 0 and HA: x ≠ 0; H0: x ≤ a and HA: x > a. 9. b) one-tailed test; two-tailed test; and one-tailed test a) H0: x < 2 and HA: x ≥ 2 α = .05 gives a critical value of 1.729. x x 0 (1.91 2) 1.034 Calculated value is < 1.729 so fail to reject, ˆ .389 n 20 one-tailed test. t H0: x = 0 and HA: x ≠ 2 calculated value is 21.94, which falls outside of the range of critical value from the table of -1.328 to 1.328 two-tailed test, so reject the null hypothesis. b) Less probability in the tail. c) Greater degrees of freedom, more confident in the estimates. d) Decreasing the level of significance (less area in the tail) will decrease the chance of a type I error, but will increase the chance of a type II error. e) Increasing the level of significance (more area in the tail) will decrease the chance of a type II error, but increase the chance of a type I error. f) Table of Decisions in Hypothesis Decision Making Decision Regarding Statistical Status of Null Hypothesis Test True False Fail to reject null correct decision Type II error Reject null Type I error correct decision 10. More on the F-distribution in class.