Variance and Hypothesis Tests Econ. 201 – Econ. Data Analysis Difference in Means - Design • Try to isolate the experimental and control groups from each other while the experiment is conducted • Amounts to ensuring either that both groups are random samples drawn from the “universe” of such populations or that each group has equivalent values of any variable that might also (in addition to the experiment) alter its outcome Example • “Men have higher cholesterol than women” • Want to isolate gender as a cause for cholesterol levels to vary. Yet we know many other things effect cholesterol – 2 are diet and genetics • Could isolate gender in two ways – 1: Draw random sample from the population • Ask, is cm>cw? – 2: Draw samples of men and women from populations with similar diets and genetics --- say 20 yr. old college students • Pose the same question as #1 Example, contd. • The Data in the example take the latter approach • Then pose the question formally as: – Ho: cm=cw (The null hypothesis) • that which you hope to disprove – H1: cm≠cw (The alternative hypothesis) An alternative statement of the hypothesis • Define the difference in cholesterol between the two groups and ask is it zero or not. • Using mean levels of the data collected on each group, form the difference. • d=cw-cm. • Ho: d=0 (The null hypothesis) • H1: d ≠0 (The alternative hypothesis) Construction of mean and variance • Look at raw data on p. 56. • Construct the following from this and the sheet “Some Basic Statistical Formulae.” – The mean value of cholesterol for each group. – The variance and standard deviation of each group. Standard Error of the Mean • Another measure of a distribution is the Standard Error of the Mean. – Formula is on the sheet: look at it. • An estimate of the variability of dispersion of the sample mean. – Assuming it were itself constructed from repeated samples of size n from a population. – Is a measure of our uncertainty over the true or population mean, given that we are “estimating” it. The Central Limit Theorem • If the underlying experimental design that generated the data is a random one, then the means of various such experiments will be drawn from a distribution with a mean = (∑x)/n, and a standard deviation = s/√n. • Then the area under the standard normal curve (p. 57) contains various ranges of the mean. A general rule of thumb says that we have a 95(.4)% confidence level that the true sample mean lies within +/- 2(s/√n) Intuition • Calculate the interval around each group’s mean with the standard errors of the means (see page 57). • The further apart are the means and the smaller the dispersions around these means (stnd. errors), the more likely we are to determine that the mean levels of the two groups are different. Alternative formulation, d • Look at formation and resulting distribution of d on p. 57. • d = 173.5 – 163.3 = 10.2 • Now form the variance of this mean difference • Defined as the sum of the variances of the standard errors of each individual mean – see p. 57 for formulas, = 6.02 Formation of 95% confidence interval around mean d • 2+/-(standard error of the difference), here 2+/-(6.02). • Can be 95% confident that true mean d lies in the range from: -1.84 to 22.24. • Cannot be 95% sure d is not 0. • This interval includes zero, so at the 95% confidence level, given the data, we accept the null hypothesis, H0, reject the alternative, H1. Cholesterol Example, contd. • Look at raw data by frequency (p. 57) • Understand that the two, equivalent, ways of framing the hypothesis concern either: • 1. The degree of overlap between the confidence interval we construct around the mean of men’s cholesterol observations, and that we construct around the mean of woman’s cholesterol observations, and seeing by how much they overlap, or, • 2.Whether the distribution of d contains 0 in the confidence interval we construct around its mean Construction of the true confidence level • We know we can meet the requirement of 1+/(stnd. error of the mean) • Would give us a 66% level of confidence because 66% of the area under the standard normal curve lie in this range • Here = 10.2+/- 6.02: from 4.18 to 16.22 • But we cannot meet the criteria for a 95% level of confidence – somewhere between 95% and 66% • So there is weak support for the contention that cholesterol varies by gender Or could consult a t-statistic • t = mean/(it’s standard error) • “critical values of t, depend on the size of the sample, and gives a significance value at which a particular sample mean can be assumed to be different than zero • here t = 10.2/6.02 = 1.69 • for a sample of 30, a t–statistic of 1.69, is significant at approximately the 90% level Automate the calculation • Use Excel • Convenient for “big” data sets, with many observations • Use it to calculate: – 1. avg. cholesterol, – 2. differences from avg., – 3. differences squared – 4. squared differences summed Excel Computations, contd. • Use a calculator and formula sheet for the rest • Calculate the variance and the standard deviations of the two samples • Calculate the stnd. error of each mean • Then calculate the stnd. error of the difference in means