Topic 3: Estimation and Hypothesis Testing • Goals: – Review statistical estimation – Review confidence intervals for a single population mean & population proportion – Review confidence intervals for the difference between two population means – Understand formulas for sample size determination – Introduce hypothesis testing 1 Estimation • A major goal of data analysis is to make statistical inferences, using sample mean to estimate the corresponding parameter in the population: – Parameter is a numerical descriptive measure of a population, e.g. µ for population mean; or σ for population standard deviation (Greek letters are used). – Sample statistic is a numerical descriptive measure of a sample, e.g. x for sample mean; or s for sample standard deviation. • Political polls were used to estimate the final election result – used proportions, p for P or P 2 Random Sampling • To be valid, estimation must be based on a representative sample, obtained by random sampling. – Random => every unit has a known probability of being selected in sample – Various techniques Simple: Use random numbers from computer packages Stratified: population divided into non-overlapping subpopulations Systematic: every Kth item is selected Cluster: population divided into non-overlapping areas/clusters then random sample in each area 3 Non-random sampling • Main types: – Convenience, • selected for convenience of researcher – Judgement • Use judgement of researcher – Quota • Aim for certain quota from some subgroups – Snowball • Based on referral from other samples • Sampling can lead to errors i.e. not correctly representing the population • Also can get non-sampling errors – Recording errors, missing data, response errors, … 4 Sampling distribution of the mean • Assume we want to obtain a point estimate of the population mean from a sample statistic, e.g. mean or median – Note: sample statistics are random variables hence have probability distributions • Sampling Dist of the mean is obtained by taking repeated random samples from the population and finding the (different) mean for each sample • This sampling dist will have its own mean, variance or standard error: – Expressed as x or x 5 Confidence Intervals (CI) • Aim to develop CI estimates for the mean and the proportion This is the 2nd type of statistical inference • CI => the estimate covers a range of values (an interval) rather than just a point estimate The interval will have a specified confidence or probability of correctly estimating the true value of the population parameter 6 CI of Population Mean • Depends on whether population variance is known or unknown • If population has normal distribution or sample size large enough – Then the 95% CI of population mean , with known standard deviation (the square-root of variance), is: X Z 2 n 7 • 95% confidence => want area under normal distribution to cover 95% (that is, =0.05) Hence 2.5% at either end is unlikely (/2 =0.025) Use area under normal probability table to find appropriate z value Z-value is the standard normal value, (corresponding to a cumulative area of 1- /2) = 1.96 for 95% confidence level. See figures 8.1 and 8.2, Black z ~ N 0,1 8 Interpretation of CI 95% Confidence Interval If all possible samples of the same size n are taken, and sample means found, 95% of the intervals will include the true population mean (somewhere within the interval around their mean) 5% of the intervals will not include the population mean 9 Example • New Zealand (NZ) companies trading in China were asked for the number of years trading in China: Sample = 44, mean = 10.455 years Assume population standard deviation = = 7.7 years • If only sample standard deviation is known use this formula, as n is >30 Want 90% CI for number of years for NZ companies trading in China 10 Answer • Use: X Z n 2 • Hence 7.7 10.455 1.645 44 10.455 1.91 8.545 12.365 • So we are 90% confident that if all NZ companies were surveyed that the mean years they would have traded in China would between 8.5 to 12.4 years 11 CI of mean with unknown If n < 30 and is unknown, then use Student’s t distribution instead of normal distribution. t distribution is symmetrical about its zero mean, but flatter than standard normal => more area in its tails, and varies with sample size See fig 8.4 Need to know degrees of freedom (df) = n-1 12 95% CI for population mean with unknown standard deviation is found by: X t ( / 2 , n 1) s n s s P X t( 0.025,n 1) X t( 0.025,n1) 0.95 n n Where t refers to the t values such that 2.5% of total area under the curve falls within each tail for correct degrees of freedom 13 Degrees of Freedom • Degrees of freedom represents the number of values that can be ‘freely chosen’ when calculating a statistic • If n= 5 then df = 4 because once we have the first four values, the 5th has to be a certain value – E.g. we want 5 values to have a mean of 20 • 1st 4 values are 18, 24, 19, 16 • then the 5th value must be 23 (=100 -77) 14 Example • Sample of 20 sales invoices, mean $110.27, sample standard deviation =$28.95 • Then 95% Confidence Interval is found by: X t n 1 s n 28.95 110.27 2.093 20 110.27 13.56 or $96.71 $123.83 15 CONFIDENCE INTERVAL FOR DIFFERENCE BETWEEN TWO POPULATION MEANS • Common problem is related to comparing the outcome of two different populations Eg. Who earns the most for 2 sets of graduates What type of car travels further on a tank of petrol 16 If large and independent samples from large or normally distributed population, • Independent vs. paired samples, what is the difference? then CI of differences between means is: 2 2 A A B X A X B z B n A nB 17 Example • What is the difference in savings on groceries using coupons between 2 groups of shoppers based on income levels? • Find 98% Confidence Interval for the difference in mean savings Middle income shoppers Low income shoppers n1 = 60 n2 = 80 x1 $5.84 x2 $2.67 1 $1.41 2 $0.54 18 Answer 1.412 0.54 2 1 2 (5.84 2.67) 2.33 60 80 3.17 0.45 2.72 1 2 3.62 • Hence there is a 98% level of confidence that the actual difference in the population mean coupon savings/week between 2 income groups is between $2.72 and $3.62 19 Formula for when population variance unknown • The formula depends on knowing if population variances are equal or not • If assume equal variance, then: ( a b ) ( X a X b ) t / 2,v 1 1 s ( ) na nb 2 p • Where: df v na nb 2 2 2 ( n 1 ) s ( n 1 ) s a b b s 2p a na nb 2 20 Confidence Interval for Proportions • Confidence Interval for p is given below, with ps = sample proportion When n is large (30 or more) use Z value When n is small (less than 30) use t value P ps z ps 1 ps n 21 Example • Sample of 100 sales invoices, 10 have errors, what is 95% Confidence Interval • Ps = 10/100 =0.1 • 95% CI for P is = 0.1 1.96 0.10.9 100 0.1 0.0588 0.0412 p 0.1588 22 Interpretation • Hence the 95% confidence interval based on this sample is between 4% (4.12%)and 16% (15.88%) of sales invoices will have errors 23 Best Sample Size: variance is known • Sampling can be expensive, so we want sample size as small as possible, subject to: – amount of sampling error that is acceptable, e – Level of confidence desired (1 - ) • For Population mean, z n e 2 24 Example • Want e = $500 (error in actual incomes ) • 95% confidence, => z = 1.96 • Know = $4000 1.964000 • Then need sample size of: n 500 2 n 245.86 n 246 25 Sample size: What if variance is unknown? • If population variance = 2 is unknown use either: • Sample variance, s2 • Or = (Range of values ÷ 4), as an approximation of . PROPORTIONS • Formula for estimating sample size for proportions is: z 2 ps (1 ps ) n e2 26 Hypothesis Testing • 3rd form of statistical inference • Allows us to make inferences about a population parameter by analyzing differences between – the results observed (sample Statistic) and the results expected, based on underlying hypothesis 27 Actual Hypothesis • Need to state the hypothesis that is going to be tested: = null hypothesis = status quo, or old theory H0: = $100 (use population parameters) • Then state the alternative hypo, or the one that is to be proved as new theory H1: $100 =>Two-tailed tests • One tailed tests are also allowed 28 One tailed tests • These allow for the alternative hypo to be one-sided –Rejection can be either on left or right hand side –Possible hypotheses: • H0: ≥ 100 or H0: ≤ 100 (or H0: = 100) • H1: < 100 or H1: > 100 • Note this changes the critical value compared to same level of confidence for a 2-tailed test 29 Rejection rules • It is assumed that Ho is true, aim to test if this is the case or not • Aim to either not reject or reject Ho with certain level of confidence (95%, 90%,) – Can be expressed as (1- ) = 0.95 (or 0.99) • This helps to define the non-rejection or rejection region • identified by critical values (Z or t values) dependent on level of confidence – This depends if 1 or 2 tailed test 30 Critical values vs. Test statistic • Need to determine a test statistic to compare with critical values Test statistic for mean ( known): z x / n Test stat for mean ( unknown): t x s/ n • If test statistic falls within non-rejection region (within boundary of critical values) then Accept H0 • Accept H1 if test statistic falls in rejection region 31 Example • Hubbard’s wants to know with 95% level of confidence, that its cereal boxes contain more than 500gm – Past experience suggests that the weight in cereal boxes is normally distributed. – Firm takes a random sample of 25 boxes and finds, X 520 g s 75 gm 32 Steps required to conduct Hypothesis Test • 1. State hypos – H0: 500 – H1: > 500 (1 tailed test) • Note if only concerned if boxes were 500gm then H1 would be 500 • 2. Select test statistic – Population is normal, but n <30 and unknown => – Use student t-distribution 33 • 3. Calculate critical value based on level of significance = 5% level of significance Critical value from t-tables with df = 24 is 1.711 • 4. Calculate sample statistic X t s/ n 520 500 1.33 75 / 25 34 • 5. Decision: compare test statistic with sample statistic as 1.33 < 1.711 falls in non-rejection (acceptance) region, do not reject H0 And conclude that at the 5% level of significance (or with 95% confidence) the mean fill of cereal boxes is at least 500gm of cereal. ________________________________________ If sample statistic = 1.8 then Reject H0 and conclude that at this level of significance there is significant evidence that boxes contain more than 500gm. 35 Alternative approach • p-value (reported by computer statistical packages) = observed level of significance = smallest level at which Ho can be rejected for a given set of data • Decision rules: –If the p-value is , the null hypo is not rejected –If the p-value is < , the null hypo is rejected 36 Types of errors • There is a risk of the incorrect conclusion being made due to sample chosen • Type 1 error – =>rejecting Ho when it was in fact true – Probability of Type 1 error = • e.g. for 95% confidence level = 0.05 • Type 2 error – => accepting a false hypo, with Prob = – Size of will depend on hypo value of parameter 37 Types of errors 38 • Note: is level of significance (0.05) • Level of confidence is 1- = 0.95 or 95% • Trade off between errors: –For any sample size anything that reduces raises –A larger sample size could reduce both. 39