Estimation & Hypothesis Testing Presentation

Topic 3: Estimation and Hypothesis Testing • Goals: – Review statistical estimation – Review confidence intervals for a single population mean & population proportion – Review confidence intervals for the difference between two population means – Understand formulas for sample size determination – Introduce hypothesis testing 1 Estimation • A major goal of data analysis is to make statistical inferences, using sample mean to estimate the corresponding parameter in the population: – Parameter is a numerical descriptive measure of a population, e.g. µ for population mean; or σ for population standard deviation (Greek letters are used). – Sample statistic is a numerical descriptive measure of a sample, e.g. x for sample mean; or s for sample standard deviation. • Political polls were used to estimate the final election result – used proportions, p for P or P 2 Random Sampling • To be valid, estimation must be based on a representative sample, obtained by random sampling. – Random => every unit has a known probability of being selected in sample – Various techniques Simple: Use random numbers from computer packages Stratified: population divided into non-overlapping subpopulations Systematic: every Kth item is selected Cluster: population divided into non-overlapping areas/clusters then random sample in each area 3 Non-random sampling • Main types: – Convenience, • selected for convenience of researcher – Judgement • Use judgement of researcher – Quota • Aim for certain quota from some subgroups – Snowball • Based on referral from other samples • Sampling can lead to errors i.e. not correctly representing the population • Also can get non-sampling errors – Recording errors, missing data, response errors, … 4 Sampling distribution of the mean • Assume we want to obtain a point estimate of the population mean from a sample statistic, e.g. mean or median – Note: sample statistics are random variables hence have probability distributions • Sampling Dist of the mean is obtained by taking repeated random samples from the population and finding the (different) mean for each sample • This sampling dist will have its own mean, variance or standard error: – Expressed as  x or  x 5 Confidence Intervals (CI) • Aim to develop CI estimates for the mean and the proportion This is the 2nd type of statistical inference • CI => the estimate covers a range of values (an interval) rather than just a point estimate The interval will have a specified confidence or probability of correctly estimating the true value of the population parameter 6 CI of Population Mean • Depends on whether population variance is known or unknown • If population has normal distribution or sample size large enough – Then the 95% CI of population mean  , with known standard deviation (the square-root of variance), is:      X  Z      2  n  7 • 95% confidence => want area under normal distribution to cover 95% (that is,  =0.05) Hence 2.5% at either end is unlikely (/2 =0.025) Use area under normal probability table to find appropriate z value Z-value is the standard normal value, (corresponding to a cumulative area of 1- /2) = 1.96 for 95% confidence level. See figures 8.1 and 8.2, Black z ~ N 0,1 8 Interpretation of CI 95% Confidence Interval If all possible samples of the same size n are taken, and sample means found, 95% of the intervals will include the true population mean (somewhere within the interval around their mean) 5% of the intervals will not include the population mean 9 Example • New Zealand (NZ) companies trading in China were asked for the number of years trading in China: Sample = 44, mean = 10.455 years Assume population standard deviation =  = 7.7 years • If only sample standard deviation is known use this formula, as n is >30 Want 90% CI for number of years for NZ companies trading in China 10 Answer • Use:        X    Z   n   2  • Hence 7.7   10.455  1.645 44   10.455  1.91 8.545    12.365 • So we are 90% confident that if all NZ companies were surveyed that the mean years they would have traded in China would between 8.5 to 12.4 years 11 CI of mean with unknown  If n < 30 and  is unknown, then use Student’s t distribution instead of normal distribution. t distribution is symmetrical about its zero mean, but flatter than standard normal => more area in its tails, and varies with sample size See fig 8.4 Need to know degrees of freedom (df) = n-1 12 95% CI for population mean with unknown standard deviation is found by: X  t ( / 2 , n 1) s n s s   P X  t( 0.025,n 1)    X  t( 0.025,n1)   0.95 n n  Where t refers to the t values such that 2.5% of total area under the curve falls within each tail for correct degrees of freedom 13 Degrees of Freedom • Degrees of freedom represents the number of values that can be ‘freely chosen’ when calculating a statistic • If n= 5 then df = 4 because once we have the first four values, the 5th has to be a certain value – E.g. we want 5 values to have a mean of 20 • 1st 4 values are 18, 24, 19, 16 • then the 5th value must be 23 (=100 -77) 14 Example • Sample of 20 sales invoices, mean $110.27, sample standard deviation =$28.95 • Then 95% Confidence Interval is found by: X  t n 1 s n 28.95  110.27  2.093 20  110.27  13.56 or $96.71    $123.83 15 CONFIDENCE INTERVAL FOR DIFFERENCE BETWEEN TWO POPULATION MEANS • Common problem is related to comparing the outcome of two different populations Eg. Who earns the most for 2 sets of graduates What type of car travels further on a tank of petrol 16 If large and independent samples from large or normally distributed population, • Independent vs. paired samples, what is the difference? then CI of differences between means is:  2 2  A  A   B  X A  X B    z  B  n A nB  17 Example • What is the difference in savings on groceries using coupons between 2 groups of shoppers based on income levels? • Find 98% Confidence Interval for the difference in mean savings Middle income shoppers Low income shoppers n1 = 60 n2 = 80 x1  $5.84 x2  $2.67  1  $1.41  2  $0.54 18 Answer 1.412 0.54 2 1   2  (5.84  2.67)  2.33  60 80  3.17  0.45 2.72  1   2  3.62 • Hence there is a 98% level of confidence that the actual difference in the population mean coupon savings/week between 2 income groups is between $2.72 and $3.62 19 Formula for when population variance unknown • The formula depends on knowing if population variances are equal or not • If assume equal variance, then: (  a  b )  ( X a  X b )  t / 2,v 1 1 s (  ) na nb 2 p • Where: df  v  na  nb  2 2 2 ( n  1 ) s  ( n  1 ) s a b b s 2p  a na  nb  2 20 Confidence Interval for Proportions • Confidence Interval for p is given below, with ps = sample proportion When n is large (30 or more) use Z value When n is small (less than 30) use t value P  ps  z ps 1  ps  n 21 Example • Sample of 100 sales invoices, 10 have errors, what is 95% Confidence Interval • Ps = 10/100 =0.1 • 95% CI for P is = 0.1  1.96 0.10.9 100  0.1  0.0588  0.0412  p  0.1588 22 Interpretation • Hence the 95% confidence interval based on this sample is between 4% (4.12%)and 16% (15.88%) of sales invoices will have errors 23 Best Sample Size: variance is known • Sampling can be expensive, so we want sample size as small as possible, subject to: – amount of sampling error that is acceptable, e – Level of confidence desired (1 - ) • For Population mean,  z  n   e  2 24 Example • Want e = $500 (error in actual incomes ) • 95% confidence, => z = 1.96 • Know  = $4000  1.964000    • Then need sample size of: n   500   2 n  245.86 n  246 25 Sample size: What if variance is unknown? • If population variance = 2 is unknown use either: • Sample variance, s2 • Or  = (Range of values ÷ 4), as an approximation of . PROPORTIONS • Formula for estimating sample size for proportions is: z 2 ps (1  ps ) n e2 26 Hypothesis Testing • 3rd form of statistical inference • Allows us to make inferences about a population parameter by analyzing differences between – the results observed (sample Statistic) and the results expected, based on underlying hypothesis 27 Actual Hypothesis • Need to state the hypothesis that is going to be tested: = null hypothesis = status quo, or old theory H0:  = $100 (use population parameters) • Then state the alternative hypo, or the one that is to be proved as new theory H1:   $100 =>Two-tailed tests • One tailed tests are also allowed 28 One tailed tests • These allow for the alternative hypo to be one-sided –Rejection can be either on left or right hand side –Possible hypotheses: • H0:  ≥ 100 or H0:  ≤ 100 (or H0:  = 100) • H1:  < 100 or H1:  > 100 • Note this changes the critical value compared to same level of confidence for a 2-tailed test 29 Rejection rules • It is assumed that Ho is true, aim to test if this is the case or not • Aim to either not reject or reject Ho with certain level of confidence (95%, 90%,) – Can be expressed as (1- ) = 0.95 (or 0.99) • This helps to define the non-rejection or rejection region • identified by critical values (Z or t values) dependent on level of confidence – This depends if 1 or 2 tailed test 30 Critical values vs. Test statistic • Need to determine a test statistic to compare with critical values  Test statistic for mean ( known): z  x    / n  Test stat for mean (  unknown): t  x   s/ n • If test statistic falls within non-rejection region (within boundary of critical values) then Accept H0 • Accept H1 if test statistic falls in rejection region 31 Example • Hubbard’s wants to know with 95% level of confidence, that its cereal boxes contain more than 500gm – Past experience suggests that the weight in cereal boxes is normally distributed. – Firm takes a random sample of 25 boxes and finds, X  520 g s  75 gm 32 Steps required to conduct Hypothesis Test • 1. State hypos – H0:   500 – H1:  > 500 (1 tailed test) • Note if only concerned if boxes were 500gm then H1 would be  500 • 2. Select test statistic – Population is normal, but n <30 and  unknown => – Use student t-distribution 33 • 3. Calculate critical value based on level of significance = 5% level of significance Critical value from t-tables with df = 24 is 1.711 • 4. Calculate sample statistic X  t  s/ n 520  500   1.33 75 / 25 34 • 5. Decision: compare test statistic with sample statistic  as 1.33 < 1.711 falls in non-rejection (acceptance) region, do not reject H0 And conclude that at the 5% level of significance (or with 95% confidence) the mean fill of cereal boxes is at least 500gm of cereal. ________________________________________ If sample statistic = 1.8 then Reject H0 and conclude that at this level of significance there is significant evidence that boxes contain more than 500gm. 35 Alternative approach • p-value (reported by computer statistical packages) = observed level of significance = smallest level at which Ho can be rejected for a given set of data • Decision rules: –If the p-value is  , the null hypo is not rejected –If the p-value is < , the null hypo is rejected 36 Types of errors • There is a risk of the incorrect conclusion being made due to sample chosen • Type 1 error – =>rejecting Ho when it was in fact true – Probability of Type 1 error =  • e.g. for 95% confidence level  = 0.05 • Type 2 error – => accepting a false hypo, with Prob =  – Size of  will depend on hypo value of parameter 37 Types of errors 38 • Note:  is level of significance (0.05) • Level of confidence is 1-  = 0.95 or 95% • Trade off between errors: –For any sample size anything that reduces  raises  –A larger sample size could reduce both. 39

Estimation & Hypothesis Testing Presentation

Related documents

Products

Support

Estimation & Hypothesis Testing Presentation

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib