Uploaded by Mohammed Kashoob

Topic 3 Estimation Testing

advertisement
Topic 3: Estimation and Hypothesis
Testing
• Goals:
– Review statistical estimation
– Review confidence intervals for a single
population mean & population proportion
– Review confidence intervals for the
difference between two population means
– Understand formulas for sample size
determination
– Introduce hypothesis testing
1
Estimation
• A major goal of data analysis is to make statistical
inferences, using sample mean to estimate the
corresponding parameter in the population:
– Parameter is a numerical descriptive measure of a
population, e.g. µ for population mean; or σ for
population standard deviation (Greek letters are used).
– Sample statistic is a numerical descriptive measure of a
sample, e.g. x for sample mean; or s for sample
standard deviation.
• Political polls were used to estimate the final election
result
– used proportions, p for P or P
2
Random Sampling
• To be valid, estimation must be based on a
representative sample, obtained by random
sampling.
– Random => every unit has a known probability of
being selected in sample
– Various techniques
Simple: Use random numbers from computer
packages
Stratified: population divided into non-overlapping
subpopulations
Systematic: every Kth item is selected
Cluster: population divided into non-overlapping
areas/clusters then random sample in each area
3
Non-random sampling
• Main types:
– Convenience,
• selected for convenience of researcher
– Judgement
• Use judgement of researcher
– Quota
• Aim for certain quota from some subgroups
– Snowball
• Based on referral from other samples
• Sampling can lead to errors i.e. not correctly representing
the population
• Also can get non-sampling errors
– Recording errors, missing data, response errors, …
4
Sampling distribution of the mean
• Assume we want to obtain a point estimate of the
population mean from a sample statistic, e.g. mean
or median
– Note: sample statistics are random variables
hence have probability distributions
• Sampling Dist of the mean is obtained by taking
repeated random samples from the population and
finding the (different) mean for each sample
• This sampling dist will have its own mean, variance
or standard error:
– Expressed as  x or  x
5
Confidence Intervals (CI)
• Aim to develop CI estimates for the mean and
the proportion
This is the 2nd type of statistical inference
• CI => the estimate covers a range of values
(an interval) rather than just a point estimate
The interval will have a specified confidence or
probability of correctly estimating the true value
of the population parameter
6
CI of Population Mean
• Depends on whether population variance is known
or unknown
• If population has normal distribution or sample
size large enough
– Then the 95% CI of population mean  , with known
standard deviation (the square-root of variance), is:

  

X 
Z




 2  n 
7
• 95% confidence => want area under normal
distribution to cover 95% (that is,  =0.05)
Hence 2.5% at either end is unlikely (/2 =0.025)
Use area under normal probability table to find
appropriate z value
Z-value is the standard normal value,
(corresponding to a cumulative area of 1- /2) =
1.96 for 95% confidence level.
See figures 8.1 and 8.2, Black
z ~ N 0,1
8
Interpretation of CI
95% Confidence Interval
If all possible samples of the same size n
are taken, and sample means found, 95%
of the intervals will include the true
population mean (somewhere within the
interval around their mean)
5% of the intervals will not include
the population mean
9
Example
• New Zealand (NZ) companies trading in China
were asked for the number of years trading
in China:
Sample = 44, mean = 10.455 years
Assume population standard deviation =  = 7.7
years
• If only sample standard deviation is known use this
formula, as n is >30
Want 90% CI for number of years for NZ
companies trading in China
10
Answer
• Use:

  

  X 

 Z  
n 
 2 
• Hence
7.7
  10.455  1.645
44
  10.455  1.91
8.545    12.365
• So we are 90% confident that if all NZ
companies were surveyed that the mean
years they would have traded in China would
between 8.5 to 12.4 years
11
CI of mean with unknown 
If n < 30 and  is unknown, then use
Student’s t distribution instead of normal
distribution.
t distribution is symmetrical about its zero
mean, but flatter than standard normal
=> more area in its tails, and varies with
sample size
See fig 8.4
Need to know degrees of freedom (df) = n-1
12
95% CI for population mean with unknown
standard deviation is found by:
X  t ( / 2 , n 1)
s
n
s
s 

P X  t( 0.025,n 1)
   X  t( 0.025,n1)
  0.95
n
n

Where t refers to the t values such that 2.5% of total
area under the curve falls within each tail for correct
degrees of freedom
13
Degrees of Freedom
• Degrees of freedom represents the number
of values that can be ‘freely chosen’ when
calculating a statistic
• If n= 5 then df = 4 because once we have the
first four values, the 5th has to be a certain
value
– E.g. we want 5 values to have a mean of 20
• 1st 4 values are 18, 24, 19, 16
• then the 5th value must be 23 (=100 -77)
14
Example
• Sample of 20 sales invoices, mean $110.27,
sample standard deviation =$28.95
• Then 95% Confidence Interval is found by:
X  t n 1
s
n
28.95
 110.27  2.093
20
 110.27  13.56
or $96.71    $123.83
15
CONFIDENCE INTERVAL FOR DIFFERENCE
BETWEEN TWO POPULATION MEANS
• Common problem is related to
comparing the outcome of two different
populations
Eg. Who earns the most for 2 sets of
graduates
What type of car travels further on a tank
of petrol
16
If large and independent samples from
large or normally distributed population,
• Independent vs. paired samples, what is the
difference?
then CI of differences between means is:
 2 2 
A
 A   B  X A  X B    z
 B
 n A nB 
17
Example
• What is the difference
in savings on groceries
using coupons between
2 groups of shoppers
based on income
levels?
• Find 98% Confidence
Interval for the
difference in mean
savings
Middle
income
shoppers
Low
income
shoppers
n1 = 60
n2 = 80
x1  $5.84
x2  $2.67
 1  $1.41
 2  $0.54
18
Answer
1.412 0.54 2
1   2  (5.84  2.67)  2.33

60
80
 3.17  0.45
2.72  1   2  3.62
• Hence there is a 98% level of confidence that the actual
difference in the population mean coupon savings/week
between 2 income groups is between $2.72 and $3.62
19
Formula for when population variance
unknown
• The formula depends on knowing if population
variances are equal or not
• If assume equal variance, then:
(  a  b )  ( X a  X b )  t / 2,v
1
1
s (  )
na nb
2
p
• Where: df  v  na  nb  2
2
2
(
n

1
)
s

(
n

1
)
s
a
b
b
s 2p  a
na  nb  2
20
Confidence Interval for Proportions
• Confidence Interval for p is given below,
with ps = sample proportion
When n is large (30 or more) use Z value
When n is small (less than 30) use t value
P  ps  z
ps 1  ps 
n
21
Example
• Sample of 100 sales invoices, 10 have errors,
what is 95% Confidence Interval
• Ps = 10/100 =0.1
• 95% CI for P is =
0.1  1.96
0.10.9
100
 0.1  0.0588
 0.0412  p  0.1588
22
Interpretation
• Hence the 95% confidence interval
based on this sample is between 4%
(4.12%)and 16% (15.88%) of sales
invoices will have errors
23
Best Sample Size: variance is
known
• Sampling can be expensive, so we want sample
size as small as possible, subject to:
– amount of sampling error that is acceptable, e
– Level of confidence desired (1 - )
• For Population mean,
 z 
n

 e 
2
24
Example
• Want e = $500 (error in actual incomes )
• 95% confidence, => z = 1.96
• Know  = $4000
 1.964000  

• Then need sample size of: n  
500 

2
n  245.86
n  246
25
Sample size: What if variance is
unknown?
• If population variance = 2 is unknown use either:
• Sample variance, s2
• Or  = (Range of values ÷ 4), as an approximation of .
PROPORTIONS
• Formula for estimating sample size for proportions is:
z 2 ps (1  ps )
n
e2
26
Hypothesis Testing
• 3rd form of statistical inference
• Allows us to make inferences about
a population parameter by
analyzing differences between
– the results observed (sample
Statistic) and the results expected,
based on underlying hypothesis
27
Actual Hypothesis
• Need to state the hypothesis that is going to
be tested:
= null hypothesis
= status quo, or old theory
H0:  = $100 (use population parameters)
• Then state the alternative hypo, or the one
that is to be proved as new theory
H1:   $100
=>Two-tailed tests
• One tailed tests are also allowed
28
One tailed tests
• These allow for the alternative hypo to be
one-sided
–Rejection can be either on left or right hand
side
–Possible hypotheses:
• H0:  ≥ 100 or H0:  ≤ 100 (or H0:  = 100)
• H1:  < 100 or H1:  > 100
• Note this changes the critical value compared
to same level of confidence for a 2-tailed test
29
Rejection rules
• It is assumed that Ho is true, aim to test if
this is the case or not
• Aim to either not reject or reject Ho with
certain level of confidence (95%, 90%,)
– Can be expressed as (1- ) = 0.95 (or 0.99)
• This helps to define the non-rejection or
rejection region
• identified by critical values (Z or t values)
dependent on level of confidence
– This depends if 1 or 2 tailed test
30
Critical values vs. Test statistic
• Need to determine a test statistic to compare with critical
values
 Test statistic for mean ( known): z  x  
 / n
 Test stat for mean (  unknown):
t 
x  
s/
n
• If test statistic falls within non-rejection region (within
boundary of critical values) then Accept H0
• Accept H1 if test statistic falls in rejection region
31
Example
• Hubbard’s wants to know with 95% level of
confidence, that its cereal boxes contain more
than 500gm
– Past experience suggests that the weight in cereal
boxes is normally distributed.
– Firm takes a random sample of 25 boxes and finds,
X  520 g
s  75 gm
32
Steps required to conduct Hypothesis
Test
• 1. State hypos
– H0:   500
– H1:  > 500 (1 tailed test)
• Note if only concerned if boxes were 500gm then H1 would
be  500
• 2. Select test statistic
– Population is normal, but n <30 and  unknown =>
– Use student t-distribution
33
• 3. Calculate critical value based on level of
significance
= 5% level of significance
Critical value from t-tables with df = 24 is 1.711
• 4. Calculate sample statistic
X 
t 
s/ n
520  500

 1.33
75 / 25
34
• 5. Decision:
compare test statistic with sample statistic
 as 1.33 < 1.711 falls in non-rejection
(acceptance) region, do not reject H0
And conclude that at the 5% level of significance
(or with 95% confidence) the mean fill of cereal
boxes is at least 500gm of cereal.
________________________________________
If sample statistic = 1.8 then
Reject H0 and conclude that at this level of
significance there is significant evidence that
boxes contain more than 500gm.
35
Alternative approach
• p-value (reported by computer statistical
packages)
= observed level of significance
= smallest level at which Ho can be rejected
for a given set of data
• Decision rules:
–If the p-value is  , the null hypo is not
rejected
–If the p-value is < , the null hypo is
rejected
36
Types of errors
• There is a risk of the incorrect conclusion being made due
to sample chosen
• Type 1 error
– =>rejecting Ho when it was in fact true
– Probability of Type 1 error = 
• e.g. for 95% confidence level  = 0.05
• Type 2 error
– => accepting a false hypo, with Prob = 
– Size of  will depend on hypo value of parameter
37
Types of errors
38
• Note:  is level of significance (0.05)
• Level of confidence is 1-  = 0.95 or
95%
• Trade off between errors:
–For any sample size anything that
reduces  raises 
–A larger sample size could reduce
both.
39
Download