inferential stats

advertisement
Data Analysis
Statistics
Inferential statistics
Hypothesis testing
Normal distribution: a
probability distribution
99% of scores
are within 3sd
of mean
Who cares…
The most useful distribution in inferential
statistics.
We can translate any normal variable, X, into
the standardized value, Z to make
assumptions about the whole population.
Use when comparing means or proportions.
Example:
Suppose you were the city police and you
wanted to know how many photo radar
tickets you could expect to collect next year
so that you can develop your budget...
Last year the mean number of tickets for all
locations was 9000 with a standard deviation
of 500 tickets. What is the probability that
you will give out between 7500 tickets (your
lowball guess) and 9625 (your highball
guess)?
Calculate Z score
…what type of scale must you have to
calculate Z scores?
…what reasons can you think of for wanting
to calculate a Z score for your research?
Z tests, another application
You have been asked to conduct a
survey on customer satisfaction at the
food court. Customers indicate their
perceptions on a 5 point scale where
1=very unfriendly and 5=very friendly.
Assume this is an interval scale and that
previous studies have shown that a
normal distribution of scores is
expected.
Z tests, assumptions about
mean
You think: perhaps customers think that the
service is neither friendly nor unfriendly Ho:
mean is equal to 3.0
H1: mean is not equal to 3.0
Establish significance/confidence
level=0.05/95% confidence therefore Z= +/1.96
You do a study with a sample of 225
interviews and the mean is 3.78. The
standard deviation is 1.5.
Do we accept or reject the null hypothesis?
A Sampling Distribution
UPPER
LIMIT
LOWER
LIMIT
m=3.0
Critical values of m
Critical value - upper limit
S
= m  ZS X or m  Z
n
 1 .5 
= 3.0  1.96

 225 
Critical values of m
Critical value - lower limit
= m - ZS X or m - Z
 1 .5 
= 3.0 - 1.96

 225 
S
n
3.78 sample mean,
therefore reject Ho
and say that the
sample results are
significant at .05
level of
significance
2.804
3.0
Range of acceptability
3.196
Type I and Type II Errors
Null is true
Null is false
Accept null
Reject null
Correctno error
Type I
error
Type II
error
Correctno error
If sample is small…
Small usually means less than 30
Do a t test instead
Is this statistically significant?
Chi-square test: a hypothesis test that allows
for investigation of statistical significance in
the analysis of a frequency distribution
(or cross tab)
Categorical data such as sex, education or
dichotomous answers may be statistically
analyzed
Tests the “goodness of fit” of the sample with
expected population results
Chi-square example
Through observation research we have identified
that of the sample of 100 people who got photo
radar tickets, 60 were female and 40 were male. We
expected that the proportions should be equal (.5
probability for each sex). Our null hypothesis is that
the population data will be consistent with our
sample data at 0.05 level of significance.
If the calculated chi square is above the critical chi
square for this level (3.84) we reject the null
hypothesis. This is the case. The observed values are
not comparable to expected values
Estimation of population
parameters: Confidence
The population mean and standard deviation
are unknown; we do know the sample mean
and standard deviation….
We take a sample of a number of students
with children and ask them to identify how
much they would be willing to pay per hour
for on campus childcare . Our sample size is
30. The student population with children is
estimated to be 300.
The sample mean is $2.60.
This is called a point estimate.
How close is this sample mean to the
population mean? How confident are we?
Confidence interval: the percentage
indicating the long run probability that the
results will be correct. Usually 95%
Relationship between
variables
Correlation and regression
analysis
Types of questions
Is employee productivity associated with pay
incentives?
Is salary level correlated with type of degree
or designation?
Is willingness to pay student fees levies for
daycare correlated with whether one has a
child?
Are students grades influenced by length of
term?
Measures of association
A general term that refers to a number of
bivariate statistical techniques used to
measure the strength of a relationship
between two variables
Correlation coefficient (r): most popular. Is a
measure of the standardized covariation or
association between two variables. It ranges
from +1 to -1
Measures of association
Coefficient of determination (r2)
The proportion of the total variance of a
variable that is accounted for by knowing the
value of another variable. Often shown as a
correlation matrix.
We have calculated r=-.65 when investigating
whether the number of years of university is
correlated with unemployment. If r2=.38, we
know that about 40% of the variance in
unemployment can be explained by variance
in years of university
Regression analysis
Bivariate linear regression: a measure of
linear association that investigates a straight
line relationship.
Assuming that there is an association
between students’ performance and length of
term, can we predict a students GPA given
the distribution of their courses along
semesters
Uses interval data
Regression analysis
Multiple regression analysis: an analysis
of association that simultaneously
investigates the effect of two or more
variables on a single, interval-scaled
dependent variable
Summary
Chi-square allows you to test whether an observed sample distribution fits some
given distribution. Are the groups in your cross tab independent?
Z and t tests are used to determine if the means or proportions of two samples
are significantly different.
Simple correlation measures the relationship of one variable to another.
Correlation coefficient (r) indicates the strength of the association and direction
of the association. The coefficient of determination measures the amount of the
total variance in the DV that is accounted for by knowing the value of the
independent variable. The results are often shown in a correlation matrix.
Bivariate regression investigates a straight-line relationship between one IV and
one DV. This can be done by plotting a scatter diagram or least squares
method. This is used to forecast values of the DV given values of the IV. The
goodness of fit may be evaluated by calculating the correlation of determination.
Multiple regression analysis allows for simultaneous investigation of two or more
IV on the DV
Type of Scale
Nominal
Numerical
Operation
Counting
Descriptive
Statistics
Frequency; cross
tab
Percentage; mode
(plus…)Median
Range; Percentile
Ordinal
Rank ordering
Interval
Arithmetic
operations on
intervals bet
numbers
(plus…) Mean;
Standard
deviation;
variance
Ratio
Arithmetic
operations on
actual quantities
(plus…)
Geometric mean;
Co-efficent of
variation
Selecting appropriate
univariate statistical method
Scale
Nominal
Scale
Business
Problem
Identify sex
of key
executives
Statistical
question to
be asked
Is the
number of
female
executives
equal to the
number of
males
executives?
Possible test
of statistical
significance
Chi-square
test
Scale
Nominal
Scale
Business
Problem
Indicate
percentage
of key
executives
who are male
Statistical
question to
be asked
Possible test
of statistical
significance
Is the
proportion of Z test
male
executives
the same as
the
hypothesized
proportion?
Scale
Business
Problem
Ordinal scale Compare
actual and
expected
evaluations
Statistical
question to
be asked
Possible test
of statistical
significance
Does the
Chi-square
distribution test
of scores for
a scale with
categories of
poor,good,
excellent
differ from
an expected
distribution?
Scale
Interval or
Ratio scale
Business
Problem
Statistical
question to
be asked
Compare
actual and
hypothetical
values of
average
salary
Is the
sample
mean
significantly
different
from the
hypothesized
population
mean?
Possible test
of statistical
significance
Z-test
(sample is
large)
T-test
(sample is
small)
Determining Sample Size
What data do you need to consider




Variance or heterogeneity of population
The degree of acceptable error (confidence
interval
Confidence level
Generally, we need to make judgments on
all these variables
Determining Sample Size
Variance or heterogeneity of population



Previous studies? Industry expectations?
Pilot study?
Sequential sampling
Rule of thumb: the value of standard
deviation is expected to be 1/6 of the
range.
Determining Sample Size
Formula
N= (ZS/E)2
Z= standardization value indicating
confidence level
S= sample standard deviation
E= acceptable magnitude of error
Its not the size that matters….
Download