Inferential Statistics - Data Analysis and Modeling for Public Affairs

advertisement
Inferential Statistics
Paf 203
Data Analysis and Modeling for
Public Affairs
REVIEW







What is statistics?
What is the difference between a population
and a sample?
What is a parameter? A statistic?
What are the measures of central
tendency?
What are the measures of dispersion?
What is descriptive statistics
What is inferential statistics?
Statistics Quotations
Statistics is like a bikini, what it reveals is
suggestive, what it conceals is vital.
(Aaron Levenstein)
"Statistics is like a miniskirt, it covers up
essentials but gives you the ideas.” (A Paris
banker )
There are three kinds of lies – lies, damned
lies and statistics. – Mark Twain
Review
The word statistics can be viewed in
two contexts
Singular sense
Plural sense
Statistics as a science
Statistics as actual
number derived
from the data
Review

Statistics is the science of designing studies,
gathering data, and then classifying,
summarizing, interpreting, and presenting these
data to explain, make inferences, and support
the decisions that are reached.
points out four stages in a statistical investigation,
namely:
1) Collection of data 2) Presentation of data
3) Analysis of data 4) Interpretation of data (to draw
valid conclusion)

Review




Population- is the complete collection of
measurements, objects, or individuals under
study.
A sample- is a portion or subset taken from the
population.
A parameter is a number that describes a
population characteristic.
A statistic is a number that describes a sample
characteristic.
Two Broad Categories
of Statistics

Descriptive Statistics

Inferential Statistics
Descriptive Statistics
Used to describe a mass
of data in a clear, concise
and informative way
 Deals with the methods of
organizing, summarizing,
and presenting data

Example
The National Statistics Office (NSO) presented
the Philippine population by age group and
gender using a graph.
Inferential Statistics
Concerned with making
generalizations (drawing
conclusions) about the
characteristics of a larger set
(population) where only a
part (sample) is examined
1.10
Inferential Statistics
Larger Set
(N units/observations)
Smaller Set
(n units/observations)
Inferences and
Generalization
s
1.11
Example
A new milk formulation designed to improve the
psychomotor development of infants was tested on
randomly selected infants. Based on the results, it was
concluded that the new milk formulation is effective in
improving the psychomotor development of infants.
I.12
METHODS OF DRAWING
CONCLUSIONS
Deductive Method
 It draws conclusions from
general to specific.
 It assumes that any part of the
population will bear the
observed characteristics of the
population.
 Hence, conclusions are stated
with certainty.
4.A.13
Population
Inference
Sample
ILLUSTRATION
Statement 1: All UPLB students are
intelligent.
Statement 2: Pedro is a UPLB student.
Conclusion:
Pedro is intelligent.
METHODS OF DRAWING
CONCLUSIONS
Inductive Method
 It draws conclusions from specific
to general.
 It assumes that the characteristics
observed from a part of the
population is likely to hold true for
the whole population.
 Hence, conclusions are subject to
uncertainty.
Sample
Inference
Population
ILLUSTRATION
Statement 1: Pedro is a UPLB student.
Statement 2: Pedro is intelligent.
Conclusion:
All UPLB students are intelligent.
4.A.16
INFERENTIAL STATISTICS
It makes use of the inductive method of
drawing conclusions.
Sample
Sampling
Process
Data
Inferences/Generalization
(Subject to Uncertainty)
Population
4.A.17
The Necessary Steps of Inferential
Statistics





Specify the question to be answered and
identify the population of interest
Describe how to select the sample
Select the sample and analyze the sample
information using descriptive statistics
Use the information on step 3 to make an
inference about the population
Determine the reliability of the inference.
What do we need to know?


Random variable and its behavior
Sampling is where a sample is drawn from
much larger body of measurements called
population.
Some definitions
Inferential Statistics is generalizing a particular
characteristic of a population by generating the
information from a sample.
A
variable is a characteristic that changes or varies
over time or different individuals or objects under
consideration.
A population is the set of all measurements of interest.
A sample is a subset of measurements selected from
a population of interest.
(cont.) Some definitions

A sampling distribution is a theoretical, probabilistic
distribution of all possible sample outcomes (with
constant sample size n), for the statistic that is to be
generalized to the population.

Collecting data
It is possible to gather data from an entire population
, this is called a census. Usually, data gathered from
experiments and observations come form samples.
(cont.) Some definitions

A sample should be representative of the population
but there are many ways that samples can be
selected. It is helpful to categorize them into non
probability and probability samples.

A nonprobability sample is one where judgment of
the experimenter, the method in which data are
collected, or other factors could affect the results of
the sample.
Variable
A variable is a characteristic that changes or varies
over time or different individuals or objects under
consideration.
Broad Classification of Variables:

QUANTITATIVE


DISCRETE
CONTINUOUS
QUALITATIVE
Types of Variable
Qualitative
 assumes
values that are not numerical but
can be categorized
 categories may be identified by either nonnumerical descriptions or by numeric
codes
Types of Variable
Quantitative
 indicates
the quantity or amount of a
characteristic
 data are always numeric
 can be discrete or continuous
Types of Quantitative Variables
Discrete – variable with a finite or
countable number of
possible values
2.
A.
26
Continuous – variable that assumes
any value in a given
interval or continuum of
values
RANDOM VARIABLE
- a rule or function that assigns
exactly one real number to every
possible outcome of a random
experiment
Note: The domain of
the function is the
sample space S and the
co-domain is the set of
real numbers, .
S
-
0


TYPES OF RANDOM VARIABLES
Discrete random variables take on a set of
distinct possible values or a countably
infinite number of possible values.
TYPES OF RANDOM VARIABLES
Continuous random
variables take on any
value within a specified
interval or continuum of
values.
3.C.29
Basic Concepts in Sampling
• SAMPLING – the process of selecting a sample
• PARAMETER – descriptive measure of the
population
• STATISTIC – descriptive measure of the sample
• INFERENTIAL STATISTICS – concerned with
making generalizations about parameters using
statistics
WHY DO WE USE SAMPLES?
1. Reduced Cost
2. Greater Speed or Timeliness
3. Greater Efficiency and Accuracy
4. Greater Scope
5. Convenience
6. Necessity
7. Ethical Considerations
TWO TYPES OF SAMPLES
• Probability Samples
• Non-Probability Samples
Non-Probability Samples
• Samples are obtained haphazardly, selected
purposively or are taken as volunteers.
• The probabilities of selection are unknown.
•
They should not be used for statistical inference.
•
They result from the use of judgment sampling,
accidental sampling, purposively sampling, and
the like.
Three commonly employed non probability samples include
judgment samples, voluntary samples, and convenience
samples.
1.
Judgment samples- sample selection is sometimes
based on the opinion of one or more persons who
feel qualified to identify items for a sample as being
characteristic of the population. Example: a political
campaign manager intuitively picks certain voting
districts as reliable places to measure the public’s
opinion of her candidate. The poll that is taken form
this district is a judgement sample based on the
campaign manager’s expertise and experience.
Three commonly employed non probability samples include
judgment samples, voluntary samples, and convenience
samples.
2.
Voluntary sample- sometimes questions are
posed to the public by publishing them in
print media or by broadcasting them over
the radio or the television. Dialing one
number indicates yes, while the other
indicates no. Such polls produce voluntary
samples and attract only those who are
interested in the subject matter.
Three commonly employed non probability samples include
judgment samples, voluntary samples, and convenience
samples.
3.
Convenience samples- Often people want
to take an easy sample. For example, a
surveyor will stand in one location and ask
passersby their question or questions. Or
the student working on a project will ask the
entire class to fill out a survey
questionnaire. These samples taken at the
convenience of the surveyor is called a
convenience sample.
Probability Samples …
•
Samples are obtained using some objective chance
mechanism, thus involving randomization.
•
They require the use of a complete listing of the
elements of the universe called the sampling frame.
•
The probabilities of selection are known.
•
They are generally referred to as random samples.
•
They allow drawing of valid generalizations about the
universe/population.
Probability Sample

A probability sample is one of which the
chance of selection of each item in the
population is known before the sample is
picked.
Types of probability samples
1.
Simple random sample- is a probability
sample which is chosen in such a way that
all possible groupings of a given size have
an equal chance of being picked, and if
each item in the population has an equal
chance of being selected.
(cont.) Types of probability samples
2.
Systematic samples- Suppose we have a list of
1000 registered voters in a community and we want
to pick a probability sample of 50. We can use a
random number table to pick one of the first 20
voters (1,000/50=20) on our list. If the table gave
us the number 16, then the 16th voter on the list
would be the first to be selected. We would then
pick every 16th name after this random start (the
36th voter, 56th voter, etc.) to produce a systematic
sample.
(cont.) Types of probability samples
3.
Stratified samples-If the population is divided into
relatively homogenous groups, or strata, and a
sample is drawn from each group to produce an
overall sample, this overall sample is known as a
stratified sample. Stratified sample is usually
performed when there is a large variation within the
population and the researcher has some prior
knowledge of the structure of the population that
can be used to establish the strata. The sample
results from each stratum are weighted and
calculated with the sample results of other strata to
provide the overall estimate.
(cont.) Types of probability samples
4.
Cluster samples- is one in which the individual units
to be sampled are actually groups or clusters of
items. It is assumed that the individual items within
each cluster are representative of the population.
Example: consumer surveys in big cities emply
cluster sampling. They divide the city into blocks,
each block containing a cluster of households to be
surveyed. A number of clusters are selected for the
sample, and the households in the cluster are
surveyed.
METHODS OF PROBABILITY SAMPLING
Simple Random Sampling
Stratified Random Sampling
Systematic Random Sampling
Cluster Sampling
SIMPLE RANDOM SAMPLING
(SRS)
• Most basic method of drawing a
probability sample
• Assigns equal probabilities of selection
to each possible sample
• Results to a simple random sample
TYPES OF SIMPLE RANDOM SAMPLE
(SRSWOR)
SRS Without Replacement (SRSWOR) – does not
allow repetitions of selected units in the sample
TYPES OF SIMPLE RANDOM SAMPLE
(SRSWR)
SRS With Replacement (SRSWR) – allows
repetitions of selected units in the sample
STRATIFIED RANDOM SAMPLING
The universe is divided into L mutually
exclusive sub-universes called strata.
Independent simple random samples
are obtained from each stratum.
Note:
L
N   Nh
h 1
L
n   nh
h 1
ILLUSTRATION
Stratified Random Sample
Advantages of Stratification
1. It gives a better cross-section of the population.
2. It simplifies the administration of the
survey/data gathering.
3. The nature of the population dictates some
inherent stratification.
4. It allows one to draw inferences for various
subdivisions of the population.
5. Generally, it increases the precision of the
4.B.49
estimates.
SYSTEMATIC SAMPLING
Adopts a skipping pattern in the selection of
sample units
Gives a better cross-section if the listing is
linear in trend but has high risk of bias if there
is periodicity in the listing of units in the
sampling frame
Allows the simultaneous listing and selection
of samples in one operation
ILLUSTRATION
Systematic
Sample
4.
B.
51
Population
Determine the sampling
interval, k = N/n
CLUSTER SAMPLING
• It considers a universe divided into N
mutually exclusive sub-groups called
clusters.
• A random sample of n clusters is selected
and their elements are completely
enumerated.
• It has simpler frame requirements.
• It is administratively convenient to implement.
ILLUSTRATION
Population
Cluster Sample
What is a sampling error?

If we want to make a judgment about a
population from a sample, we want those
results to be as typical as the population. But
this is difficult to do so, and we have to live
with sampling errors. Errors can also come
from coding and recoding of data. Results
obtained from a biased sample are
worthless.
How to determine a sample size

Use the formula
n
PQ
 d 2
2
where: P is the proportion of the target population that is based
on prior information, Q is (1-P) , and d is the degree of error that
is defined by the investigator.


Example: n= (50%) 50%/(3/2) ²= 1,111 or about 1200.
If N is known, then adjust to n*= n/(1+ n/N)
Concept of Hypothesis testing

The objective of hypothesis testing is to
determine whether or not the sample data
support some belief or hypothesis about the
population. In hypothesis testing, we make
assumptions about the unknown parameters.
Hypothesis testing has five steps:
1.
2.
3.
4.
5.
Formulating the hypothesis
Selecting the statistical analysis model to be
used
Setting the criteria for rejecting the null
hypothesis
Analysis
Making a decision.
Formulating the hypothesis


There are two types of hypotheses: the null
hypothesis (Ho) and the alternative hypothesis (Ha).
The alternative hypothesis is the hypothesis that the
researcher wants to prove. The purpose of the
alternative hypothesis is to determine whether or not
the evidence provided by the sample is enough to
establish that the null hypothesis is not true. If there
is enough such evidence, then we will say that there
is evidence to support the alternative hypothesis.
(cont.) Formulating the hypothesis

The alternative hypothesis is the hypothesis
that the researcher wants to prove. The
purpose of the alternative hypothesis is to
determine whether or not the evidence
provided by the sample is enough to
establish that the null hypothesis is not true.
If there is enough such evidence, then we
will say that there is evidence to support the
alternative hypothesis.
Examples of types of hypothesis
a.
Hypothesis concerning the value of the population
mean
  0
b.
Hypothesis concerning the value of the difference
in the means of two populations.
1  2
c.
Hypothesis concerning the relationship of the two
nominal scale intervals
2  0
There are also three possible forms of
alternative hypothesis:
 Ha
≤ 0 or Ha ≥0
 Ha > 0
 Ha < 0
1. Selecting the statistical model to be
used:

Having specified the null and alternative hypothesis,
we then select the appropriate test statistic or
statistical model to be used. The choice of our
statistic would depend on the number of factors: 1.
the nature of the hypothesis problem, 2. the level of
measurement used, and the assumptions of
normality.

Some of the most frequent statistical tests used are:
–
–
–
the T-test
the Z test
Chi square test.
2. Setting the criteria for rejecting the
null hypothesis.

This involves two things: 1.selecting a significance level and 2.
determining the area of rejection.

The level of significance refers to the probability of rejecting the
null hypothesis when it is true. This is called the Type I error or the
 error. (The Type II error or the  error is accepting the null
hypothesis when it is not true). The level of significance refers to
the probability that we will reject the null hypothesis. We make the
selection of the level of significance before we compute for the test
statistic. We need to select a level of significance that we think is
reasonable. The decision as to which significance level to use
depends on the questions involved. Social scientists routinely
accept the probability of 0.05 for rejecting the null hypothesis. If a
statistical test would lead to significant policy recommendations,
then you may wish to reduce the risk of being in error and signify a
significance level of 0.01 or even .001.
TWO TYPES OF ERROR
A TYPE I ERROR is committed
when we reject a true null
hypothesis.
A TYPE II ERROR is committed
when we accept a false null
hypothesis.
5.B.64
PROBABILITY OF COMMITTING
ERRORS
The PROBABILITY OF A TYPE I ERROR
is usually denoted by .
  P  Type I error 
 P Reject Ho Ho is TRUE 
It is also known as the level of
significance of a statistical test.
PROBABILITY OF COMMITTING
ERRORS
The PROBABILITY OF A TYPE II ERROR
is usually denoted by .
  P  Type II error 
 P  Accept Ho Ho is FALSE 
5.B.66
DECISION MATRIX
ACTION
Ho is true Ho is false
Reject
Ho
Type I
error (α)
Correct
decision
Accept
Ho
Correct
decision
Type II
error (β)
Area of Rejection


Based on the significance level we choose, we then
delineate our region of acceptance and region of
rejection. The region of rejection is also called the
critical region. Outcomes falling here mean we reject
the null hypothesis. Our critical region will also
depend on whether we are doing a right tailed test, a
left tailed test or a two-tailed test.
If our alternative hypothesis involves the > sign, we
use the right tailed test. When our alternative
hypothesis involves < sign, we use the left tailed test.
When our alternative hypothesis involves the = sign,
we will use the two tailed test.
3. Analysis

The analysis part is the process of computing
for our test statistic based on the
assumptions we made and the data we have.
4. Making a decision

In assessing the null hypothesis, we can accept the
null hypothesis or reject it in favor of the alternative
hypothesis. Our decision will be based on the value
of the test statistic we obtain in the analysis stage. If
the value of the test statistic is located in the critical
region, we reject the null hypothesis in favor of the
alternative hypothesis. Our findings may be taken as
conclusive even if there is the probability that we
may be in error. If the test statistic is located in the
acceptance region, we accept the null hypothesis.
Our findings are not conclusive. We simply do not
have enough evidence to prove our alternative
hypothesis.
Example:

When the judge makes the pronouncement that the
defendant is “guilty”, he serves the sentence
imposed even if there is a probability that he is not
actually guilty. But when the judge hands down a
verdict of not guilty, it is usually not because it has
been proven beyond reasonable doubt, that he is not
guilty. There is simply not enough evidence to prove
that the defendant is guilty.
Example of hypothesis testing:
The T distribution

Compare the academic achievement of the foreign students
and the total student population

Given: student body GPA=2.0, variance is unknown; foreign
student GPA=2.58, s= 1.23, n=30

Step 1. Stating the null hypothesis:
Ho: µ= 2.00 , H1: µ ≠ 2.00.

Step 2. Selecting the sampling distribution and establishing
the critical region-use the t statistic, and define the probability
of error, Ü= 0.01, a two tailed test with the degree of
freedom= (n-1)=29. Step 1. Make assumptions-random
sampling, sampling distribution is normal
(Cont.) T distribution



Step 3. Critical region= +/- 2.756
Step 4. Computing the Test statistic:
Tcomp= ×-µ/ s/ /(n-1)
t= 2.58-2.00/1.23 /29
t= .58/.23
t= + 2.52
Step 5 Making a decision- do not reject the Ho.
The difference between the sample mean (2.58) and the
population mean (2.0) is no greater than is expected if only
random chance were operating.
Introduction to the
Chi square test of independence

2

The Chi square ( ) test of independence is
a very general test that is used to evaluate
whether or not frequencies which have been
empirically obtained differ significantly from
those which would be expected if no
relationship between the variables existed.
Chi square test of independence

Suppose we want to look at
the relationship between
religious affiliation and political LAMPP
affiliation. Suppose that, for
this purpose, you selected a
random sample of 100 Iglesia LAKAS
ni Cristo (INC) members and
100 Jesus is Lord (JIL)
members. We asked each of
Total
them how they voted during
the last Presidential elections,
and put the results in a
bivariate table.
INC
JIL
Total
80
(80%)
40
(40%)
120
20
(20%)
60
(60%)
80
100%
100%
200
Chi square test of independence


If we examine the percentages, we see that of the 100 INC
members, a larger proportion (80%) voted LAMMP while of the
100 JIL members, a larger proportion (60%) voted LAKAS. It
seems that there was a great tendency for INC members to
vote LAMMP and JIL members to vote LAKAS. Can we base
on this sample results conclude that there is a relationship
between religious affiliation and political affiliation?
The Chi-square test of independence is a technique for testing
the level of statistical significance obtained by a bivariate
relationship in a cross tabulation. It can apply to any level of
measurement as this relationship can always be put in a
bivariate or contingency table.
END – PART 1
Download