Types of Variables - Penn State Department of Statistics

advertisement
Presentation 2
Summarizing One or Two Categorical Variables
&
Relationships Between Categorical Variables
Types of Variables

Categorical – Possible values define group or categories, not
necessarily in an apparent ordering
Ex. Color of M&M’s
Gender
Stat 200 Section

Ordinal – Categorical variable where values or categories have a
natural ordering
Ex. Rate the roller coaster on a scale of 1-5 (1 is terrible and 5 is excellent)
Age groups (child, teen, adult, senior citizen)
Shirt sizes (S, M, L, XL)

Quantitative – Measurements or counts, recorded as numerical
values
Ex. Height
Temperature
# of Red M&M’s
Possible Roles Played by
Variables:

Response Variables – are the variables of
which we want to determine the outcome.
These are the variables of main interest.

Explanatory Variables – are partially
explain the value of the response variable
for the individual.
For each of the following identify the response and the
explanatory variables as well as the variable type:
1.
Is there a relationship between a person’s gender and their favorite kind of
music?
Response:
2.
Do men and women listen to the same number of hours of music?
Response:
3.
Explanatory:
Do people who play musical instruments rate the types of music the same?
Response:
5.
Explanatory:
Does a person’s hometown influence the amount they would pay for a
single CD?
Response:
4.
Explanatory:
Explanatory:
Do people who have a CD burner prefer to buy or burn their CDs?
Response:
Explanatory:
Summarizing Categorical Variables:

For one variable:
1.
2.

Numerical Summaries: counts and percents
Graphical Summaries: Pie Chart or Bar Graph
For two variables:
1.
Numerical Summaries: 2-way tables with counts and row
percents. The explanatory variable should be the row
variable (first variable entered in Minitab) and the response
variable should be the column variable (second variable
entered in Minitab).
2.
Graphical Summaries: Bar Graph
Example for One Categorical Variable:

Where do Penn State alumni live? The PSU Alumni Association would
like to obtain the answer to this question from all PSU alumni. They
can’t ask all alumni so they take a random sample of 50 alumni from
the directory. They determined the state of residence from the
address. Here are the results:
State
Frequency
PA
25
NJ
10
MD
5
VA
5
OH
2
NY
2
OTHER
1
TOTAL
n=50
What do these descriptive statistics tell us?
Example for Two Categorical Variables:

Do most college students have a credit card? A study would like to
determine if the percentage of students that have at least one credit card
differs based on year in school. Four different samples (Fr, So, Jr, Sr) each
having 100 PSU students, were obtained. Each student was asked one
question, “Do you currently have at least one credit card?”
Identify the response and the explanatory variable in this case:
Response:
Explanatory:
Bar Graph Credit Card Example
Yes
No
Row total
Freshman
42
58
100
Sophomore
55
45
100
Junior
76
24
100
Senior
81
19
100
254
146
400
Column Total
90
80
70
60
50
No
40
Yes
30
20
10
0
Fr
What do these descriptive statistics tell us?
So
Jr
Sr
Assessing the Statistical Significance of the
Relationship between two Categorical Variables.
Suppose we ask 15 randomly picked students 2 questions:
1.
Do you smoke?
2. Did you have a beer last night?
We summarize the results using the Cross Tabulation function in Minitab :
Tabulated Statistics: smoke, beer
Rows: smoke Columns: beer
n
y
All
n
9
81.82
2
11
18.18 100.00
y
1
25.00
3
4
75.00 100.00
All
10
66.67
5
15
33.33 100.00
Cell Contents -Count
% of Row
Inference about the Population!

How can we tell if there’s a relationship between being
a smoker and drinking beer last night?

Does the relationship presented in sample data hold in
the population presented by this sample?


Techniques used to make generalizations about the
population using a sample are known as inferential
statistics.
A statistically significant relationship is one that
is large enough to be unlikely to have occurred in the
observed sample if there is no relationship in the
population.
Null and Alternative Hypotheses


Another way to express our objective is that we are
deciding between two possible hypotheses about the
population:
Null Hypothesis: The two variables are not related.
Alternative Hypothesis: The two variables are related.
In our example we have:
Null Hypothesis : Being a smoker and drinking beer last
night are not related.
Alternative Hypothesis : Being a smoker and drinking
beer last night are related.
Chi-square Statistic



We usually use Chi-square Statistic to handle this
type of questions.
Chi-square Statistic measures the statistical significance
of the association between 2 categorical variables. A
large Chi-square Statistic indicates there is a statistically
significant relationship between the 2 variables.
How Chi-square Statistic works?
It measures the difference between the observed counts
and the counts that would be expected if there were no
relationship (under the null hypothesis).
Chi-Square Statistic and p-value




A large Chi-square Statistic indicates there is a statistically
significant relationship between the 2 variables. However,
how large is large?
This is why we need to use “p-value” as an indicator to
tell us if the Chi-square Statistic is “large enough”.
We can obtain the p-value in our Minitab output.
How to use the p-value?
1.
2.
3.

The bigger the Chi-square Statistic is, the smaller the pvalue will be.
Generally, when the p-value is less than 0.05 (5%), we will
assume that the observed relationship did not occur by chance,
and it is statistically significant.
Generally, when the p-value larger than 0.05 (5%), we will say
the observed relationship could have occurred just by chance.
Therefore, we can not reject the null hypothesis that there is no
relationship.
Example: Part 3 of the activity….
Download