Statistics March 16, 2011 Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility

advertisement
Statistics
Analyses of Categorical Variables
March 16, 2011
Jobayer Hossain, Ph.D.
Nemours Bioinformatics Core Facility
Nemours Biomedical Research
Categorical Variable
• Observations belong to a finite set of discrete
categories or groups.
• Gender, race, severity of a disease are some
examples of categorical variables.
• Descriptive Statistics: Frequencies, percentages and
proportions are usually used to describe data of a
categorical variable.
Nemours Biomedical Research
Categorical Variables in Retinoid data
• There are three categorical variables in Retinoid data
set that we have on the course website
– trt (Treatment groups)
– bmigrp2 ( obese or non-obese at baseline)
– bmigrp3 (non-obese, overweight, and obese at
baseline)
Nemours Biomedical Research
Characterizing categorical variables
Calculating frequencies and
percentages for categories of
Nemours Biomedical Research
variables trt, bmigrp2, and
bmigrp3 in the Retinoid data set
SPSS output: characterizing categorical
variables
Nemours Biomedical Research
Proportion Tests
• Use
– Test for the value of a single proportion
• E.g., to test if the proportion of obese in 39
participating patients equals to specific value (say)
0.5 or not?
– Test for equality of two or more Proportions
• E.g. proportions of obese in two treatment groups
are equal or not.
Nemours Biomedical Research
Test for the value of a single proportion
Nemours Biomedical Research
Type a test value
SPSS output: Test for the value of a
single proportion
We are testing a hypothesis that the proportion of lean
subjects in this sample is .05, i.e
H0: the proportion of lean = 0.5 against the alternative,
Ha: the proportion of lean ≠0.5
The observed proportion of lean subjects is 0.33 which is
(.5-.33 = .17) smaller than the hypothesized proportion
(.5). The p-value is .053, which is marginally higher than
the level of significance (.05).
Nemours Biomedical Research
Chi-square Test
•
Formula: If xi (i=1,2,…n) are independent and normally distributed
with mean µ and standard deviation σ2, then,
2
 xi − µ 
2
is
a
χ
distribution with n d.f.


∑
σ 
i =1 
n
•
If we don’t know µ, 2then we estimate it using a sample mean and
 xi − x 
2
is
a
χ
distribution with (n - 1) d.f.


∑
σ 
i =1 
n
then,
Nemours Biomedical Research
The Pearson Chi-squared Test
Consider a contingency table. The number of units that fall
in a cell is the cell’s observed frequency, and the number
predicted by theory to do so is the cell’s expected count.
The Pearson chi- squared test statistic to summarize the
difference between observed and expected counts is,
2
(
−
)
O
E
i
χ2 = ∑ i
, distribute d as χ 2 with (n - 1) d.f.
Ei
i =1
n
Oi = Observed Frequency
Ei = Expected Frequency
Nemours Biomedical Research
The Pearson Chi-squared Test
•
USE
– Testing the equality of proportions for all categories of a variable
– Testing the user specified proportions for all categories of a variable
– Testing the independence/ association of attributes
– Testing the population variance σ2= σ02.
•
Assumptions
– Sample observations should be independent.
– Cell frequencies should be >= 5.
– Total observed and expected frequencies are equal
Nemours Biomedical Research
The Pearson Chi-squared Test– calculation of
expected frequency
• Expected frequency for any cell– Single variable:
• The probability associated with a cell is multiplied by total
number of subjects
– Two variables (contingency table)
• (row total X column total) / grand total
Nemours Biomedical Research
The Pearson Chi-squared Test– calculation of
degrees of freedom (df)
• Single variable:
(Number of categories – 1)
E.g. the variable bmigrp3 has 3 categories. So the df is (3-1)
=2
• Two variables (contingency table)
(Number of rows -1) X (Number of columns-1)
To compare the distribution of obesity status (bmigrp3)
between two treatment groups, the associated df is (3-1) x
(2-1) = 2
Nemours Biomedical Research
The Pearson Chi-squared Test–
Skewness
• The distribution of Chi-square statistic is positively
skewed. That is, it has a long tail to the right.
• As the df increases, the distribution of Chi-square
statistic becomes more symmetric.
Nemours Biomedical Research
Testing the equality of proportions of a variable
Nemours Biomedical Research
Testing the equality of proportions of a
variable
H0: proportion of subjects are equal in all
three groups.
Ha: Three proportions are not equal
The asymptotic p-value is 0.006
which is much smaller than the level
of significance (0.05). It indicates a
significant difference in proportions
of subjects between three groups.
Nemours Biomedical Research
Testing the user specified proportions of a
variable
Nemours Biomedical Research
Testing the user specified proportions a
variable
Let us test a hypothesis that the proportions of subjects are
0.3, 0.4, and 0.3 in lean, overweight, and obese respectively.
The asymptotic p-value is 0.001 which is much smaller than
the level of significance (0.05). It indicates that the proportion
of subjects in three groups are significantly different from the
specified proportions of three groups.
Nemours Biomedical Research
Testing the independence/ association of
attributes
Nemours Biomedical Research
SPSS output: Testing the independence/
association of attributes
Testing the distribution of
obesity status in two
treatment groups. That is,
testing the association of
obesity and treatment
groups.
The value of Pearson chisquared test is 4.005 and
the df is 2. The asymptotic
p-value is 0.135 which is
greater than the level of
significance (0.05).
Question: Can we reject the
H0?
Nemours Biomedical Research
Limitations of Chi-square Test
• The only product is p-value and there is no other
parameter to describe the degree or strength of
association.
• May not be appropriate to use for a small sample size
specially with any cell less than expected frequencies
5.
Nemours Biomedical Research
Fisher Exact Test
• An exact test in the analysis of 2x2 contingency table
• Most useful for small sample size, specially when
Pearson Chi-squared test is not applicable.
• The exact probability of observing the cell counts is
calculated using the hypergeometric distribution
• No distributional assumption is needed
Nemours Biomedical Research
Fisher Exact Test
Nemours Biomedical Research
Fisher Exact Test
Nemours Biomedical Research
Odds and Odds Ratio
• Like proportion, odds of an event also is very useful
to describe the categorical data.
• Odds, Odds ratio, and related regressions will be
covered in the next class.
Nemours Biomedical Research
Thank you
Nemours Biomedical Research
Download