Uploaded by Aparna Venugopal

Categorical - PV

advertisement
P. V. Prathyusha
Research Assistant
Dept of Biostatistics
NIMHANS
Bivariate Description
Usually we want to study associations between two or
more variables

Quantitative
var’s
:
show
data
using
scatterplots,
correlation

Categorical var’s : show data using contingency tables

Mixture of categorical var. and quantitative var : can
give numerical summaries (mean, standard deviation)
or side-by-side box plots for the groups

General Social Survey (GSS) data
Men:
mean = 7.0, s = 8.4
Women: mean = 5.9, s = 6.0
Categorical data
Age (years)
(< 15, 15 -30, 30-45, 45 +)
Ordinal
Gender
(M, F)
Nominal
Diagnosis
(Normal, Abnormal)
Nominal
Improvement (Mild, Moderate, Fair)
Ordinal
SES
(Low, Medium, High)
Ordinal
Locality
(Rural / Urban)
Anxiety score (< 13, 13 -23, 24-40, 41+)
Nominal
Ordinal
Contingency Tables

Cross classifications of categorical variables in which
 rows (typically) represent categories of explanatory variable and
 columns represent categories of response variable.

Counts in “cells” of the table give the numbers of
individuals at the corresponding combination of levels of
the two variables

Contingency
tables
enable
us
to
compare
one
characteristic of the sample, e.g. Smoking defined by
another categorical variable, e.g. gender
Happiness and Family Income
Happiness
Income
Very Pretty Not too
Row Total
------------------------------Above Aver. 164
233
26
423
Average
293
473
117
883
Below Aver. 132
383
172
687
-----------------------------Col Total
589 1089
315
1993
Row and column totals are called Marginal counts
What can a contingency table do ?
Can summarize by percentages on response variable (happiness)
Happiness
Income
Very
Pretty
Not too
Total
-------------------------------------------Above 164 (39%)
233 (55%)
26 (6%)
423
Average 293 (33%) 473 (54%)
117 (13%)
883
Below
132 (19%) 383 (56%)
172 (25%)
687
----------------------------------------------
Example: Percentage “very happy” is
39% for above aver. income (164/423 = 0.39)
33% for average income (293/883 = 0.33)
19% for below average income (??)
What can a contingency table do ?

Association between two categorical variables.

For example, you want to know if there is any
association between gender and headache.

Is there any association between taking aspirin and
risk of heart attack in the population.

To test whether lung cancer is associated with
smoking or not.

Diabetes associated with type of occupation or not.
Observed frequencies
• Depending on the subjects’ response, the data could be
summarized in a table like this.
• The observed numbers or counts in the table are:
Gender
Headache
Marginal
total (row)
Yes
No
Men
10
30
40
Women
23
17
40
Marginal total
(column)
33
47
80
This is what we have observed in the random sample of 80 subjects.
Sample to population …
 Knowing the incidence of headache of this 80 subjects, with
great certainty is of limited use to us.
 On the basis of observed frequencies (or %), we can make
claims about the sample itself, but we cannot generalize to
make claims about the population from which we drew our
sample, unless we submit our results to a test of statistical
significance.
 A test of statistical significance tells us how confidently we
can generalize to a larger (unmeasured) population from a
(measured) sample of that population.
Steps in test of hypothesis . .
1.
Find out the type of problem
and the question to be answered.
2. State the null & alternative hypotheses.
3. Calculate the standard error.
4. Calculate the critical ratio. Generally this is given by
Difference between the means (proportions)
--------------------------------------Standard error of the difference
5. Compare the value observed in the experiment with
that given by the table, at a predetermined significance
level.
6. Make inferences.
Testing of hypothesis
We need to measure how different our observed results are
from the null hypothesis.
How does chi-square do this?
It compares the observed frequencies in our sample with the
expected frequencies.
What are expected frequencies ?
Expected frequencies

If the Null Hypothesis was true, what would be the frequencies for each
cell?

Numbers of men and women with and without headache we would expect to
be same, if there is no relation between gender & headache.

i.e., if men and women were equally affected by headache, we would have
expected these numbers in our sample of 80 people.
Gender
Headache
Yes
TOTAL
No
Men
10
17
30
23
40
Women
23
17
17
23
40
TOTAL
33
47
80

Under the assumption of no association between
gender and headache, the expected numbers or
counts in the table are
Expected number = (row total)(column total)
(table total)
Gender
Headache
Yes
TOTAL
No
Men
10
16.5
30
23.5
40
Women
23
16.5
17
23.5
40
TOTAL
33
47
80
Chi-square value
The Chi-square value is a single number that adds up all the
differences between the observed data and the expected
data.
2
2
(
O

E
)
(Observed
Expected)
ij
2  
  ij
Expected
Eij
all cells
i, j
2cal = (10 – 16.5)2/16.5 + (30 – 23.5)2/23.5 + (23 – 16.5)2/16.5 +(17– 23.5)2/23.5
N ( |ad – bc|)2
Chi-square = -------------------R1xR2xC1xC2
Theoretical Chi-square value
Look up the theoretical Chi-square value in 2 distribution
table with d.f = (r-1)*(c-1), to see if it is big enough to
indicate a significant association of headache & gender.
For a 2 x 2 table like this, d.f is (2-1)*(2-1) =1.
Critical value 2
1, 0.05
= 3.841.
Degrees of freedom
Gender
Headache
Yes
Men
10
10
Women
23
No
17
80
Degrees of freedom are the number of independent pieces of
information in the data set.
In a contingency table, the degrees of freedom are calculated as the
product of the # of rows -1 and the # of columns -1, or (r-1)(c-1).
Chi-square values….
If the observed data and expected data are identical (i.e., if
there is no difference), the Chi-square value is 0.
Greater differences between expected and observed data
produce a larger Chi-square value.
Larger the Chi-square value, greater the probability that there
really is a significant association.
Assumptions of 2 test
 The sample must be randomly drawn from the population.
 Data must be reported in raw frequencies (not percentages).
 Categories of the variables must be mutually exclusive &
exhaustive.
 Expected frequencies cannot be too small, expected frequency
should be more than 5 in at least 80% of the cells.
Tables of higher dimensions
Happiness
Income
Very Pretty Not too
Total
------------------------------Above Aver. 164
233
26
423
Average
293
473
117
883
Below Aver. 132
383
172
687
-----------------------------Total
589 1089
315
1993
Chi-square test can be employed for tables of higher dimensions too.
Higher dimension tables..
OCCUPATION
KNOWLEDGE
Poor
Good
3 (20)
--
--
6 (40)
Business
5 (33)
6 (40)
Unemployed
7 (47)
3 (20)
Govt sector
Pvt sector
OCCUPATION
Chi-Square
Value
10.691(a)
(a)
df
Asymp. Sig.
(2sided)
3
.014
4 cells (50.0%) have expected count < 5.
KNOWLEDGE
Poor
Good
Govt / Pvt
3 (20)
6 (40)
Business
5 (33)
6 (40)
Unemployed
7 (47)
3 (20)
Chi-Square
Value
2.691(a)
df
2
Asymp. Sig.
(2sided)
.260
(a) 2 cells (33.3%) have expected count < 5.
Higher dimension tables..
Depression
Family
type
Normal
Borderline
Abnormal
Chi-Square
Value
Df
Asymp. Sig.
(2-sided)
Nuclear
(n=39)
1 (50)
2 (67)
36 (95)
6.715(a)
2
0.035
Joint
(n=4)
1 (50)
1 (33)
2 (5)
(a) 5 cells (83.3%) have expected count < 5
M-W U test would be appropriate here
Collapsing tables


Can often combine columns/rows to increase expected
counts that are too low
○ may increase or reduce interpretability
○ may create or destroy structure in the table
There is no clear guidelines
○ avoid simply trying to identify the combination of cells
that produces a “significant” result.

chi-square is basically a measure of significance.

it is not a good measure of strength of association.

It can help you decide if an association exists,
but not to tell how strong it is.
Yate’s Correction
Chi-square distribution is a continuous distribution and it fails
to maintain its character of continuity even if any one of the
expected frequencies is less than 5.
In such cases, Yates Correction for continuity is applied to
maintain the character of continuity of the distribution.
The formula for Chi-square test with Yates correction is:
N ( |ad – bc| - N/2 )2
Chi-square = ------------------------------------R1xR2xC1xC2
Fisher’s Exact Test
The test can be used in case of 2x2 contingency tables when the
sample sizes are small.
Column 1
Column 2
Total
Row 1
a
b
R1= a+b
Row 2
c
d
R2= c+d
Total
C1= a+c
C2= b+d
N = R1+R2
Fisher’s exact probability is given by
R1! R2! C1! C2!
p = -------------------, where N! = 1x2x3x…….x(N-1)xN
N! a! b! c! d!
Gender
Age group
<20 yrs
> 20 yrs
Male
4
1
Female
1
5
11
p = (a+b)! (a+c)! (b+d)! (c+d)! / (N! a! b! c! d!)
p = 5! 5! 6! 6! / 11! 4! 1! 1! 5!
p = .065
If the p value is less than or equal to 0.05, the null hypothesis
is rejected and the difference between the rows (or columns)
is considered significant.
Cochran (1954) suggests :
The decision regarding use of Chi-square should
be guided by the following considerations:
1.
When N > 40, use Chi-square corrected for continuity.
2.
When N is between 20 and 40, the Chi-square test may
be used if all the expected frequencies are 5 or more.
If any expected frequency is less than 5, use the Fisher’s
Exact probability test.
3. When N < 20, use Fisher’s test in all cases.
McNemar’s Test
Used in case of two related samples or there are repeated
measurements
Can be used to test for significance of changes in “before-after”
designs in which each person is used as his own control.
Thus the test can be used

to test the effectiveness of a treatment /training program /
therapy /intervention…. or

to compare the ratings of two judges on the same set of
individuals .
McNemar’s Test …
PRE
POST
Total
Normal Abnormal
Normal
a
b
R1
Abnormal
c
d
R2
Total
C1
C2
N
Pre Rx
POST - Rx
PRE Rx
Severe
Severe
7
Post Rx
Tot
Mild
19 I
26
24
Mild
W4
20
Tot.
11
39
50
Total
Mild
Severe
Mild
40
8
Severe
45
I
85
7
52
15
100
Total
W
48
The McNemar’s test statistic is given by:
What is probability?
The probability of a favorable event is the fraction of times you
expect to see that event in many trials. In epidemiology behavioural
research, a “risk” is considered a probability.
For example…
You record 25 heads on 50 flips of a coin, what is the
probability of a heads?
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝐻𝑒𝑎𝑑𝑠 =
# 𝐻𝑒𝑎𝑑𝑠 25
=
= .50 𝑜𝑟 50%
# 𝑇𝑟𝑖𝑎𝑙𝑠 50
Remember: a probability should never exceed
1.0 or 100%.
Relative Risk (RR)

Relative risk is the risk of developing a disease relative
to exposure

It is most commonly used in cohort studies to study
Incidence

It is ratio of probability of the event occurring in the
exposed group versus control(unexposed group)
Risk of developing
disease among
exposed
Probability of
outcome among not
exposed
RR 
Incidence of disease among exp osed
a/a b

Incidence of disease among not  exp osed c / c  d
What is the risk of myocardial infarction (MI) if a
patient is taking aspirin versus a placebo?

Risk of MI for aspirin group = 50/1080= .046 or 4.6%

Risk of MI for a placebo group = 200 / 1770 = .11 or 11%

What is the risk of myocardial infarction (MI) if a patient is
taking aspirin versus a placebo?
a / a  b 0.046
RR 

 0.418
c/cd
0.11
RR - Example
Lung Cancer
No lung
Cancer
Smokers
190
450
Non-smokers
70
700
a / a  b 190 /(190  450) 0.297
RR 


 3.27
c/cd
70 /( 70  700)
0.09
The risk of lung cancer is 3.27 times higher in smokers than
non smokers

RR = 1
association between exposure and
disease unlikely to exist

RR > 1
increased risk of disease among
those that have been exposed

RR < 1
decreased risk of disease among
those that have been exposed
DM type II
No DM type
II
Total
BMI <30
25
350
375
BMI >30
65
200
265
Total
90
550
640
a / a  b 25 / 375
RR 

 0.27
c / c  d 65 / 265

Those who have BMI <30 have less risk of developing
Type II Diabetes
What are Odds?
An “odds” is a probability of a favorable event occurring vs. not
occurring.
For example…
What are the odds you will get a heads when flipping a
fair coin?
𝑂𝑑𝑑𝑠 𝑜𝑓ℎ𝑒𝑎𝑑𝑠 =
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 ℎ𝑒𝑎𝑑𝑠
(1−𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 ℎ𝑒𝑎𝑑𝑠
.50
= (1−.50) =1
“The odds of flipping heads to flipping tails is 1”
In clinical and epidemiologic research, we use a ratio of two odds, or
Odds Ratio (OR) and Relative Risk Ratio (RR), to express the strength
of relationship between two variables.
Odds Ratio (OR)

Odds ratio is the ratio of two odds

Generally used in case control studies to study
Prevalence

It is the ratio of the odds of exposure in cases to
the odds of exposure in controls

Provides an estimate of relative risk when the
outcome is rare
Case
control
Exposed
a
b
unexposed
c
d

Odd of exposure among the cases : a/c

Odd of exposure among the control: b/d
Odds of exp osure in cases
a / c ad
ORExposure 


Odds of exp osure in controls
b / d bc
OR for Cohort and Cross sectional studies
Outcome YES
Outcome No
Exposed
a
b
unexposed
c
d

Odd of outcome among exposed=a/b

Odd of outcome among unexposed=c/d
Odds of outcome among exp osed
a / b ad
OROutcome 


Odds of outcome among un exp osed c / d bc

The exposure odds ratio is equal to the disease odds
ratio.
OR- Example
Had MI
No MI
Aspirin
50
1030
Placebo
200
1570
odds of myocardial infarction (MI) if a patient is
taking aspirin = 50/1030
 odds of myocardial infarction (MI) if a patient is
taking aspirin = 200/1570


OR = 50*1570 / 1030 * 200 = .38 or 38%
OR- Example
Smoking
Hx
No
Smoking
Hx

Lung
Cancer
No Lung
Cancer
190
450
BMI < 30
25
350
70
700
BMI > 30
65
200
OR = (190*700/ (450*70)
= 4.22
DM Type No DM
II
Type II

OR = (25*200) / (350*65)
= .21
OR values
Odds of exp osure in cases
ORExposure 
Odds of exp osure in controls

OR=1

OR>1
No association between exposure and outcome
indicates that the exposure is associated with an
increased risk of developing the disease

OR<1
indicates that the exposure is associated with the
reduced risk of (protect against) developing the outcome
When to use RR & OR
Exposed
unexposed

Outcome
YES
Outcome No
a
b
c
a/ab
c/cd
OR 
a/c
c/d
In a rare condition a and c will be very small
compared to that of b and d
 So Relative Risk would be

d
RR 
a / b ad
RR 

c / d bc
So given a rare condition Odds ratio approximates
to Relative Risk.
Relative Risk can only be calculated
for prospective studies
 Odds Ratio can be calculated for any
of the designs

 Case- control design
 Cross sectional
 Cohort

Diagnostic procedure or a test gives us an answer to the
following question:
"How well this test discriminates between certain two
conditions of interest (health and disease; two stages of
a disease etc.)?".

This discriminative ability can be quantified by the
following measures
 sensitivity and specificity
 positive and negative predicative values (PPV, NPV)
diagnostic accuracy:
Sensitivity and Specificity

Measure how ‘good’ a test is at detecting binary
features of interest (disease/no disease)

Sensitivity is the ability of a test to correctly
classify an individual as ′diseased′

The ability of a test to correctly classify an
individual as disease- free is called the test′s
specificity.
Usually the ‘true’ disease status is
determined by some ‘gold standard’
method
 For
a specific test, sensitivity
increases as specificity decreases and
vice versa

Disease
Present
Disease
Absent
Total
Test
+
a
b
a+b
Test
-
c
Total
a+c
b+d
Total Diseased
Total Normals
Total test
positive
d
c+d
Total test
negative
Sensitivity= a/a+c
100
Specificity=d/b+d
Disease
Present
Disease
Absent
Total
Test
+
25
2
27
Test
-
5
68
73
Total
30
70
100
25/30
sensitivity
68/70
specificity
Positive Predictive Value (PPV)

It is the percentage of patients with a positive test who
actually have the disease.

PPV: = a / a+b
=Probability (patient having disease when test is positive)
Negative Predictive Value (NPV)

It is the percentage of patients with a negative test
who do not have the disease.

NPV= d/c+d
= Probability (patient not having disease when test is negative)
Download