# Lab10 ```STAT 460
Lab 10 Turn in Sheet
To receive credit for this lab, turn this sheet in before leaving the lab.
Name: _____________________________________________
Lab Section: ____
1. What is the p-value? What null hypothesis is rejected?
2. What is a one unit change for each of the two variables?
3. Interpret the cross-tabulation.
4. List one or two things that are still unclear to you.
11/22/2004
2
STAT 460
Lab 10 Instructions
11/22/2004
Goals: In this lab you will learn to perform chi square tests and learn about logistic regression.
Part I. Chi-square test of independence
Summary:
Chi square test are used to test whether two variables measured on a group of subjects are
independent. A table of cell counts for each combination of the levels of X and Y, called a
contingency table, is usually produced first. If X and Y are independent, then the probability
distribution of X is the same for each Y (and vice versa). The chi square statistic is the sum of
(observed – expected)2/expected where “expected” is the expected number of subjects in each
cell if X and Y are independent. The null distribution of the chi square statistic is the chi square
distribution with (r-1)*(c-1) df, where r is the number of rows and c is the number of columns in
the contingency table.
a) Categorical explanatory and categorical outcome (response) variables or two categorical
outcome variables
b) Null hypothesis: X is independent of Y or Pr(Y=1|X=0)=Pr(Y=1|X=1)=…
c) Construct a contingency table which counts numbers of subjects for each combination of
levels of variable X and variable Y:
d) Expected value (under the null hypothesis) = (row count * column count)/(total count)
(observed  exp ected ) 2
e) Chi square statistic is X  
exp ected
2
f) Under H0, X2 follows a chi-square (χ2) distribution with (r-1)*(c-1) d.f.
g) SAS (Statistics/Table Analysis with Statistics:ChiSquare):
A manufacturer was considering marketing crackers high in a certain kind of edible fiber as
a dieting aid. Dieters would consume some crackers before a meal, filling their stomachs so that
they would feel less hungry and eat less. A laboratory studied whether people would in fact eat
less in this way.
Overweight female subjects ate crackers with different types of fiber (bran fiber, gum fiber,
both, and a control cracker) and were then allowed to eat as much as they wished from a prepared
menu. The amount of food they consumed and their weight were monitored, along with any side
effects they reported. Unfortunately, some subjects developed uncomfortable bloating and gastric
upset from some of the fiber crackers. A contingency table of &quot;Cracker&quot; versus &quot;Bloat&quot; shows the
relationship between the four different types of cracker and the four levels (1 being HIGH, 2
=LOW, 3=MEDIUM, and 4 NO bloating) of severity of bloating as reported by the subjects.
We will use a Chi-Square test to see whether Bloating is independent of Cracker (the type of fiber
eaten).
.
b. Create a contingency table and run the chi square test as follows. From the menu choose
Statistics/Table Analysis. Enter “cracker” as rows and bloat as columns. Under Statistics
click Chi Square, and Exact Test. Under Table add expected counts and row and column
percents.
c. In the cross-tabulation results, notice, e.g.,
1. observed count for no bloating and fiber, is 7
2. expected count under the null, is 4.25 = (12 x 17)/48
3. the percent of subjects with no bloat in the four different fiber types is (bran) 58.3,
(combo) 16.7, (control) 50.0 and (gum) 16.7%.
4. Row Pct, gives conditional distributions for BLOAT given CRACKER; e.g.
conditional probability values of the conditional distribution of BLOAT given
CRACKER is BRAN are (0, 0.333, 0.083, 0.583) and these values sum to 1.
Further, e.g. P(bloat=2|cracker=bran)=0.333.
i. What is the conditional distribution P(bloat| cracker= control)?
5. Col Pct, gives conditional distributions for CRACKER given BLOAT, e.g.,
conditional probability values of the conditional distribution of CRACKER given
BLOAT is level 2 are (0.267, 0.333, 0.267, 0.133) and these values sum to 1.
Further, e.g. P(cracker=combo|bloat=2)=0.333.
i. What is the conditional distribution P(cracker|bloat=1)?
2
6. Odds of no bloating is the number of subjects no bloating/ number of subjects
with bloating to 1, e.g. 17 to 21 or 17/21 to 1
7. Relative risk of being bloated based on whether you are eating a control cracker
or not: (10/12)/(21/36)=1.43 to 1
i. You can do the above calculation manually or create a collapsed 2x2 table
to get the summarized counts and then calculate the risk
Data/Transform/Recode values such that BLOAT levels 1,2,3 to 1 and 4
to 0, and CRACKER control to 0 and the rest to 1.
ii. Reports/Tables choose Row classes/Column classes .
8. Increased risk is 10/12 – 21/36 = 0.25
9. Odds Ratio = (2*21)/(10*15)= 7/25 = 0.21
d. In the Chi Square tests, the “Value” is the chi square test statistic. The “Prob” is the
asymptotic p-value. The word asymptotic indicates that the null sampling distribution that
is used to convert from the statistic to the p-value is only an approximation when the
sample size is not large. Many people consider the approximation to be inappropriate
when some cells have less than 5 expected counts, but somewhat smaller values are
usually OK.
e. (♠1) What is the p-value? What null hypothesis is rejected?
f.
One solution to the problem of low cell counts is to combine categories. Use
Data/Transform/RecodeValues. Create bloat2 (label for New Column Name) from bloat .
Note that 2 and 4 are the codes for “low” and “none”. Use Original and New Values to
code 2 and 4 to 0 for no bloat and to code 1 and 3 to 1 bloat. Rerun the analysis and reinterpret. This p-value is more reliable because few cells have much less than 5 subjects.
What kind of cracker would you avoid?
3
Part II. Binary Logistic Regression
Summary:
Logistic regression is a powerful way to relate one or more explanatory variables to a binary
(categorical) outcome. It mirrors linear regression, but the linear combination of coefficients and
explanatory variables, β0+β1X1+…+βpXp, instead of representing the mean outcome, represents
the log odds of the probability of “success”.
The odds equal probability/(1-probability). Odds run between 0 and infinity. Log odds run
between –infinity and +infinity, so every possible linear combination corresponds to a valid
probability between 0 and 1. To convert from the linear combination (η) to probability, the
formula is exp(η)/(1+exp(η)).
When making predictions with logistic regression equations, we have “additive” effects on the
log odds scale. E.g., b1=2, then we estimate that the log odds of success increase by 2 for each
one-unit increase in x1. If we want to work on the odds scale, the properties of logs tell us that we
now have multiplicative effects. Using the same example, exp(2)=7.39, so the odds of success get
multiplied by 7.39 for each one-unit increase in x1. There is no easy way to express the change in
probability of success for a one-unit increase in x1; the best you can do is to calculate the
probability predicted by the model for several meaningful values of x1.
The assumptions of logistic regression include that for any group of subjects with some fixed
combination of explanatory variables, the outcome follows the binomial distribution bin(n,p)
where n is the group size and p is the predicted probability of “success”. (The binomial
distribution is just that of flipping an unfair coin with heads probability equal to p.) Since we
often don’t have any groups of subject with identical explanatory variables, the HosmerLemeshow goodness of fit test makes groups of similar subjects then tests that the groups are
consistent with the binomial distribution. A low p-value suggests that the model is suspect, either
through an inappropriate selection of explanatory variables (e.g., a missing interaction) or through
an outcome probability process that is not inherently binomial.
a. Generalized Linear Model approach looks like regression but the linear
combination of coefficients and explanatory variables is related to the outcome
through a “link function”, usually symbolized as g( ). Explanatory variables can
be continuous, coded categorical, transformed for non-linearity, and multiplied
across variables for interaction.
b. Let η=β0+β1X1+…+βkXk where η is pronounced “eta”, and p is the number of
explanatory variables (including expansions of categorical explanatory variables
with more than two levels). Remember, in ordinary regression, μ(Y|X)= η.
c. Let π=Pr(Y=1|X) be the probability of “success” for any combination of
explanatory variables.
d. The link function for logistic regression is g(π)=log(π/(1-π)) which is the log
odds of success or “logit” of the success probability.
4
e. Note: Odds(Y)=Pr(Y)/(1-Pr(Y)), e.g. p=0.2, 0.5, 0.75, odds=0.25, 1, 3. Also,
log(odds)=--1.30, 0, 1.10.
f.
In logistic regression g(π)=η or log(π/(1-π)) = β0+β1X1+…+βkXk. Here is a plot
of the relationship between one explanatory variable and the probability of
success when all of the other explanatory variables are held constant:
g. To estimated the probability of success for any combination of explanatory
variables, calculate the log odds of success, η, then use the logistic formula, p =
exp(η) / (1+ exp(η)). Note: We can break this down to odds=exp(log odds) and
probability=odds/(1+odds).
h. Assumptions:
i.
ii.
iii.
iv.
v.
Binomial outcome,
logistic relationship for Pr(Y=1) vs. each X,
variance(Y|X)=Pr(Y=1|X)*(1-Pr(Y=1|X)),
“fixed” X,
independent errors.
i.
Summary: Logistic regression handles binary outcomes by working on the scale
of “log odds of success”.
j.
SAS results (Statistics/sRegression/Logistic)
5
a. Load donner.txt into SAS. Perform some appropriate EDA.
b. Scientific hypothesis: women are better able to survive harsh conditions than men
(children excluded from analysis)
c. Statistical model: LogOdds(survival | age, gender) = β0 + βageAge + βfemaleFemale,
H0: βfemale=0
βfemale is the change in log odds of success when comparing a male to a female of any age.
βage is the change in log odds of success when comparing a person to another 1 year older.
d. Perform the logistic regression to model the log odds of survival on the age and gender of
the adult pioneers. Use Regression/Logistic. Set the Dependent variable Survived and
both Covariates, Female and Age as Quantitative. Set Model Probability to 1.
e. Output Analylsis:
Model Information
Data Set
Response Variable
Number of Response Levels
Number of Observations
Model
Optimization Technique
_PROJ_.DONNER
SURVIVED
2
45
binary logit
Fisher's scoring
Response Profile
Ordered
Value
1
2
SURVIVED
Total
Frequency
1
0
20
25
Probability modeled is SURVIVED='1'.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Intercept
6
Intercept
and
Criterion
Only
AIC
SC
-2 Log L
Covariates
63.827
65.633
61.827
57.256
62.676
51.256
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr &gt; ChiSq
10.5703
9.0965
6.8627
2
2
2
0.0051
0.0106
0.0323
Likelihood Ratio
Score
Wald
f.
The Variables in the Equation box has a line for the constant (intercept) and each
explanatory variable. Check the columns for the coefficient, its standard error and the
Wald statistic (B/SE(B))2. Note that two p-values are significant. The intercept
coefficient represents the log odds of survival for zero year old males. Why is this not a
worthwhile quantity to interpret?
Analysis of Maximum Likelihood Estimates
Parameter
DF
Estimate
Standard
Error
Wald
Chi-Square
Pr &gt; ChiSq
Intercept
AGE
FEMALE
1
1
1
1.6331
-0.0782
1.5973
1.1102
0.0373
0.7555
2.1637
4.3988
4.4699
0.1413
0.0360
0.0345
The LOGISTIC Procedure
Odds Ratio Estimates
Effect
Point
Estimate
AGE
FEMALE
0.925
4.940
95% Wald
Confidence Limits
0.860
1.124
0.995
21.716
Association of Predicted Probabilities and Observed Responses
Percent Concordant
Percent Discordant
Percent Tied
Pairs
73.0
23.8
3.2
500
Somers' D
Gamma
Tau-a
c
0.492
0.508
0.248
0.746
g. The Exp(B) represents the change in odds when the explanatory variable goes up one unit.
(♠2) What is a one unit change for each of the two variables?
1. Prediction example: 21 year old male
7
Eta: η = 1.633 - 0.078(21) = -0.005
Odds(survival) = exp(-0.005)=0.995 (1 survives for every 1 that dies)
Pr(survival) = 0.995 / (1+0.995) = 0.499
2. Prediction example: 21 year old female
Eta: η = 1.633 - 0.078(21) +1.597 = 1.592
Odds(survival) = exp(1.592)=4.914
(5 survive for every 1 that dies)
Pr(survival) = 4.914 / (1+4.914) = 0.831
3. Comparing odds (gender): exp(1.597)=4.938 0.995*4.938=4.914
4. Comparing odds (per year): exp(-0.078)=0.925, so the odds are 0.925 times
as large (for either gender) for each 1 year increase in age.
5. Comparing odds (per decade): exp(-0.78)=0.458, so the odds are 0.458
times as large (for either gender) for each 10 year increase in age.
h. Now rerun logistic regression with recoded FEMALE as a class. Be sure to change
(recode) the values so that “Female” coefficient will reflect female relative to male, not
vice versa. Under Model select Backward Elimination (this is for automatic model
selection for a best model). Under Statistics choose Goodness-of-fit. Under Prediction,
Predict original sample and save them.
i.
First look at the Class Level Information. Design variables 1 is a dummy-type code for
females and -1 for females. Now in Analysis of Maximum Likelihood Estimates look for
the coefficients of the model. Compare those to the first model we ran? Now the
interpretation is different?
1. For example: Prediction example: 21 year old male
Eta: η = 2.4318 - 0.078(21) +0.796(-1)= -0.005
j.
The classification table is worthwhile if classification of new cases is your goal (but future
cases tend to be classified not as well as current cases). Here our main interest is
interpretation of the coefficients to test the hypothesis that females survive better than
males after correcting for age.
8
k. Graphical summary of model results
l.
Calculate the eta (η) values, which are the log odds of survival. The calculate exp(η) to
get the odds of survival. Then calculate odds/(1+odds) to find the probability of survival.
Age
Female
25
0
25
1
50
0
50
1
LogOdds
9
Odds
Probability
```