4. In logistic regression the logit is

advertisement
Statistics Advanced concepts 1
Multiple Choice Question test
This document should be completed during the course and
submitted by the specified date
Student name:
Date submitted:
Other information that may be relevant (optional):
Important information


In the table below please indicate the correct option(s) for each question.
Some questions have more than one correct response if that is the case please ensure you indicate
all the correct responses for example if a and c are the correct ones show this by a, c.
MCQ number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Red_laptop Document1
Survival analysis
Page 1
Logistic regression
Multiple regression
1. Survival analysis 1 Multiple choice Questions
1. Burton and Walls 1987 investigated the survival of patients on one of
three types of renal replacement therapy, peritoneal dialysis,
heamodialysis and transplantation details given opposite. What is the
usual name for the exponential coefficient column? (one correct choice)
a.
b.
c.
d.
e.
Hazard
Hazard
Hazard
Hazard
Hazard
Rate (HR)
Ratio (HR)
probability
proportion
logarithm
2. Considering the results from Burton and Walls 1987 given opposite.
Which is the most appropriate way of interpreting the values in the
exponential coefficient column (one correct choice)
a.
b.
c.
d.
e.
Odds
Probability
Time to event
Proportion failing
Odds ratio
Burton P R, Walls J 1987 Selection-adjusted comparison of life-expectancy of patients
on continuous ambulatory peritoneal dialysis, haemodialysis, and renal
transplantation
Variables that significantly influenced probability of survival
Variable
Exponential coefficient
(risk multiplying factor)
Statistical
significance
Age (each additional decade)
1.68
p<0.0001
Amyloidosis
8.26
p<0.0001
Acute or acute-on-chronic
presentation
2.73
p<0.005
Ischaemic heart disease
1.65
p<0.025
3.17
p<0.03
Male sex
0.48
p<0.001
Parenthood
0.45
p<0.001
Pyelonephritis
0.48
p<0.02
0.64
p<0.05
Adverse
Convulsions
Beneficial:
Residence in Leicestershire
3. Considering the results from Burton and Walls 1987 given above. Which variable represents the greatest hazard (one correct choice)
a.
b.
c.
d.
e.
Age (in decades)
Amyloidosis
Convulsions
Ischaemic heart disease
Acute or acute on chronic presentation
4. Considering the results from Burton and Walls 1987 given above. Which variable represents the greatest benefit (one correct choice)
a.
b.
c.
d.
e.
Male sex
Parenthood
Pyenonephritis
Residence in Leicestershire
Absence of Ischaemic heart disease
5. Considering the results from Burton and Walls 1987 given above. If anyone were considering dropping a variable from the model which one
would it most likely be? (one correct choice)
a.
b.
c.
d.
e.
Male sex
Parenthood
Pyenonephritis
Residence in Leicestershire
Absence of Ischaemic heart disease
6. Considering the results from Burton and Walls 1987 given above. What is the Exponential coefficient value likely going to be for the female
sex? (one correct choice)
a.
b.
c.
d.
e.
0
.5
1
1- 0.48
1+ 0.48
Red_laptop Document1
Page 2
7. Considering the results from Rait et al 2010 given opposite. What is the
more usual term for the Y axis? (one correct choice)
a.
b.
c.
d.
e.
Survival function S(t)
Logit
Inverse hazard
Actuarial survival
Proportion censored
8. Considering the results from Rait et al 2010 given opposite. The cohort
detail below the x axis are? (one correct choice)
a.
b.
c.
d.
e.
Irrelevant and should not be shown
Confuse the issues
More important than the graph
Provide useful additional information
Can be calculated from the graph
9. When gathering the failure times to calculate the Kaplan Meier plot which of the following statements is correct? (one correct choice)
a.
b.
c.
d.
e.
Its accurate measurement is of minimal importance
Can be grouped into equal intervals
Can be calculated from other measures
Its accurate measurement is of major importance
It is best to collect then at the end of the study period only
10. Which of the following are not included in the Censored observations . . .? (one correct choice)
a.
b.
c.
d.
e.
Those
Those
Those
Those
Those
who experience the event during the followup period of the study
that are lost to followup
that fail to provide event data
subjects whose survival time is less than the followup period of the study
who experience the event after the followup period of the study
11. Censored observations are . . .? (one correct choice)
a.
b.
c.
d.
e.
More important than non-censored ones in survival analysis
Are assumed to be normally distributed over time
Are assumed to have the same survival chances as uncensored observations
Are essential to allow calculation of the Kaplan Meier plot
Are allocated to the baseline survival curve
12. A Cox regression analysis . . .(one correct choice)
a.
b.
c.
d.
e.
Is used to analyse survival data when individuals in the study are followed for varying lengths of time.
Can only be used when there are censored data
Always assumes that the relative hazard for a particular variable is constant at all times
Uses the logrank statistic to compare two survival curves
Relies on the assumption that the explanatory variables (covariates) in the model are Normally distributed.
Personal note: I added (taken from p. 210) but can’t find the reference!
Red_laptop Document1
Page 3
2. Logistic Regression Multiple choice Questions
1. In Simple Logistic regression the predictor (independent variable) . . .? (one correct choice)
a.
b.
c.
d.
e.
is always interval/ratio data
must undergo a logarithmic transformation before undergoing logistic regression
be in the range of 0 to 1
represent ranked scores
must be a binary variable
2. A logistic regression model was used to assess the association between CVD and obesity. P is defined to be the
probability that the people have CVD, obesity was coded as 0=non obese, 1=obese. log(P/(1-P)) = -2 + 0.7(obesity)
What is the log odds ratio for CVD in persons who are obese as compared to not obese? (one correct choice)
a.
b.
c.
d.
e.
0.7
-2
2.7
Exp(0.7)
Exp(2)
Personal note: in the equation above log(P/(1-P)) = -2 + 0.7(obesity). Log(P/1-p)) is the logit function. 2+0.7(obesity) raised to e is called the Linear predictor. Equating with a + bx, a and b are log odds ratios for each
of the variables. Therefore 0.7 is the log odds ratio for obesity. REMEMBER right hand side odds, left hand side
separate odds ratios for each b
3. Which of the following formula produces the correct value for the probability of having CVD (Cardio Vascular
Disease) from the logistic regression equation log(P/(1-P)) = -2 + 0.7(obesity) where Pi=1/1+exp(-zi) where zi is the
linear Predictor (LP) (one correct choice)
a.
b.
c.
d.
e.
Pcvd= exp(-2+.7)/1- exp(-(-2+.7))
Pcvd= exp(-2+.7)/ 1+ exp(-2+.7)
Pcvd= exp(-2+.7)/ 1+ exp(-(-2 x .7))
Pcvd= exp(-2 x.7)/ 1+ exp(-(-2+.7))
Pcvd= exp(-2+.7)/ 1+ exp(-(-2+.7))
4. In logistic regression the logit is . . . : (one correct choice)
a.
b.
c.
d.
the natural logarithm of the odds .
an instruction to record the data.
a logarithm of a digit.
the cube root of the sample size.
5. In binomial logistic regression the dependent (or criterion) variable: (one correct choice)
a.
b.
c.
d.
is a random variable
is like the median and is split the data into two equal halves.
consists of two categories.
is expressed in bits.
Red_laptop Document1
Page 4
6. A model in binomial logistic regression is: (one correct choice)
a.
b.
c.
d.
a set of predictors which classify cases on the dependent or criterion variable.
another name for a contingency table
a miniature version of the analysis based on a small number of participants.
the most common score
7. Like linear regression logistic regression . . .: (one correct choice)
a.
b.
c.
d.
has one or more independent variables.
provides a value directly from an equation for the dependent variable
Uses the same method to estimate b weights.
has a dependent variable.
8. A classification table: (one correct choice)
a.
b.
c.
d.
helps the researcher assess statistical significance.
indicates how well a model has predicted group membership.
indicates how well the independent variable(s) correlate with the dependent variable.
provides a basis for calculating the exp(b) value
9. In simple logistic regression the traditional goodness of fit measure, -2(log likelihood of current model – log
likelihood of previous model) is : (one correct choice)
a.
b.
c.
d.
a statistic that does not follow a Chi square PDF.
indicates the spread of answers to a question.
an index of how closely the analysis reaches statistical significance.
how close the predicted findings are to actual findings.
10. Step 0 in simple logistic regression is: (one correct choice)
a.
b.
c.
d.
when there is no correlation between the predictors and the outcome variables.
when there is 0 spread around the regression line.
when there are no predictors only a constant term in the model.
when all the predictor variables are included in the model
11. In simple logistic regression the pseudo R square values . . . (one correct choice)
a.
b.
c.
d.
e.
Provide a greater degree of accuracy than those provided in linear regression
Should not be thought of a affect size measures
Are not based on the -2Log Likelihood values
You should only consider the rough magnitude of them
Provide the most appropriate method a assessing parameters
12. Likelihood (In the statistical sense) . . (one correct choice)
a.
b.
c.
d.
e.
Is the same as a p value
Is the probability of observing a particular parameter value given a set of data
attempts to find the parameter value which is the most likely given the observed data.
minimises the difference between the model and the data
is another name for the probability
Red_laptop Document1
Page 5
13. A Maximun Likelihood Estimator (in the statistical sense) . . (one correct choice)
a.
b.
c.
d.
e.
Is the same as a p value
Is the probability of observing a particular parameter value given a set of data
attempts to find the parameter value which is the most likely given the observed data.
Is the same as R Square
is another name for the probability
14. In simple logistic regression analysis in both SPSS and R which of the following is produced in a standard output
(one correct choice):
a.
b.
c.
d.
e.
Likelihoods (rather than -2log likelihoods)
F statistic
B (natural log odds ratio) for each parameter
T statistic and associated P value
Hazard function
The table below is from a simple logistic regression analysis, for each of the boxes pointing to a place in the table
select the option that correctly names and explains the column.
1
2
Variables in the Equation
3
B
Step 1a
S.E.
Wald
df
Sig.
4
95% C.I.for EXP(B)
Exp(B)
time
-0.015683
0.007256
4.671430
1.000000
0.030668
0.984440
Constant
12.727363
5.803108
4.810116
1.000000
0.028293
336839.852549
Lower
0.970539
Upper
0.998540
a. Variable(s) entered on step 1: time.
15. Box one is pointing to . . . (one correct choice):
a.
b.
c.
d.
e.
-2log likelihoods
Akaike's information criterion
The value of the natural log odds ratio for the parameter estimate
The value of the odds ratio for the parameter estimate
The standard deviation of the estimated B sampling distribution
16. Box two is pointing to . . . (one correct choice):
a.
b.
c.
d.
e.
-2log likelihoods
The standard error of the Logistic function
The value of the natural log odds ratio for the parameter estimate
The mean of the estimated B sampling distribution
The standard deviation of the estimated B sampling distribution
17. Box three is pointing to . . . (one correct choice):
a.
b.
c.
d.
e.
P value associated with the linear predictor (LP)
P value associated with the Wald statistic for a specific parameter estimate
P value associated with the Wald statistic for all parameters combined
P value associated with the Wald statistic for the confidence interval for the specific parameter estimate
P value associated with none of the above
Red_laptop Document1
Page 6
18. Box four is pointing to . . . (one correct choice):
a.
b.
c.
d.
e.
-2log likelihoods
Akaike's information criterion
The value of the natural log odds ratio for the parameter estimate
The value of the odds ratio for the parameter estimate
The standard deviation of the estimated B sampling distribution
19. A logistic regression reports a Exp(B) value of .9844 for a time variable along with a p value of <.03 You would
interpret the p value, remembering that the p value is a conditional probability, where the null parameter value is
zero . .. . . (one correct choice):
a. You would obtain a result like this or one more extreme three times in a hundred given that there was a
strong relationship between time and the predictor variable
b. You would obtain a result like this or one more extreme three times in a hundred given that there was not
relationship between time and the predictor variable
c. You would obtain a result like this or one less extreme three times in a hundred given that there was not
relationship between time and the predictor variable
d. You would obtain a result like this or one less extreme three times in a hundred given that there was a
strong relationship between time and the predictor variable
e. You would obtain a result like this at least 97 times (i.e. 1-.03) in a hundred given that the alternative
hypothesis is true
20. This question has been taken from Bland 1996 p.327 (adapted). The following table shows the logistic regression
of vein graft failure on some potential explanatory variables.
Logistic regression of graft failure after 6 months (Thomas et al. 1993)
Variable
Coef. (log
odd)
Std. Err.
z=coef/se
p
white_cell
1.238
0.273
4.539
<0.001
count
Graft type 1
0.175
0.876
0.200
0.842
Graft type 2
0.973
1.030
0.944
0.348
Graft type 3
0.038
1.518
0.025
0.980
female
-0.289
0.767
-0.377
0.708
age
0.022
0.035
0.633
0.528
smoker
0.998
0.754
1.323
0.190
diabetic
1.023
0.709
1.443
0.153
constant
-13.726
3.836
-3.578
0.001
Number of observations = 84, chi squared = 38.05, d.f.= 8, P < 0.0001
95% Conf. Interval
(Coef)
Odds ratio
exp(coef)
0.695
1.781
3.448
-1.570
-1.080
-2.986
-1.816
-0.048
-0.504
-0.389
-21.369
1.920
3.025
3.061
1.239
0.092
2.501
2.435
-6.083
1.191
2.645
1.039
.7486
1.022
2.712
2.7815
0.00001
From this analysis, which one of the following statements is FALSE (one correct choice):
a. patients with high white cell counts had over 3 times the odds of having graft failure
b. the log odds of graft failure for a diabetic is between 0.389 less and 2.435 greater than that for a nondiabetic ignoring the statistical significance
c. grafts were more likely to fail in female subjects, though this is not statistically significant
d. there were four types of graft (hint: think reference groups)
e. The relationship between white cell count and graft failure may be due to smokers having higher white cell
counts.
Red_laptop Document1
Page 7
Following MCQs are adapted from Statistics at a glance by Petrie & Sabin 3rd edition website.
21. In logistic regression . . . (one correct choice):
a.
b.
c.
d.
e.
The Wald statistic may be used to determine whether the b coefficient for a single explanatory
(independent) variable is statistically significant, where the null hypothesis is that it is equal to zero.
The Wald statistic may be used to determine the overall fit of the model.
The Wald statistic may be used to determine whether the b coefficient for each explanatory (independent)
variable is statistically significant, where the null hypothesis is that it is equal to zero.
The Wald statistic may be used to determine whether the b coefficient for each explanatory (independent)
variable is statistically significant, where the null hypothesis is that it is equal to 1.
The Wald statistic may be used to determine whether overall any of the b coefficients are statistically
significant, where the null hypothesis is that they are all equal to zero.
22. In logistic regression . . . (one correct choice):
a. The -2log(likelihood) is a measure of lack of fit for the logistic model, the smaller the value the poorer the fit
between the observed data and the model.
b. The -2log(likelihood) is a measure of lack of fit of a single b coefficient, the smaller the value the poorer the
fit between the observed data and the model.
c. The -2log(likelihood) is a measure of goodness of fit of the logistic model, the smaller the value the closer
the fit between the observed data and the model.
d. The -2log(likelihood) is a measure of goodness of fit of a single b coefficient, the smaller the value the closer
the fit between the observed data and the model.
e. The -2log(likelihood) is only used in linear regression.
23. In logistic regression . . . (one correct choice):
a. The model chi square (traditional fit measure/likelihood ratio test) provides information on overall model fit
where a significant p value indicates that the current model is a better fit than the previous one.
b. The model chi square (traditional fit measure/likelihood ratio test) provides information on overall model fit
where a significant p value indicates that the current model is a worse fit than the previous one.
c. The model chi square (traditional fit measure/likelihood ratio test) provides information on model fit for a
single b coefficient, where a significant p value indicates that the current model is a better fit than the
previous one.
d. The model chi square (traditional fit measure/likelihood ratio test) provides information on model fit for a
single b coefficient, where a insignificant p value indicates that the current model is a better fit than the
previous one.
e. The model chi square (traditional fit measure/likelihood ratio test) provides information on overall model fit
where a insignificant p value indicates that the current model is a better fit than the previous one.
Red_laptop Document1
Page 8
The following table shows the results of a multivariable logistic regression analysis on data from the Framingham
study (1951) in which there were 5209 participants on whom 9 covariates were measured at baseline. The
dependent variable was whether coronary heart disease (CHD) was present (coded as ‘one’) or absent (coded as
‘zero’) after 10 years.
Variable
Definition
bi
P value
RRi
CI for RRi
Sex
-1.588
<0.001
0.20
0.14 to 0.29
Age
M=0,
F=1
years
0.081
<0.001
1.08
1.07 to 1.10
Height
inches
-0.053
<0.05
0.95
0.95 to 1.00
SBP
mm Hg
0.009
<0.02
1.01
1.00 to 1.02
DBP
mm Hg
0.006
>0.05
1.01
1.01 to 1.02
Cholesterol
mg/ml
0.007
<0.001
1.01
1.00 to 1.01
ECG abnormal
0.854
<0.001
2.35
1.67 to 3.31
Relative weight
Y=1
N=0
100wt/median wt)%
1.359
<0.001
3.89
1.89 to 8.00
Alcohol consumption
oz/month
-0.059
>0.05
0.94
0.88 to 1.01
Constant term
a=-5.370
24. The odds of CHD in females compared to males is what % lower (one correct choice):
a.
b.
c.
d.
e.
20
40
60
80
exp(1.08)
25. Which of the following is true (one correct choice):
a. Individuals with an abnormal ECG at baseline were more than three times the odds of suffering from CHD
than those with a normal ECG, after adjusting for other factors.
b. Individuals with an abnormal ECG at baseline were more than two times the odds of suffering from CHD than
those with a normal ECG, after adjusting for other factors.
c. Individuals with an abnormal ECG at baseline were more than two times as probable to suffer from CHD
than those with a normal ECG, after adjusting for other factors.
d. Individuals with an abnormal ECG at baseline were more than three times as probable to suffer from CHD
than those with a normal ECG, after adjusting for other factors.
e. Individuals with an abnormal ECG at baseline were more than two times likely to suffer from CHD if they also
had the other risk factors than those with a normal ECG..
26. Which of the following is true:
a. Cholesterol level shows a statistically insignificant result and a large effect size
b. Cholesterol level shows a statistically significant result and a large effect size
c. Cholesterol level shows a statistically insignificant result and a small effect size
d. Cholesterol level shows a statistically significant result and a small effect size
e. Cholesterol level shows a statistically significant result with no reported effect size
27. Which of the following might be dropped from a future model:
a.
b.
c.
d.
e.
Diastolic blood pressure (DBP), Alcohol consumption, height
Diastolic blood pressure (DBP), Alcohol consumption
Systolic blood pressure (SBP), Age, Sex, Relative weight
Diastolic blood pressure (DBP), Systolic blood pressure, Cholesterol
None of the variables
Red_laptop Document1
Page 9
3. Multiple Regression
1. A partial correlation . . . . .(one correct answer)
a.
b.
c.
d.
e.
Controls for influence on the first of the variables being correlated
Controls for influence on the second of the variables being correlated
Controls for influence on both of the variables being correlated
Divides the influence of the specified suppressor variable(s), equally across the X and Y variables
Suppresses the influence of the specified suppressor variable(s), equally across the X and Y variables
2. A part correlation . . . . .(one correct answer)
a.
b.
c.
d.
e.
Controls for influence on the first of the variables being correlated
Controls for influence on the second of the variables being correlated
Controls for influence on both of the variables being correlated
Controls for influence on either the first or second of the variables being correlated
Suppresses the influence of the specified suppressor variable(s), equally across the X and Y variables
3. Linear multiple regression which only involves nominal predictors (inputs/independent) variables is traditionally
analysed using . . . . .(one correct answer)
a.
b.
c.
d.
e.
Analysis of variance (ANOVA)
Analysis of covariance (ANCOVA)
Survival analysis
Generalised Estimating Equations (gee)
Logistic regression
4. Linear multiple regression which involves both nominal and interval/ratio (continuous) predictors
(inputs/independent) variables is traditionally analysed using . . . . .(one correct answer)
a.
b.
c.
d.
e.
Analysis of variance (ANOVA)
Analysis of covariance (ANCOVA)
Survival analysis
Generalised Estimating Equations (gee)
Logistic regression
5. The linear multiple regression approach over ANOVA provides the following advantage. .(one correct answer)
a.
b.
c.
d.
e.
Allows analysis of non normally distributed samples
Parameter estimation (B's and β's).
Missing data analysis
Copes better with smaller sample sizes
Reduced error estimates
6. The squared part correlation for each of the parameter estimates in multiple linear regression represents . . . .
.(one correct answer)
a.
b.
c.
d.
e.
increase in accuracy for that particular variable
Percentage of error attributed to the parameter
increase in R2 for that particular parameter (unique contribution to the model)
Level of normality exhibited by the parameter estimate
Alternative to the R2 measure
Red_laptop Document1
Page 10
7. What does the term collinearity or multicollinearity mean with regard to multiple linear regression? . .(one
correct answer)
a.
b.
c.
d.
e.
A desirable situation where there is a low correlation between one or more predictors (input variables)
A undesirable situation where there is a low correlation between one or more predictors (input variables)
A desirable situation where there is a high correlation between one or more predictors (input variables)
A undesirable situation where there is a high correlation between one or more predictors (input variables)
A undesirable situation where there is no correlation between the predictors (input variables)
8. Ross NA, Wolfson MC, Dunn JR, Berthelot J-M, Kaplan GA, Lynch JW. Relation between income inequality and
mortality in Canada and in the United States: cross-sectional assessment using census data and vital statistics. BMJ
2000; 320: 898–902
Ross et al. regressed mortality in working aged men against median share of income (i.e. the proportion of total
income accruing to the less well off 50% of households) in 282 USA metropolitan areas and 53 Canadian
metropolitan areas. The median income for the areas was included as an explanatory variable. They found the
difference in slopes significant (p < 0.01), R2 = 0.51. Example courtesy of Micheal Campbell.
(please select the THREE correct/true answers)
The model is yi = a + b1X1i + b2X2i + b3X3i + b4X4i
yi is the mortality per 100 000 for metropolitan area i, i = 1…335
X1i takes the value 1 for the USA and 0 for Canada
X2i is median share of income for area i (defined above)
X3i = Xi1.X2i (the product of X1i and X2i)
X4i is median income for area i
a.
b.
c.
d.
Mortality is assumed to have a Normal distribution.
The test to compare slopes is a t test with 330 degrees of freedom.
The relationship between mortality and median income is assumed to be different for the USA and Canada.
The relationship between mortality and median share of income is assumed linear.
e. The variability of the residuals is assumed the same for the USA and Canada.
Personal comments
a – false it is the residuals having allowed for median share of income and country that is assumed normal.
b = true df =330 = 285 +53-1-4 [ b1,b2, b3, b4]
c = false it is the relationship between median share that is assumed different
9. Example courtesy of Micheal Campbell. In a multiple regression equation y = a + b1X1 + b2X2,
(please select the TWO correct/true answers)
a.
b.
c.
d.
e.
The independent’ variables X1 and X2 must be continuous
The leverage depends on the values of y.
The slope b2 is unaffected by values of X1.
If X2 is a categorical variable with three categories, it is modelled by two dummy variables.
If there are 100 points in the data set, then there are 97 degrees of freedom for testing b1.
Personal comments:
a- False x1 and x2 can be discrete
b- False Depends on X1 and X2
c – False changing values of X1 will alter relationship with X2 and so effect B2
d - true
e - true
[end of document]
Red_laptop Document1
Page 11
Download