Richard I. Frederick, Ph.D.

advertisement
Solving Classification Problems for
Symptom Validity Tests with
Mixed Groups Validation
Richard Frederick, Ph.D., ABPP (Forensic)
US Medical Center for Federal Prisoners
Springfield, Missouri
I am not a neuropsychologist.
My view of brain
Your view of brain
My board certifications:
Forensic Psychology
American Board of Professional Psychology
Assessment Psychology
American Board of Assessment Psychology
My professional goal:
Use tests properly in forensic
psychological assessments
Goals of workshop
Participants in this workshop will be able
to employ Excel graphing methods:
--to evaluate classification characteristics of
symptom validity tests
--to adapt symptom validity test scores to their
individual, local, base rates
--to combine information from local base rate
and multiple symptom validity tests
richardfrederick.com
Something is terribly wrong
1. The SIRS has sensitivity = .485 and specificity = .995.
2. The SIRS was administered to 131 criminal defendants
who were strongly suspected of feigned psychopathology.
68% of them were categorized as feigning by the SIRS
What is a classification test?
A structured routine for determining
which individuals belong to which
of two groups.
(1) There are two groups.
(2) It’s not easy to determine which
group an individual belongs to
without the help of the test.
Real World
The distributions represent our
estimations of how the populations of
the two groups score on the test.
We generally estimate the population
distributions by sampling. We notice
that the populations have two separate,
but overlapping distributions. The extent
of the overlap is of concern to us.
Questions that must be addressed in
research before we can continue:
(1) Are there really two separate groups?
(2) Can we effectively represent the
population distributions by sampling?
Real World
What we notice next.
The mean separation between the
groups is 10 points.
Persons in Population A have a mean
score that is 10 points below persons in
Population B.
The sd for each population is the same. The
mean separation between groups is one sd.
When researchers talk about mean
separation, they often refer to effect size.
Often, Cohen’s d is the statistic used to
refer to standardized mean separation.
Here, Cohen’s d = 1. This is often referred
to as a large, or very large, effect size.
Mean separation = 0
Making tests often means finding those
characteristics that best separate the
distributions of the two groups.
Two distributions of gender
with respect to:
Intelligence
Moderately large mean separation
Two distributions of gender
with respect to:
Longevity
Large mean separation
Two distributions of gender
with respect to:
Hair Length
Very large mean separation
Two distributions of
gender
with respect to:
Body Mass
Real World
Summary:
(1) We have two groups.
(2) We have a test for which the two
groups score differentially.
(3) The differences in mean scores
represents a very large effect.
Foundations of TPR and FPR
More commonly, researchers report
Sensitivity and Specificity. These terms
are common, but not most helpful.
We are going to use the terms:
True Positive Rate (TPR) and
False Positive Rate (FPR).
TPR = Sensitivity
FPR = 1 - Specificity
What are TPR and FPR?
TPR is the proportion of individuals who do have
the condition who generate positive scores. TPR
is the rate of scores are beyond the cut in the
direction that indicates the presence of the
condition.
FPR is the proportion of individuals who do NOT
have the condition who generate positive scores.
FPR is the rate of scores beyond the cut in the
direction that indicates the presence of the
condition.
Have nots
Haves
The green line represents
the cut score. Scores to the
LEFT of the line are classified
NEGATIVE. Scores to right
are classified POSITIVE.
Here, the False Positive Rate is 92.4%.
The True Positive Rate is 100%.
As we move the line to the right, both
rates DECREASE.
To totally eliminate
false positives, we have
to be willing to identify
almost no one as a
positive.
Test/ /Truth
Have disorder
Don’t
Has disorder
True Positives
False Positives
Doesn’t
False Negatives
Haves
Positives
True Negatives Negatives
Have Nots
TPR = True Positives/Haves
FPR = False Positives/Have Nots
Haves
Have nots
A positive score will be one that is
associated with Population A
membership. If we set a point at which
a score will be used to say, “This score
represents Population A,” such a score
will be referred to as a “positive score.”
A positive score can be a true positive
or a false positive: unknown to us.
The True Positive Rate is the proportion of
Population A members who generate a
positive score.
In our figure, the point at which we
begin to identify “positive scores” is at 50, the
mean of population A. Scores at or below 50
are called positive, and a person who
generates a positive score is classified as a
Population A member.
We can pick any value to be our “cut score,” but
it’s hard to pick one that doesn’t result in some
Population B members producing “positive
scores.”
In our figure, 50% of the Population A members
have scores at 50 or below. This is the True
Positive Rate. TPR = .50.
In our figure, 16% of the Population B members
have scores at 50 or below. This it the False
Positive Rate. FPR = .16.
We note that it is not the test that has
a certain TPR and FPR.
It is the chosen test score that has a
certain TPR and FPR.
A different test score will almost
certainly have different TPR and FPR.
Overcoming limiting factors of “known
groups” validation in determining test
score sensitivity and specificity
We think of a test as a way to characterize a dependency.
As you have more of X, you have more of Y.
Y depends on X.
X predicts Y.
X is some construct. Y is some test score.
There is a relationship that we wish to characterize and
quantify.
Let’s consider feigning.
As you are more likely to feign, you are more likely to
engage in certain behavior.
This behavior might be “providing answers to items on a
test” at a certain rate.
You might choose more items, you might choose fewer
items than “normals.”
We develop the idea that we can identify individuals who
respond at a certain rate as feigners, and we decide to
make a decision point about when we call test takers
feigners and when we don’t.
We call that decision point a cut score.
We call test scores at or beyond the cut score:
positive scores
Some positive scores are correct: true positives
Some positive scores are incorrect: false positives
If our test is any good, and if the relationship between
X and Y is strong, then our rate of true positives is much
higher than our rate of false positives.
Let’s skip to the end. We are now using the test in our
clinic.
We look over our results. We see a number of “positive
scores.”
We know that those “positive scores” are some unknown
mixture of “true positives” and “false positives.”
We’d like to know what that ratio of that mixture is.
Here’s how we do it:
First, we estimate what the true positive rate of the cut
score is.
Then, we estimate what the false positive rate of the
cut score is.
Then, we figure out what percentage of people in our
sample are feigning.
Then we can get the ratio of the mixture of our true
positive and false positives in all the positive scores in our
clinic. (We call this positive predictive power.)
Getting TPR and FPR:
We depend on researchers to tell us what the estimates
of true positive rate and false positive rate are.
They usually do this through a process called
“criterion groups validation.”
People with more confidence than might be called
for refer to this process as “known groups validation.”
The process is seemingly straightforward.
Identify two groups. One group has the condition.
All the positives in this group are “true positives.”
One group doesn’t have the condition. All the positives
in this group are “false positives.”
The rate of “true positives” is the sensitivity of the test.
TPR = sensitivity.
The rate of “false positives” is the non-specificity of
the test. FPR = 1 – specificity.
There are many problems with this process, but let’s
focus on the main two.
Problem 1
In Study 1, for a given cut score, researchers report the
TPR is .67 and the FPR is .12.
In Study 2, for the same cut score, researchers report
TPR = .58 and FPR = .09.
Which values do you use?
Problem 2:
In Study 1, for a given cut score, researchers report the
TPR is .67 and the FPR is .12.
In Study 2, for a different cut score, researchers report
TPR = .58 and FPR = .09.
Which cut score do you use?
“Known” groups validation
Let’s validate a test!
God whispers to us what truth is and
we identify 100 honest responders
and 100 feigners.
100
100
We take our best shot at a test.
TEST
TRUTH
100
100
Test results
TEST
TRUTH
49
1
50
51
99
150
100
100
We say for our test:
True positive rate = 49/100 = 49% [sens = 49%]
False positive rate = 1/100 = 1% [specificity = 99%]
TEST
TRUTH
49
1
50
51
99
150
100
100
Because God does not whisper to us anymore,
we take this test, our best test, and we say,
“This is the best we can do.” Let’s call it our
Gold Standard.
We will now make criterion groups with this test,
and we will call the groups “Known Groups.”
We will then validate tests, based on these
Known Groups.
We say for our test:
True positive rate = 49/100 = 49% [sensitivity = 49%]
False positive rate = 1/100 = 1% [specificity = 99%]
TEST
TRUTH
49
1
50
51
99
150
100
100
Our move from TRUTH to KNOWN GROUPS
TRUTH
“KNOWN” GROUPS
49
51
100
1
99
100
50
150
We forget what truth is and develop faith in our gold
standard
“KNOWN” GROUPS
50
150
Let’s validate a new test, which just happens to be a perfect
test. What test diagnostic efficiencies will we assign our
new, perfect, test?
PERFECT TEST
“KNOWN” GROUPS
49
51
100
1
99
100
50
150
Let’s validate a new test, which just happens to be a perfect
test. What test diagnostic efficiencies will we assign our
new, perfect, test?
PERFECT TEST
“KNOWN” GROUPS
49
51
100
1
99
100
50
150
TPR = 49/50 = 98%, FPR = 51/150 = 34%
Our belief that
we can make
perfect criterion
groups from
imperfect criteria
has led us to
misunderstand
tremendously
what we are
doing.
Let’s begin to address these problems in a
non-traditional way.
Table for Computation of Test Characteristics
Positive
(Feigners)
Test
Positive
Test
Negative
80%
20%
Negative
(Not Feigning)
10%
Computation for
Positive
Predictive Power
90%
Computation for
Negative
Predictive Power
Sensitivity = Specificity =
80%
90%
Table for Computation of Test Characteristics
Positive
(Feigners)
Test
Positive
Test
Negative
80%
20%
Negative
(Not Feigning)
10%
PPP = Ratio of
True Positives to
All Positives
90%
NPP = Ratio of
True Negatives
to All Negatives
True
False
Positive
Positive
Rate (TPR) = Rate (FPR) =
80%
10%
Table for Computation of Test Characteristics
Positive
(Feigners)
Test
Positive
Test
Negative
80%
20%
Negative
(Not Feigning)
10%
PPP = Ratio of
True Positives to
All Positives
90%
NPP = Ratio of
True Negatives
to All Negatives
True
False
Positive
Positive
Rate (TPR) = Rate (FPR) =
80%
10%
Table for Computation of Test Characteristics
Base Rate of Feigning
100%
0%
Test Positive
80%
10%
Test Negative
20%
90%
NOTE: Calculations
of TPR and FPR are
INDEPENDENT of
Base Rate
True Positive
Rate (TPR) =
80%
False Positive
Rate (FPR) =
10%
Table for Computation of Test Characteristics
Base Rate of Feigning
Test Positive
100%
0%
80%
10%
True Positive
Rate (TPR) =
80%
False Positive
Rate (FPR) =
10%
Table for Computation of Test Characteristics
Base Rate of Feigning
Proportion Tests
Positive
1.00
0
.80
.10
True Positive
False Positive
Rate (TPR) = .80 Rate (FPR) = .10
The Test Validation Summary
Proportion Positive Scores on Classification Test
1
0.9
0.8
True Positive Rate
0.7
0.6
0.5
0.4
0.3
Proportion Positive Scores
0.2
False Positive Rate
0.1
0
.00
.05
.10
.15
.20
.25
.30
.35
.40
.45
.50
.55
.60
.65
Base Rate of Feigning
.70
.75
.80
.85
.90
.95 1.00
REMINDER: Here is what we are working on—figuring out
which positives in our clinic are true positives.
First, we estimate what the true positive rate of the cut
score is. Then, we estimate what the false positive rate of
the cut score is.
Let’s do that part now.
Then, we figure out what percentage of people in our
sample are feigning.
Then we can get the ratio of the mixture of our true
positive and false positives in all the positive scores in our
clinic.
Mixed groups validation
Table for Computation of Test Characteristics
Base Rate of Malingering
Pr +
Tests
1.0
.8
.6
.4
.2
0
.8
.66
.52
.38
.24
.1
TPR =
.8
(.8, .6, .4, .2 are mixed
groups, not pure)
FPR =
.1
Table for Computation of Test Characteristics
Base Rate of Malingering
Pr +
Tests
0
.2
.4
.6
.8
1
.1
.24
.38
.52
.66
.8
FPR =
.1
TPR =
.8
The Test Validation Summary
Proportion Positive Scores on Classification Test
1
0.9
0.8
True Positive Rate
0.7
0.6
0.5
0.4
0.3
Proportion Positive Scores
0.2
False Positive Rate
0.1
0
.00
.05
.10
.15
.20
.25
.30
.35
.40
.45
.50
.55
.60
.65
.70
Base Rate of Malingering
.75
.80
.85
.90
.95 1.00
The Test Validation Summary
Proportion Positive Scores on Classification Test
1
When BR = 1, 80% of scores
positive, all true positives
0.9
When 0 < BR < 1, positive scores
are some mixture of true positives
and false positives. That mixture
is easily discernible.
0.8
0.7
True Positive Rate
0.6
0.5
0.4
0.3
When BR = 0, 10% of scores
positive, all false positives
0.2
Proportion Positive Scores
False Positive Rate
0.1
0
.00
.05
.10
.15
.20
.25
.30
.35
.40
.45
.50
.55
.60
.65
.70
Base Rate of Malingering
.75
.80
.85
.90
.95 1.00
1.00
When we
say FPR = .16
and TPR = .50,
what we’re
saying is that,
no matter
what samples
we test, we
expect to
see no fewer
than 16%
positive scores
and no more
than 50%
positive
scores.
0.90
Proportion Positive Scores in Sample
0.80
Movement along this line from left to right
represents increasing rate of Population A and
increasing rate of positive scores.
0.70
0.60
0.50
0.40
0.30
0.20
FPR = .16, TPR = .50
0.10
0.00
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
Base Rate of Population A in Sample
0.90
1.00
The Test Validation Summary
FPR is the proportion of
positive scores obtained when
BR = 0.
Proportion Positive Scores on Classification
Test
1
0.9
TPR is the proportion of
positive scores obtained when
BR = 1.
0.8
0.7
The BR of the condition varies
moving along the solid
straight line as the proportion
of positive scores increases
from FPR to TPR.
0.6
0.5
0.4
0.3
NPP
0.2
PPP
Proportion Positive Scores
0.1
0
.00 .05 .10 .15 .20 .25 .30 .35 .40 .45 .50 .55 .60 .65 .70 .75 .80 .85 .90 .95 1.00
Base Rate of Condition
FPR = .10
TPR = .80
The Test Validation Summary
Proportion Positive Scores on Classification
Test
1
0.9
0.8
0.7
0.6
0.5
0.4
The mixture of true
positives and false
positives changes
in a linear fashion,
moving from 0% true
positives to 100% true
positives, but the rate
of change (PPP) is not
linear. PPP changes in
a non-linear, or
curvilinear, fashion.
0.3
NPP
0.2
PPP
Proportion Positive Scores
0.1
0
.00 .05 .10 .15 .20 .25 .30 .35 .40 .45 .50 .55 .60 .65 .70 .75 .80 .85 .90 .95 1.00
Base Rate of Condition
FPR = .10
TPR = .80
Table for Computation of Test Characteristics
Base Rate of Malingering
0
Pr +
Tests
.2
.4
.6
.8
.24
.38
.52
.66
1
The Test Validation Summary
Proportion Positive Scores on Classification Test
1
0.9
0.8
True Positive Rate
0.7
0.6
0.5
0.4
0.3
Proportion Positive Scores
0.2
False Positive Rate
0.1
0
.00
.05
.10
.15
.20
.25
.30
.35
.40
.45
.50
.55
.60
.65
.70
Base Rate of Malingering
.75
.80
.85
.90
.95 1.00
The Test Validation Summary
Proportion Positive Scores on Classification Test
1
0.9
0.8
True Positive Rate
0.7
0.6
0.5
0.4
0.3
Proportion Positive Scores
0.2
False Positive Rate
0.1
0
.00
.05
.10
.15
.20
.25
.30
.35
.40
.45
.50
.55
.60
.65
.70
Base Rate of Malingering
.75
.80
.85
.90
.95 1.00
1.00
FPR = .052, SE = .021
0.90
TPR = .777, SE = .061
Proportion Positive TOMM Scores
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
Estimated Base Rate of Malingering
0.90
1.00
1.00
FPR = .056, SE = .025
0.90
TPR = .742, SE = .093
TOMM
No simulation
studies
Proportion Positive TOMM Scores
0.80
0.70
0.60
FPR = .056,
SE = .025
0.50
0.40
0.30
TPR = .742,
SE = .093
0.20
0.10
0.00
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
Estimated Base Rate of Malingering
0.90
1.00
NPP
PPP
For any
imperfect
test,
PPP ranges
from 0 to 1
as base rate
ranges from
0 to 1
NPP ranges
from 0 to 1
as base rate
ranges from
1 to 0
Using MGV to estimate
test diagnostic efficiencies of
the Reliable Digit Span
Laurie Ragatz, PhD
Richard Frederick, PhD
What is Reliable Digit Span?
RDS is a symptom validity measure for Digit
Span. The value of RDS is derived by adding
longest strings of two trials passed for both
forward and backward Digit Span.
Researched cut scores include 5 or lower, 6 or
lower, 7 or lower, or 8 or lower.
Reliable Digit Span Example:
Forward
Digit Span
Correct
Incorrect
1 4
Correct
Incorrect
2 5
Correct
Incorrect
5 7 1
Correct
Incorrect
8 3 4
Correct
Incorrect
5 9 4 6
Correct
Incorrect
7 2 3 9
Correct
Incorrect
Directions: Examinee recalls
numbers in the same order they
were provided by the examiner
Backward Digit Span
Correct
Incorrect
Example
Correct Answer
Correct
Incorrect
1 2
2 1
Correct
Incorrect
7 4
4 7
Correct
Incorrect
5 3 9
9 3 5
Correct
Incorrect
8 2 4
4 2 8
Correct
Incorrect
Directions: Examinee recalls numbers in the
reverse order they were provided by the
examiner
Reliable Digit Span: 4 + 3 =7
(1) We found all available articles dealing
with RDS and identified the cut scores
investigated. We included simulator studies.
(2) Based on the authors’ decision about
criterion group membership, we
calculated the overall base rate of
malingering in the study.
(3) We observed the overall rate of
positive scores in the study at the
identified cut score.
(4) We did not include any data for persons
with mental retardation. The rate of
positive scores among persons with mental
retardation was exceedingly high for all cut
scores.
Criterion group
Test outcome
Example:
Smith (2010) reported
203 TOMMs at cut < 45. Is
Test score
positive
Test score
negative
Total
Is not
malingering malingering
Total
42
15
57
21
125
146
63
140
203
We have 63 malingerers in a sample of 203. BR = 63/203 = 0.31.
We have 57 positive scores. Proportion positive scores (PPS) is
57/203 = .28. For this study, we plot (BR, PPS) = (.31, .28)
x = .31, y = .28. Our n for WLS = 203.
RDS = 5 or lower
Study
Meyers & Volbrecht
Mathias, Greve, Bianchini et al. 2002
Etherton, Bianchini, Greve et al 2005
Etherton, Bianchini, Ciota, & Greve 2005
Axelrod, Fichtenberg, Millis, & Wertheimer, nd
Ylioga, Baird, Podell (2009)
Harrison, Rosenblum, Currie (2010)
N
96
54
157
60
65
62
133
Cut score = 5 or lower
BR
PPS
0.490
0.052
0.444
0.093
0.223
0.057
0.333
0.083
0.554
0.215
0.532
0.113
0.113
0.008
Using weighted least squares regression (with N as the weight),
we regressed Proportion Positive Scores (PPS) on Base Rate (BR)
to generate the Proportion Positive Score Line.
We obtained y-intercept of -.015 (all negative values are
truncated to 0), and slope of .265.
RDS = 5 or lower
Study
N
BR
PPS
1
0.9
1
96
0.49
0.052
2
54
0.444
0.093
3
157
0.223
0.057
4
60
0.333
0.083
0.4
5
65
0.554
0.215
0.2
6
62
0.532
0.113
7
133
0.113
0.008
put these data in WLS
to obtain regression
line characteristics
0.8
0.7
0.6
0.5
0.3
0.1
0
0
0.2
0.4
0.6
scatterplot
0.8
1
RDS: 5 or lower, FPR = 0, TPR = .265
RDS = 6 or lower
Study
Duncan & Ausborn, 2002
Meyers & Volbrecht
Mathias, Greve, Bianchini et al. 2002
Etherton, Bianchini, Greve et al 2005
Strauss, Slick, Hunter, et al 2002
Etherton, Bianchini, Ciota, & Greve 2005
Axelrod, Fichtenberg, Millis, & Wertheimer, nd
Ylioga, Baird, Podell (2009)
Harrison, Rosenblum, Currie (2010)
Babikian, Boone, Lu, & Arnold
Greiffenstein & Baker (2008)
y-intercept = .015, slope = .419
N
187
96
54
157
74
60
65
62
133
154
87
Cut score = 6 or lower
BR
PPS
0.283
0.230
0.490
0.094
0.444
0.185
0.223
0.089
0.459
0.243
0.333
0.117
0.554
0.354
0.532
0.242
0.113
0.045
0.429
0.130
0.775
0.368
RDS: 6 or lower, FPR = .015, TPR = .434
RDS = 7 or lower
N
Study
Duncan & Ausborn, 2002
Meyers & Volbrecht
Mathias, Greve, Bianchini et al. 2002
Etherton, Bianchini, Greve et al 2005
Inman & Berry, 2002
Etherton, Bianchini, Ciota, & Greve 2005
Axelrod, Fichtenberg, Millis, & Wertheimer, nd
Ruocco, Swirsky-Sacchetti, Chute et al., 2007
Merten, Bossink, Schmand (first)
Ylioga, Baird, Podell (2009)
Harrison, Rosenblum, Currie (2010)
Greiffenstein, Baker, Gola (1994)
Babikian, Boone, Lu, & Arnold
Greiffenstein, Gola, Baker (1995)
Greiffenstein & Baker (2008)
y-intercept = .187, slope = .39
187
96
54
157
92
60
65
77
48
62
133
106
154
177
602
Cut score = 7 or lower
BR
PPS
0.283
0.394
0.490
0.260
0.444
0.333
0.223
0.270
0.478
0.130
0.333
0.133
0.554
0.554
0.041
0.338
0.500
0.458
0.532
0.452
0.113
0.083
0.406
0.396
0.429
0.234
0.384
0.582
0.492
0.419
RDS: 7 or lower, FPR = .187, TPR = .618
RDS = 8 or lower
Study
Meyers & Volbrecht
Mathias, Greve, Bianchini et al. 2002
Etherton, Bianchini, Greve et al 2005
Etherton, Bianchini, Ciota, & Greve 2005
Axelrod, Fichtenberg, Millis, & Wertheimer, nd
Ylioga, Baird, Podell (2009)
Harrison, Rosenblum, Currie (2010)
Greiffenstein, Baker, Gola (1994)
Babikian, Boone, Lu, & Arnold
y-intercept = .236, slope = .824
N
96
54
157
60
65
62
133
106
154
Cut score = 8 or lower
BR
PPS
0.49
0.458
0.444444
0.5
0.22293
0.49
0.333333
0.217
0.553846 0.753846154
0.532258
0.565
0.112782
0.263
0.40566
0.557
0.428571
0.377
RDS: 8 or lower, FPR = .236, TPR = .824
As we move
from a cut score
of 5 or lower to 6
or lower, we
obtain
substantial
improvement in
TPR estimate
with little cost
in FPR increase.
Our choice for best cut score for RDS
RDS: 6 or lower, FPR = .015, TPR = .434
Cut score
FPR
TPR
5 or lower
0 (.038)
.25 (.07)
6 or lower
.015 (.053)
.434 (.082)
7 or lower
.187 (.102)
.618 (.155)
8 or lower
.236 (.112)
.824 (.190)
By using WLS regression, we can obtain standard errors of
our estimates of FPR and TPR.
So, new researchers can test hypotheses about parametric
values of FPR and TPR.
Overcoming limiting factors of “known
groups” validation in determining test
score sensitivity and specificity
Summary:
1. The TVS and MGV allow powerful research into existing
published data sets. Summary data are used.
2. Understanding of parametric values of TPR and FPR is
facilitated when researchers publish results on a variety of
cut scores that should be considered. A frequency
distribution would be ideal, for example,
RDS
n
RDS
n
RDS
n
0
5
5
7
10
88
1
0
6
51
11
74
2
0
7
68
12
61
3
1
8
79
13
32
4
3
9
98
14
12
3. Combining studies in this way allows us to generate
stable values of TPR and FPR with SE’s so that new research
can test those values.
4. Researchers should focus on the basis for estimating BR’s
in their research groups. All research estimating FPR and TPR
is vulnerable to error when the purity of research groups is
overestimated. Working towards a reliable estimate of
mixed group base rate will facilitate better validation studies.
Reliably estimate local base rates of
feigning for proper allocation of
sensitivity and specificity information
How can the Test Validation Summary help me
determine my local BR?
1. Get the best estimate of the test FPR and TPR
for a certain test score.
2. Find the proportion of test scores in your
sample that are positive scores.
The Test Validation Summary
You review your records
and determine that 40% of
your patients have a positive
score when the score has
FPR = .10 and TPR = .80.
Proportion Positive Scores on Classification
Test
1
0.9
0.8
From the TVS, you see
that this corresponds to a
BR = .43. You see that
in your clinic, the PPP for
a positive score is .86 and
the NPP for a negative score
is .86.
0.7
0.6
0.5
0.4
0.3
NPP
0.2
PPP
Proportion Positive Scores
0.1
0
.00 .05 .10 .15 .20 .25 .30 .35 .40 .45 .50 .55 .60 .65 .70 .75 .80 .85 .90 .95 1.00
Base Rate of Condition
FPR = .10
TPR = .80
From a sample, observe
rate of positive scores.
Use TVS to estimate
condition BR in that
sample, PPP and NPP
for that BR.
527 criminal
defendants
who took
RMT and VIP
concurrently
Rate of positive
scores in this
sample was .113
PPP = .814
1 – NPP = .077
1.00
FPR = .056, SE = .025
0.90
TPR = .742, SE = .093
TOMM
No simulation
studies
Proportion Positive TOMM Scores
0.80
0.70
0.60
FPR = .056,
SE = .025
0.50
0.40
0.30
TPR = .742,
SE = .093
0.20
0.10
0.00
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
Estimated Base Rate of Malingering
0.90
1.00
Beth A. Caillouet, Bernice A. Marcopulos, Jesse G. Brand,
Julie Ann Kent, & Richard I. Frederick
Question: What are the BRs of malingering in the two samples?
Question: What are the BRs of malingering in the two samples?
Information needed:
Estimates of TOMM FPR and TPR. From TOMM TVS, we
get FPR = .056, TPR = 742.
Sample 1: Secondary gain present. Proportion positive
scores = 55/220 = .25.
Sample 2: Secondary gain absent. Proportion positive
scores = 34/299 = .11.
Use TOMM TVS to estimate BR of each sample.
When PPS = .25,
BR = .28.
When PPS = .11,
BR = .08.
Defensibly choose symptom validity
cut scores that are ideally suited for
their local base rates
M-FAST
Malingering
Genuine
MFAST > 5
MFAST < 6
86
TPR = .93
FPR = .17
BR malingering = 35%, N = 86
Malingering
Genuine
.35(86)
.65(86)
MFAST > 5
MFAST < 6
TPR = .93
FPR = .17
BR malingering = 35%, N = 86
86
MFAST > 5
Malingering
Genuine
.93(30)
.17(56)
30
56
MFAST < 6
TPR = .93
FPR = .17
BR malingering = 35%, N = 86
86
Malingering
Genuine
MFAST > 5
28
9.52
MFAST < 6
2
46.48
30
56
TPR = .93
FPR = .17
BR malingering = 35%, N = 86
86
Malingering
Genuine
MFAST > 5
28
10
MFAST < 6
2
46
30
56
TPR = .93
86
FPR = 10/56 = .18
BR malingering = 35%, N = 86
Malingering
Genuine
MFAST > 5
28
10
38
MFAST < 6
2
46
48
30
56
86
TPR = 28/30 = .93
PPP = 28/38 = .737
BR malingering = .35
NPP
FPR = 10/56 = .18
NPP = .958
PPP
1 – FPR
TPR
Malingering
Genuine
MFAST > 5
28
103
131
MFAST < 6
2
467
469
30
570
600
TPR = 28/30 = .93
PPP = 28/131 = .213
BR malingering = .05
NPP
FPR = 10/56 = .18
NPP = .996
PPP
1 – FPR
TPR
Test validation
summary for
M-FAST cut
score
recommended
by test manual.
PPP does not even reach 50%
correct decisions until BR > .16
At recommended cut score FPR very high
M-FAST > 5
FPR = .17
TPR = .93
At BR = .05, PPP does not exceed
.50 until cut score adjusted to
> 9 on M-FAST
Combining information from local base
rate and multiple symptom validity tests
You can get estimates of PPP and NPP
for the sample you work with—IF you
can reliably estimate the BR.
737 defendants were administered:
Rey 15 Item Memory Test (RMT)—memorize and
reproduce 15 items—very easy test.
Score is items reproduced (0 to 15)
Word Recognition Test (WRT)—memorize 15 words,
identify those 15 and correctly reject 15 from a list of 30.
Score is number of hits and correct rejections (0 to 30)
RMT validating
using MGV
with clinical
probability
judgments.
FPR = .025
TPR = .574
Frederick & Bowden,
2009
RMT < 9
FPR = .025
TPR = .574
We found 726 defendants who completed BOTH RMT and WRT.
81/726 failed the RMT= .111 proportion positive score.
By observation of TVS, then BR = .16, PPP = .814, NPP = .923
From a sample, observe
rate of positive scores.
Use TVS to estimate
condition BR in that
sample, PPP and NPP
for that BR.
527 criminal
defendants
who took
RMT and VIP
concurrently
Rate of positive
scores in this
sample was .113
PPP = .814
1 – NPP = .077
We found 726 defendants who completed BOTH RMT and WRT.
81/726 failed the RMT= .111 proportion positive score.
By observation of TVS, then BR = .16, PPP = .814, NPP = .923
If PPP = .814, then in this sample, the probability of feigning if RMT
is positive, is .814.
If NPP = .923, then in this sample, the probability of feigning if RMT
is negative is .077,
or 1 - .923.
To conduct MGV, we sampled from two groups:
1. The 645 individuals who passed the RMT—had a negative score.
2. The 81 individuals who failed the RMT—had a positive score.
Example of sampling
645 individuals with
negative scores, p(mal) = .077
Sample n = 360
81 individuals with
positive scores, p(mal) = .814
Sample n = 40
400 cases, 10% failures, 90% passes
Overall p(mal) = 40*.814 + 360*.077 = .151
Sample 25 times, plot x = .151, y = observed rate of
positive WRT scores, n for WLS = 400
Group
Ratio Failures Passes
N
BR
Samples
1
0
0
645
645
0.077
1
2
0.1
40
300
400
0.1507
25
3
0.2
40
160
200
0.2244
25
4
0.3
40
93
133
0.2981
25
5
0.4
40
60
100
0.3718
25
6
0.5
40
40
80
0.4455
25
7
0.6
40
27
67
0.5192
25
8
0.7
40
17
57
0.5929
25
9
0.8
40
10
50
0.6666
25
10
0.9
40
4
44
0.7403
25
11
1
81
0
81
0.814
1
For each sample, BR was pre-estimated. Then we observed
rate of positive WRT scores at each potential cut score.
Word Recognition Test (WRT)
Range 4 to 30, Mean = 23.2
Within group of RMT < 9, mean = 18.7
Within group of RMT > 8, mean = 23.8
Word Recognition Test (WRT)
For every potential cut score of WRT (4 -30), we plotted all x, y pairs
obtained from sampling
We performed WLS to obtain the FPR and TPR estimates
at every potential cut score.
We plotted the FPR and TPR estimates at every potential cut
score to generate the ROC curve.
AUC = .905, SE = .012, 95% CI for AUC = .881-.930.
Best cut scores:
LTE 18 (TPR = .563, FPR = .034)
LTE 19 (TPR = .620, FPR = .066)
LTE 18
1
0.9
Proportion Positive Scores
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
Base Rate
0.6
0.7
0.8
0.9
1
We plotted the FPR and
TPR estimates at every
potential cut
score to generate the ROC
curve.
TPR
AUC = .905, SE = .012,
95% CI for AUC = .881.930.
Best cut scores:
LTE 18
(TPR = .563, FPR = .034)
LTE 19
(TPR = .620, FPR = .066)
FPR
WORD RECOGNITON TEST (WRT)
Summary:
1. We can use tests to form mixed groups for validation.
2. The best estimates of FPR and TPR for a test cut score
allow us to estimate PPP and NPP at our sample BR.
3. Instead of “known groups” design (which is misleading),
we do not presume to know (or care) about the status of
any individual. We assign individuals “probabilities of
having the condition” based on their test score.
4. Mixed groups have an overall “probability of having the
condition,” which is the average of the individual probabilities.
5. We do not need to be certain about group memberships.
We gain much flexibility by working with probabilities of having
the condition vs. certainties of having the condition.
Another example
Dawes 1967 showed that valid probability judgments are
excellent base rate indicators. His work was
substantiated in Frederick 2000 and Frederick and
Bowden 2009.
To conduct MGV, we formed groups of defendants for
whom individuals ratings of likelihood of malingering
psychosis were generated by forensic psychologists,
before any testing took place.
The BR of malingered psychosis for each group was then
the mean of the probability rating. If each member of the
group had been rated as 10% likely to feign psychosis,
then the BR of the group was estimated to be 10%.
We then observed the hit rate (proportion positive scores) for
the groups for a variety of F-family indicators of feigning on the
MMPI-2 and MMPI-2-RF.
We formed 15 groups of 30 individuals. For each group, we
had a static base rate, which was the mean of the probability
judgments assigned before testing.
Within each group, we iteratively observed the hit rate of positive
F-family indicators at each potential cut score. Using the BR
estimate and the proportion positive scores at each potential cut
score, we performed WLS to generate estimates of FPR and TPR.
From these estimates, we generated ROC curves.
15 groups, 30 defendants in each group, 450 defendants
Each defendant rated from 0 to 100 before testing, with respect
to likelihood he would feign psychosis.
Groups were formed after first sorting individuals by ratings, from
lowest to highest.
Mean ratings of groups (each group, n = 30):
0
0
1.2
4.2
5.0
5.0
5.0
5.0
8.1
10
14.5
22.2
30.3
45.7
72.3
Rates of positive F-family scores at each potential cut observed.
Scale
AUC
SE
95%CI
F
.904
.015
.874-.933
Fp
.870
.018
.834-.906
Fp (no L items)
.905
.015
.877-.934
F-r
.940
.011
.919-.962
Fp-r
.926
.013
.901-.950
Estimates by Nicholson, Mouton, Bagby, Buis, Peterson,
and Buigas (1998):
AUC’s and SE:
F (.929, .021)
Fp (.885, .027)
Scale
Cutoff
FPR
TPR
F
GTE 28
.043
.635
Fp
GTE 8
.054
.484
Fp (no L items)
GTE7
.055
.537
F-r
GTE20
.050
.640
Fp-r
GTE8
.055
.652
Summary:
1. Using the estimates of likelihood of feigning based only
on clinician judgment prior to testing did not result in
random results. We can assume that mean probability
judgments were effective base rate estimates.
2. Our estimates of F and Fp are consistent with estimates
in large, well-validated analysis.
3. In this study, MMPI-2-RF indicators have higher mean
AUC and lower SE than their MMPI-2 counterparts.
Scale
Cutoff
FPR
TPR
F
GTE 28
.043
.635
Combine information about F with the SIRS-2
f
Valid
Frequency
4
4
6
1
9
1
10
2
11
3
12
2
13
3
14
4
15
2
16
3
17
3
18
2
19
4
20
5
21
2
22
7
23
3
24
1
25
5
26
6
27
6
Percent
2.7
.7
.7
1.3
2.0
1.3
2.0
2.7
1.3
2.0
2.0
1.3
2.7
3.4
1.3
4.7
2.0
.7
3.4
4.0
4.0
Valid Percent
3.1
3.1
.8
3.8
.8
4.6
1.5
6.1
2.3
8.4
1.5
9.9
2.3
12.2
3.1
15.3
1.5
16.8
2.3
19.1
2.3
21.4
1.5
22.9
3.1
26.0
3.8
29.8
1.5
31.3
5.3
36.6
2.3
38.9
.8
39.7
3.8
43.5
4.6
48.1
4.6
52.7
28
4.7
5.3
7
58.0
Cumulative Percent
131 defendants
who took MMPI
and SIRS
52.7% of cases are 27 or lower
47.3% of cases are 28 or higher
What is the base rate of feigned psychopathology?
Scale
Cutoff
FPR
TPR
F
GTE 28
.043
.635
BR
TPR
FPR
NPP
PPP
What we say:
Within our sample of 131 defendants, the BR of feigned
psychopathology is .73 (NOT .475)
At BR = .73, the PPP of F GTE 28 is .976.
At BR = .73, the NPP of GTE 28 is .492, so
p(feigning if LTE 27) is still .508) (Remember, they’re
being given the SIRS for a reason)
F < 28
NPP about .66
F > 27
Application of MGV to a CGV estimation
of FPR and TPR
Greve, Bianchini, Love, Brennan, & Heinly (2006) articulated six separate
groups with increasing base rate of malingering based on formal criteria
for malingering (the Slick criteria) to validate the MMPI-2 Fake Bad Scale
1. No incentive (no evidence of external incentive and no test
performance suggestive of malingering; n = 18, mean FBS = 15.4)
2. Incentive (external incentive, but no test performance suggestive
of malingering; n = 79, mean FBS = 19.5)
3. Suspect (external incentive and at least one indicator suggestive
of malingering; n = 66, mean FBS = 22.7)
4. Statistically Likely (external incentive; at least two indicators
suggestive of malingering; n = 51, mean FBS = 22.8)
5. Probable (external incentive; strong indicators of malingering;
n = 31, mean FBS = 26.9)
6. Definite (external incentive; very strong indicators of
malingering; n = 14, mean FBS = 29.8)
Even though it is clear that
BR Definite > BR Probable > BR Statistically Likely >
BR Suspect > BR Incentive Only > BR No Incentive
They were required, to conduct “Known” groups
validation, to ignore this obvious circumstance and to define
BR No Incentive = BR Incentive Only = 0
BR Statistically Likely = BR Probable = BR Definite = 1.0
And drop all participants defined as Suspect
to yield the following ROC
FBS ROC
generated
by “Known”
groups
validation by
Greve & Bianchini
If we had estimates of the BR for each of the subgroups
formed by Greve and Bianchini, we could use MGV to
estimate FPR and TPR for each potential cut score.
We have our stable estimate of
TOMM FPR and TPR
1.00
TOMM
No simulation
studies
FPR = .056, SE = .025
0.90
TPR = .742, SE = .093
Proportion Positive TOMM Scores
0.80
FPR = .056,
SE = .025
0.70
0.60
0.50
TPR = .742,
SE = .093
0.40
0.30
0.20
0.10
0.00
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
Estimated Base Rate of Malingering
0.90
1.00
We can get estimates of BRs for those groups
from other work by Greve & Bianchini.
They formed similar groups using the Slick
criteria to investigate the TOMM.
We can use the proportion of positive TOMMs
in each of these subgroups to estimate the BRs
in each of them.
From Greve, Bianchini, Doane (2006)
Proportion Positive
TOMM Scores
No Inc
0
Inc Only
5
Suspect
20
Probable
47
Definite
78
The Test Validation Summary
1
Proportion Positive Scores on TOMM
0.9
TOMM: FPR = .056, TPR = .742
0.8
Est BR of
Proportion
Positive TOMM malingering
Scores
0.7
No Inc
0
0
Inc Only
.05
0
Suspect
.20
.21
Probable .47
.633
0.6
0.5
0.4
0.3
0.2
0.1
0
.00 .05 .10 .15 .20 .25 .30 .35 .40 .45 .50 .55 .60 .65 .70 .75 .80 .85 .90 .95 1.00
Base Rate of Malingering
Definite
.78
1
We take these BR estimates and reapply them
to the Greve & Bianchini FBS data.
Example of MGV for FBS based on BR estimates for
Greve & Bianchini groups established by Slick criteria
Base Rate of Malingering
0
0
.21
.633
1
Pr +
Tests
.11
.09
.23
.52
.79
n
18
79
66
31
14
For FBS > 27, using WLS Regression, FPR = .091, TPR = .773
(For WLS, n is the weighted variable)
At FBS > 27
Evaluate constructs that underlie
symptom validity tests
1.00
10 clinical
studies
using Rey
15-Item Test
FPR = .054, SE = .037; TPR = .570, SE = .119
0.90
0.80
Probability RMT Score < 9
0.70
No simulators
0.60
All clinical data
0.50
0.40
0.30
0.20
0.10
0.00
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
Estimated Base Rate of Malingering
0.90
1.00
RMT validating
using MGV
with clinical
probability
judgments.
FPR = .025
TPR = .574
Frederick & Bowden,
2009
CI, TPR = .574,
SE = .044
We will
generate TVS
based on these
values and
find PPP and
1 – NPP to
estimate
probability of
bad intent
represented
by RMT score.
Intends to respond correctly
Inconsistent/Invalid
Compliant/Valid
Low Effort
High Effort
Irrelevant/Invalid
Suppressed/Invalid
Does not intend to respond correctly
Validity Indicator Profile
VIP Verbal Subtest Items
• Easy:
Baby
Drink
Infant
• Moderate:
People
Ally
Folk
• Difficult:
Nimiety
Conceit
Surfeit
1.0
Compliant
0.9
Sector 1
Sector 2
Sector 3
Proportion Correct
0.8
0.7
0.6
Irrelevant
0.5
0.4
0.3
0.2
0.1
Suppression
Malingering
0.0
0
10
20
30
40
50
60
Running Mean Serial Position
70
80
90
1.0
Compliant
0.9
Sector 1
Sector 2
Sector 3
Proportion Correct
0.8
Not guessing,
knowledgeable
responding
0.7
0.6
Guessing
0.5
Guessing
is imminent
0.4
0.3
0.2
0.1
Easy items
Difficult items
0.0
0
10
20
30
40
50
60
Running Mean Serial Position
70
80
90
1.0
0.9
Proportion Correct
0.8
0.7
0.6
Careless
curves
Inconsistent
curves
0.5
0.4
0.3
0.2
0.1
0.0
0
10
20
30
40
50
60
Running Mean Serial Position
70
80
90
527 criminal
defendants
who took
RMT and VIP
concurrently
Rate of positive
scores in this
sample was .113
PPP = .814
1 – NPP = .077
1.00
Here we
are matching
VIP categories
to the construct
most likely
captured by
the VIP.
FPR = 0, SE = .028
0.90
TPR = .859, SE = .130
Probability of Suppression or Irrelevant on VIP
0.80
0.70
0.60
0.50
0.40
0.30
BR of .42 estimated
for this group is mean
of PPP for positive RMT
scores in this group and
(1 – NPP) for negative
RMT scores in the group
Points in
scatterplot
represent
groups of 25
individuals.
Sorted defendants
0.20
by clinical ratings
of malingering,
0.10
then took 20
groups of 25 and
0.00
one group of 27,
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
for 527 defendants.
Probability Feigned Cognitive Impairment Given by RMT < 9
1.00
Same 21
subgroups,
N = 527
defendants
0.90
Probability Inconsistent or Lower on VIP
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
Probability Feigned Cognitive Impairment
0.90
1.00
527 criminal
defendants
VRIN was converted
to “probability
of invalid
responding” by
dividing VRIN
raw score by 12.
VRIN raw scores
>12 were
assigned p = 1.
We are interested
in FPR and TPR for
“Invalid”
Download