The Dance between Accuracy and Bias

advertisement
Issues in Measuring
Individual Differences in
Judgmental Accuracy
David A. Kenny
University of Connecticut
http://davidakenny.net/doc/Ghent14.ppt
Who Am I?
A social psychologist who studies
dyads.
Keenly interested in dyadic questions
of consensus, reciprocity, and
accuracy.
I study dyads in which each person is
paired with the same set of
persons:
-- round robin design
-- block design
I Also Study…
Dyadic designs in which one person is
paired with just one partner.
-- romantic couples
-- supervisor-supervisee
One model that I have investigated is the
Actor-Partner Interdependence Model.
Can be used to study:
Effects of individual differences (e.g.,
EI) on relational outcomes.
Accuracy in interpersonal perception.
Judgmental Accuracy
Does the judge know the
emotional state, thoughts,
intent, or personality of a
target?
A Renewed Interest in
Individual Differences
 Interest in Emotional
Intelligence (EQ)
 Models that provide a
framework for understanding
judge moderators
 Interest in neurological deficits
Strategies To Measure
Individual Differences
Standardized Scales
e.g., PONS
measure the number correct
 Random Effects Models: SAM
model the judgment
search for mediators (biases)
(Presuming that all judges evaluate
the same targets/items.)
Standardized Scales
Develop a pool of items
Pick the “good” items
Establish reliability as
measured by internal
consistency
Capitalization
on Chance
“Good” items may not be
so good in another
sample.
Examples …
Follow-up Alphas
Scale
Initial
Follow-up
CARAT
.56
.46
IPT-30
.52
.29
IPT-15
.38
.24
Low Reliability of Scales
Scale
a
IICa
CARAT
.46
.028
IPT-30
.29
.013
IPT-15
.24
.039
PONS
.86
.021
Eyes
.49
.026
aIIC:
Inter-item correlation
Maybe an Inter-Item Correlation
of .03 Is Not All that Bad?
Peabody Picture Vocabulary Test: .08
Beck Depression: .30
Bem M/F Scale: .19
Rosenberg Self-Esteem: .34
I guess it is bad.
Why So Low?
• Bad Ideas
• No individual differences
• Abandon the traditional
psychometric approach
• Better Ideas
• Multidimensionality
• Average item difficulty
No Individual Differences?:
NO!
People perform well above chance.
Difficult to believe that there is a
skill but no individual differences.
Validity evidence
Tests correlate in theoretically
meaningful ways with
antecedents and consequents.
Abandon the Psychometric
Approach
Argument that internal consistency
estimates are inappropriate because
constructs are multidimensional.
Other forms of reliability (test-retest)
are appropriate.
However, some sort of internal
consistency measure (e.g., split-half) is
still desirable.
Multidimensionality
• Tests are often multidimensional.
– Different emotions
– Different aspects of the target
• Channels: auditory vs. visual
• Information: face vs. body
• Items tapping different dimensions will be
weakly correlated which lowers IIC.
• In some cases, split-half reliability is more
sensible.
• Multidimensionality explains some but not
all of the low IICs.
Item Difficulty
• It turns out that average item
difficulty has a dramatic effect on
reliability.
• Obviously if items are either too
difficult or easy, reliability will be
poor.
• What is the optimal difficulty?
Assume
• Two response alternatives.
• Allow for guessing
• What is the ideal average item
difficulty?
• 75%?
• Simulation model that varies
average item difficulty…
CTT vs. IRT
• An internal consistency
measure is based on Classical
Test Theory (CTT)
• Since about 1970, in testing
CTT has been discarded and
Item Response Theory (IRT)
has been adopted.
Item Response Theory (IRT)
• r is judge ability, assumed to have a
normal distribution
• d is item difficulty; let f = ln[d/(1 – d)]
– A d of 0 implies a judge has a 50/50 chance of
being right.
• probability that the judge is correct:
er-f/(1 + er-f)
(e approximately equals 2.718)
• allow for guessing
er-f/(1 + er-f) + g[1 − (er-f/(1 + er-f)]
“SD” the
variance of r
1.0
0.9
0.8
0.7
Alpha
0.6
0.5
0.4
0.3
SD = 1.0
0.2
SD = 0.5
0.1
0.0
0.5
0.6
0.7
0.8
Proportion Correct
0.9
1
Interpretation
•
•
•
•
Curves peak in the high 80s
Predicted by IRT (high .80s)
Better to design “easy” tests
Why?
• Performance of low ability
judges is almost entirely due to
chance. If you want to
discriminate low ability judges,
you need an easy test.
1
PONS
Subtests
0.9
0.8
0.7
Alpha
0.6
Subtests
0.5
SD = 0.33
0.4
0.3
0.2
0.1
0
0.5
0.6
0.7
0.8
Proportion Correct
0.9
1
Modeling Accuracy: Random
Effects Model
Social Accuracy Model (SAM)
Biesanz, J. C. (2010). The social accuracy
model of interpersonal perception:
Assessing individual differences in
perceptive and expressive accuracy.
Multivariate Behavioral Research, 45,
853-885.
Social Accuracy Model (SAM)
Biesanz, J. C. (2010). The social accuracy
model of interpersonal perception:
Assessing individual differences in
perceptive and expressive accuracy.
Multivariate Behavioral Research, 45,
853-885.
The Truth and Bias Model: T&B
West, T. V., & Kenny, D. A. (2001). The
truth and bias model of judgment.
Psychological Review, 118, 357-378.
SAM & T&B Models
• Theoretical and empirical frameworks
designed to address the basic questions
of how accuracy and bias operate, and
the nature of their interdependence
• In studying accuracy, truth (T) becomes
a predictor.
• Accuracy is a slope or an effect not the
outcome or summed score.
• Most prior work has not looked at
emotion.
Bias
• T&B: How strongly judgments are
pulled away from the mean level of the
truth toward one of the poles of the
judgment continuum: For example,
perceivers are generally biased to think
others are telling the truth.
• SAM: Components analogous to
Cronbach’s: elevation, differential
elevation, and stereotype accuracy.
Example
• Christensen 1981 dissertation
• 12 targets and 103 judges
• Each target tells three stories, two true
and one false.
• Outcome is dichotomous, the judgment
is that the story is true or false.
Variables
• Judgment
• Is the target telling the truth
(1) or a lie (0)?
• Truth
• Equals 1 if the target is
telling the truth
• Equals -1 if the target is
lying.
SAM Model: Fixed Effects
• Intercept: Overall Bias
–Do people on average tend
to think others are telling
the truth?
• Truth: Overall Skill
–The overall effect of Truth
on the judgment of truth
telling
SAM Model: Random Effects
• Judge: Personal Bias
• Truth x Judge: Individual Differences in
Judges’ Skill
• Target: Demeanor Bias
• Truth x Target: Readability or
Expressiveness
• Judge x Target: Relational Demeanor Bias
• Truth x Judge x Target: Relational Skill
(black and blue effects may be correlated)
Analysis
• R’s rlmer with logit function
• Analysis can also be done within SAS,
HLM, R, Mplus, or MLwiN (and likely
Stata)
• Analysis can take a very long time,
especially with Judge x Target as a
random variable.
Results: Fixed Effects
Effect
Estimate Std. Error
p
Intercept 0.526
0.094
<.001
Truth
0.364
0.126
.004
Intercept is a logit. It implies that when
Truth = 0, judges think that there is .629 chance
the person is telling the truth. People are “biased”
to think targets are telling the truth.
Truth is a logit difference or log of an odds
ratio. It corresponds to a .709 chance of being
right if someone is telling truth and .460 chance if
telling a lie.
Results: Random Effects
No evidence of a Judge x Target
variance for either skill or bias effects.
Thus, people are not consistently
better at judging some people than
other people and the bias to think that
a person is lying is the same for all
judges.
Also skill and bias variables are
uncorrelated.
Results:
Variances of Random Effects
Variance
Absolute Relative
Skill
Judge
Target
0.133
0.156
.032
.038
Judge
Target
0.496
0.031
3.290
.121
.008
.801
Bias
Error
Follow-up Analyses
• Traditional
–Compute estimates of skill and
correlated them with other
variables.
• Look for mediators and moderators
–A Brunswikian analysis
Traditional
• Can obtain estimates of skill (called
Empirical Bayes).
–Predicted slope adjusted for level of
knowledge (i.e., regression towards
the mean).
• Can correlate these estimates with
other variables for validity studies.
Brunswikian Analysis
• Look for a cue, e.g., eye contact in
the Christensen study.
• See if the cue explains accuracy.
• T&B treats a cue as a bias, but a
bias can lead to accuracy.
Cue as a Mediator
Cue
b
a
Truth
c'
Judgment
Effects
• Terms
– a: cue validity
– b: cue utilization
– ab: achievement (indirect effect)
– c’: accuracy not explained by the cue (direct effect)
– c = ab + c’: total effect
• Individual Differences: a, b (and so ab), and
c’ can vary by
– Judges
– Targets
• Thus, there might be moderated mediation.
A Surprising Result
• Assume a and b are non-zero and c’ equals
zero. Thus, the effect is completely
mediated, i.e., the total effect or c would
equal indirect effect or ab.
• You would think that the power of the two
tests are about the same.
• They are not! There can be substantially
more power in the test of the indirect effect.
The test of c need 75 times the number of
cases to have the same power as the test of
ab!
Conclusion
• Measurement of individual differences in
this area is difficult.
• It is important to move beyond the
measurement of a global score. Need to
model the process of judgment by judges
of targets.
• Obviously there are difficult analysis issues
that are not discussed in much detail in
this talk.
http://davidakenny.net/doc/Ghent14.ppt
Thank
You!
Download