What`s New in the I/O Testing and Assessment Literature

advertisement
What’s New in the I/O Testing and
Assessment Literature That’s
Important for Practitioners?
Paul R. Sackett
New Developments in the
Assessment of Personality
Topic 1: A faking-resistant approach to
personality measurement
• Tailored Adaptive Personality Assessment System (TAPAS)
• Developed for Army Research Institute by Drasgow
Consulting Group
• Multidimensional Pairwise Preference Format combined
with applicable Item Response Theory model
• Items are created by pairing statements from different
dimensions that are similar in desirability and trait “location”
• Example item: “Which is more like you?”
• __1a) People come to me when they want fresh ideas.
• __1b) Most people would say that I’m a “good listener”.
A faking-resistant approach to personality
measurement (continued)
• Extensive work show it’s faking-resistant
• Non-operational field study in Army show useful prediction
of attrition, disciplinary incidents, completion of basic
training, adjustment to Army life, among other criteria
• Now in operational use on a trial basis
•
Drasgow, F., Stark, S., Chernyshenko, O. S., Nye, C. D., and Hulin, C. L. (2012).
Development of the Tailored Adaptive Personality Assessment System (TAPAS) to
Support Army Selection and Classification Decisions. Technical Report 1311, Army
Research Institute
Topic 2: The Value of Contextualized
Personality Items
• A new meta-analysis documents the higher predictive power
obtained by “contextualizing” items (e.g., asking about
behavior at work, rather than behavior in general)
• Mean r with supervisory ratings for work context vs. general:
–
–
–
–
–
•
Conscientiousness: .30 vs .22
Emotional Stability: .17 vs. 12
Extraversion: .25 vs. .08
Agreeableness: .24 vs. .10
Openness: .19 vs. .02
Shaffer, J.A., & Postlethwaite, B. E. (2012). A matter of context: A meta-analytic
investigation of the relative validity of contextualized and noncontextualized
personality measures. Personnel Psychology, 65, 445-494.
Topic 3: Moving from the Big 5 to Narrower
Dimensions
•
DeYoung, Quilty and Peterson (2007) suggested the following:
– Neuroticism:
•
•
Volatility - irritability, anger, and difficulty controlling emotional impulses
Withdrawal - susceptibility to anxiety, worry, depression, and sadness
– Agreeableness:
•
•
Compassion - empathetic emotional affiliation
Politeness - consideration and respect for others’ needs and desires
– Conscientiousness:
•
•
Industriousness - working hard and avoiding distraction
Orderliness - organization and methodicalness
– Extraversion:
•
•
Enthusiasm - positive emotion and sociability
Assertiveness - drive and dominance
– Openness to Experience:
•
•
Intellect - ingenuity, quickness, and intellectual engagement
Openness - imagination, fantasy, and artistic and aesthetic interests
DeYoung, C. G., Quilty, L. C., & Peterson, J. B. (2007). Between facets and domains: 10 Aspects of the
Big Five, Journal of Personality and Social Psychology, 93, 880-896
Moving from the Big 5 to Narrower Dimensions
(continued)
• Dudley et al (2006) show the value of this perspective
– Four conscientiousness facets: achievement, dependability,
order, and cautiousness
– Validity was driven largely by the achievement and/or
dependability facets, with relatively little contribution from
cautiousness and order
– Achievement receives the dominant weight in predicting task
performance, while dependability receives the dominant weight in
predicting counterproductive work behavior
–
Dudley NM, Orvis KA, Lebiecki JE, Cortina JM. 2006. A meta-analytic investigation of
conscientiousness in the prediction of job performance: Examining the intercorrelations and the
incremental validity of narrow traits. J. Appl. Psychol. 91:40-57
Topic 4: The Use of Faking Warnings
• Landers et al (2011) administered a warning after 1/3 of the
items to managerial candidates exhibiting what they called
“blatant extreme responding”.
• Rate of extreme responding was halved after the warning
•
Landers, R. N., Sackett, P. R., & Tuzinski, K. A. (2011). Retesting after initial
failure, coaching rumors, and warnings against faking in online personality
measures for selection. Journal of Applied Psychology, 96(1), 202.
More on the Use of Faking Warnings
• Nathan Kuncel suggests three potentially relevant
goals when individuals take a personality test:
• - be impressive
• - be credible
• - be true to oneself
More on the Use of Faking Warnings
• Jenson and Sackett (2013) suggested that
“priming” concern for being credible could
reduce faking.
• Test-takers who scheduled a follow-up interview
just before taking the personality test obtained
lower scores than those who did not
Jenson, C. E., and Sackett, P. R. (2013). Examining ability to fake and
test-taker goals in personality assessments. SIOP presentation.
New Developments in the
Assessment of Cognitive Ability
A cognitive test with reduced adverse impact
• In 2011, SIOP awarded its M.Scott Myers Award for applied
research to Yusko, Goldstein, Scherbaum, and Hanges for
the development of the Siena Reasoning Test
• This is a nonverbal reasoning test, using unfamilar item
content, such as made-up words (if a GATH is larger than a
SHET…) and figures
• Concept is that adverse impact will be reduced by
eliminating content with which groups have differential
familiarity
Validity and subgroup d for Siena Test
• Black-White d commonly in the .3-.4 range
• Sizable number of validity studies, with validities in
the range commonly seen for cognitive tests.
• In one independent study, HumRRO researchers
included Siena along with another cognitive test;
corrected validity .45 for other test (d = 1.); .35 for
Siena (d = .38) (SIOP 2010: Paullin, Putka, and
Tsacoumis)
Why the reduced d?
• Somewhat of a puzzle. There is a history of using nonverbal reasoning tests
– Raven’s Progressive Matrices
– Large sample military studies in Project A
• But these do not show the reduced d that is seen with the
Siena Test
• Things to look into: does d vary with item difficulty, and how
does Siena compare with other tests?
• (Note: Nothing published to date that I am aware of. Some powerpoint
decks from SIOP presentations can be found online: search for “Siena
Reasoning Test”)
New Developments in Situational
Judgment Testing
Sample SJT item
• You find yourself in an argument with several co-workers
about who should do a very disagreeable, but routine
task. Which of the following would be the most effective
way to resolve this situation?
• (a) Have your supervisor decide, because this would
avoid any personal bias.
• (b) Arrange for a rotating schedule so everyone shares
the chore.
• (c) Let the workers who show up earliest choose on a
first-come, first-served basis.
• (d) Randomly assign a person to do the task and don't
change it.
Key findings
• Extensive validity evidence
• Can measure different constructs (problem
solving, communication skills, integrity,etc.)
• Incremental validity over ability and personality
• Small subgroup differences, except for
cognitively-oriented SJTs
• Items can be presented in written form or by
video; recent move to animation rather than
recording live actors
Lievens, Sackett, and Buyse, T. (2009)
comparing response instructions
• Ongoing debate re “would do” vs. “should do”
instructions
• Lievens et al. randomly assigned Belgian
medical school applicants to “would do” or
“should do” in operational interpersonal skills
SJT; did the same with a student sample
Lievens, Sackett, and Buyse, T. (2009)
comparing response instructions
• In operational setting, all gave “should do”
responses
– So: we’d like to know “would do”, but in effect,
can only get “should do”
Arthur et al (2014): comparing response
formats
• Compared 3 options:
– Rate effectiveness of each response
– Rank the responses
– Choose best and worst response
• 20-item integrity-oriented SJT
• Administered to over 30,000 retail/hospitality
job applicants
• On-line admin; each format used for one week
• “Rate each response” emerges as superior
–
–
–
–
Higher reliability
Lower correlation with cognitive ability
Smaller gender mean difference
Higher correlation with conceptually relevant
personality dimensions (conscientiousness,
agreeableness, emotional stability)
• Follow-up study with student sample
– Higher retest reliabilty
– More favorable reactions
Krumm et al. (in press)
• Question: how “situational” is situational
judgment?
• Some suggest SJTs really just measure
general knowledge about appropriate social
behavior
• So Krumm et al. conducted a clever
experiment: they “decapitated” SJT items
– Removed the stem – just presented the
responses
• 559 airline pilots completed 10 items each from
– Airline pilot knowledge SJT
– Integrity SJT
– Teamwork SJT
• Overall, mean scores are 1 SD higher with the
stem
• But for more than half the items, there is no
difference with and without stem
• So stem matters overall, but is irrelevant for lots of SJT
items
• Depends on specificity of stem content
• “You are flying an “angel flight” with a nurse and noncritical child
patient, to meet an ambulance at a downtown regional airport. You
filed visual flight rule: it is 11:00 p.m. on a clear night, when, at 60
nm out, you notice the ammeter indicating a battery discharge and
correctly deduce the alternator has failed. Your best guess is that
you have from 15 to 30 min of battery power remaining. You decide
to:
• (a) Declare an emergency, turn off all electrical systems, except for
1 NAVCOM and transponder, and continue to the regional airport as
planned.
• (b) Declare an emergency and divert to the Planter’s County Airport,
which is clearly visible at 2 o’clock, at 7 nm.
• (c) Declare an emergency, turn off all electrical systems, except for
1 NAVCOM, instrument panel lights, intercom, and transponder, and
divert to the Southside Business Airport, which is 40 nm straight
ahead.
• (d) Declare an emergency, turn off all electrical systems, except for
1 NAVCOM, instrument panel lights, intercom, and transponder, and
divert to Draper Air Force Base, which is at 10 o’clock, at 32 nm.”
• Arthur, W., Jr., Glaze, R. M., Jarrett, S. M., White, C. D., Schurig,
I., & Taylor, J. E. (2014). Comparative evaluation of three
situational judgment test response formats in terms of constructrelated validity, subgroup differences, and susceptibility to
response distortion. Journal of Applied Psychology, 99(3), 535545.
• Krumm, S, Lievens, F., Huffmeier,J., Lipnevich, A., Bendels,H.,
and Hertel, G.(in press). How “situational” is judgment in
situational judgment tests? Journal of Applied Psychology.
• Lievens, F., Sackett, P. R, and Buyse, T. (2009). The effects of
response instructions on situational judgment test performance
and validity in a high-stakes context. Journal of Applied
Psychology, 94, 1095-1101.
New Developments in Integrity
Testing
Two meta-analyses with differing
findings
• Ones, Viswesvaran, and Schmidt (1993) is the “classic” analysis of
integrity test validity.
– found 662 studies, including many where only raw data was provided
(i.e., no write-up). Info sharing from many publishers
• In 2012, Van Iddekinge et al conducted an updated meta-analysis
– applied strict inclusion rules as to what studies to include (e.g.,
reporting of study detail)
– 104 studies (including 132 samples) met inclusion criteria.
– 30 publishers contacted; only 2 shared info.
• Both based bottom line conclusions on studies using a predictive design
and a non-self report criterion.
Predicting Counterproductive
Behavior
K
• Ones et al – overt tests
N
10 5598
Mean Validity
.39
• Ones et al- personalitybased tests
62 93092
.29
• Van Iddekinge et al
10 5056
.11
Why the difference?
• Not clear. A number of factors do not seem to be the cause:
– Differences in types of studies examined (e.g., both
excluded studies with polygraph as criteria)
– Differences in corrections (e.g., unreliability)
• Several factors may contribute, though this is speculation
– Some counterproductive behaviors may be more
predictable than others, but all are lumped together in
these analyses
• Given reliance in both on studies not readily available to
public scrutiny, this won’t be resolved until further work is
done
Broader questions
• This raises broader issues about data
openness policies
– Publisher obligations?
– Researcher obligations?
– Journal publication standards?
• Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (1993). Comprehensive
meta-analysis of integrity test validities: Findings and implications for
personnel selection and theories of job performance. Journal of Applied
Psychology, 78, 679 –703
• Van Iddekinge, C. H., Roth, P. L., Raymark, P. H., & Odle-Dusseau, H.
N. (2012). The criterion-related validity of integrity tests: An updated
meta-analysis. Journal of Applied Psychology, 97, 499 –530.
New Developments in Using
Vocational Interest Measures
• Since Hunter and Hunter (1984), interest in
using interest measures for selection has
diminished greatly
• They report a meta-analytic estimate of
validity for predicting performance as .10
• BUT: how many studies in this metaanalysis?
– 3!!!
• New meta-analysis by Van Iddekinge et al.
(2011)
• Lots of studies (80)
• Mean validity for a single interest dimension:
.11
• Mean validity for a single interest dimension
relevant to the job in question: .23
• Other studies suggest incremental validity
over ability and personality
• The “catch”: studies use data collected for
research purposes
• Concern that candidates can “fake” a jobrelevant interest profile
• I expect interest to turn to developing fakingresistant interest measures
• Van Iddekinge, C. H., Roth, P. L., Putka, D. J., & Lanivich,
S. E. (2011). Are you interested? A meta-analysis of
relations between vocational interests and employee
performance and turnover. Journal of Applied
Psychology, 96(6), 1167.
• Nye, C. D., Su, R., Rounds, J., & Drasgow, F. (2012).
Vocational interests and performance a quantitative
summary of over 60 years of research. Perspectives on
Psychological Science, 7(4), 384-403.
New Developments in Using
Social Media
Van Iddekinge et al (in press)
•
•
•
•
•
•
Students about to graduate made Facebook info available
Recruiters rated profile on 10 dimensions
Supervisors rated performance a year later
Facebook ratings did not predict performance
Higher ratings for women than men
Lower ratings for Blacks and Hispanics than Whites
• Van Iddekinge, C. H., Lanivich, S. E., Roth, P. L., & Junco,
E. (in press). Social Media for Selection? Validity and
Adverse Impact Potential of a Facebook-Based
Assessment. Journal of Management.
Distribution of Performance
Is performance normally
distributed?
• We’ve implicitly assumed this for years
– Data analysis strategies assume normality
– Evaluations of selection system utility assume
normality
• O’Boyle and Aguinis (2012) offer hundreds of
data sets, all consistently showing that a
“power law” distribution fits better
– This is a distribution with the largest number of
observations at the very bottom, with the number
of observations then dropping rapidly
The O’Boyle and Aguinis data
• They argue against looking at ratings data,
as ratings may “forced” to fit a normal
distribution
• Thus they focus on objective data
– Tallies of publication in journals
– Sports performance (e.g., golf tournaments won,
points scored in NBA)
– Awards in arts and letters (e.g. Number of
Academy Award nominations)
– Political elections (number of terms to which one
has been elected)
An alternate view
• “Job performance is defined as the total
expected value of the discrete behavioral
episodes an individual carries out over a
standard period of time” (Motowidlo and Kell,
2013)
Aggregating individual behaviors
affects distribution
Including all performers affects
distribution
Equalizing opportunity to perform
References
• O’Boyle Jr. E., & Aguinis, H. (2012). The best
and the rest: Revisiting the norm of normality
of individual performance. Personnel
Psychology, 65(1), 79.
• Beck, J., Beatty, A. S., and Sackett, P. R.
(2014) On the distribution of performance: A
reply to O’Boyle and Aguinis. Personnel
Psychology, 67, 531-566.
Download