What’s New in the I/O Testing and Assessment Literature That’s Important for Practitioners? Paul R. Sackett New Developments in the Assessment of Personality Topic 1: A faking-resistant approach to personality measurement • Tailored Adaptive Personality Assessment System (TAPAS) • Developed for Army Research Institute by Drasgow Consulting Group • Multidimensional Pairwise Preference Format combined with applicable Item Response Theory model • Items are created by pairing statements from different dimensions that are similar in desirability and trait “location” • Example item: “Which is more like you?” • __1a) People come to me when they want fresh ideas. • __1b) Most people would say that I’m a “good listener”. A faking-resistant approach to personality measurement (continued) • Extensive work show it’s faking-resistant • Non-operational field study in Army show useful prediction of attrition, disciplinary incidents, completion of basic training, adjustment to Army life, among other criteria • Now in operational use on a trial basis • Drasgow, F., Stark, S., Chernyshenko, O. S., Nye, C. D., and Hulin, C. L. (2012). Development of the Tailored Adaptive Personality Assessment System (TAPAS) to Support Army Selection and Classification Decisions. Technical Report 1311, Army Research Institute Topic 2: The Value of Contextualized Personality Items • A new meta-analysis documents the higher predictive power obtained by “contextualizing” items (e.g., asking about behavior at work, rather than behavior in general) • Mean r with supervisory ratings for work context vs. general: – – – – – • Conscientiousness: .30 vs .22 Emotional Stability: .17 vs. 12 Extraversion: .25 vs. .08 Agreeableness: .24 vs. .10 Openness: .19 vs. .02 Shaffer, J.A., & Postlethwaite, B. E. (2012). A matter of context: A meta-analytic investigation of the relative validity of contextualized and noncontextualized personality measures. Personnel Psychology, 65, 445-494. Topic 3: Moving from the Big 5 to Narrower Dimensions • DeYoung, Quilty and Peterson (2007) suggested the following: – Neuroticism: • • Volatility - irritability, anger, and difficulty controlling emotional impulses Withdrawal - susceptibility to anxiety, worry, depression, and sadness – Agreeableness: • • Compassion - empathetic emotional affiliation Politeness - consideration and respect for others’ needs and desires – Conscientiousness: • • Industriousness - working hard and avoiding distraction Orderliness - organization and methodicalness – Extraversion: • • Enthusiasm - positive emotion and sociability Assertiveness - drive and dominance – Openness to Experience: • • Intellect - ingenuity, quickness, and intellectual engagement Openness - imagination, fantasy, and artistic and aesthetic interests DeYoung, C. G., Quilty, L. C., & Peterson, J. B. (2007). Between facets and domains: 10 Aspects of the Big Five, Journal of Personality and Social Psychology, 93, 880-896 Moving from the Big 5 to Narrower Dimensions (continued) • Dudley et al (2006) show the value of this perspective – Four conscientiousness facets: achievement, dependability, order, and cautiousness – Validity was driven largely by the achievement and/or dependability facets, with relatively little contribution from cautiousness and order – Achievement receives the dominant weight in predicting task performance, while dependability receives the dominant weight in predicting counterproductive work behavior – Dudley NM, Orvis KA, Lebiecki JE, Cortina JM. 2006. A meta-analytic investigation of conscientiousness in the prediction of job performance: Examining the intercorrelations and the incremental validity of narrow traits. J. Appl. Psychol. 91:40-57 Topic 4: The Use of Faking Warnings • Landers et al (2011) administered a warning after 1/3 of the items to managerial candidates exhibiting what they called “blatant extreme responding”. • Rate of extreme responding was halved after the warning • Landers, R. N., Sackett, P. R., & Tuzinski, K. A. (2011). Retesting after initial failure, coaching rumors, and warnings against faking in online personality measures for selection. Journal of Applied Psychology, 96(1), 202. More on the Use of Faking Warnings • Nathan Kuncel suggests three potentially relevant goals when individuals take a personality test: • - be impressive • - be credible • - be true to oneself More on the Use of Faking Warnings • Jenson and Sackett (2013) suggested that “priming” concern for being credible could reduce faking. • Test-takers who scheduled a follow-up interview just before taking the personality test obtained lower scores than those who did not Jenson, C. E., and Sackett, P. R. (2013). Examining ability to fake and test-taker goals in personality assessments. SIOP presentation. New Developments in the Assessment of Cognitive Ability A cognitive test with reduced adverse impact • In 2011, SIOP awarded its M.Scott Myers Award for applied research to Yusko, Goldstein, Scherbaum, and Hanges for the development of the Siena Reasoning Test • This is a nonverbal reasoning test, using unfamilar item content, such as made-up words (if a GATH is larger than a SHET…) and figures • Concept is that adverse impact will be reduced by eliminating content with which groups have differential familiarity Validity and subgroup d for Siena Test • Black-White d commonly in the .3-.4 range • Sizable number of validity studies, with validities in the range commonly seen for cognitive tests. • In one independent study, HumRRO researchers included Siena along with another cognitive test; corrected validity .45 for other test (d = 1.); .35 for Siena (d = .38) (SIOP 2010: Paullin, Putka, and Tsacoumis) Why the reduced d? • Somewhat of a puzzle. There is a history of using nonverbal reasoning tests – Raven’s Progressive Matrices – Large sample military studies in Project A • But these do not show the reduced d that is seen with the Siena Test • Things to look into: does d vary with item difficulty, and how does Siena compare with other tests? • (Note: Nothing published to date that I am aware of. Some powerpoint decks from SIOP presentations can be found online: search for “Siena Reasoning Test”) New Developments in Situational Judgment Testing Sample SJT item • You find yourself in an argument with several co-workers about who should do a very disagreeable, but routine task. Which of the following would be the most effective way to resolve this situation? • (a) Have your supervisor decide, because this would avoid any personal bias. • (b) Arrange for a rotating schedule so everyone shares the chore. • (c) Let the workers who show up earliest choose on a first-come, first-served basis. • (d) Randomly assign a person to do the task and don't change it. Key findings • Extensive validity evidence • Can measure different constructs (problem solving, communication skills, integrity,etc.) • Incremental validity over ability and personality • Small subgroup differences, except for cognitively-oriented SJTs • Items can be presented in written form or by video; recent move to animation rather than recording live actors Lievens, Sackett, and Buyse, T. (2009) comparing response instructions • Ongoing debate re “would do” vs. “should do” instructions • Lievens et al. randomly assigned Belgian medical school applicants to “would do” or “should do” in operational interpersonal skills SJT; did the same with a student sample Lievens, Sackett, and Buyse, T. (2009) comparing response instructions • In operational setting, all gave “should do” responses – So: we’d like to know “would do”, but in effect, can only get “should do” Arthur et al (2014): comparing response formats • Compared 3 options: – Rate effectiveness of each response – Rank the responses – Choose best and worst response • 20-item integrity-oriented SJT • Administered to over 30,000 retail/hospitality job applicants • On-line admin; each format used for one week • “Rate each response” emerges as superior – – – – Higher reliability Lower correlation with cognitive ability Smaller gender mean difference Higher correlation with conceptually relevant personality dimensions (conscientiousness, agreeableness, emotional stability) • Follow-up study with student sample – Higher retest reliabilty – More favorable reactions Krumm et al. (in press) • Question: how “situational” is situational judgment? • Some suggest SJTs really just measure general knowledge about appropriate social behavior • So Krumm et al. conducted a clever experiment: they “decapitated” SJT items – Removed the stem – just presented the responses • 559 airline pilots completed 10 items each from – Airline pilot knowledge SJT – Integrity SJT – Teamwork SJT • Overall, mean scores are 1 SD higher with the stem • But for more than half the items, there is no difference with and without stem • So stem matters overall, but is irrelevant for lots of SJT items • Depends on specificity of stem content • “You are flying an “angel flight” with a nurse and noncritical child patient, to meet an ambulance at a downtown regional airport. You filed visual flight rule: it is 11:00 p.m. on a clear night, when, at 60 nm out, you notice the ammeter indicating a battery discharge and correctly deduce the alternator has failed. Your best guess is that you have from 15 to 30 min of battery power remaining. You decide to: • (a) Declare an emergency, turn off all electrical systems, except for 1 NAVCOM and transponder, and continue to the regional airport as planned. • (b) Declare an emergency and divert to the Planter’s County Airport, which is clearly visible at 2 o’clock, at 7 nm. • (c) Declare an emergency, turn off all electrical systems, except for 1 NAVCOM, instrument panel lights, intercom, and transponder, and divert to the Southside Business Airport, which is 40 nm straight ahead. • (d) Declare an emergency, turn off all electrical systems, except for 1 NAVCOM, instrument panel lights, intercom, and transponder, and divert to Draper Air Force Base, which is at 10 o’clock, at 32 nm.” • Arthur, W., Jr., Glaze, R. M., Jarrett, S. M., White, C. D., Schurig, I., & Taylor, J. E. (2014). Comparative evaluation of three situational judgment test response formats in terms of constructrelated validity, subgroup differences, and susceptibility to response distortion. Journal of Applied Psychology, 99(3), 535545. • Krumm, S, Lievens, F., Huffmeier,J., Lipnevich, A., Bendels,H., and Hertel, G.(in press). How “situational” is judgment in situational judgment tests? Journal of Applied Psychology. • Lievens, F., Sackett, P. R, and Buyse, T. (2009). The effects of response instructions on situational judgment test performance and validity in a high-stakes context. Journal of Applied Psychology, 94, 1095-1101. New Developments in Integrity Testing Two meta-analyses with differing findings • Ones, Viswesvaran, and Schmidt (1993) is the “classic” analysis of integrity test validity. – found 662 studies, including many where only raw data was provided (i.e., no write-up). Info sharing from many publishers • In 2012, Van Iddekinge et al conducted an updated meta-analysis – applied strict inclusion rules as to what studies to include (e.g., reporting of study detail) – 104 studies (including 132 samples) met inclusion criteria. – 30 publishers contacted; only 2 shared info. • Both based bottom line conclusions on studies using a predictive design and a non-self report criterion. Predicting Counterproductive Behavior K • Ones et al – overt tests N 10 5598 Mean Validity .39 • Ones et al- personalitybased tests 62 93092 .29 • Van Iddekinge et al 10 5056 .11 Why the difference? • Not clear. A number of factors do not seem to be the cause: – Differences in types of studies examined (e.g., both excluded studies with polygraph as criteria) – Differences in corrections (e.g., unreliability) • Several factors may contribute, though this is speculation – Some counterproductive behaviors may be more predictable than others, but all are lumped together in these analyses • Given reliance in both on studies not readily available to public scrutiny, this won’t be resolved until further work is done Broader questions • This raises broader issues about data openness policies – Publisher obligations? – Researcher obligations? – Journal publication standards? • Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (1993). Comprehensive meta-analysis of integrity test validities: Findings and implications for personnel selection and theories of job performance. Journal of Applied Psychology, 78, 679 –703 • Van Iddekinge, C. H., Roth, P. L., Raymark, P. H., & Odle-Dusseau, H. N. (2012). The criterion-related validity of integrity tests: An updated meta-analysis. Journal of Applied Psychology, 97, 499 –530. New Developments in Using Vocational Interest Measures • Since Hunter and Hunter (1984), interest in using interest measures for selection has diminished greatly • They report a meta-analytic estimate of validity for predicting performance as .10 • BUT: how many studies in this metaanalysis? – 3!!! • New meta-analysis by Van Iddekinge et al. (2011) • Lots of studies (80) • Mean validity for a single interest dimension: .11 • Mean validity for a single interest dimension relevant to the job in question: .23 • Other studies suggest incremental validity over ability and personality • The “catch”: studies use data collected for research purposes • Concern that candidates can “fake” a jobrelevant interest profile • I expect interest to turn to developing fakingresistant interest measures • Van Iddekinge, C. H., Roth, P. L., Putka, D. J., & Lanivich, S. E. (2011). Are you interested? A meta-analysis of relations between vocational interests and employee performance and turnover. Journal of Applied Psychology, 96(6), 1167. • Nye, C. D., Su, R., Rounds, J., & Drasgow, F. (2012). Vocational interests and performance a quantitative summary of over 60 years of research. Perspectives on Psychological Science, 7(4), 384-403. New Developments in Using Social Media Van Iddekinge et al (in press) • • • • • • Students about to graduate made Facebook info available Recruiters rated profile on 10 dimensions Supervisors rated performance a year later Facebook ratings did not predict performance Higher ratings for women than men Lower ratings for Blacks and Hispanics than Whites • Van Iddekinge, C. H., Lanivich, S. E., Roth, P. L., & Junco, E. (in press). Social Media for Selection? Validity and Adverse Impact Potential of a Facebook-Based Assessment. Journal of Management. Distribution of Performance Is performance normally distributed? • We’ve implicitly assumed this for years – Data analysis strategies assume normality – Evaluations of selection system utility assume normality • O’Boyle and Aguinis (2012) offer hundreds of data sets, all consistently showing that a “power law” distribution fits better – This is a distribution with the largest number of observations at the very bottom, with the number of observations then dropping rapidly The O’Boyle and Aguinis data • They argue against looking at ratings data, as ratings may “forced” to fit a normal distribution • Thus they focus on objective data – Tallies of publication in journals – Sports performance (e.g., golf tournaments won, points scored in NBA) – Awards in arts and letters (e.g. Number of Academy Award nominations) – Political elections (number of terms to which one has been elected) An alternate view • “Job performance is defined as the total expected value of the discrete behavioral episodes an individual carries out over a standard period of time” (Motowidlo and Kell, 2013) Aggregating individual behaviors affects distribution Including all performers affects distribution Equalizing opportunity to perform References • O’Boyle Jr. E., & Aguinis, H. (2012). The best and the rest: Revisiting the norm of normality of individual performance. Personnel Psychology, 65(1), 79. • Beck, J., Beatty, A. S., and Sackett, P. R. (2014) On the distribution of performance: A reply to O’Boyle and Aguinis. Personnel Psychology, 67, 531-566.