THE EFFECTS OF RESPONSE INSTRUCTIONS ON SITUATIONAL JUDGMENT TEST PERFORMANCE IN A HIGH-STAKES EMPLOYMENT CONTEXT A Thesis Presented to the faculty of the Department of Psychology California State University, Sacramento Submitted in partial satisfaction of the requirements for the degree of MASTER OF ARTS in Psychology (Industrial / Organizational Psychology) by Clinton Dean Kelly FALL 2013 © 2013 Clinton Dean Kelly ALL RIGHTS RESERVED ii THE EFFECTS OF RESPONSE INSTRUCTIONS ON SITUATIONAL JUDGMENT TEST PERFORMANCE IN A HIGH-STAKES EMPLOYMENT CONTEXT A Thesis by Clinton Dean Kelly Approved by: __________________________________, Committee Chair Gregory Hurtz, Ph.D. __________________________________, Second Reader Lawrence Meyers, Ph.D. __________________________________, Third Reader Howard Fortson, Ph.D. ____________________________ Date iii Student: Clinton Dean Kelly I certify that this student has met the requirements for format contained in the University format manual, and that this thesis is suitable for shelving in the Library and credit is to be awarded for the thesis. __________________________, Graduate Coordinator Jianjian Qin, Ph.D. Department of Psychology iv ___________________ Date Abstract of THE EFFECTS OF RESPONSE INSTRUCTIONS ON SITUATIONAL JUDGMENT TEST PERFORMANCE IN A HIGH-STAKES EMPLOYMENT CONTEXT by Clinton Dean Kelly The purpose of this study was to investigate the effects of response instructions on situational judgment test (SJT) performance. Participants (N = 2,224) completed one of two forms of a 15 item SJT test as part of an entry-level firefighter selection test. Eight ofthe SJT items on Form A used knowledge instructions (what should you do) and 7 used behavioral tendency instructions (what would you do). On Form B, the same items were used, but for each item the “should” versus “would” response instructions were reversed relative to Form A. No significant differences were found for mean scores by instruction type within or between subjects. Additionally, no significant differences were found in the SJT items’ correlations with cognitive ability by instruction type. These results are in direct contrast to those found in previous studies involving low-stakes situations. Limitations and theoretical implications of the findings were also discussed. _______________________, Committee Chair Gregory Hurtz, Ph.D. _______________________ Date v ACKNOWLEDGEMENTS I would like to thank Gregory Hurtz, Ph.D. for serving as my chair, providing guidance along the way, and putting up with my procrastination. I would also like to thank Larry Meyers, Ph.D. and Howard Fortson, Ph.D. for their contributions to my education in the classroom and in the workplace. Jason Schaefer helped to get this all started and assisted with some preliminary presentations of the findings. My wife, daughter, and parents have been a source of support, encouragement, and love throughout this process. Lastly, Nala was always by my side with unconditional support. vi TABLE OF CONTENTS Page Acknowledgements ...................................................................................................... vi List of Tables ............................................................................................................... ix Chapter 1. OVERVIEW ............................................................................................................. 1 Brief History of SJTs ......................................................................................... 2 SJTs as a Measurement Method ....................................................................... 3 Reliability ......................................................................................................... 7 Validity ............................................................................................................. 8 2. STUDY BACKGROUND .................................................................................... 11 SJTs as Measures of Behavioral Tendency .................................................... 11 SJTs as Measures of Job Knowledge or Cognitive Ability ............................ 11 Instruction Types ........................................................................................... 11 Faking ............................................................................................................ 17 Limitations of Prior Research ........................................................................ 19 Present Study .................................................................................................. 20 Hypotheses ..................................................................................................... 21 3. METHODS ............................................................................................................ 26 Participants .................................................................................................... 26 Development of the Test ................................................................................ 26 vii Procedure ........................................................................................................ 30 4. RESULTS .............................................................................................................. 32 Data Screening ............................................................................................... 32 Descriptive Statistics ...................................................................................... 33 ANCOVA Analysis ........................................................................................ 41 Logistic Regression Analyses ........................................................................ 43 5. DISCUSSION ....................................................................................................... 48 Discussion of Hypothesized Relationships .................................................... 48 Limitations...................................................................................................... 50 Conclusions and Theoretical Implications ..................................................... 53 References ................................................................................................................... 56 viii LIST OF TABLES Tables 1. Page Table 1 Constructs Assessed Using SJTs from Christian, Edwards, & Bradley, 2010 ......................................................................................................................... 5 2. Table 2 SJT Construct Correlations from McDaniel et al. 2007 ............................. 9 3. Table 3 Types of SJT Instructions ......................................................................... 12 4. Table 4 SJT Construct Correlations by Instruction Type when Test Content is Held Constant from McDaniel et al., 2007 .......................................................... 15 5. Table 5 Entry Firefighter Test Sections ................................................................ 27 6. Table 6 Mean Test Score by Gender and Ethnicity............................................... 37 7. Table 7 Gender and Ethnicity by Fire Department ............................................... 38 8. Table 8 Participants and Mean Test Scores by Fire Department .......................... 39 9. Table 9 SJT Item Difficulty and Correlation with Cognitive Ability .................. 40 10. Table 10 ANCOVA Summary Table .................................................................... 43 11. Table 11 Logistic Regression Summary Table ..................................................... 46 ix 1 Chapter 1 OVERVIEW Personnel selection is the process that organizations follow to make hiring decisions. Virtually all organizations use some sort of selection procedure as a part of their personnel selection process. A selection procedure is any procedure used singly or in combination to make a personnel decision (SIOP, 2003). The selection procedures organizations use can vary from the simplistic (e.g., pass a criminal background check in order to clean dishes at a restaurant) to the complex (apply with the CIA and take a personality test, a cognitive ability test, a writing sample, a polygraph, three interviews, and a physical ability test). Regardless of the simplicity or complexity of the selection procedure, the goal should be to hire the best person for the job. Many different types of selection procedures have been used throughout the years (Schmidt & Hunter, 1998); however, this study will focus specifically on the use of situational judgment tests (SJTs) in personnel selection. SJTs have become more popular over the last few years in personnel selection (Weekley & Ployhart, 2006). Some reasons for this increase in popularity are that SJTs have the potential to reduce adverse impact in comparison to cognitive ability tests, present problem solving skills in an applied setting, offer incremental validity over cognitive ability and personality, and receive positive applicant reactions (Landy, 2007). As the name implies, SJTs contain items that are designed to assess an applicant’s judgment regarding a situation encountered in the work place (Weekley & Ployhart, 2 2006). A respondent is presented with work-related situations and a list of possible courses of action. An example SJT item is provided below: You have a project due to your supervisor by a specific date, but because of your workload, you realize that you will not be able to complete the project by the deadline. Which is the best response? a. Work extra hours in order to complete the project by the deadline. b. Complete what you can by the deadline. c. Let your supervisor know that the deadline was unrealistic. d. Ask for an extension of the deadline. SJTs are versatile and are not limited to a single mode of administration (e.g., written, verbal, video-based, or computer-based) (Clevenger, Pereira, Wiechmann, Schmitt, & Schmidt-Harvey, 2001) nor a single scoring method (e.g., dichotomous, rating scale) (Bergman, Drasgow, Donovan, Henning, & Juraska, 2006). However, the focus of this paper will be on written multiple-choice SJTs that are dichotomously scored. Brief History of SJTs One of the first tests to contain SJTs with a list of response options like the example provided was the George Washington Social Intelligence Test created in 1926 (Moss, 1926). One of the seven subtests was titled Judgment in Social Situations and was designed to assess knowledge of judgment and human motives. A few years later during World War II, the army created a test designed to assess the common sense, experience, and general knowledge of soldiers (Norhrop, 1989). The use of SJTs then spread to organizations where tests were developed to measure supervisory potential. The Practical 3 Judgment Test (Cardall, 1942) and Supervisory Practices Test (Bruce & Learner, 1958) were both created for organizational use. SJTs continued to be used sporadically in organizations, but did not receive much attention until 1990 when Motowidlo, Dunnette, and Carter (1990) created an SJT to select entry-level managers. They referred to their test as a low-fidelity simulation and found that the correlations of the SJT scores with supervisory performance ratings ranged from .28 to .37 in the total sample. The authors concluded that, “low-fidelity simulations can be used to harness the predictive potential of behavioral samples even when cost or other practical constraints make it impossible to develop and implement high-fidelity simulations” (Motowidlo et al., 1990, p. 647). Since the early 1990s, SJT research has increased and they are commonly used in personnel selection in both the United States and Europe (Salgado, Viswesvaran, & Ones, 2001). The rise in use of SJTs is likely due to their validity in predicting job performance (McDaniel, Hartman, Whetzel, & Grubb, 2007), their incremental validity over job knowledge tests (Lievens & Patterson, 2011) and personality tests (McDaniel et al., 2007), their reduced race-based adverse impact relative to cognitive measures (Whetzel, McDaniel, & Nguyen, 2008), and their face and content validity (Motowidlo, Hanson, & Crafts, 1997). SJTs as a Measurement Method It is important to recognize the distinction between methods and constructs when referring to SJTs or any test used in personnel selection. A construct is the concept a selection procedure is intended to measure and the method is the measurement tool used to asses a construct or constructs (Arthur & Villado, 2008; SIOP, 2003). Some methods 4 are designed to measure a single construct and other methods may measure more than one construct. For example, an integrity test is designed to measure the construct of integrity and a cognitive ability test is designed to measure the construct of general cognitive ability. An example of a measurement method that may assess more than one construct is an assessment center. An assessment center may be designed to assess the constructs of interpersonal skills, managing others, and job knowledge. SJTs are similar to assessment centers in that they are a measurement method and not a construct (McDaniel & Whetzel, 2005). SJTs can be used as a method to measure different constructs (i.e., construct heterogeneous), although the method does place some restrictions on what can be measured (Christian, Edwards, & Bradley, 2010; Schmitt & Chan, 2006). Two of the most common construct domains measured by SJTs include leadership and interpersonal skills (Christian et al., 2010). Table 1 contains a summary of the constructs assessed using SJTs from a meta-analysis conducted by Christian el al. (2010). 5 Table 1 Constructs Assessed Using SJTs from Christian, Edwards, & Bradley, 2010 Construct Domain Construct Job knowledge and skills Knowledge of the interrelatedness of units Pilot judgment Managing tasks Team role knowledge Interpersonal Skills Ability to size up personalities Customer contact effectiveness Customer service interactions Guest relations Interactions Interpersonal skills Negotiations Service situations Social intelligence Working effectively with others Teamwork Skills Teamwork Teamwork KSAs Leadership Administrative judgment Conflict resolution for managers Directing the activities of others Handling people General management performance Handling employee problems (Table 1 continues) 6 (Table 1 continued) Construct Domain Construct Leadership/supervision Managerial/supervisory skill or judgment Managerial situations Supervisor actions dealing with people Supervisor job knowledge Supervisor problems Supervisor Profile Questionnaire Managing others Personality Composites Conscientiousness, Agreeableness, Neuroticism Adaptability, ownership, self-initiative, teamwork, integrity, work ethic Conscientiousness Conscientiousness Agreeableness Agreeableness Neuroticism Neuroticism Adapted from Christian, Edwards, & Bradley (2010). It is important to remember that SJTs often do not clearly measure one specific construct and that Table 1 lists the constructs that Christian et al. (2010) felt were most saturated by the SJTs they included in their study. The potential for SJTs to measure more than one construct in a valid and cost effective manner make them attractive tools for individuals working in personnel selection. 7 Reliability Reliability relates to the precision of a test. It is a measurement of the amount of error in a test. There are different methods for assessing reliability, but only three will be discussed here. Test-retest reliability is obtained by having the same person complete a test twice and calculating a Pearson correlation between the two test administrations. Parallel forms reliability is obtained by creating two versions of a test that assess the same construct or constructs and having some people complete each version of the test. A Pearson correlation between the two tests is then calculated to obtain the reliability. The third form is internal consistency reliability, which is based on the idea that test takers will respond similarly to items that assess the same construct. Coefficient alpha or Cronbach’s alpha is the measure of internal consistency that is typically used when reporting internal consistency reliability. There are some special considerations to keep in mind before estimating the reliability of SJTs because they tend to be construct heterogeneous at the scale and item level (McDaniel & Whetzel, 2005). Consider the SJT example presented previously. The situation stated that due to workload the employee has recognized that he or she will not be able to finish a project by the scheduled deadline. The respondent is then asked which of the listed response options is best. A respondent who chooses the option that states, “work extra hours in order to complete the project by the deadline,” may choose this response due to a high level of intelligence and/or due to a high level of agreeableness. This particular item would then correlate with both cognitive ability and agreeableness (Whetzel & McDaniel, 2009). If the SJT consists of additional items that also correlate 8 with both intelligence and other personality factors, we could say that the SJT is construct heterogeneous. In other words, the SJT is not internally consistent. Unless there is evidence that an SJT is homogenous (i.e., internally consistent), test-retest and parallel forms reliability are preferred and Cronbach’s alpha should be avoided (Cronbach, 1951). It is more appropriate to use test-retest reliability or parallel forms reliability. However, in research or field situations, it can be difficult to obtain testretest or parallel forms reliability and many researchers continue to report internal consistency estimates of reliability without mentioning the tendency of such measures to underestimate reliability (Whetzel & McDaniel, 2009). Validity The existing research concerning the types of validity evidence for SJTs focus on construct validity and predictive validity (i.e., prediction of job performance). Concerning construct validity, researchers have found moderate correlations of SJTs with cognitive ability and personality (McDaniel, Morgeson, Finnegan, Campion, & Braverman, 2001; McDaniel & Nguyen, 2001; Ployhart & Ehrhart, 2003). McDaniel et al. (2007) conducted a meta-analysis of 118 studies of SJTs that included data on construct validity. They found that on average SJTs are most highly correlated with cognitive ability and three of the Big Five personality traits (conscientiousness, agreeableness, and emotional stability). The mean r values ranged from .29 to .19 (See Table 2). 9 Table 2 SJT Construct Correlations from McDaniel et al. 2007 κ Mean r Ρ 95 .29 .32 25,473 51 .22 .25 Conscientiousness 31,277 53 .23 .27 Emotional Stability 19,325 49 .19 .22 Extroversion 11,351 25 .13 .14 Openness to experience 4,515 19 .11 .13 Instruction Type N Cognitive Ability 30,859 Agreeableness Adapted from McDaniel et al. (2007). Note. N is the number of subjects across all studies in the analysis; κ is the number of studies; Mean r is the observed mean; ρ is the estimated mean population correlation which were corrected for measurement error in the personality and cognitive ability measures. McDaniel et al. (2007) also evaluated the validity evidence concerning prediction of job performance and incremental validity over cognitive ability and personality. They found an average correlation of .26 with job performance ratings based on 118 coefficients (N = 24,756), an incremental validity from .03 to .05 over cognitive ability, and an incremental validity from .06 to .07 over a measure of the Big 5. The authors did note that almost all of the data in their meta-analysis came from incumbent samples and not job applicants. Overall, research supports the construct, predictive and incremental validity evidence for SJTs. However, there have been conflicting data regarding the strength of the validity coefficients. Most of the differences in findings have been attributed to two different instruction types used for SJTs (McDaniel & Nguyen, 2001). The primary purpose of this study is to add to the existing knowledge of how response instructions 10 moderate the construct validity of SJTs as it relates to cognitive ability and how the item level difficulty is affected by response instruction type. 11 Chapter 2 STUDY BACKGROUND SJTs as Measures of Behavioral Tendency Professionals in personnel selection often work on the assumption of behavioral consistency, which states that the best predictor of future behavior is past behavior (Wernimont & Campbell, 1968). This is one of the explanations offered for why SJTs predict job performance (Motowidlo et al., 1990). According to this explanation, SJTs serve as an indication of how job applicants will behave on the job if hired. If this is true, organizations can use SJTs that reflect common work situations to assess a candidate’s behavioral tendencies. SJTs as Measures of Job Knowledge or Cognitive Ability The second explanation offered for why SJTs predict job performance is that they measure job knowledge or cognitive ability (McDaniel et al., 2001). Using this explanation, when candidates complete SJTs they are cognitively evaluating each response option and picking the best option. The best option may not always be congruent with a candidate’s behavioral tendency. If this explanation is true, organizations can use SJTs to assess a candidate’s maximal performance (Cronbach, 1984). Instruction Types Research has shown that the amount of behavioral tendency or cognitive ability measured by an SJT may be moderated by the instruction type used (Ployhart & Ehrhart, 12 2003; McDaniel et al., 2007; Lievens, Sackett, & Buyse, 2009). Accordingly, SJT instructions are typically divided into two categories: behavioral tendency instructions and knowledge instructions (McDaniel & Nguyen, 2001). Table 3 provides a summary of the different instructions by type. Ployhart and Ehrhart (2003) note that the “most and least likely do” are probably the most commonly used instructions in practice. Table 3 Types of SJT Instructions Instruction Type Study Behavioral Tendency Instructions 1. Would do Bruce and Learner (1958) 2. Most and least likely do Pulakos and Schmitt (1996) 3. Rate and rank what you would most likely do Jagmin (1985) 4. Have done Motowidlo, Carter, Dunnette, Tippins, Werner, Burnett, and Vaughan (1992) Knowledge Instructions 1. Should do Hanson (1994) 2. Best response Greenberg (1963) 3. Best and worst Weekley and Jones (1999) 4. Best and second best Richardson, Bellows, Henry, & Co. (1981) 5. Best, second, and third best Cardell (1942) 6. Level of importance Corts (1980) 7. Rate effectiveness of each response Sacco, Scheu, Ryan, Schmitt, Schmidt, and Rogg (2000) Adapted from Ployhart and Ehrhart (2003) and McDaniel et al. (2007). The variety of instruction types used have led to some confusion about what SJTs measure and how effective they are in predicting job performance. The first study to 13 attempt to clarify this confusion by modifying SJT response instructions while holding the SJT content constant was conducted by Ployhart and Ehrhart (2003). In this study, they developed a five-item paper-and-pencil SJT consisting of common study situations encountered by students with four response options for each item. The researchers identified six different instruction types (three behavioral tendency and three knowledge). In Phase I of the study, 486 students were randomly administered the SJT using one of the six instruction types. In Phase II, 134 students completed the same five-item SJT six times, once for each instruction type. The results found that the SJT items with knowledge instructions resulted in higher means, lower standard deviations, and more non-normal distributions than the behavioral tendency instructions. They also evaluated the intercorrelations among instruction type and the correlations between instruction types. The intercorrelation among behavioral tendency instructions (r = .76) was higher than the intercorrelation among knowledge instructions (r = .36). The correlation between the two instruction types was r = .22. These results support the idea that different instruction types elicit a different response from candidates when presented with the same SJT item. However, this study was conducted with a student sample and it may be reasonable to suspect that results may differ with a non-student sample. The students in this study were in a low-stakes situation with little incentive to misrepresent their behavioral tendencies. McDaniel et al. (2007) conducted a meta-analysis of paper-and-pencil SJTs whose participants were employees (four of the 118 studies included used applicants) rather than university students. They categorized 118 SJT studies as having used 14 behavioral tendency or knowledge instructions. For the majority of the studies included in the meta-analysis the content varied, therefore holding the content constant was not possible. However, eight of the 118 studies did hold the content constant so that the SJT was administered twice, once with behavioral tendency instructions and once with cognitive instructions. Table 4 contains a summary of the construct correlations by instruction type for the eight studies where content was held constant. The results found that SJTs with knowledge instructions were more highly correlated with cognitive ability than behavioral tendency instructions and that behavioral tendency instructions were more highly correlated with all five facets of the Big Five than knowledge instructions. The results of this study support the results of Ployhart and Ehrhart (2003) that different instruction types elicit a different response from participants when presented with the same SJT item. However, the authors caution against drawing strong conclusions because only eight of the 118 studies held the SJT test content constant. This meta-analysis made an important step forward in the moderating effect of instruction type by using job incumbents rather than students. This research was conducted primarily with job incumbents and McDaniel et al. (2007) conclude that the incentive to misrepresent behavioral tendencies may be greater for job applicants and therefore lead to potentially different results regarding the effects of response instructions. They recommended that research on SJT response instructions be conducted using applicant data. 15 Table 4 SJT Construct Correlations by Instruction Type when Test Content is Held Constant from McDaniel et al. 2007 Instruction Type Cognitive Ability N κ Mean r Ρ 1,497 8 .20 .22 Knowledge instructions 737 4 .25 .28 Behavioral tendency instructions 760 4 .15 .17 1,465 8 .15 .17 Knowledge instructions 763 4 .12 .14 Behavioral tendency instructions 702 4 .17 .20 1,465 8 .24 .27 Knowledge instructions 763 4 .19 .21 Behavioral tendency instructions 702 4 .29 .33 1,465 8 .06 .08 Knowledge instructions 763 4 .02 .02 Behavioral tendency instructions 702 4 .11 .13 1,465 8 .04 .04 Knowledge instructions 763 4 .02 .02 Behavioral tendency instructions 702 4 .06 .07 1,465 8 .07 .08 Knowledge instructions 763 4 .05 .05 Behavioral tendency instructions 702 4 .09 .10 Agreeableness Conscientiousness Emotional Stability Extroversion Openness to experience Adapted from McDaniel et al. (2007). Note. N is the number of subjects across all studies in the analysis; κ is the number of studies; Mean r is the observed mean; ρ is the estimated mean population correlation which were corrected for measurement error in the personality and cognitive ability measures. Lievens et al. (2009) had the unique opportunity to study the effects of response instructions in a high stakes setting. They were able to manipulate SJT response 16 instructions while holding the content constant in a high-stakes medical school applicant setting. In this study, a paper-and-pencil SJT measuring interpersonal/communication skills concerning physician and patient interactions was used as part of an admission exam for medical studies in Belgium. The SJT consisted of 30 items with four response alternatives for each item and used dichotomous scoring. Applicants were given 40 minutes to complete the SJT. They randomly assigned medical school applicants the same SJT with behavioral tendency or knowledge instructions. Their sample consisted of 1,086 students who completed the SJT with knowledge instructions and 1,098 students who completed the same SJT with behavioral tendency instructions. Similar with past research, they found that the SJT with knowledge instructions had a higher correlation with cognitive ability than the SJT with behavioral tendency instructions (.19 vs. .11). However, the authors note that this difference in cognitive loading is not large and is smaller than the differences found by Ployhart and Ehrhart (2003) and McDaniel et al. (2007) in low-stakes situations. In contrast with Ployhart and Ehrhart (2003), Lievens et al. (2009) found that the difficulty level of the SJT did not vary by instruction type. The SJT with cognitive instructions had a mean score of 14.92 and the SJT with behavioral tendency instructions had a mean score of 14.52. The authors concluded that “the answer to our initial question of whether test takers would respond differently to knowledge versus behavioral tendency instructions for SJTs in a high-stakes context seems to be no” (p. 1099). 17 Faking Upon reviewing the results of research on the moderating effects of response instructions for SJTs, there appear to be two different findings. In low-stakes situations (Ployhart & Ehrhart, 2003; McDaniel, 2007) people tend to respond differently to knowledge and behavioral tendency instructions. Knowledge instructions in low-stakes situations have resulted in higher mean scores and smaller standard deviations than behavioral tendency instructions (Ployhart & Ehrhart, 2003). In high-stakes situations, the knowledge and behavioral tendency instructions were equal in their level of difficulty and standard deviations (Lievens et al., 2009). In low-stakes situations, the correlation of knowledge instructions with cognitive ability is meaningfully higher than behavioral tendency instructions (Ployhart & Ehrhart, 2003; McDaniel et al., 2007). While a highstakes situation has found a similar pattern, the difference in correlation with cognitive ability between knowledge and behavioral tendency instructions is “quite small; thus any effects on adverse impact would be very small” (Lievens et al., 2009, p. 1100). A possible explanation for the disparity in findings between low-stakes and highstakes situations is the potential for candidates to fake on the SJT. In a low-stakes research situation, participants have nothing to gain by faking. Low-stakes research settings often consist of university students who receive extra credit based solely on their participation in the study and not on the score they receive on the SJT. In these types of situations, it is likely that candidates will respond truthfully to behavioral tendency instructions. There are situations where how a person would respond (behavioral tendency) might be different from how a person should respond (knowledge). If 18 participants are able to recognize this distinction and are not motivated to pick the best response (e.g., a university student receiving extra credit for research participation), faking is not likely to be an issue. However, if participants are motivated by the situation to pick the best response (e.g., applying for a job) even though they are asked to pick the behavioral tendency response, the potential for faking exists. Lievens et al. (2009) state that behavioral tendency instructions in a high-stakes setting have the potential to create a moral dilemma in applicants when the behavioral tendency and knowledge answer are not compatible. The candidate must decide to respond honestly or pick the best response. McFarland & Ryan (2000) have found that even when people are instructed to fake on a measure, there are individual differences in the amount of response distortion. In a highstakes situation, this could result in some participants being rewarded for response distortion while others are punished for responding honestly. It is important to understand what role, if any, faking plays in SJT response instructions. To evaluate this important topic, Nguyen, Biderman, and McDaniel (2005) evaluated the role of response instructions on faking for SJTs and if candidates are able to fake at all. They had 203 college students complete a 31 item SJT. The students were to select the “best action” and the “worst action” to take in each item as well as select the item they would “most likely” do and “least likely” do. Each participant completed the SJT two times. For one administration, they were instructed to answer honestly and for the other administration they were instructed to respond as if they were applying for a job and to pick the responses that would best guarantee they would be hired. The results 19 showed that the students were able to fake good on the SJT with behavioral tendency instructions (i.e., they improved their score), but were not able to fake good on the SJT with the knowledge instructions. The authors conclude that knowledge instructions may be immune to faking because candidates cannot fake knowledge. A candidate knows the best answer or does not, whereas with behavioral tendency instructions, candidates appear to be able to distinguish between what they would do versus what they should do. The authors speculate that faked responses on SJTs with behavioral tendency may approximate knowledge responses. If the stakes are high (e.g., obtaining employment vs. not obtaining employment) researchers should be aware of the potential for faking to occur if behavioral tendency instructions are used. Limitations of Prior Research The two primary limitations of prior research on the effects of response instructions on SJT performance have been the inability to keep SJT content constant and the lack of high-stakes settings. When the content varies, there is the possibility that knowledge SJTs and behavioral tendency SJTs are different in a particular characteristic that causes candidates to respond or not respond differently. For example, it may be that SJTs with knowledge instructions are written with greater detail and specificity to help discriminate between the correct and incorrect responses. On the other hand, behavioral tendency instructions may be written with more ambiguity in order to encourage personality-related responses. The variation of content is a potential confound that limits any interpretations drawn from response differences by instruction type. 20 When the stakes are high, there is also the possibility that candidates may respond differently to behavioral tendency instructions than when in a low-stakes situation (i.e., faking). I have mentioned three studies where SJT response instructions have been studied and content has been held constant. Ployhart and Ehrhart (2003) kept SJT content constant with college students in a low-stakes situation. McDaniel et al. (2007) were able to keep content constant in four of the 118 studies included in their meta-analysis using job incumbents in a low-stakes situation. Lievens et al., (2009) were able to keep content constant with medical school applicants in a high-stakes situation. The first two studies specifically called for research to be conducted in a high-stakes situation. Lievens et al., (2009) were able to do this; however, they call for research to be done to investigate the generalizability of their finding to high-stakes employment testing. To my knowledge, there have been no studies that have been able to keep SJT content constant while varying response instructions in a high-stakes employment context. Present Study In this study, I had the opportunity to evaluate the effects of response instructions in a high-stakes employment context. Specifically, I was able to work with a test vendor who creates selection tests for entry-level firefighters and vary the response instructions on the SJT items within their test. This answers the call for research to be conducted in a high-stakes setting with real job applicants (Whetzel & McDaniel, 2009; Lievens et al., 2009; Ployhart & MacKenzie, 2010). Accurate testing is very important from both an academic and a practical standpoint. Most of the research on SJTs has been conducted using college students in 21 low-stakes situations. Academics will benefit from this study by being able to compare results previously found in low-stakes situations with a high-stakes situation. There may be important boundary conditions that result in differences between low and high-stakes testing situations. Understanding if the stakes can change the construct being measured by the same SJT is an important theoretical contribution. From a practical standpoint, understanding what a test is measuring and how item presentation may affect measurement is of vital importance. The goal of any test is to come as close as possible to obtaining the true score of a candidate. If there are differences in how candidates respond based on the instruction type used, these differences may be due to different constructs being assessed, error, or faking. In a competitive job market, answering one SJT item incorrectly can be the difference in getting a job or remaining unemployed. Having data in a high-stakes situation on how candidates respond to different SJT instruction types will provide valuable information to help personnel selection practitioners improve their testing practices. Hypotheses The first hypothesis is aligned with the hypothesis and findings of Lievens et al. (2009). They hypothesized and found no differences in the difficulty level of SJTs with different instruction types in a high-stakes situation. The participants in the current study are applying for a job as an entry-level firefighter, so the stakes are high in this situation. The participants are highly motivated to achieve maximum performance, which may lead to an increased likelihood of faking (i.e., provide knowledge answers even when asked to respond behaviorally) and I anticipate that they will treat behavioral tendency instructions 22 as if they were knowledge instructions. This contradicts the finding in low-stakes situations where SJTs with knowledge instructions tend to have higher mean scores (Ployhart & Ehrhart, 2003). The first hypothesis is that there will be no difference in item difficulty between SJTs with behavioral tendency and knowledge instructions given on the same items to two different groups of people. The second hypothesis concerns the correlation with cognitive ability of SJTs with different response instructions. Research in low-stakes settings has consistently found that SJTs with knowledge instructions have a higher correlation with cognitive ability than SJTs with behavioral tendency instructions (Ployhart & Ehrhart, 2003; McDaniel et al., 2007). The study conducted by Lievens et al. (2009) in a high-stakes situation found the same correlation, albeit the differences in correlation were not as pronounced as those found in low-stakes settings. One of their reasons for hypothesizing that the difference in correlation will still occur in a high-stakes setting is that there are individual differences in response distortion (McFarland & Ryan, 2000). The study conducted by McFarland and Ryan (2000) examined faking on a personality test, biodata inventory, and an integrity test. Although these findings may generalize to SJTs, it is possible that the finding of McFarland and Ryan (2000) was due to something else. Lievens et al. hypothesized (2009) that test takers in a high-stakes condition will generally provide the same response regardless of instruction type. They suggest that a test taker in a high-stakes context will provide a knowledge response to behavioral tendency instructions if that test taker has some knowledge of the domain being tested. However, if the test taker does not have knowledge of the domain being tested then the 23 test taker is likely to provide a non-faked, behavioral tendency response. The SJT administered by Lievens et al. (2009) was difficult. The 30 item SJT with knowledge instructions had a mean score of 14.52 and the same SJT with behavioral tendency instructions had a mean score of 14.92. This is just under 50% of the items being answered correctly. Applying the reasoning of Lievens et al. (2009), test takers completing the SJT with behavioral tendency instructions likely lacked some knowledge of the content domain being tested and had to resort to a non-faked behavioral tendency response for some items, which may have led to a slightly lower correlation with cognitive ability. The SJT test in this study is less difficult, as indicated by data from the test vendor on a previous alternate form of the SJT. Prior to manipulating the response instructions, the test vendor in this study had been administering a version of this 15 item SJT and the mean score was 12.36. This is just over 80% of the items being answered correctly. Because the version of the SJT used in this study is similar to the previous version, I expect the mean score to be similar. The minimum requirement for the test takers is graduation from high school or an equivalent (e.g., GED) and the items are aimed at that level. With this test, I anticipate that the test takers will have knowledge of the domain being tested and therefore will provide knowledge responses instead of nonfaked behavioral tendency responses. The second hypothesis is that SJTs with behavioral tendency and knowledge instructions given on the same items to two different groups of people will both be moderately correlated with cognitive ability. 24 Both of the hypotheses are accepting the null hypothesis. The null is a hypothesis of zero effect. Zero effect means that there are no differences between the variables of interest. Most experiments in psychology create a theoretically based hypothesis about a potential difference and statistically compare it no difference (i.e., the null hypothesis). That is, they want to know if the statistical values obtained are different enough from zero (i.e., no difference or the null hypothesis) that they are likely not due to chance or in other words, statistically significant. Hypothesizing the null is typically not well received in the scientific community. One of the primary reasons given for not hypothesizing the null is that there are too many alternative explanations to explain a finding of no effect. However, Cortina and Folger (1998) have argued that hypothesizing the null may be appropriate in certain situations. One situation is when testing for boundary conditions. A boundary condition is something that limits the relationship between variables or limits the conditions under which an effect would otherwise exist. Knowing when an effect fails to occur can be just as valuable to theory as knowing when it occurs (Greenwald, 1993). Research has reliably shown that candidates respond differently to knowledge and behavioral instructions in low-stakes settings. Do these differences hold up in a highstakes situation? Does a high-stakes situation serve as a boundary condition limiting the effects that are typically found in low-stakes? One of the hypotheses of Lievens et al. (2009) was a null hypothesis because they too were testing for boundary conditions. According to Cortina and Folger (1998), “investigators rarely think in terms of boundary conditions, and even in those rare instances in which they do so think, they are justifiably 25 reluctant to test those boundary conditions for fear that the article, however professional, will be unpublishable” (p. 341). Given the purpose of the current study is to determine boundary conditions for differences in SJT response instructions, it is appropriate to hypothesize the null. 26 Chapter 3 METHODS Participants This study was conducted with applicants for entry-level firefighting positions at various fire departments throughout the United States. Fire departments who decided to purchase this particular test from an established test vendor were included in the study. The individual fire departments were responsible for recruiting, screening, and inviting candidates to complete the test. All departments had the same minimum qualification requirement of high school graduation or completion of an equivalency test (e.g., GED). A total of six fire departments participated in the study. Each fire department that participated in the study received only one form of the test so that all candidates saw the same items, in the same order, and with the same instruction types. Because this was a high-stakes situation, where hiring decisions were made based partly on the results of this test, keeping the content and instruction type identical for all candidates within each fire department avoided the potential issue of mean score differences based on different SJT instruction types across candidates. A fire department was provided with either a Form A or Form B version of the test, but not both. Table 6 provides the number of participants per form and per department. Development of the Test The entry-level firefighter test used in this study was created using a content validation approach based on the results of a national job analysis conducted by the test 27 vendor. The test consisted of four different sections with a total of 100 multiple-choice items. Each item had four alternatives with only one alternative keyed as the correct response. There were no penalties for incorrect answers and each correct answer was worth one point. The total possible score on the test was 100 points. Candidates were allowed a total of two hours to complete the test and none of the individual sections were timed. Candidates could complete the test at their own pace and could answer the items in the order they chose (at both the item and the section levels). The first 85 items measured cognitive ability and were divided into three different sections. The items were designed to assess knowledge learned in high school and did not require any previous knowledge of firefighting. The cognitive ability items (first 85 items) were identical across Form A and Form B of the test. The internal consistency coefficient of the cognitive ability section of the test was α = .89 for Form A and α = .88 for Form B. Due to test security issues, I can only provide descriptions of the items and cannot include actual items from the test (see Table 5 for a breakdown of the test sections and number of items per section). Table 5 Entry Firefighter Test Sections Test Section No. Items Reading Comprehension 30 Math 30 Mechanical Aptitude 25 Interpersonal Relations 15 Note. The section titled Interpersonal Relations contained the SJT items. 28 The reading comprehension section consisted of six reading passages that were each approximately a page in length followed by five items relating to the passage. The reading passages related to firefighting (e.g., types of foams used in firefighting, classifying fires), but did not require previous knowledge of firefighting to answer the items. Additionally, the material in the reading passage did not necessarily correspond to actual materials or techniques in firefighting (e.g., listing fake foams, using a made up system of classifying fires) so as not to give individuals with previous knowledge of firefighting an unfair advantage. The math section consisted of word problems put into the context of firefighting. They required candidates to have knowledge of addition, subtraction, multiplication, and division. Here is an example of a math item that is similar to those on the actual test. “An engine can pump 850 gallons per minute when operating at 100% efficiency. How many total gallons will an engine pump in seven minutes if operating at 70% capacity?” The mechanical aptitude section of the test consisted of items that are designed to asses a person’s ability to figure out how objects work and move, alone and in relation to other objects. An example is an item that consists of an image of six interlocking circular gears of different sizes labeled A thru F. The candidate might be asked “If gear A moves in a clockwise direction, which gear will revolve the fastest in a counterclockwise direction?” Overall, the sections of the cognitive ability test match well with meta-analytic data that has found a combination of cognitive and mechanical ability measures yield high validity coefficients with job performance (Barrett, Polomsky, & McDaniel, 1999). 29 The last section of the test contained 15 SJT items. The SJT was designed to assess the ability of potential firefighters to interact effectively with the public and coworkers. Experienced firefighters and test development professionals generated the critical incidents and response options. Each SJT item contained four response options with only one option keyed as correct. The scenarios and response options in the SJT section were identical across the two forms with the only change being the instruction type. As with the cognitive ability test, I can only provide descriptions of the items and cannot include actual items from the test. Examples of the topics the SJT items covered include a citizen expressing concern over department policy, a coworker behaving inappropriately, and receiving an assignment from a manager that contradicts policy. The average length of the SJT items was 30.73 words with a standard deviation of 12.85. In Form A, the first eight items used knowledge instructions. At the top of the page, candidates were instructed to “read each situation and choose how you should respond.” The word “should” was bolded to draw the candidate’s attention. In addition to the instructions at the top of the page, each item ended with “In this situation, you should” and the answer options completed the sentence. After the first eight SJT items there was a page break before candidates saw the remaining seven SJT items. At the top of the page, candidates were instructed to “read each situation and choose how you would respond.” The word “would” was bolded to draw the candidate’s attention. In addition to the instructions at the top of the page, each item ended with “In this situation, you would” and the answer options completed the sentence. 30 The only change for Form B was that the instruction types for the SJT items were reversed. The first eight SJT items that had cognitive instructions (should) in Form A, now had behavioral instructions (would). The same reversal occurred for the last seven SJT items in Form B. The internal consistency of the SJT for Form A was α = .42 and the internal consistency for Form B was α = .37. Because SJTs are measurement methods and generally measure more than a single construct, estimates of internal consistency usually underestimate reliability (Whetzel & McDaniel, 2009; Cronbach, 1951). Test-retest and parallel forms reliability measures are preferred (Whetzel & McDaniel, 2009); however, these measures were not available. Although test-retest and parallel forms reliability are preferred, it may be appropriate to report internal consistency when an effort was made to make the SJT content homogeneous (Ployhart & Ehrhart, 2003; Schmidt & Hunter, 1996). Lievens et al. (2009) created a 30 item SJT designed to be a homogenous measure of interpersonal/communication skills and reported an internal consistency of α = .55 for knowledge instructions and α = .56 for behavioral instructions. Even when SJTs are intended to be homogenous, internal consistency tends to be low. Procedure The six fire departments in the study contacted the test vendor expressing an interest in using their paper-and-pencil entry-level firefighter test as a part of their selection process. Each department was able to review the test prior to purchase. After making the decision to rent the test, the vendor then shipped Form A or Form B of the test to the fire department. The fire departments were responsible for proctoring and 31 administering the test. The proctors were provided with instructions on how to administer the test and given a form for documenting any candidate questions or concerns regarding the test or items on the test. None of the proctor report forms contained information regarding confusion with the SJT section or the instruction type in the SJT section. The test vendor was responsible for scoring the test and providing the department with the individual candidate scores and summary statistics for the test. 32 Chapter 4 RESULTS Data Screening Initially, 1457 participants completed Form A of the test and 868 completed Form B. Prior to analyzing the test data from the fire departments, it was determined that data for a respondent would be retained if both the cognitive ability section of the test and the SJT section were completed by the respondent. After screening on the previously mentioned criteria, 38 participants were dropped from Form A and 34 participants were dropped from Form B. The data were then analyzed for univariate and multivariate outliers. Univariate outliers were examined at the two levels of the independent variable (i.e., Form A and Form B). A participant was considered an outlier if their z score on the cognitive section of the test or SJT section of the test was ±3.3 (Tabachnik & Fidell, 2007). This resulted in 18 participants being dropped from Form A and 10 participants being dropped from Form B. The data were then examined for multivariate outliers looking at each participant’s Mahalanobis distance. Using an alpha of .001, this resulted in one participant being dropped from Form A. All of the participants dropped from the analyses were outliers towards the bottom of the distribution (i.e., they correctly responded to approximately 25% of the items on the cognitive and/or SJT section of the test). There are a few potential explanations as to why some participants obtained a score in the 25% range. The participants may have given up early in the test, the participants may have provided 33 random answers to the items, or the participants may not have been motivated to perform maximally. After data screening was complete, the final sample consisted of 1400 participants for Form A and 824 participants for Form B. Descriptive Statistics The Pearson correlation of SJT scores on Form A with cognitive ability was r(1398) = .27, p < .001 and was r(822) = .32, p < .001 for Form B. Table 6 contains the mean test scores and number of participants by gender and ethnicity and Table 7 contains the percentage of each gender and ethnicity by department. Overall, the sample did not contain many minorities and was primarily male and Caucasian; however, the data match up well with the gender and ethnicity firefighter demographics reported by the Bureau of Labor Statistics (2012). For Form A, there were no significant differences by gender on the SJT score (F(2, 1397) = 1.42, p = .24, 2 = .002) and the cognitive section score (F(2, 1397) = 2.33, p = .10, 2 = .003). For Form B, there was no significant difference by gender on the on the SJT score (F(2, 821) = 0.99, p = .37, 2 = .002), but there was a significant difference on the cognitive section score (F(2, 821) = 4.26, p = .01, 2 = .010). However, this difference only accounted for 1.0% of the variance in the cognitive section score. Additionally, Tukey post hoc tests revealed that the difference in scores is between the unknown group and males, with the unknown group obtaining a higher cognitive score on average. Not knowing the gender of the unknown group limits any conclusions drawn from this result. 34 Ethnicity differences were then evaluated for each form. On Form A, there were significant differences by ethnicity on the SJT score (F(7, 1392) = 2.18, p = .03, 2 = .011) and the cognitive section score (F(7, 1392) = 14.93, p < .001, 2 = .070). On the SJT, Tukey post hoc tests indicated that the only difference in SJT scores was between the Caucasian and Hispanic participants. On average, Caucasian participants scored higher on the SJT in Form A than the Hispanic participants. The differences are small and only account for 1.1% of the variance in SJT scores. On the cognitive ability section, African Americans scored lower than all other ethnic groups except for Native American. The Asian and Caucasian participants scored higher than Hispanics. The differences in scores are a little larger than for the SJT and accounted for 7.7% of the variance in cognitive ability scores. For Form B, there was no significant difference by ethnicity on the SJT score (F(7, 816) = 1.58, p = .14, 2 = .013). It should be noted that the assumption of homogeneity of variances was violated for SJT scores on Form B and there is a large discrepancy in ethnicity sample sizes (three groups have less than 20 participants and two groups have more than 250). The ratio of the largest to smallest group variance for the SJT on Form B was 8.10, well above the ratio of three that is commonly used to assess homogeneity of variance. The largest variance was for the African American group, which also had one of the smallest sample sizes (n = 16). Large variances associated with smaller sample sizes can increase the likelihood of a Type I error. While there was no score difference on Form B on the SJT, there was a significant difference by ethnicity on the cognitive ability section score (F(7, 816) = 4.89, p < .001, 2 = .040). Due to the 35 violation of homogeneity of variance, the Games-Howell post hoc method was used to evaluate mean score differences by ethnicity. The post hoc tests on Form B revealed a similar pattern of score difference in the cognitive ability section as Form A. African American participants scored lower than all other ethnicities except Native American and Filipino. The size of the effect is also similar to Form A, with ethnicity accounting for 4.0% of the variance in the cognitive ability score. Table 8 contains the mean test scores and number of participants in each test form by fire department. When the mean scores were compared between Form A and Form B by department, a one-way ANOVA demonstrated significant differences on the cognitive ability section (F(1, 2222) = 14.21, p < .01, 2 = .006), but not for the SJT (F(1, 2222) = 2.59, p = .11, 2 = .001). While the differences on the cognitive ability section by department were significant, it only accounted for 0.6% of the variance in scores. Tables 6, 7, and 8 reveal a particular scoring pattern. Table 6 indicates that there are ethnic differences on the cognitive ability section and that they are primarily driven by African American differences. Table 7 shows that department three has the highest percentage of African American participants and Table 8 shows that department three has the lowest mean score on cognitive ability of all the departments. This study was not designed to assess subgroup differences on SJTs and cognitive ability, but it is important to note that differences do exist. The differences found in this study are similar to those provided by Ployhart and Holtz (2008). Across both test forms, the Caucasian-African American uncorrected d-value for cognitive ability is 1.09 and for the SJT is .29. Ployhart and Holtz (2008) reported d-values of .99 for cognitive ability and .40 for SJTs. 36 Table 9 contains the average difficulty of the SJT Items and the correlation with cognitive ability by instruction type. Across Form A and Form B, the SJT items had an average p-value of .85 and average correlation with cognitive ability of .10. The p-values ranged from .65 to .96 and the correlations with cognitive ability ranged from -.06 to .25. The p-values were similar across instruction type, with the range of p-value differences by instruction type from .00 to .04. The SJT item correlations with cognitive ability were similar across instruction type, with a range of correlation differences by instruction type from .00 to .07. On Form A, the first eight SJT items with knowledge instructions had a total score correlation of .18 with cognitive ability. The same SJT items on Form B with behavioral tendency instructions had a total score correlation of .19 with cognitive ability. The last seven SJT items with behavioral tendency instructions on Form A had a total score correlation of .26 with cognitive ability. The same SJT items on Form B with knowledge instructions had a total score correlation of .32 with cognitive ability. 37 Table 6 Mean Test Score by Gender and Ethnicity Cognitive Ability N Percent Mean SD SJT Mean SD Gender Male 1752 78.8% 61.78 10.69 12.67 1.74 78 3.5% 60.76 10.92 12.45 1.78 394 17.7% 63.93 10.31 12.69 1.72 Caucasian 1024 46.0% 63.21 10.18 12.77 1.71 Hispanic 278 12.5% 59.16 11.15 12.51 1.79 Asian 100 4.5% 63.97 11.15 12.68 1.65 African American 74 3.3% 51.55 10.80 12.27 1.99 Filipino 55 2.5% 59.55 10.63 12.47 1.71 Native American 21 0.9% 59.29 9.69 12.24 2.05 Other 86 3.9% 61.88 10.20 12.44 1.82 586 26.3% 63.03 10.24 12.76 1.64 2224 100.0% 62.13 10.66 12.69 1.72 Female Unknown Ethnicity Unknown Total 38 Table 7 Gender and Ethnicity by Fire Department Department Percent 1 2 3 4 5 6 n = 1400 n = 135 n = 16 n = 480 n = 97 n = 96 Gender Male Female Unknown 80.14% 65.93% 93.75% 72.08% 94.85% 91.67% 4.43% 1.48% 15.43% 32.59% 0.00% 2.29% 0.00% 3.13% 6.25% 25.63% 5.15% 5.21% Ethnicity Caucasian 44.50% 49.63% 75.00% 38.33% 60.82% 82.29% Hispanic 13.93% 2.96% 0.00% 12.92% 16.49% 1.04% Asian 5.14% 1.48% 0.00% 4.79% 2.06% 1.04% African American 4.14% 1.48% 18.75% 2.08% 1.03% 0.00% Filipino 3.00% 0.00% 0.00% 2.50% 1.03% 0.00% Native American 1.14% 0.00% 0.00% 1.04% 0.00% 0.00% Other 4.29% 1.48% 0.00% 3.13% 6.19% 3.13% Unknown 23.86% 42.96% 6.25% 35.21% 12.37% 12.50% 39 Table 8 Participants and Mean Test Scores by Fire Department Cognitive Ability Form and Fire Department N Percent Mean SD SJT Mean SD Form A 1 1400 62.9% 61.48 10.76 12.64 1.76 1400 62.9% 61.48 10.76 12.64 1.76 2 135 6.1% 63.46 10.26 12.99 1.63 3 16 0.7% 47.13 7.52 9.81 2.01 4 480 21.6% 64.95 10.20 12.86 1.51 5 97 4.4% 64.53 8.04 13.20 1.45 6 96 4.3% 55.71 8.54 12.02 1.86 824 37.1% 63.23 10.40 12.76 1.66 Total Form B Total 40 Table 9 SJT Item Difficulty and Correlation with Cognitive Ability Form A Item p-value SD Form B r p-value SD R 86 .83 .38 .07** .84 .36 .14** 87 .86 .35 .12** .90 .30 .09** 88 .91 .28 .12** .92 .28 .14** 89 .76 .43 -.01** .78 .41 -.06** 90 .65 .48 .07** .68 .47 .05** 91 .76 .43 .05** .77 .42 .08** 92 .94 .24 .07** .93 .25 .07** 93 .87 .33 .12** .87 .34 .06** Avg. .82 .37 .08** .84 .35 .07** Total Score Correlation .18** .19** 94 .93 .26 .08** .91 .29 .07** 95 .76 .43 .10** .78 .42 .17** 96 .96 .20 .12** .96 .18 .09** 97 .93 .25 .12** .93 .25 .10** 98 .74 .44 .12** .72 .45 .12** 99 .92 .26 .13** .94 .23 .18** 100 .81 .39 .19** .81 .39 .25** Avg. .86 .32 .12** .86 .32 .14** Total Score Correlation .26** .31** Note. Items 86-93 used knowledge instructions in Form A and behavioral tendency instructions in Form B. Items 94-100 used behavioral tendency instructions in Form A and knowledge instructions in Form B. The total score correlation is the correlation of the cumulative SJT instruction type score with the cognitive ability score. *p ≤ .05; **p ≤ .01. 41 ANCOVA Analysis Prior to the analyses, the variables were examined to determine if they met the statistical assumptions underlying the use of ANCOVA. The dependent variables (SJT items) in this study are binary (0 for an incorrect response and 1 for a correct response) and do not meet the criteria for ANCOVA. This analysis requires that the dependent variable be measured on a continuous level. However, the decision was made to analyze the data with ANCOVA and follow it up with logistic regression where it is appropriate to use binary data as the dependent variable. Despite the use of binary dependent variables, additional assumptions were considered prior to the analyses. As previously mentioned, univariate and multivariate outliers were removed from the data set. The sample sizes in each level of the independent variable were large enough to meet the assumption of normality of sampling distributions. The covariate (cognitive ability section of the test) is a reliable measure as indicated by an internal consistency of α = .89 for Form A and α = .88 for Form B. The assumptions of linearity and homogeneity of regression were not evaluated due the dependent variables being binary. A mixed model (within-subjects and between-subjects) ANCOVA was performed on SJT item performance where instruction type varied. Two types of instructions were evaluated: knowledge instructions (What should you do?) and behavioral instructions (What would you do?). The dependent variable was item performance on each of the 15 SJT items. The covariate was cognitive ability level as indicated by each participant’s total score on the first 85 items of the test. The ANCOVA summary table is presented in Table 10. 42 In the between-subjects portion of the ANCOVA model there was a statistically significant effect of the covariate cognitive ability on SJT item performance, F(1, 2220) = 195.03, p < .001, partial 2 = .081. An inspection of the bivariate (zero-order) correlation showed that higher scores on the cognitive ability section were associated with higher scores on the SJT section (r(2222)= .29, p < .001). Once the covariate was controlled for, neither instruction type nor the interaction (instruction type by cognitive ability) were significant. In the within-subjects portion of the ANCOVA model, there was a significant main effect of the SJT item on SJT item performance, F(14, 31080) = 13.35, p < .001, partial 2 = .006. In other words, the SJT items varied in difficulty within participants. This was not surprising as it was expected that the SJT items would vary in their level of difficulty. Lastly, there was a significant interaction effect of item and cognitive ability on SJT item performance, F(14, 31080) = 7.20, p < .001, partial 2 = .003. 43 Table 10 ANCOVA Summary Table Type III SS df MS F Sig. Partial eta2 Item 21.90 14 1.56 13.35 .000 .006 Item X Cognitive 11.80 14 0.84 7.20 .000 .003 Item X Instruction 1.42 14 0.10 0.87 .597 .000 Item X Instruction X Cognitive 1.41 14 0.10 0.86 .605 .000 31080 0.12 Source Within-Subjects Effects Error 3640.84 Between-Subjects Effects Cognitive 35.21 1 35.21 195.03 .000 .081 Instruction 0.09 1 0.09 0.52 .472 .000 Instruction X Cognitive 0.12 1 0.12 0.67 .413 .000 Error 400.83 2220 0.18 Logistic Regression Analysis Logistic regression makes fewer assumptions than ANCOVA and does not require that dependent variables be continuous. The data from this study met all the assumptions of logistic regression. The dependent variable was dichotomously coded, with “0” used for an incorrect response to the SJT item and “1” for a correct response. Logistic regression analyses were run 15 times, once for each SJT item. Running many logistic regression analyses inflates the probability of making a Type I error, therefore a sequentially-modified Bonferroni correction was applied with an alpha of .05. In this study, the step-up Bonferroni correction described by Hochberg (1988) was utilized. In this method, the largest obtained p-value is compared to an alpha of .05. If the largest of 44 the 15 p-values obtained is equal to or less than .05 then all p-values are considered significant. If it is larger than .05 then the next largest p-value is compared to an alpha of .05 divided by two (.025). If this p-value is equal to or less than .025 it is considered significant, along with all the remaining smaller p-values. This process continues (i.e., the alpha of .05 is divided by three for the third largest, four for the fourth largest, and so forth) until a p-value meets the criteria for significance or until all p-values have failed to reach significance. A total of four different logistic regression models were evaluated to determine their relationship with correctly answering an SJT item. Table 11 contains the results of the logistic regression. The first model contained instruction type (Form A or Form B) as a predictor. In this model, instruction type did not provide a statistically significant improvement in explaining item performance over the constant-only model for any of the 15 SJT items. The difficulty of the SJT items did not differ by instruction type (knowledge or behavioral). The second model contained the cognitive ability section score as a predictor. In this model, 14 of the 15 SJT items provided a statistically significant improvement over the constant-only model. Increases in cognitive ability were associated with an increased probability of answering the SJT item correctly. For the 14 significant SJT items the mean Nagelkerke pseudo R2 was .026 and the mean Exp(B) was 1.032. The 14 significant SJT items had a Nagelkerke pseudo R2 range of .012 to .070 and an Exp(B) range of 1.025 to 1.055. On average,for every one-point increase in cognitive ability, participants 45 were 1.032 times more likely to answer the SJT items correctly, accounting for 2.6% of the variance in SJT item performance. The third model contained the cognitive ability section score and instruction type as predictors. In model two, cognitive ability had an effect on SJT performance. Model three is able to evaluate if there are differences in SJT performance by instruction type when controlling for cognitive ability. In this model, after controlling for cognitive ability, instruction type did not provide a statistically significant improvement over the constant-only model for any of the 15 SJT items. When the cognitive ability of participants was controlled, there were no differences in SJT difficulty by instruction type. For each of the 15 SJT items in model three, there was minimal or no increase in the Nagelkerke pseudo R2 when compared to model two. In fact, the largest change in Nagelkerke pseudo R2 was 0.005 (0.5% additional variance) over model two for item 94. The fourth model contained the cognitive ability section score, instruction type, and the interaction of instruction type and cognitive ability as predictors. This model was designed to test whether the association with cognitive ability differs across instruction type. In this model, the interaction did not provide a statistically significant improvement for any of the 15 SJT items over the main-effect-only model comprised of cognitive ability and instruction type. 46 46 Table 11 Logistic Regression Summary Table Model 1 Model 2 Model 3 Model 4 Instruction Cognitive Cognitive + Instruction Cognitive + Instruction + Cognitive X Instruction Sig. Nagelkerke R Square 86 1.114 .368 .001 1.025 .000* .017 1.066 .596 .017 1.020 .076 .019 87 1.409 .014 .005 1.032 .000* .023 1.338 .038 .027 0.997 .844 .027 88 1.061 .707 .000 1.041 .000* .033 0.985 .925 .033 1.008 .588 .034 89 1.169 .136 .002 0.998 .740* .000 1.174 .128 .002 1.001 .903 .002 90 1.143 .152 .001 1.013 .002* .006 1.119 .234 .007 0.996 .614 .007 91 1.055 .606 .000 1.014 .003* .006 1.030 .779 .006 1.006 .510 .006 92 0.870 .439 .001 1.026 .001* .012 0.828 .296 .014 0.997 .841 .014 93 0.933 .597 .000 1.026 .000* .016 0.889 .369 .017 0.984 .183 .019 94 0.741 .060 .004 1.025 .001* .012 0.705 .029 .017 0.992 .587 .018 Item Exp(B) Exp(B) Sig. Nagelkerke R Square Exp(B) Sig. Nagelkerke R Square Exp(B) Sig. Nagelkerke R Square (Table 11 continues) 47 47 (Table 11 continued) Model 1 Model 2 Instruction Cognitive Sig. Nagelkerke R Square 95 1.090 .411 .000 96 1.249 .334 97 Model 3 Model 4 Cognitive + Instruction Cognitive + Instruction + Cognitive X Instruction Nagelkerke R Square Exp(B) Sig. 1.028 .000* .023 1.039 .001 1.051 .000* .039 0.972 .872 .000 1.041 .000* 98 0.902 .296 .001 99 1.371 .084 .005 Item Exp(B) Nagelkerke R Square Exp(B) Sig. .720 .023 1.017 .093 .025 1.140 .572 .040 0.993 .782 .040 .030 0.900 .557 .030 0.994 .722 .031 1.025 .000* .021 0.860 .129 .022 1.001 .916 .022 1.055 .000* .053 1.249 .229 .055 1.029 .100 .058 Exp(B) Sig. 100 1.118 .329 .001 1.052 .000* .070 1.022 .855 .070 1.018 .114 Note. Model 3 reports the Exp(B) and Significance for instruction type. Model 4 reports the Exp(B) and Significance for the interaction. *Results are significant after applying the step-up Bonferroni correction Nagelkerke R Square .072 48 Chapter 5 DISCUSSION The majority of SJT studies analyzing the effects of response instructions have been conducted with participants in low-stakes settings. When research has been conducted in high-stakes settings, SJT content has generally varied by instruction type. This study answered the specific call for research on SJT instruction type in a high-stakes employment setting (Whetzel & McDaniel, 2009; Lievens et al., 2009; Ployhart & MacKenzie, 2010). Research on instruction type in a high-stakes setting is an important step towards understanding the effect response instructions may have on the construct being measured. A greater understanding of the construct can lead to more precision in employee selection, which leads to the ultimate goal of improved organizational performance. Discussion of Hypothesized Relationships The first hypothesis was supported as there was no difference in item difficulty between SJTs with behavioral tendency and knowledge instructions given on the same items to two different groups of people. This held true both when controlling and not controlling for the cognitive ability level of the participants (see Models 1 and 3 in Table 11). The largest p-value difference between instruction type was .04 (see item 87 in Table 9). This result is contrary to findings of studies conducted in low-stakes settings. Lowstakes have generally found that SJTs with knowledge instructions have higher mean scores than SJTs with behavioral tendency instructions (Ployhart & Ehrhart, 2003). 49 However, this finding is consistent with the study conducted in a high-stakes situation by Lievens et al. (2009). Based on the findings of the current study and those of Lievens et al. (2009), it appears that a high-stakes situation eliminates possible mean score differences by SJT instruction type. The second hypothesis was also supported as SJTs with behavioral tendency and knowledge instructions given on the same items to two different groups of people were both equally correlated with cognitive ability. Cognitive ability was the only predictor of SJT item performance in the logistic regression analyses and the lack of interaction between cognitive ability and instruction type suggests that the relationship does not differ in strength across the instruction types (see Models 2 and 4 in Table 11). The average total score correlation between the SJT items with knowledge instructions and cognitive ability was .18 and .32. The average total score correlation between the SJT items with behavioral tendency instructions and cognitive ability was .19 and .26 (see Table 9). This result is contrary to findings of studies conducted in low-stakes settings. Low-stakes have generally found that SJTs with knowledge instructions have higher correlations with cognitive ability than SJTs with behavioral tendency instructions (McDaniel et al., 2007). This is also contrary to the high-stakes findings of Lievens et al. (2009) where they found a correlation of .19 between knowledge instructions and cognitive ability and .11 between behavioral tendency instructions and cognitive ability. However, the correlations of Lievens et al. (2009) are smaller than those found in lowstakes situations. The difference in correlations between the current study and Lievens et al. (2009) may be explained by the varying levels of difficulty of the SJTs. The SJT used 50 in this study had an average between form percentage score of 84.57% and the average percentage score in Lievens et al. (2009) was 49.07%. The easier SJT items in this study may have allowed candidates to rely on their cognitive knowledge rather than providing behavioral responses, as might happen when candidates are unsure of the correct cognitive response. Support for the second hypothesis was also found in regards to the strength of the correlations of the SJT items with cognitive ability. In the meta-analysis conducted by McDaniel et al. (2007), they found a mean correlation of .25 between SJTs with knowledge instructions and cognitive ability and .15 between SJTs with behavioral tendency instructions and cognitive ability (See Table 4). The mean correlations with cognitive ability in this study were .25 for the knowledge instructions and .23 for the behavioral tendency instructions. Both instructions types were moderately correlated with cognitive ability. Limitations A number of important limitations moderate the generalizability of this study’s findings. The SJTs were not randomly assigned in this study. Six different fire departments participated in the study and each fire department was provided with only one form of the test. The decision to provide each fire department with the same form of the test was made consciously prior the study due to the litigious nature of selection testing in public safety. The test vendor felt that the potential for litigation would increase if there were any observed differences by form within a fire department. Providing the same form to all candidates within a department helped to mitigate this risk because all 51 participants saw the same instruction type for each SJT item. A consequence of this administration constraint is that the data for Form A was collected from one larger fire department and the data for Form B was collected from five different fire departments. Within the six departments, there was also no direct control over the exam administrations or examinee groups. The fire departments were provided with standardized proctor’s instructions regarding test administration practices, security, and timing procedures. However, it is unknown if these instructions were consistently adhered to across departments and test administrations. It is also unknown whether all of the examinees were typical of those commonly seeking work within each department. All of the departments had the same minimum qualifications for application; however, some departments may attract candidates who are overqualified while others may attract candidates who are just minimally qualified. As was noted previously, the ethnic diversity of the candidates varied by department. Specifically, department three had a higher percentage of African American participants than the other departments. The lack of random test form assignment within departments limits the generalizability of the findings. The length of the SJT was not ideal in this study. The test vendor had an existing test with 15 SJT items. It would have been better to have an SJT with at least 30 items to increase the reliability and to obtain more data for use in the analyses. The overall difficulty of the SJT reduced the ability to discriminate between high and low performers (average percentage score between the two forms was 84.57%) 52 The current study lacked a criterion measure (i.e., job performance) and a personality measure. I was only able to correlate scores with cognitive ability in the current study. The fire departments involved in the study did not elect to use a measure of personality as part of the selection process and did not want to be involved in collecting and/or providing job performance data. Knowing how the SJT scores relate to job performance and personality would have created a more holistic picture of the effects of response instructions. Previous studies have found that SJT correlations with personality vary by instruction type (McDaniel et al., 2007). It would have been valuable to collect this data in the current study since it was a high-stakes employment situation. It is possible that the participants in the study did not notice when the SJT instruction type changed. When the instruction type changed, there was a page break before the participants saw the remaining SJT items. The new instructions were at the top of the page. Additionally, the end of each SJT item contained the behavioral tendency or knowledge instruction. Even with the page break and instructions at the end of each item, the overall presentation of the items is very similar and participants may not have noticed the change. This approach to the instructions was picked consciously. The goal was to clearly separate the two instruction types without specifically telling the candidates that they are expected to respond differently to the new instruction type. It may be that participants did not notice the within-subjects instruction change; however if participants responded differently by instruction type, this would have been noticed in the betweensubjects comparison on the first eight items prior to the instruction change. No differences were found between-subjects either which indicates that the instruction types 53 were treated similarly. Future research could examine how different methods for changing instruction types within an SJT may affect how participants respond. Lastly, both of the hypotheses in this study hypothesized the null. One of the primary reasons for not hypothesizing the null is that there are too many alternative explanations to explain a finding of no effect. The intent of this study was to assess boundary conditions as it relates to SJT instruction types, but no causality can be assumed from the results. All of the limitations described here are potential explanations for why the null was not rejected in this study. Conclusions and Theoretical Implications One of the strengths of this study was that it was conducted using an employee selection exam taken by participants who were seeking employment as a firefighter. The motivation to perform maximally was very high. This same motivation for maximal performance is difficult to achieve when university students are used as participants. There was no need to simulate an employment selection setting because the setting was real. Another strength was the size of the sample. Although diversity in the sample was lacking, the sample size for each form of the test was sufficient to provide stable data for use in the analyses. Specific geographic data was not provided, but the departments included in study came from different states within the United States. Overall, the findings of this study indicate that participants in a high-stakes situation respond consistently (i.e., they treat different instruction types the same) to SJTs with knowledge or behavioral tendency instructions. Given this finding, it may be beneficial for high-stakes personnel selection SJT test developers to avoid behavioral 54 tendency instructions (Lievens et al., 2009). Participants who receive behavioral tendency instructions may find themselves in a dilemma as to how they should respond. Should they provide an honest behavioral response or a knowledge response that they know is best? Utilizing knowledge instructions avoids the potential for situations where the honest job applicant receives a lower score that is not due to a lack of knowledge, but rather to adhering to the specific response instructions unlike the majority of the other job applicants. In employee selection testing, a top down approach is often used where higher scorers are selected to move on in the selection process or are hired based on obtaining a higher score. The written test score may also be added to other selection tests (e.g., interview, writing sample) to create a cumulative score. A difference of one or a few points on the SJT may be the deciding factor in a competitive job market. Developers of SJTs should consider if it is more important that participants know what they should do or choose the answer that best represents what they would do. It can be argued that participants who know what they should do (i.e., knowledge response) have the capacity to be trained on the proper response for a particular situation, whereas a participant whose behavioral tendency response is incorrect may not know what should be done and therefore would be harder to train and more likely to respond incorrectly on the job. It can also be argued that participants whose behavioral tendency response is in alignment with the knowledge response are optimal because their two responses are congruent and therefore will lead to maximizing the likelihood of a correct response in a real situation. However, as previously noted, in a high-stakes situation it can be difficult 55 to determine if a participant’s response is truly a behavioral tendency response or a faking good response. Future research should evaluate the effects of non-dichotomous scoring techniques by response instructions. For example, the response instruction of “what would you most likely do and what would you least likely do” is probably the most commonly used instruction type (Ployhart & Ehrhart, 2003). Research should investigate if the correlations with cognitive ability and personality differ when evaluating the “least likely” option or if they differ in their level of difficulty. Likert rating scales are sometimes used with SJTs. When SJT content is held constant, research should evaluate if the construct measured changes by scoring technique. If a Likert rating scale is used but the instruction type varies, does the construct being measured change? Lastly, research should evaluate what cognitive processes are involved when participants complete an SJT. Do participants consciously discriminate between their personal knowledge and behavioral tendency responses? When participants are unsure of the knowledge response to an SJT, do they resort to their behavioral tendency response? SJTs have proven to be a valuable tool for high-stakes employment testing. However, there is still much we do not know about this measurement method. Additional research will help to define the boundary conditions of this method, which constructs it is most aptly suited to measure, and how to modify response instructions and scoring techniques to target specific constructs. 56 REFERENCES Arthur, W., &Villado, A. (2008).The importance of distinguishing between constructs and methods when comparing predictors in personnel selection research and practice.Journal of Applied Psychology, 93, 435-442. Barrett, G. V., Polomsky, M. D., & McDaniel, M. A. (1999). Selection tests for firefighters: A comprehensive review and meta-analysis. Journal of Business and Psychology, 13, 507-513. Bergman, M. E., Drasgow, F., Donovan, M. A., Henning, J. B., &Juraska, S. (2006). Scoring situational judgment tests: Once you get the data, your troubles begin. International Journal of Selection and Assessment, 14, 223-235. Bruce, M. M., & Learner, D. B. (1958).A supervisory practices test.Personnel Psychology, 11, 207-216. Bureau of Labor Statistics (2012, August). Labor force characteristics by race and ethnicity, 2011, Report 1036. Retrieved from http://www.bls.gov/cps/cpsrace2011.pdf Cardall, A. J. (1942). Preliminary manual for the Test of Practical Judgment. Chicago: Science Research Associates. Christian, M. S., Edwards, B. D., & Bradley, J. C. (2010).Situational judgment tests: Constructs assessed and a meta-analysis of their criterion-related validities.Personnel Psychology, 63, 83–117. 57 Clevenger, J., Pereira, GM., Wiechmann, D., Schmitt, N., & Schmidt-Harvey, V. (2001).Incremental validity of situational judgment tests.Journal of Applied Psychology, 86, 410–417. Cortina, J. M., &Folger, R. G. (1998). When is it acceptable to accept a null hypothesis: No way, Jose? Organizational Research Methods, 1, 334-350. Corts, D. B. (1980).Development of a procedure for examining trades and labor applicants for promotion to first-line supervisor. Washington, DC: U.S. Office of Personnel Management, Personnel Research and Development Center, Research Branch. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297-334. Cronbach, L. J. (1984). Essentials of psychological testing. (4th ed.) New York: Harper & Row. Greenberg, S. H. (1963). Supervisory judgment test manual (Technical Series No. 35). Washington, DC: Personnel Measurement Research and Development Center, U.S. Civil Service Commission, Bureau of Programs and Standards, Standards Division. Greenwald, A. G. (1993). Consequences of prejudice against the null hypothesis. In G. Kern & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences (pp. 419-460). Hillsdale, NJ: Lawrence Erlbaum. 58 Hanson, M. A. (1994). Development and construct validation of a situational judgment test of supervisory effectiveness for first-line supervisors in the U.S. Army.Unpublished doctoral dissertation, University of Minnesota. Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75, 800-802. Jagmin, N. (1985). Individual differences in perceptual/cognitive constructions of jobrelevant situations as a predictor of assessment center success. College Park, MD: University of Maryland. Landy, F. J. (2007). The validation of personnel decisions in the twenty-first century: Back to the future. In S. M. McPhail (Ed.), Alternate validation strategies: Developing and leveraging existing validity evidence (pp. 409-426). San Francisco: Josey-Bass Lievens, F., & Patterson, F. (2011).The validity and incremental validity of knowledge tests, low-fidelity simulations, and high-fidelity simulations for predicting job performance in advanced-level high-stakes selection.Journal of Applied Psychology, 96, 927-940. Lievens, F., Sackett, P. R., &Buyse, T. (2009). The effects of response instructions on situational judgment test performance and validity in a high-stakes context. Journal of Applied Psychology, 94, 1095-1101. McDaniel, M. A., Hartmand, N. S., Whetzel, D. L., & Grubb III, W. L. (2007). Situational judgment tests, response instructions, and validity: A meta-analysis. Personnel Psychology, 60, 63-91. 59 McDaniel, M. A., Morgeson, F. P., Finnegan, E. B., Campion, M. A., &Bracerman, E. P. (2001).Predicting job performance using situational judgment tests: A clarification of the literature.Journal of Applied Psychology, 86, 730-740. McDaniel, M. A., Nguyen, N. T. (2001). Situational judgment tests: A review of practice and constructs assessed. International Journal of Selection and Assessment, 9, 103-113. McDaniel, M. A., &Whetzel, D. L. (2005).Situational judgment test research: Informing the debate on practical intelligence theory.Intelligence, 33, 515-525. McFarland, L. A., & Ryan, A. M. (2000).Variance in faking across non-cognitive measures.Journal of Applied Psychology, 85, 812-821. Moss, F. A. (1926). Do you know how to get along with people? Why some people get ahead in the world while others do not.Scientific American, 135, 26-27. Motowidlo, S. J., Carter, G. W., Dunnette, M. D., Tippins, N., Werner, S., Burnett, J. R., & Vaughan, M. J. (1992). Studies of the structured behavioral interview.Journal of Applied Psychology, 77, 571-587. Motowidlo, S. J., Dunnette, M. D., & Carter (1990).An alternative selection procedure: The low-fidelity simulation.Journal of Applied Psychology, 75, 640-647. Motowidlo, S. J., Hanson, M. A., & Crafts, J. L. (1997).Low-fidelity simulations. In D. L. Whetzel& G. R. Wheaton (Eds.), Applied measurement methods in industrial psychology (pp. 241-260), Palo Alto: Davis Black. Northrop, L. C. (1989). The psychometric history of selected ability constructs. Washington, DC: U.S. Office of Personnel Management. 60 Nguyen, N. T., Biderman, M. D.,& McDaniel, M. A. (2005). Effects of response instructions on faking a situational judgment test. International Journal of Selection and Assessment, 13, 250-260. Ployhart, R. E., &Ehrhart, M. G. (2003). Be careful what you ask for: Effects of response instructions on the construct validity and reliability of situational judgment tests. International Journal of Selection and Assessment, 11, 1-16. Ployhart, R. E., & Holtz, B. C. (2008).The diversity-validity dilemma: Strategies for reducing racioethnic and sex subgroup differences and adverse impact in selection.Personnel Psychology, 61, 153-172. Ployhart, R. E., &MacKenzie, W. I. (2010).Situational judgment tests: A critical review and agenda for the future.InZedeck, S. (Ed.), APA handbook of industrial and organizational psychology. (pp. 237-252). Washington, DC: American Psychological Association. Pulakos, E. D. & Schmitt, N. (1996).An evaluation of two strategies for reducing adverse impact and their effects on criterion-related validity.Human Performance, 9, 241258. Richardson, Bellows, Henry, & Co. (1981). Technical Reports Supervisory Profile Record. Richardson, Bellows, Henry, & Co. Sacco, J. M., Scheu, C. R., Ryan, A. M., Schmitt, N., Schmidt, D. B., &Rogg, K. L. (2000).Reading level and verbal test scores as predictors of subgroup differences and validities of situational judgment tests. Paper presented at the 15th annual 61 conference of the Society for Industrial and Organizational Psychology, New Orleans, LA. Salgado, J.F., Viswesvaran, C., & Ones, D. S. (2001). Predictors used for personnel selection: An overview of constructs, methods, and techniques. In N. R. Anderson, S. S. Ones, H. K. Sinangil& C. Viswesvaran (Eds.), Handbook of industrial, work, & organizational psychology, Vol. 1. (pp. 165-199)London: Sage. Schmidt, F. L., & Hunter, J. E. (1996). Measurement error in psychological research: Lessons from 26 research scenarios. Psychological Methods, 1, 199-223. Schmidt, F. L., & Hunter, J. E. (1998).The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings.Psychological Bulletin, 124, 262-274. Schmitt, N., & Chan, D. (2006).Situational judgment tests: Method or construct? In J.A. Weekley& R. E. Ployhart (Eds.), Situational judgment tests (pp. 135-156). Mahwah, NJ: Lawrence Erlbaum Associates. SIOP (2003).Principles for the validation and use of personnel selection procedures.Bowling Green, OH: Society for Industrial Organizational Psychology. Weekley, J. A., & Jones, C. (1999).Further studies of situational tests.Personnel Psychology, 52, 679-700. Tabachnik, B. G., &Fidell, L. S. (2007).Experimental Design Using ANOVA.Belmont, CA: Duxbury. 62 Weekley, J. A., &Ployhart, R. E. (2006).An introduction to situational judgment testing. In J.A. Weekley& R. E. Ployhart (Eds.), Situational judgment tests Mahwah, NJ: Lawrence Erlbaum Associates. Wernimont, P., & Campbell, J. P. (1968).Signs, samples, and criteria.Journal of Applied Psychology, 52, 372-376. Whetzel, D. L., & McDaniel, M. A. (2009). Situational judgment tests: An overview of current research.Human Resource Management Review, 19, 188-202. Whetzel, D. L., McDaniel, M. A., & Nguyen, N. T. (2008). Subgroup differences in situational judgment test performance: A meta-analysis. Human Performance, 21, 291-309.