THE EFFECTS OF RESPONSE INSTRUCTIONS ON SITUATIONAL JUDGMENT

THE EFFECTS OF RESPONSE INSTRUCTIONS ON SITUATIONAL JUDGMENT
TEST PERFORMANCE IN A HIGH-STAKES EMPLOYMENT CONTEXT
A Thesis
Presented to the faculty of the Department of Psychology
California State University, Sacramento
Submitted in partial satisfaction of
the requirements for the degree of
MASTER OF ARTS
in
Psychology
(Industrial / Organizational Psychology)
by
Clinton Dean Kelly
FALL
2013
© 2013
Clinton Dean Kelly
ALL RIGHTS RESERVED
ii
THE EFFECTS OF RESPONSE INSTRUCTIONS ON SITUATIONAL JUDGMENT
TEST PERFORMANCE IN A HIGH-STAKES EMPLOYMENT CONTEXT
A Thesis
by
Clinton Dean Kelly
Approved by:
__________________________________, Committee Chair
Gregory Hurtz, Ph.D.
__________________________________, Second Reader
Lawrence Meyers, Ph.D.
__________________________________, Third Reader
Howard Fortson, Ph.D.
____________________________
Date
iii
Student: Clinton Dean Kelly
I certify that this student has met the requirements for format contained in the University
format manual, and that this thesis is suitable for shelving in the Library and credit is to
be awarded for the thesis.
__________________________, Graduate Coordinator
Jianjian Qin, Ph.D.
Department of Psychology
iv
___________________
Date
Abstract
of
THE EFFECTS OF RESPONSE INSTRUCTIONS ON SITUATIONAL JUDGMENT
TEST PERFORMANCE IN A HIGH-STAKES EMPLOYMENT CONTEXT
by
Clinton Dean Kelly
The purpose of this study was to investigate the effects of response instructions on
situational judgment test (SJT) performance. Participants (N = 2,224) completed one of
two forms of a 15 item SJT test as part of an entry-level firefighter selection test. Eight
ofthe SJT items on Form A used knowledge instructions (what should you do) and 7
used behavioral tendency instructions (what would you do). On Form B, the same items
were used, but for each item the “should” versus “would” response instructions were
reversed relative to Form A. No significant differences were found for mean scores by
instruction type within or between subjects. Additionally, no significant differences were
found in the SJT items’ correlations with cognitive ability by instruction type. These
results are in direct contrast to those found in previous studies involving low-stakes
situations. Limitations and theoretical implications of the findings were also discussed.
_______________________, Committee Chair
Gregory Hurtz, Ph.D.
_______________________
Date
v
ACKNOWLEDGEMENTS
I would like to thank Gregory Hurtz, Ph.D. for serving as my chair, providing guidance
along the way, and putting up with my procrastination. I would also like to thank Larry
Meyers, Ph.D. and Howard Fortson, Ph.D. for their contributions to my education in the
classroom and in the workplace. Jason Schaefer helped to get this all started and assisted
with some preliminary presentations of the findings. My wife, daughter, and parents
have been a source of support, encouragement, and love throughout this process. Lastly,
Nala was always by my side with unconditional support.
vi
TABLE OF CONTENTS
Page
Acknowledgements ...................................................................................................... vi
List of Tables ............................................................................................................... ix
Chapter
1. OVERVIEW ............................................................................................................. 1
Brief History of SJTs ......................................................................................... 2
SJTs as a Measurement Method ....................................................................... 3
Reliability ......................................................................................................... 7
Validity ............................................................................................................. 8
2. STUDY BACKGROUND .................................................................................... 11
SJTs as Measures of Behavioral Tendency .................................................... 11
SJTs as Measures of Job Knowledge or Cognitive Ability ............................ 11
Instruction Types ........................................................................................... 11
Faking ............................................................................................................ 17
Limitations of Prior Research ........................................................................ 19
Present Study .................................................................................................. 20
Hypotheses ..................................................................................................... 21
3. METHODS ............................................................................................................ 26
Participants .................................................................................................... 26
Development of the Test ................................................................................ 26
vii
Procedure ........................................................................................................ 30
4. RESULTS .............................................................................................................. 32
Data Screening ............................................................................................... 32
Descriptive Statistics ...................................................................................... 33
ANCOVA Analysis ........................................................................................ 41
Logistic Regression Analyses ........................................................................ 43
5. DISCUSSION ....................................................................................................... 48
Discussion of Hypothesized Relationships .................................................... 48
Limitations...................................................................................................... 50
Conclusions and Theoretical Implications ..................................................... 53
References ................................................................................................................... 56
viii
LIST OF TABLES
Tables
1.
Page
Table 1 Constructs Assessed Using SJTs from Christian, Edwards, & Bradley,
2010 ......................................................................................................................... 5
2.
Table 2 SJT Construct Correlations from McDaniel et al. 2007 ............................. 9
3.
Table 3 Types of SJT Instructions ......................................................................... 12
4.
Table 4 SJT Construct Correlations by Instruction Type when Test Content is
Held Constant from McDaniel et al., 2007 .......................................................... 15
5.
Table 5 Entry Firefighter Test Sections ................................................................ 27
6.
Table 6 Mean Test Score by Gender and Ethnicity............................................... 37
7.
Table 7 Gender and Ethnicity by Fire Department ............................................... 38
8.
Table 8 Participants and Mean Test Scores by Fire Department .......................... 39
9.
Table 9 SJT Item Difficulty and Correlation with Cognitive Ability .................. 40
10.
Table 10 ANCOVA Summary Table .................................................................... 43
11.
Table 11 Logistic Regression Summary Table ..................................................... 46
ix
1
Chapter 1
OVERVIEW
Personnel selection is the process that organizations follow to make hiring
decisions. Virtually all organizations use some sort of selection procedure as a part of
their personnel selection process. A selection procedure is any procedure used singly or
in combination to make a personnel decision (SIOP, 2003). The selection procedures
organizations use can vary from the simplistic (e.g., pass a criminal background check in
order to clean dishes at a restaurant) to the complex (apply with the CIA and take a
personality test, a cognitive ability test, a writing sample, a polygraph, three interviews,
and a physical ability test). Regardless of the simplicity or complexity of the selection
procedure, the goal should be to hire the best person for the job. Many different types of
selection procedures have been used throughout the years (Schmidt & Hunter, 1998);
however, this study will focus specifically on the use of situational judgment tests (SJTs)
in personnel selection.
SJTs have become more popular over the last few years in personnel selection
(Weekley & Ployhart, 2006). Some reasons for this increase in popularity are that SJTs
have the potential to reduce adverse impact in comparison to cognitive ability tests,
present problem solving skills in an applied setting, offer incremental validity over
cognitive ability and personality, and receive positive applicant reactions (Landy, 2007).
As the name implies, SJTs contain items that are designed to assess an applicant’s
judgment regarding a situation encountered in the work place (Weekley & Ployhart,
2
2006). A respondent is presented with work-related situations and a list of possible
courses of action. An example SJT item is provided below:
You have a project due to your supervisor by a specific date, but because of your
workload, you realize that you will not be able to complete the project by the deadline.
Which is the best response?
a.
Work extra hours in order to complete the project by the deadline.
b.
Complete what you can by the deadline.
c.
Let your supervisor know that the deadline was unrealistic.
d.
Ask for an extension of the deadline.
SJTs are versatile and are not limited to a single mode of administration (e.g.,
written, verbal, video-based, or computer-based) (Clevenger, Pereira, Wiechmann,
Schmitt, & Schmidt-Harvey, 2001) nor a single scoring method (e.g., dichotomous, rating
scale) (Bergman, Drasgow, Donovan, Henning, & Juraska, 2006). However, the focus of
this paper will be on written multiple-choice SJTs that are dichotomously scored.
Brief History of SJTs
One of the first tests to contain SJTs with a list of response options like the
example provided was the George Washington Social Intelligence Test created in 1926
(Moss, 1926). One of the seven subtests was titled Judgment in Social Situations and was
designed to assess knowledge of judgment and human motives. A few years later during
World War II, the army created a test designed to assess the common sense, experience,
and general knowledge of soldiers (Norhrop, 1989). The use of SJTs then spread to
organizations where tests were developed to measure supervisory potential. The Practical
3
Judgment Test (Cardall, 1942) and Supervisory Practices Test (Bruce & Learner, 1958)
were both created for organizational use.
SJTs continued to be used sporadically in organizations, but did not receive much
attention until 1990 when Motowidlo, Dunnette, and Carter (1990) created an SJT to
select entry-level managers. They referred to their test as a low-fidelity simulation and
found that the correlations of the SJT scores with supervisory performance ratings ranged
from .28 to .37 in the total sample. The authors concluded that, “low-fidelity simulations
can be used to harness the predictive potential of behavioral samples even when cost or
other practical constraints make it impossible to develop and implement high-fidelity
simulations” (Motowidlo et al., 1990, p. 647). Since the early 1990s, SJT research has
increased and they are commonly used in personnel selection in both the United States
and Europe (Salgado, Viswesvaran, & Ones, 2001). The rise in use of SJTs is likely due
to their validity in predicting job performance (McDaniel, Hartman, Whetzel, & Grubb,
2007), their incremental validity over job knowledge tests (Lievens & Patterson, 2011)
and personality tests (McDaniel et al., 2007), their reduced race-based adverse impact
relative to cognitive measures (Whetzel, McDaniel, & Nguyen, 2008), and their face and
content validity (Motowidlo, Hanson, & Crafts, 1997).
SJTs as a Measurement Method
It is important to recognize the distinction between methods and constructs when
referring to SJTs or any test used in personnel selection. A construct is the concept a
selection procedure is intended to measure and the method is the measurement tool used
to asses a construct or constructs (Arthur & Villado, 2008; SIOP, 2003). Some methods
4
are designed to measure a single construct and other methods may measure more than
one construct. For example, an integrity test is designed to measure the construct of
integrity and a cognitive ability test is designed to measure the construct of general
cognitive ability. An example of a measurement method that may assess more than one
construct is an assessment center. An assessment center may be designed to assess the
constructs of interpersonal skills, managing others, and job knowledge. SJTs are similar
to assessment centers in that they are a measurement method and not a construct
(McDaniel & Whetzel, 2005). SJTs can be used as a method to measure different
constructs (i.e., construct heterogeneous), although the method does place some
restrictions on what can be measured (Christian, Edwards, & Bradley, 2010; Schmitt &
Chan, 2006). Two of the most common construct domains measured by SJTs include
leadership and interpersonal skills (Christian et al., 2010). Table 1 contains a summary of
the constructs assessed using SJTs from a meta-analysis conducted by Christian el al.
(2010).
5
Table 1
Constructs Assessed Using SJTs from Christian, Edwards, & Bradley, 2010
Construct Domain
Construct
Job knowledge and skills
Knowledge of the interrelatedness of units
Pilot judgment
Managing tasks
Team role knowledge
Interpersonal Skills
Ability to size up personalities
Customer contact effectiveness
Customer service interactions
Guest relations
Interactions
Interpersonal skills
Negotiations
Service situations
Social intelligence
Working effectively with others
Teamwork Skills
Teamwork
Teamwork KSAs
Leadership
Administrative judgment
Conflict resolution for managers
Directing the activities of others
Handling people
General management performance
Handling employee problems
(Table 1 continues)
6
(Table 1 continued)
Construct Domain
Construct
Leadership/supervision
Managerial/supervisory skill or judgment
Managerial situations
Supervisor actions dealing with people
Supervisor job knowledge
Supervisor problems
Supervisor Profile Questionnaire
Managing others
Personality Composites
Conscientiousness, Agreeableness,
Neuroticism
Adaptability, ownership, self-initiative,
teamwork, integrity, work ethic
Conscientiousness
Conscientiousness
Agreeableness
Agreeableness
Neuroticism
Neuroticism
Adapted from Christian, Edwards, & Bradley (2010).
It is important to remember that SJTs often do not clearly measure one specific
construct and that Table 1 lists the constructs that Christian et al. (2010) felt were most
saturated by the SJTs they included in their study. The potential for SJTs to measure
more than one construct in a valid and cost effective manner make them attractive tools
for individuals working in personnel selection.
7
Reliability
Reliability relates to the precision of a test. It is a measurement of the amount of
error in a test. There are different methods for assessing reliability, but only three will be
discussed here. Test-retest reliability is obtained by having the same person complete a
test twice and calculating a Pearson correlation between the two test administrations.
Parallel forms reliability is obtained by creating two versions of a test that assess the
same construct or constructs and having some people complete each version of the test. A
Pearson correlation between the two tests is then calculated to obtain the reliability. The
third form is internal consistency reliability, which is based on the idea that test takers
will respond similarly to items that assess the same construct. Coefficient alpha or
Cronbach’s alpha is the measure of internal consistency that is typically used when
reporting internal consistency reliability.
There are some special considerations to keep in mind before estimating the
reliability of SJTs because they tend to be construct heterogeneous at the scale and item
level (McDaniel & Whetzel, 2005). Consider the SJT example presented previously. The
situation stated that due to workload the employee has recognized that he or she will not
be able to finish a project by the scheduled deadline. The respondent is then asked which
of the listed response options is best. A respondent who chooses the option that states,
“work extra hours in order to complete the project by the deadline,” may choose this
response due to a high level of intelligence and/or due to a high level of agreeableness.
This particular item would then correlate with both cognitive ability and agreeableness
(Whetzel & McDaniel, 2009). If the SJT consists of additional items that also correlate
8
with both intelligence and other personality factors, we could say that the SJT is construct
heterogeneous. In other words, the SJT is not internally consistent.
Unless there is evidence that an SJT is homogenous (i.e., internally consistent),
test-retest and parallel forms reliability are preferred and Cronbach’s alpha should be
avoided (Cronbach, 1951). It is more appropriate to use test-retest reliability or parallel
forms reliability. However, in research or field situations, it can be difficult to obtain testretest or parallel forms reliability and many researchers continue to report internal
consistency estimates of reliability without mentioning the tendency of such measures to
underestimate reliability (Whetzel & McDaniel, 2009).
Validity
The existing research concerning the types of validity evidence for SJTs focus on
construct validity and predictive validity (i.e., prediction of job performance). Concerning
construct validity, researchers have found moderate correlations of SJTs with cognitive
ability and personality (McDaniel, Morgeson, Finnegan, Campion, & Braverman, 2001;
McDaniel & Nguyen, 2001; Ployhart & Ehrhart, 2003). McDaniel et al. (2007) conducted
a meta-analysis of 118 studies of SJTs that included data on construct validity. They
found that on average SJTs are most highly correlated with cognitive ability and three of
the Big Five personality traits (conscientiousness, agreeableness, and emotional stability).
The mean r values ranged from .29 to .19 (See Table 2).
9
Table 2
SJT Construct Correlations from McDaniel et al. 2007
κ
Mean r
Ρ
95
.29
.32
25,473
51
.22
.25
Conscientiousness
31,277
53
.23
.27
Emotional Stability
19,325
49
.19
.22
Extroversion
11,351
25
.13
.14
Openness to experience
4,515
19
.11
.13
Instruction Type
N
Cognitive Ability
30,859
Agreeableness
Adapted from McDaniel et al. (2007).
Note. N is the number of subjects across all studies in the analysis; κ is the number of
studies; Mean r is the observed mean; ρ is the estimated mean population correlation
which were corrected for measurement error in the personality and cognitive ability
measures.
McDaniel et al. (2007) also evaluated the validity evidence concerning prediction
of job performance and incremental validity over cognitive ability and personality. They
found an average correlation of .26 with job performance ratings based on 118
coefficients (N = 24,756), an incremental validity from .03 to .05 over cognitive ability,
and an incremental validity from .06 to .07 over a measure of the Big 5. The authors did
note that almost all of the data in their meta-analysis came from incumbent samples and
not job applicants.
Overall, research supports the construct, predictive and incremental validity
evidence for SJTs. However, there have been conflicting data regarding the strength of
the validity coefficients. Most of the differences in findings have been attributed to two
different instruction types used for SJTs (McDaniel & Nguyen, 2001). The primary
purpose of this study is to add to the existing knowledge of how response instructions
10
moderate the construct validity of SJTs as it relates to cognitive ability and how the item
level difficulty is affected by response instruction type.
11
Chapter 2
STUDY BACKGROUND
SJTs as Measures of Behavioral Tendency
Professionals in personnel selection often work on the assumption of behavioral
consistency, which states that the best predictor of future behavior is past behavior
(Wernimont & Campbell, 1968). This is one of the explanations offered for why SJTs
predict job performance (Motowidlo et al., 1990). According to this explanation, SJTs
serve as an indication of how job applicants will behave on the job if hired. If this is true,
organizations can use SJTs that reflect common work situations to assess a candidate’s
behavioral tendencies.
SJTs as Measures of Job Knowledge or Cognitive Ability
The second explanation offered for why SJTs predict job performance is that they
measure job knowledge or cognitive ability (McDaniel et al., 2001). Using this
explanation, when candidates complete SJTs they are cognitively evaluating each
response option and picking the best option. The best option may not always be
congruent with a candidate’s behavioral tendency. If this explanation is true,
organizations can use SJTs to assess a candidate’s maximal performance (Cronbach,
1984).
Instruction Types
Research has shown that the amount of behavioral tendency or cognitive ability
measured by an SJT may be moderated by the instruction type used (Ployhart & Ehrhart,
12
2003; McDaniel et al., 2007; Lievens, Sackett, & Buyse, 2009). Accordingly, SJT
instructions are typically divided into two categories: behavioral tendency instructions
and knowledge instructions (McDaniel & Nguyen, 2001). Table 3 provides a summary of
the different instructions by type. Ployhart and Ehrhart (2003) note that the “most and
least likely do” are probably the most commonly used instructions in practice.
Table 3
Types of SJT Instructions
Instruction Type
Study
Behavioral Tendency Instructions
1. Would do
Bruce and Learner (1958)
2. Most and least likely do
Pulakos and Schmitt (1996)
3. Rate and rank what you would most
likely do
Jagmin (1985)
4. Have done
Motowidlo, Carter, Dunnette, Tippins,
Werner, Burnett, and Vaughan (1992)
Knowledge Instructions
1. Should do
Hanson (1994)
2. Best response
Greenberg (1963)
3. Best and worst
Weekley and Jones (1999)
4. Best and second best
Richardson, Bellows, Henry, & Co.
(1981)
5. Best, second, and third best
Cardell (1942)
6. Level of importance
Corts (1980)
7. Rate effectiveness of each response
Sacco, Scheu, Ryan, Schmitt,
Schmidt, and Rogg (2000)
Adapted from Ployhart and Ehrhart (2003) and McDaniel et al. (2007).
The variety of instruction types used have led to some confusion about what SJTs
measure and how effective they are in predicting job performance. The first study to
13
attempt to clarify this confusion by modifying SJT response instructions while holding
the SJT content constant was conducted by Ployhart and Ehrhart (2003).
In this study, they developed a five-item paper-and-pencil SJT consisting of
common study situations encountered by students with four response options for each
item. The researchers identified six different instruction types (three behavioral tendency
and three knowledge). In Phase I of the study, 486 students were randomly administered
the SJT using one of the six instruction types. In Phase II, 134 students completed the
same five-item SJT six times, once for each instruction type. The results found that the
SJT items with knowledge instructions resulted in higher means, lower standard
deviations, and more non-normal distributions than the behavioral tendency instructions.
They also evaluated the intercorrelations among instruction type and the correlations
between instruction types. The intercorrelation among behavioral tendency instructions (r
= .76) was higher than the intercorrelation among knowledge instructions (r = .36). The
correlation between the two instruction types was r = .22. These results support the idea
that different instruction types elicit a different response from candidates when presented
with the same SJT item. However, this study was conducted with a student sample and it
may be reasonable to suspect that results may differ with a non-student sample. The
students in this study were in a low-stakes situation with little incentive to misrepresent
their behavioral tendencies.
McDaniel et al. (2007) conducted a meta-analysis of paper-and-pencil SJTs
whose participants were employees (four of the 118 studies included used applicants)
rather than university students. They categorized 118 SJT studies as having used
14
behavioral tendency or knowledge instructions. For the majority of the studies included
in the meta-analysis the content varied, therefore holding the content constant was not
possible. However, eight of the 118 studies did hold the content constant so that the SJT
was administered twice, once with behavioral tendency instructions and once with
cognitive instructions. Table 4 contains a summary of the construct correlations by
instruction type for the eight studies where content was held constant. The results found
that SJTs with knowledge instructions were more highly correlated with cognitive ability
than behavioral tendency instructions and that behavioral tendency instructions were
more highly correlated with all five facets of the Big Five than knowledge instructions.
The results of this study support the results of Ployhart and Ehrhart (2003) that different
instruction types elicit a different response from participants when presented with the
same SJT item. However, the authors caution against drawing strong conclusions because
only eight of the 118 studies held the SJT test content constant. This meta-analysis made
an important step forward in the moderating effect of instruction type by using job
incumbents rather than students. This research was conducted primarily with job
incumbents and McDaniel et al. (2007) conclude that the incentive to misrepresent
behavioral tendencies may be greater for job applicants and therefore lead to potentially
different results regarding the effects of response instructions. They recommended that
research on SJT response instructions be conducted using applicant data.
15
Table 4
SJT Construct Correlations by Instruction Type when Test Content is Held Constant from
McDaniel et al. 2007
Instruction Type
Cognitive Ability
N
κ
Mean r
Ρ
1,497
8
.20
.22
Knowledge instructions
737
4
.25
.28
Behavioral tendency instructions
760
4
.15
.17
1,465
8
.15
.17
Knowledge instructions
763
4
.12
.14
Behavioral tendency instructions
702
4
.17
.20
1,465
8
.24
.27
Knowledge instructions
763
4
.19
.21
Behavioral tendency instructions
702
4
.29
.33
1,465
8
.06
.08
Knowledge instructions
763
4
.02
.02
Behavioral tendency instructions
702
4
.11
.13
1,465
8
.04
.04
Knowledge instructions
763
4
.02
.02
Behavioral tendency instructions
702
4
.06
.07
1,465
8
.07
.08
Knowledge instructions
763
4
.05
.05
Behavioral tendency instructions
702
4
.09
.10
Agreeableness
Conscientiousness
Emotional Stability
Extroversion
Openness to experience
Adapted from McDaniel et al. (2007).
Note. N is the number of subjects across all studies in the analysis; κ is the number of
studies; Mean r is the observed mean; ρ is the estimated mean population correlation
which were corrected for measurement error in the personality and cognitive ability
measures.
Lievens et al. (2009) had the unique opportunity to study the effects of response
instructions in a high stakes setting. They were able to manipulate SJT response
16
instructions while holding the content constant in a high-stakes medical school applicant
setting. In this study, a paper-and-pencil SJT measuring interpersonal/communication
skills concerning physician and patient interactions was used as part of an admission
exam for medical studies in Belgium. The SJT consisted of 30 items with four response
alternatives for each item and used dichotomous scoring. Applicants were given 40
minutes to complete the SJT. They randomly assigned medical school applicants the
same SJT with behavioral tendency or knowledge instructions. Their sample consisted of
1,086 students who completed the SJT with knowledge instructions and 1,098 students
who completed the same SJT with behavioral tendency instructions. Similar with past
research, they found that the SJT with knowledge instructions had a higher correlation
with cognitive ability than the SJT with behavioral tendency instructions (.19 vs. .11).
However, the authors note that this difference in cognitive loading is not large and is
smaller than the differences found by Ployhart and Ehrhart (2003) and McDaniel et al.
(2007) in low-stakes situations.
In contrast with Ployhart and Ehrhart (2003), Lievens et al. (2009) found that the
difficulty level of the SJT did not vary by instruction type. The SJT with cognitive
instructions had a mean score of 14.92 and the SJT with behavioral tendency instructions
had a mean score of 14.52. The authors concluded that “the answer to our initial question
of whether test takers would respond differently to knowledge versus behavioral
tendency instructions for SJTs in a high-stakes context seems to be no” (p. 1099).
17
Faking
Upon reviewing the results of research on the moderating effects of response
instructions for SJTs, there appear to be two different findings. In low-stakes situations
(Ployhart & Ehrhart, 2003; McDaniel, 2007) people tend to respond differently to
knowledge and behavioral tendency instructions. Knowledge instructions in low-stakes
situations have resulted in higher mean scores and smaller standard deviations than
behavioral tendency instructions (Ployhart & Ehrhart, 2003). In high-stakes situations,
the knowledge and behavioral tendency instructions were equal in their level of difficulty
and standard deviations (Lievens et al., 2009). In low-stakes situations, the correlation of
knowledge instructions with cognitive ability is meaningfully higher than behavioral
tendency instructions (Ployhart & Ehrhart, 2003; McDaniel et al., 2007). While a highstakes situation has found a similar pattern, the difference in correlation with cognitive
ability between knowledge and behavioral tendency instructions is “quite small; thus any
effects on adverse impact would be very small” (Lievens et al., 2009, p. 1100).
A possible explanation for the disparity in findings between low-stakes and highstakes situations is the potential for candidates to fake on the SJT. In a low-stakes
research situation, participants have nothing to gain by faking. Low-stakes research
settings often consist of university students who receive extra credit based solely on their
participation in the study and not on the score they receive on the SJT. In these types of
situations, it is likely that candidates will respond truthfully to behavioral tendency
instructions. There are situations where how a person would respond (behavioral
tendency) might be different from how a person should respond (knowledge). If
18
participants are able to recognize this distinction and are not motivated to pick the best
response (e.g., a university student receiving extra credit for research participation),
faking is not likely to be an issue.
However, if participants are motivated by the situation to pick the best response
(e.g., applying for a job) even though they are asked to pick the behavioral tendency
response, the potential for faking exists. Lievens et al. (2009) state that behavioral
tendency instructions in a high-stakes setting have the potential to create a moral
dilemma in applicants when the behavioral tendency and knowledge answer are not
compatible. The candidate must decide to respond honestly or pick the best response.
McFarland & Ryan (2000) have found that even when people are instructed to fake on a
measure, there are individual differences in the amount of response distortion. In a highstakes situation, this could result in some participants being rewarded for response
distortion while others are punished for responding honestly. It is important to understand
what role, if any, faking plays in SJT response instructions.
To evaluate this important topic, Nguyen, Biderman, and McDaniel (2005)
evaluated the role of response instructions on faking for SJTs and if candidates are able to
fake at all. They had 203 college students complete a 31 item SJT. The students were to
select the “best action” and the “worst action” to take in each item as well as select the
item they would “most likely” do and “least likely” do. Each participant completed the
SJT two times. For one administration, they were instructed to answer honestly and for
the other administration they were instructed to respond as if they were applying for a job
and to pick the responses that would best guarantee they would be hired. The results
19
showed that the students were able to fake good on the SJT with behavioral tendency
instructions (i.e., they improved their score), but were not able to fake good on the SJT
with the knowledge instructions.
The authors conclude that knowledge instructions may be immune to faking
because candidates cannot fake knowledge. A candidate knows the best answer or does
not, whereas with behavioral tendency instructions, candidates appear to be able to
distinguish between what they would do versus what they should do. The authors
speculate that faked responses on SJTs with behavioral tendency may approximate
knowledge responses. If the stakes are high (e.g., obtaining employment vs. not obtaining
employment) researchers should be aware of the potential for faking to occur if
behavioral tendency instructions are used.
Limitations of Prior Research
The two primary limitations of prior research on the effects of response
instructions on SJT performance have been the inability to keep SJT content constant and
the lack of high-stakes settings. When the content varies, there is the possibility that
knowledge SJTs and behavioral tendency SJTs are different in a particular characteristic
that causes candidates to respond or not respond differently. For example, it may be that
SJTs with knowledge instructions are written with greater detail and specificity to help
discriminate between the correct and incorrect responses. On the other hand, behavioral
tendency instructions may be written with more ambiguity in order to encourage
personality-related responses. The variation of content is a potential confound that limits
any interpretations drawn from response differences by instruction type.
20
When the stakes are high, there is also the possibility that candidates may respond
differently to behavioral tendency instructions than when in a low-stakes situation (i.e.,
faking). I have mentioned three studies where SJT response instructions have been
studied and content has been held constant. Ployhart and Ehrhart (2003) kept SJT content
constant with college students in a low-stakes situation. McDaniel et al. (2007) were able
to keep content constant in four of the 118 studies included in their meta-analysis using
job incumbents in a low-stakes situation. Lievens et al., (2009) were able to keep content
constant with medical school applicants in a high-stakes situation. The first two studies
specifically called for research to be conducted in a high-stakes situation. Lievens et al.,
(2009) were able to do this; however, they call for research to be done to investigate the
generalizability of their finding to high-stakes employment testing. To my knowledge,
there have been no studies that have been able to keep SJT content constant while
varying response instructions in a high-stakes employment context.
Present Study
In this study, I had the opportunity to evaluate the effects of response
instructions in a high-stakes employment context. Specifically, I was able to work with a
test vendor who creates selection tests for entry-level firefighters and vary the response
instructions on the SJT items within their test. This answers the call for research to be
conducted in a high-stakes setting with real job applicants (Whetzel & McDaniel, 2009;
Lievens et al., 2009; Ployhart & MacKenzie, 2010).
Accurate testing is very important from both an academic and a practical
standpoint. Most of the research on SJTs has been conducted using college students in
21
low-stakes situations. Academics will benefit from this study by being able to compare
results previously found in low-stakes situations with a high-stakes situation. There may
be important boundary conditions that result in differences between low and high-stakes
testing situations. Understanding if the stakes can change the construct being measured
by the same SJT is an important theoretical contribution.
From a practical standpoint, understanding what a test is measuring and how item
presentation may affect measurement is of vital importance. The goal of any test is to
come as close as possible to obtaining the true score of a candidate. If there are
differences in how candidates respond based on the instruction type used, these
differences may be due to different constructs being assessed, error, or faking. In a
competitive job market, answering one SJT item incorrectly can be the difference in
getting a job or remaining unemployed. Having data in a high-stakes situation on how
candidates respond to different SJT instruction types will provide valuable information to
help personnel selection practitioners improve their testing practices.
Hypotheses
The first hypothesis is aligned with the hypothesis and findings of Lievens et al.
(2009). They hypothesized and found no differences in the difficulty level of SJTs with
different instruction types in a high-stakes situation. The participants in the current study
are applying for a job as an entry-level firefighter, so the stakes are high in this situation.
The participants are highly motivated to achieve maximum performance, which may lead
to an increased likelihood of faking (i.e., provide knowledge answers even when asked to
respond behaviorally) and I anticipate that they will treat behavioral tendency instructions
22
as if they were knowledge instructions. This contradicts the finding in low-stakes
situations where SJTs with knowledge instructions tend to have higher mean scores
(Ployhart & Ehrhart, 2003). The first hypothesis is that there will be no difference in item
difficulty between SJTs with behavioral tendency and knowledge instructions given on
the same items to two different groups of people.
The second hypothesis concerns the correlation with cognitive ability of SJTs
with different response instructions. Research in low-stakes settings has consistently
found that SJTs with knowledge instructions have a higher correlation with cognitive
ability than SJTs with behavioral tendency instructions (Ployhart & Ehrhart, 2003;
McDaniel et al., 2007). The study conducted by Lievens et al. (2009) in a high-stakes
situation found the same correlation, albeit the differences in correlation were not as
pronounced as those found in low-stakes settings. One of their reasons for hypothesizing
that the difference in correlation will still occur in a high-stakes setting is that there are
individual differences in response distortion (McFarland & Ryan, 2000). The study
conducted by McFarland and Ryan (2000) examined faking on a personality test, biodata
inventory, and an integrity test. Although these findings may generalize to SJTs, it is
possible that the finding of McFarland and Ryan (2000) was due to something else.
Lievens et al. hypothesized (2009) that test takers in a high-stakes condition will
generally provide the same response regardless of instruction type. They suggest that a
test taker in a high-stakes context will provide a knowledge response to behavioral
tendency instructions if that test taker has some knowledge of the domain being tested.
However, if the test taker does not have knowledge of the domain being tested then the
23
test taker is likely to provide a non-faked, behavioral tendency response. The SJT
administered by Lievens et al. (2009) was difficult. The 30 item SJT with knowledge
instructions had a mean score of 14.52 and the same SJT with behavioral tendency
instructions had a mean score of 14.92. This is just under 50% of the items being
answered correctly. Applying the reasoning of Lievens et al. (2009), test takers
completing the SJT with behavioral tendency instructions likely lacked some knowledge
of the content domain being tested and had to resort to a non-faked behavioral tendency
response for some items, which may have led to a slightly lower correlation with
cognitive ability.
The SJT test in this study is less difficult, as indicated by data from the test
vendor on a previous alternate form of the SJT. Prior to manipulating the response
instructions, the test vendor in this study had been administering a version of this 15 item
SJT and the mean score was 12.36. This is just over 80% of the items being answered
correctly. Because the version of the SJT used in this study is similar to the previous
version, I expect the mean score to be similar. The minimum requirement for the test
takers is graduation from high school or an equivalent (e.g., GED) and the items are
aimed at that level. With this test, I anticipate that the test takers will have knowledge of
the domain being tested and therefore will provide knowledge responses instead of nonfaked behavioral tendency responses. The second hypothesis is that SJTs with behavioral
tendency and knowledge instructions given on the same items to two different groups of
people will both be moderately correlated with cognitive ability.
24
Both of the hypotheses are accepting the null hypothesis. The null is a hypothesis
of zero effect. Zero effect means that there are no differences between the variables of
interest. Most experiments in psychology create a theoretically based hypothesis about a
potential difference and statistically compare it no difference (i.e., the null hypothesis).
That is, they want to know if the statistical values obtained are different enough from
zero (i.e., no difference or the null hypothesis) that they are likely not due to chance or in
other words, statistically significant.
Hypothesizing the null is typically not well received in the scientific community.
One of the primary reasons given for not hypothesizing the null is that there are too many
alternative explanations to explain a finding of no effect. However, Cortina and Folger
(1998) have argued that hypothesizing the null may be appropriate in certain situations.
One situation is when testing for boundary conditions. A boundary condition is
something that limits the relationship between variables or limits the conditions under
which an effect would otherwise exist. Knowing when an effect fails to occur can be just
as valuable to theory as knowing when it occurs (Greenwald, 1993).
Research has reliably shown that candidates respond differently to knowledge and
behavioral instructions in low-stakes settings. Do these differences hold up in a highstakes situation? Does a high-stakes situation serve as a boundary condition limiting the
effects that are typically found in low-stakes? One of the hypotheses of Lievens et al.
(2009) was a null hypothesis because they too were testing for boundary conditions.
According to Cortina and Folger (1998), “investigators rarely think in terms of boundary
conditions, and even in those rare instances in which they do so think, they are justifiably
25
reluctant to test those boundary conditions for fear that the article, however professional,
will be unpublishable” (p. 341). Given the purpose of the current study is to determine
boundary conditions for differences in SJT response instructions, it is appropriate to
hypothesize the null.
26
Chapter 3
METHODS
Participants
This study was conducted with applicants for entry-level firefighting positions at
various fire departments throughout the United States. Fire departments who decided to
purchase this particular test from an established test vendor were included in the study.
The individual fire departments were responsible for recruiting, screening, and inviting
candidates to complete the test. All departments had the same minimum qualification
requirement of high school graduation or completion of an equivalency test (e.g., GED).
A total of six fire departments participated in the study. Each fire department that
participated in the study received only one form of the test so that all candidates saw the
same items, in the same order, and with the same instruction types. Because this was a
high-stakes situation, where hiring decisions were made based partly on the results of this
test, keeping the content and instruction type identical for all candidates within each fire
department avoided the potential issue of mean score differences based on different SJT
instruction types across candidates. A fire department was provided with either a Form A
or Form B version of the test, but not both. Table 6 provides the number of participants
per form and per department.
Development of the Test
The entry-level firefighter test used in this study was created using a content
validation approach based on the results of a national job analysis conducted by the test
27
vendor. The test consisted of four different sections with a total of 100 multiple-choice
items. Each item had four alternatives with only one alternative keyed as the correct
response. There were no penalties for incorrect answers and each correct answer was
worth one point. The total possible score on the test was 100 points. Candidates were
allowed a total of two hours to complete the test and none of the individual sections were
timed. Candidates could complete the test at their own pace and could answer the items in
the order they chose (at both the item and the section levels).
The first 85 items measured cognitive ability and were divided into three different
sections. The items were designed to assess knowledge learned in high school and did not
require any previous knowledge of firefighting. The cognitive ability items (first 85
items) were identical across Form A and Form B of the test. The internal consistency
coefficient of the cognitive ability section of the test was α = .89 for Form A and α = .88
for Form B. Due to test security issues, I can only provide descriptions of the items and
cannot include actual items from the test (see Table 5 for a breakdown of the test sections
and number of items per section).
Table 5
Entry Firefighter Test Sections
Test Section
No. Items
Reading Comprehension
30
Math
30
Mechanical Aptitude
25
Interpersonal Relations
15
Note. The section titled Interpersonal Relations contained the SJT items.
28
The reading comprehension section consisted of six reading passages that were
each approximately a page in length followed by five items relating to the passage. The
reading passages related to firefighting (e.g., types of foams used in firefighting,
classifying fires), but did not require previous knowledge of firefighting to answer the
items. Additionally, the material in the reading passage did not necessarily correspond to
actual materials or techniques in firefighting (e.g., listing fake foams, using a made up
system of classifying fires) so as not to give individuals with previous knowledge of
firefighting an unfair advantage.
The math section consisted of word problems put into the context of firefighting.
They required candidates to have knowledge of addition, subtraction, multiplication, and
division. Here is an example of a math item that is similar to those on the actual test. “An
engine can pump 850 gallons per minute when operating at 100% efficiency. How many
total gallons will an engine pump in seven minutes if operating at 70% capacity?”
The mechanical aptitude section of the test consisted of items that are designed to
asses a person’s ability to figure out how objects work and move, alone and in relation to
other objects. An example is an item that consists of an image of six interlocking circular
gears of different sizes labeled A thru F. The candidate might be asked “If gear A moves
in a clockwise direction, which gear will revolve the fastest in a counterclockwise
direction?”
Overall, the sections of the cognitive ability test match well with meta-analytic
data that has found a combination of cognitive and mechanical ability measures yield
high validity coefficients with job performance (Barrett, Polomsky, & McDaniel, 1999).
29
The last section of the test contained 15 SJT items. The SJT was designed to
assess the ability of potential firefighters to interact effectively with the public and
coworkers. Experienced firefighters and test development professionals generated the
critical incidents and response options. Each SJT item contained four response options
with only one option keyed as correct. The scenarios and response options in the SJT
section were identical across the two forms with the only change being the instruction
type. As with the cognitive ability test, I can only provide descriptions of the items and
cannot include actual items from the test. Examples of the topics the SJT items covered
include a citizen expressing concern over department policy, a coworker behaving
inappropriately, and receiving an assignment from a manager that contradicts policy. The
average length of the SJT items was 30.73 words with a standard deviation of 12.85.
In Form A, the first eight items used knowledge instructions. At the top of the
page, candidates were instructed to “read each situation and choose how you should
respond.” The word “should” was bolded to draw the candidate’s attention. In addition to
the instructions at the top of the page, each item ended with “In this situation, you
should” and the answer options completed the sentence.
After the first eight SJT items there was a page break before candidates saw the
remaining seven SJT items. At the top of the page, candidates were instructed to “read
each situation and choose how you would respond.” The word “would” was bolded to
draw the candidate’s attention. In addition to the instructions at the top of the page, each
item ended with “In this situation, you would” and the answer options completed the
sentence.
30
The only change for Form B was that the instruction types for the SJT items were
reversed. The first eight SJT items that had cognitive instructions (should) in Form A,
now had behavioral instructions (would). The same reversal occurred for the last seven
SJT items in Form B.
The internal consistency of the SJT for Form A was α = .42 and the internal
consistency for Form B was α = .37. Because SJTs are measurement methods and
generally measure more than a single construct, estimates of internal consistency usually
underestimate reliability (Whetzel & McDaniel, 2009; Cronbach, 1951). Test-retest and
parallel forms reliability measures are preferred (Whetzel & McDaniel, 2009); however,
these measures were not available. Although test-retest and parallel forms reliability are
preferred, it may be appropriate to report internal consistency when an effort was made to
make the SJT content homogeneous (Ployhart & Ehrhart, 2003; Schmidt & Hunter,
1996). Lievens et al. (2009) created a 30 item SJT designed to be a homogenous measure
of interpersonal/communication skills and reported an internal consistency of α = .55 for
knowledge instructions and α = .56 for behavioral instructions. Even when SJTs are
intended to be homogenous, internal consistency tends to be low.
Procedure
The six fire departments in the study contacted the test vendor expressing an
interest in using their paper-and-pencil entry-level firefighter test as a part of their
selection process. Each department was able to review the test prior to purchase. After
making the decision to rent the test, the vendor then shipped Form A or Form B of the
test to the fire department. The fire departments were responsible for proctoring and
31
administering the test. The proctors were provided with instructions on how to administer
the test and given a form for documenting any candidate questions or concerns regarding
the test or items on the test. None of the proctor report forms contained information
regarding confusion with the SJT section or the instruction type in the SJT section. The
test vendor was responsible for scoring the test and providing the department with the
individual candidate scores and summary statistics for the test.
32
Chapter 4
RESULTS
Data Screening
Initially, 1457 participants completed Form A of the test and 868 completed Form
B. Prior to analyzing the test data from the fire departments, it was determined that data
for a respondent would be retained if both the cognitive ability section of the test and the
SJT section were completed by the respondent. After screening on the previously
mentioned criteria, 38 participants were dropped from Form A and 34 participants were
dropped from Form B.
The data were then analyzed for univariate and multivariate outliers. Univariate
outliers were examined at the two levels of the independent variable (i.e., Form A and
Form B). A participant was considered an outlier if their z score on the cognitive section
of the test or SJT section of the test was ±3.3 (Tabachnik & Fidell, 2007). This resulted in
18 participants being dropped from Form A and 10 participants being dropped from Form
B. The data were then examined for multivariate outliers looking at each participant’s
Mahalanobis distance. Using an alpha of .001, this resulted in one participant being
dropped from Form A. All of the participants dropped from the analyses were outliers
towards the bottom of the distribution (i.e., they correctly responded to approximately
25% of the items on the cognitive and/or SJT section of the test). There are a few
potential explanations as to why some participants obtained a score in the 25% range.
The participants may have given up early in the test, the participants may have provided
33
random answers to the items, or the participants may not have been motivated to perform
maximally.
After data screening was complete, the final sample consisted of 1400 participants
for Form A and 824 participants for Form B.
Descriptive Statistics
The Pearson correlation of SJT scores on Form A with cognitive ability was
r(1398) = .27, p < .001 and was r(822) = .32, p < .001 for Form B. Table 6 contains the
mean test scores and number of participants by gender and ethnicity and Table 7 contains
the percentage of each gender and ethnicity by department. Overall, the sample did not
contain many minorities and was primarily male and Caucasian; however, the data match
up well with the gender and ethnicity firefighter demographics reported by the Bureau of
Labor Statistics (2012).
For Form A, there were no significant differences by gender on the SJT score
(F(2, 1397) = 1.42, p = .24, 2 = .002) and the cognitive section score (F(2, 1397) = 2.33,
p = .10, 2 = .003). For Form B, there was no significant difference by gender on the on
the SJT score (F(2, 821) = 0.99, p = .37, 2 = .002), but there was a significant difference
on the cognitive section score (F(2, 821) = 4.26, p = .01, 2 = .010). However, this
difference only accounted for 1.0% of the variance in the cognitive section score.
Additionally, Tukey post hoc tests revealed that the difference in scores is between the
unknown group and males, with the unknown group obtaining a higher cognitive score on
average. Not knowing the gender of the unknown group limits any conclusions drawn
from this result.
34
Ethnicity differences were then evaluated for each form. On Form A, there were
significant differences by ethnicity on the SJT score (F(7, 1392) = 2.18, p = .03, 2 =
.011) and the cognitive section score (F(7, 1392) = 14.93, p < .001, 2 = .070). On the
SJT, Tukey post hoc tests indicated that the only difference in SJT scores was between
the Caucasian and Hispanic participants. On average, Caucasian participants scored
higher on the SJT in Form A than the Hispanic participants. The differences are small and
only account for 1.1% of the variance in SJT scores. On the cognitive ability section,
African Americans scored lower than all other ethnic groups except for Native American.
The Asian and Caucasian participants scored higher than Hispanics. The differences in
scores are a little larger than for the SJT and accounted for 7.7% of the variance in
cognitive ability scores.
For Form B, there was no significant difference by ethnicity on the SJT score
(F(7, 816) = 1.58, p = .14, 2 = .013). It should be noted that the assumption of
homogeneity of variances was violated for SJT scores on Form B and there is a large
discrepancy in ethnicity sample sizes (three groups have less than 20 participants and two
groups have more than 250). The ratio of the largest to smallest group variance for the
SJT on Form B was 8.10, well above the ratio of three that is commonly used to assess
homogeneity of variance. The largest variance was for the African American group,
which also had one of the smallest sample sizes (n = 16). Large variances associated with
smaller sample sizes can increase the likelihood of a Type I error. While there was no
score difference on Form B on the SJT, there was a significant difference by ethnicity on
the cognitive ability section score (F(7, 816) = 4.89, p < .001, 2 = .040). Due to the
35
violation of homogeneity of variance, the Games-Howell post hoc method was used to
evaluate mean score differences by ethnicity. The post hoc tests on Form B revealed a
similar pattern of score difference in the cognitive ability section as Form A. African
American participants scored lower than all other ethnicities except Native American and
Filipino. The size of the effect is also similar to Form A, with ethnicity accounting for
4.0% of the variance in the cognitive ability score.
Table 8 contains the mean test scores and number of participants in each test form
by fire department. When the mean scores were compared between Form A and Form B
by department, a one-way ANOVA demonstrated significant differences on the cognitive
ability section (F(1, 2222) = 14.21, p < .01, 2 = .006), but not for the SJT (F(1, 2222) =
2.59, p = .11, 2 = .001). While the differences on the cognitive ability section by
department were significant, it only accounted for 0.6% of the variance in scores.
Tables 6, 7, and 8 reveal a particular scoring pattern. Table 6 indicates that there
are ethnic differences on the cognitive ability section and that they are primarily driven
by African American differences. Table 7 shows that department three has the highest
percentage of African American participants and Table 8 shows that department three has
the lowest mean score on cognitive ability of all the departments. This study was not
designed to assess subgroup differences on SJTs and cognitive ability, but it is important
to note that differences do exist. The differences found in this study are similar to those
provided by Ployhart and Holtz (2008). Across both test forms, the Caucasian-African
American uncorrected d-value for cognitive ability is 1.09 and for the SJT is .29. Ployhart
and Holtz (2008) reported d-values of .99 for cognitive ability and .40 for SJTs.
36
Table 9 contains the average difficulty of the SJT Items and the correlation with
cognitive ability by instruction type. Across Form A and Form B, the SJT items had an
average p-value of .85 and average correlation with cognitive ability of .10. The p-values
ranged from .65 to .96 and the correlations with cognitive ability ranged from -.06 to .25.
The p-values were similar across instruction type, with the range of p-value differences
by instruction type from .00 to .04. The SJT item correlations with cognitive ability were
similar across instruction type, with a range of correlation differences by instruction type
from .00 to .07. On Form A, the first eight SJT items with knowledge instructions had a
total score correlation of .18 with cognitive ability. The same SJT items on Form B with
behavioral tendency instructions had a total score correlation of .19 with cognitive ability.
The last seven SJT items with behavioral tendency instructions on Form A had a total
score correlation of .26 with cognitive ability. The same SJT items on Form B with
knowledge instructions had a total score correlation of .32 with cognitive ability.
37
Table 6
Mean Test Score by Gender and Ethnicity
Cognitive Ability
N
Percent
Mean
SD
SJT
Mean
SD
Gender
Male
1752
78.8%
61.78
10.69
12.67
1.74
78
3.5%
60.76
10.92
12.45
1.78
394
17.7%
63.93
10.31
12.69
1.72
Caucasian
1024
46.0%
63.21
10.18
12.77
1.71
Hispanic
278
12.5%
59.16
11.15
12.51
1.79
Asian
100
4.5%
63.97
11.15
12.68
1.65
African American
74
3.3%
51.55
10.80
12.27
1.99
Filipino
55
2.5%
59.55
10.63
12.47
1.71
Native American
21
0.9%
59.29
9.69
12.24
2.05
Other
86
3.9%
61.88
10.20
12.44
1.82
586
26.3%
63.03
10.24
12.76
1.64
2224
100.0%
62.13
10.66
12.69
1.72
Female
Unknown
Ethnicity
Unknown
Total
38
Table 7
Gender and Ethnicity by Fire Department
Department Percent
1
2
3
4
5
6
n = 1400
n = 135
n = 16
n = 480
n = 97
n = 96
Gender
Male
Female
Unknown
80.14% 65.93% 93.75% 72.08% 94.85% 91.67%
4.43%
1.48%
15.43% 32.59%
0.00%
2.29%
0.00%
3.13%
6.25% 25.63%
5.15%
5.21%
Ethnicity
Caucasian
44.50% 49.63% 75.00% 38.33% 60.82% 82.29%
Hispanic
13.93%
2.96%
0.00% 12.92% 16.49%
1.04%
Asian
5.14%
1.48%
0.00%
4.79%
2.06%
1.04%
African American
4.14%
1.48% 18.75%
2.08%
1.03%
0.00%
Filipino
3.00%
0.00%
0.00%
2.50%
1.03%
0.00%
Native American
1.14%
0.00%
0.00%
1.04%
0.00%
0.00%
Other
4.29%
1.48%
0.00%
3.13%
6.19%
3.13%
Unknown
23.86% 42.96%
6.25% 35.21% 12.37% 12.50%
39
Table 8
Participants and Mean Test Scores by Fire Department
Cognitive Ability
Form and Fire Department
N
Percent
Mean
SD
SJT
Mean
SD
Form A
1
1400
62.9%
61.48
10.76
12.64
1.76
1400
62.9%
61.48
10.76
12.64
1.76
2
135
6.1%
63.46
10.26
12.99
1.63
3
16
0.7%
47.13
7.52
9.81
2.01
4
480
21.6%
64.95
10.20
12.86
1.51
5
97
4.4%
64.53
8.04
13.20
1.45
6
96
4.3%
55.71
8.54
12.02
1.86
824
37.1%
63.23
10.40
12.76
1.66
Total
Form B
Total
40
Table 9
SJT Item Difficulty and Correlation with Cognitive Ability
Form A
Item
p-value
SD
Form B
r
p-value
SD
R
86
.83
.38
.07**
.84
.36
.14**
87
.86
.35
.12**
.90
.30
.09**
88
.91
.28
.12**
.92
.28
.14**
89
.76
.43
-.01**
.78
.41 -.06**
90
.65
.48
.07**
.68
.47
.05**
91
.76
.43
.05**
.77
.42
.08**
92
.94
.24
.07**
.93
.25
.07**
93
.87
.33
.12**
.87
.34
.06**
Avg.
.82
.37
.08**
.84
.35
.07**
Total Score Correlation
.18**
.19**
94
.93
.26
.08**
.91
.29
.07**
95
.76
.43
.10**
.78
.42
.17**
96
.96
.20
.12**
.96
.18
.09**
97
.93
.25
.12**
.93
.25
.10**
98
.74
.44
.12**
.72
.45
.12**
99
.92
.26
.13**
.94
.23
.18**
100
.81
.39
.19**
.81
.39
.25**
Avg.
.86
.32
.12**
.86
.32
.14**
Total Score Correlation
.26**
.31**
Note. Items 86-93 used knowledge instructions in Form A and behavioral tendency
instructions in Form B. Items 94-100 used behavioral tendency instructions in Form A
and knowledge instructions in Form B. The total score correlation is the correlation of the
cumulative SJT instruction type score with the cognitive ability score.
*p ≤ .05; **p ≤ .01.
41
ANCOVA Analysis
Prior to the analyses, the variables were examined to determine if they met the
statistical assumptions underlying the use of ANCOVA. The dependent variables (SJT
items) in this study are binary (0 for an incorrect response and 1 for a correct response)
and do not meet the criteria for ANCOVA. This analysis requires that the dependent
variable be measured on a continuous level. However, the decision was made to analyze
the data with ANCOVA and follow it up with logistic regression where it is appropriate
to use binary data as the dependent variable. Despite the use of binary dependent
variables, additional assumptions were considered prior to the analyses. As previously
mentioned, univariate and multivariate outliers were removed from the data set. The
sample sizes in each level of the independent variable were large enough to meet the
assumption of normality of sampling distributions. The covariate (cognitive ability
section of the test) is a reliable measure as indicated by an internal consistency of α = .89
for Form A and α = .88 for Form B. The assumptions of linearity and homogeneity of
regression were not evaluated due the dependent variables being binary.
A mixed model (within-subjects and between-subjects) ANCOVA was performed
on SJT item performance where instruction type varied. Two types of instructions were
evaluated: knowledge instructions (What should you do?) and behavioral instructions
(What would you do?). The dependent variable was item performance on each of the 15
SJT items. The covariate was cognitive ability level as indicated by each participant’s
total score on the first 85 items of the test. The ANCOVA summary table is presented in
Table 10.
42
In the between-subjects portion of the ANCOVA model there was a statistically
significant effect of the covariate cognitive ability on SJT item performance, F(1, 2220) =
195.03, p < .001, partial 2 = .081. An inspection of the bivariate (zero-order) correlation
showed that higher scores on the cognitive ability section were associated with higher
scores on the SJT section (r(2222)= .29, p < .001). Once the covariate was controlled for,
neither instruction type nor the interaction (instruction type by cognitive ability) were
significant.
In the within-subjects portion of the ANCOVA model, there was a significant
main effect of the SJT item on SJT item performance, F(14, 31080) = 13.35, p < .001,
partial 2 = .006. In other words, the SJT items varied in difficulty within participants.
This was not surprising as it was expected that the SJT items would vary in their level of
difficulty. Lastly, there was a significant interaction effect of item and cognitive ability
on SJT item performance, F(14, 31080) = 7.20, p < .001, partial 2 = .003.
43
Table 10
ANCOVA Summary Table
Type III SS
df
MS
F
Sig.
Partial eta2
Item
21.90
14
1.56
13.35
.000
.006
Item X Cognitive
11.80
14
0.84
7.20
.000
.003
Item X Instruction
1.42
14
0.10
0.87
.597
.000
Item X Instruction X
Cognitive
1.41
14
0.10
0.86
.605
.000
31080
0.12
Source
Within-Subjects Effects
Error
3640.84
Between-Subjects Effects
Cognitive
35.21
1
35.21
195.03
.000
.081
Instruction
0.09
1
0.09
0.52
.472
.000
Instruction X
Cognitive
0.12
1
0.12
0.67
.413
.000
Error
400.83
2220
0.18
Logistic Regression Analysis
Logistic regression makes fewer assumptions than ANCOVA and does not
require that dependent variables be continuous. The data from this study met all the
assumptions of logistic regression. The dependent variable was dichotomously coded,
with “0” used for an incorrect response to the SJT item and “1” for a correct response.
Logistic regression analyses were run 15 times, once for each SJT item. Running many
logistic regression analyses inflates the probability of making a Type I error, therefore a
sequentially-modified Bonferroni correction was applied with an alpha of .05. In this
study, the step-up Bonferroni correction described by Hochberg (1988) was utilized. In
this method, the largest obtained p-value is compared to an alpha of .05. If the largest of
44
the 15 p-values obtained is equal to or less than .05 then all p-values are considered
significant. If it is larger than .05 then the next largest p-value is compared to an alpha of
.05 divided by two (.025). If this p-value is equal to or less than .025 it is considered
significant, along with all the remaining smaller p-values. This process continues (i.e., the
alpha of .05 is divided by three for the third largest, four for the fourth largest, and so
forth) until a p-value meets the criteria for significance or until all p-values have failed to
reach significance.
A total of four different logistic regression models were evaluated to determine
their relationship with correctly answering an SJT item. Table 11 contains the results of
the logistic regression.
The first model contained instruction type (Form A or Form B) as a predictor. In
this model, instruction type did not provide a statistically significant improvement in
explaining item performance over the constant-only model for any of the 15 SJT items.
The difficulty of the SJT items did not differ by instruction type (knowledge or
behavioral).
The second model contained the cognitive ability section score as a predictor. In
this model, 14 of the 15 SJT items provided a statistically significant improvement over
the constant-only model. Increases in cognitive ability were associated with an increased
probability of answering the SJT item correctly. For the 14 significant SJT items the
mean Nagelkerke pseudo R2 was .026 and the mean Exp(B) was 1.032. The 14 significant
SJT items had a Nagelkerke pseudo R2 range of .012 to .070 and an Exp(B) range of
1.025 to 1.055. On average,for every one-point increase in cognitive ability, participants
45
were 1.032 times more likely to answer the SJT items correctly, accounting for 2.6% of
the variance in SJT item performance.
The third model contained the cognitive ability section score and instruction type
as predictors. In model two, cognitive ability had an effect on SJT performance. Model
three is able to evaluate if there are differences in SJT performance by instruction type
when controlling for cognitive ability. In this model, after controlling for cognitive
ability, instruction type did not provide a statistically significant improvement over the
constant-only model for any of the 15 SJT items. When the cognitive ability of
participants was controlled, there were no differences in SJT difficulty by instruction
type. For each of the 15 SJT items in model three, there was minimal or no increase in the
Nagelkerke pseudo R2 when compared to model two. In fact, the largest change in
Nagelkerke pseudo R2 was 0.005 (0.5% additional variance) over model two for item 94.
The fourth model contained the cognitive ability section score, instruction type,
and the interaction of instruction type and cognitive ability as predictors. This model was
designed to test whether the association with cognitive ability differs across instruction
type. In this model, the interaction did not provide a statistically significant improvement
for any of the 15 SJT items over the main-effect-only model comprised of cognitive
ability and instruction type.
46
46
Table 11
Logistic Regression Summary Table
Model 1
Model 2
Model 3
Model 4
Instruction
Cognitive
Cognitive + Instruction
Cognitive + Instruction +
Cognitive X Instruction
Sig.
Nagelkerke
R Square
86
1.114 .368
.001
1.025 .000*
.017
1.066
.596
.017
1.020
.076
.019
87
1.409 .014
.005
1.032 .000*
.023
1.338
.038
.027
0.997
.844
.027
88
1.061 .707
.000
1.041 .000*
.033
0.985
.925
.033
1.008
.588
.034
89
1.169 .136
.002
0.998 .740*
.000
1.174
.128
.002
1.001
.903
.002
90
1.143 .152
.001
1.013 .002*
.006
1.119
.234
.007
0.996
.614
.007
91
1.055 .606
.000
1.014 .003*
.006
1.030
.779
.006
1.006
.510
.006
92
0.870 .439
.001
1.026 .001*
.012
0.828
.296
.014
0.997
.841
.014
93
0.933 .597
.000
1.026 .000*
.016
0.889
.369
.017
0.984
.183
.019
94
0.741 .060
.004
1.025 .001*
.012
0.705
.029
.017
0.992
.587
.018
Item Exp(B)
Exp(B)
Sig.
Nagelkerke
R Square
Exp(B)
Sig.
Nagelkerke
R Square
Exp(B)
Sig.
Nagelkerke
R Square
(Table 11 continues)
47
47
(Table 11 continued)
Model 1
Model 2
Instruction
Cognitive
Sig.
Nagelkerke
R Square
95
1.090 .411
.000
96
1.249 .334
97
Model 3
Model 4
Cognitive + Instruction
Cognitive + Instruction +
Cognitive X Instruction
Nagelkerke
R Square
Exp(B)
Sig.
1.028 .000*
.023
1.039
.001
1.051 .000*
.039
0.972 .872
.000
1.041 .000*
98
0.902 .296
.001
99
1.371 .084
.005
Item Exp(B)
Nagelkerke
R Square
Exp(B)
Sig.
.720
.023
1.017
.093
.025
1.140
.572
.040
0.993
.782
.040
.030
0.900
.557
.030
0.994
.722
.031
1.025 .000*
.021
0.860
.129
.022
1.001
.916
.022
1.055 .000*
.053
1.249
.229
.055
1.029
.100
.058
Exp(B)
Sig.
100
1.118 .329
.001
1.052 .000*
.070
1.022 .855
.070
1.018 .114
Note. Model 3 reports the Exp(B) and Significance for instruction type. Model 4 reports the Exp(B) and Significance for the
interaction.
*Results are significant after applying the step-up Bonferroni correction
Nagelkerke
R Square
.072
48
Chapter 5
DISCUSSION
The majority of SJT studies analyzing the effects of response instructions have
been conducted with participants in low-stakes settings. When research has been
conducted in high-stakes settings, SJT content has generally varied by instruction type.
This study answered the specific call for research on SJT instruction type in a high-stakes
employment setting (Whetzel & McDaniel, 2009; Lievens et al., 2009; Ployhart &
MacKenzie, 2010). Research on instruction type in a high-stakes setting is an important
step towards understanding the effect response instructions may have on the construct
being measured. A greater understanding of the construct can lead to more precision in
employee selection, which leads to the ultimate goal of improved organizational
performance.
Discussion of Hypothesized Relationships
The first hypothesis was supported as there was no difference in item difficulty
between SJTs with behavioral tendency and knowledge instructions given on the same
items to two different groups of people. This held true both when controlling and not
controlling for the cognitive ability level of the participants (see Models 1 and 3 in Table
11). The largest p-value difference between instruction type was .04 (see item 87 in Table
9). This result is contrary to findings of studies conducted in low-stakes settings. Lowstakes have generally found that SJTs with knowledge instructions have higher mean
scores than SJTs with behavioral tendency instructions (Ployhart & Ehrhart, 2003).
49
However, this finding is consistent with the study conducted in a high-stakes situation by
Lievens et al. (2009). Based on the findings of the current study and those of Lievens et
al. (2009), it appears that a high-stakes situation eliminates possible mean score
differences by SJT instruction type.
The second hypothesis was also supported as SJTs with behavioral tendency and
knowledge instructions given on the same items to two different groups of people were
both equally correlated with cognitive ability. Cognitive ability was the only predictor of
SJT item performance in the logistic regression analyses and the lack of interaction
between cognitive ability and instruction type suggests that the relationship does not
differ in strength across the instruction types (see Models 2 and 4 in Table 11). The
average total score correlation between the SJT items with knowledge instructions and
cognitive ability was .18 and .32. The average total score correlation between the SJT
items with behavioral tendency instructions and cognitive ability was .19 and .26 (see
Table 9). This result is contrary to findings of studies conducted in low-stakes settings.
Low-stakes have generally found that SJTs with knowledge instructions have higher
correlations with cognitive ability than SJTs with behavioral tendency instructions
(McDaniel et al., 2007). This is also contrary to the high-stakes findings of Lievens et al.
(2009) where they found a correlation of .19 between knowledge instructions and
cognitive ability and .11 between behavioral tendency instructions and cognitive ability.
However, the correlations of Lievens et al. (2009) are smaller than those found in lowstakes situations. The difference in correlations between the current study and Lievens et
al. (2009) may be explained by the varying levels of difficulty of the SJTs. The SJT used
50
in this study had an average between form percentage score of 84.57% and the average
percentage score in Lievens et al. (2009) was 49.07%. The easier SJT items in this study
may have allowed candidates to rely on their cognitive knowledge rather than providing
behavioral responses, as might happen when candidates are unsure of the correct
cognitive response.
Support for the second hypothesis was also found in regards to the strength of the
correlations of the SJT items with cognitive ability. In the meta-analysis conducted by
McDaniel et al. (2007), they found a mean correlation of .25 between SJTs with
knowledge instructions and cognitive ability and .15 between SJTs with behavioral
tendency instructions and cognitive ability (See Table 4). The mean correlations with
cognitive ability in this study were .25 for the knowledge instructions and .23 for the
behavioral tendency instructions. Both instructions types were moderately correlated with
cognitive ability.
Limitations
A number of important limitations moderate the generalizability of this study’s
findings. The SJTs were not randomly assigned in this study. Six different fire
departments participated in the study and each fire department was provided with only
one form of the test. The decision to provide each fire department with the same form of
the test was made consciously prior the study due to the litigious nature of selection
testing in public safety. The test vendor felt that the potential for litigation would increase
if there were any observed differences by form within a fire department. Providing the
same form to all candidates within a department helped to mitigate this risk because all
51
participants saw the same instruction type for each SJT item. A consequence of this
administration constraint is that the data for Form A was collected from one larger fire
department and the data for Form B was collected from five different fire departments.
Within the six departments, there was also no direct control over the exam
administrations or examinee groups. The fire departments were provided with
standardized proctor’s instructions regarding test administration practices, security, and
timing procedures. However, it is unknown if these instructions were consistently
adhered to across departments and test administrations. It is also unknown whether all of
the examinees were typical of those commonly seeking work within each department. All
of the departments had the same minimum qualifications for application; however, some
departments may attract candidates who are overqualified while others may attract
candidates who are just minimally qualified. As was noted previously, the ethnic
diversity of the candidates varied by department. Specifically, department three had a
higher percentage of African American participants than the other departments. The lack
of random test form assignment within departments limits the generalizability of the
findings.
The length of the SJT was not ideal in this study. The test vendor had an existing
test with 15 SJT items. It would have been better to have an SJT with at least 30 items to
increase the reliability and to obtain more data for use in the analyses. The overall
difficulty of the SJT reduced the ability to discriminate between high and low performers
(average percentage score between the two forms was 84.57%)
52
The current study lacked a criterion measure (i.e., job performance) and a
personality measure. I was only able to correlate scores with cognitive ability in the
current study. The fire departments involved in the study did not elect to use a measure of
personality as part of the selection process and did not want to be involved in collecting
and/or providing job performance data. Knowing how the SJT scores relate to job
performance and personality would have created a more holistic picture of the effects of
response instructions. Previous studies have found that SJT correlations with personality
vary by instruction type (McDaniel et al., 2007). It would have been valuable to collect
this data in the current study since it was a high-stakes employment situation.
It is possible that the participants in the study did not notice when the SJT
instruction type changed. When the instruction type changed, there was a page break
before the participants saw the remaining SJT items. The new instructions were at the top
of the page. Additionally, the end of each SJT item contained the behavioral tendency or
knowledge instruction. Even with the page break and instructions at the end of each item,
the overall presentation of the items is very similar and participants may not have noticed
the change. This approach to the instructions was picked consciously. The goal was to
clearly separate the two instruction types without specifically telling the candidates that
they are expected to respond differently to the new instruction type. It may be that
participants did not notice the within-subjects instruction change; however if participants
responded differently by instruction type, this would have been noticed in the betweensubjects comparison on the first eight items prior to the instruction change. No
differences were found between-subjects either which indicates that the instruction types
53
were treated similarly. Future research could examine how different methods for
changing instruction types within an SJT may affect how participants respond.
Lastly, both of the hypotheses in this study hypothesized the null. One of the
primary reasons for not hypothesizing the null is that there are too many alternative
explanations to explain a finding of no effect. The intent of this study was to assess
boundary conditions as it relates to SJT instruction types, but no causality can be
assumed from the results. All of the limitations described here are potential explanations
for why the null was not rejected in this study.
Conclusions and Theoretical Implications
One of the strengths of this study was that it was conducted using an employee
selection exam taken by participants who were seeking employment as a firefighter. The
motivation to perform maximally was very high. This same motivation for maximal
performance is difficult to achieve when university students are used as participants.
There was no need to simulate an employment selection setting because the setting was
real. Another strength was the size of the sample. Although diversity in the sample was
lacking, the sample size for each form of the test was sufficient to provide stable data for
use in the analyses. Specific geographic data was not provided, but the departments
included in study came from different states within the United States.
Overall, the findings of this study indicate that participants in a high-stakes
situation respond consistently (i.e., they treat different instruction types the same) to SJTs
with knowledge or behavioral tendency instructions. Given this finding, it may be
beneficial for high-stakes personnel selection SJT test developers to avoid behavioral
54
tendency instructions (Lievens et al., 2009). Participants who receive behavioral tendency
instructions may find themselves in a dilemma as to how they should respond. Should
they provide an honest behavioral response or a knowledge response that they know is
best? Utilizing knowledge instructions avoids the potential for situations where the
honest job applicant receives a lower score that is not due to a lack of knowledge, but
rather to adhering to the specific response instructions unlike the majority of the other job
applicants. In employee selection testing, a top down approach is often used where higher
scorers are selected to move on in the selection process or are hired based on obtaining a
higher score. The written test score may also be added to other selection tests (e.g.,
interview, writing sample) to create a cumulative score. A difference of one or a few
points on the SJT may be the deciding factor in a competitive job market.
Developers of SJTs should consider if it is more important that participants know
what they should do or choose the answer that best represents what they would do. It can
be argued that participants who know what they should do (i.e., knowledge response)
have the capacity to be trained on the proper response for a particular situation, whereas a
participant whose behavioral tendency response is incorrect may not know what should
be done and therefore would be harder to train and more likely to respond incorrectly on
the job. It can also be argued that participants whose behavioral tendency response is in
alignment with the knowledge response are optimal because their two responses are
congruent and therefore will lead to maximizing the likelihood of a correct response in a
real situation. However, as previously noted, in a high-stakes situation it can be difficult
55
to determine if a participant’s response is truly a behavioral tendency response or a faking
good response.
Future research should evaluate the effects of non-dichotomous scoring
techniques by response instructions. For example, the response instruction of “what
would you most likely do and what would you least likely do” is probably the most
commonly used instruction type (Ployhart & Ehrhart, 2003). Research should investigate
if the correlations with cognitive ability and personality differ when evaluating the “least
likely” option or if they differ in their level of difficulty. Likert rating scales are
sometimes used with SJTs. When SJT content is held constant, research should evaluate
if the construct measured changes by scoring technique. If a Likert rating scale is used
but the instruction type varies, does the construct being measured change? Lastly,
research should evaluate what cognitive processes are involved when participants
complete an SJT. Do participants consciously discriminate between their personal
knowledge and behavioral tendency responses? When participants are unsure of the
knowledge response to an SJT, do they resort to their behavioral tendency response?
SJTs have proven to be a valuable tool for high-stakes employment testing.
However, there is still much we do not know about this measurement method. Additional
research will help to define the boundary conditions of this method, which constructs it is
most aptly suited to measure, and how to modify response instructions and scoring
techniques to target specific constructs.
56
REFERENCES
Arthur, W., &Villado, A. (2008).The importance of distinguishing between constructs
and methods when comparing predictors in personnel selection research and
practice.Journal of Applied Psychology, 93, 435-442.
Barrett, G. V., Polomsky, M. D., & McDaniel, M. A. (1999). Selection tests for
firefighters: A comprehensive review and meta-analysis. Journal of Business and
Psychology, 13, 507-513.
Bergman, M. E., Drasgow, F., Donovan, M. A., Henning, J. B., &Juraska, S. (2006).
Scoring situational judgment tests: Once you get the data, your troubles begin.
International Journal of Selection and Assessment, 14, 223-235.
Bruce, M. M., & Learner, D. B. (1958).A supervisory practices test.Personnel
Psychology, 11, 207-216.
Bureau of Labor Statistics (2012, August). Labor force characteristics by race and
ethnicity, 2011, Report 1036. Retrieved from
http://www.bls.gov/cps/cpsrace2011.pdf
Cardall, A. J. (1942). Preliminary manual for the Test of Practical Judgment. Chicago:
Science Research Associates.
Christian, M. S., Edwards, B. D., & Bradley, J. C. (2010).Situational judgment tests:
Constructs assessed and a meta-analysis of their criterion-related
validities.Personnel Psychology, 63, 83–117.
57
Clevenger, J., Pereira, GM., Wiechmann, D., Schmitt, N., & Schmidt-Harvey, V.
(2001).Incremental validity of situational judgment tests.Journal of Applied
Psychology, 86, 410–417.
Cortina, J. M., &Folger, R. G. (1998). When is it acceptable to accept a null hypothesis:
No way, Jose? Organizational Research Methods, 1, 334-350.
Corts, D. B. (1980).Development of a procedure for examining trades and labor
applicants for promotion to first-line supervisor. Washington, DC: U.S. Office of
Personnel Management, Personnel Research and Development Center, Research
Branch.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests.
Psychometrika, 16, 297-334.
Cronbach, L. J. (1984). Essentials of psychological testing. (4th ed.) New York: Harper
& Row.
Greenberg, S. H. (1963). Supervisory judgment test manual (Technical Series No. 35).
Washington, DC: Personnel Measurement Research and Development Center,
U.S. Civil Service Commission, Bureau of Programs and Standards, Standards
Division.
Greenwald, A. G. (1993). Consequences of prejudice against the null hypothesis. In G.
Kern & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences
(pp. 419-460). Hillsdale, NJ: Lawrence Erlbaum.
58
Hanson, M. A. (1994). Development and construct validation of a situational judgment
test of supervisory effectiveness for first-line supervisors in the U.S.
Army.Unpublished doctoral dissertation, University of Minnesota.
Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance.
Biometrika, 75, 800-802.
Jagmin, N. (1985). Individual differences in perceptual/cognitive constructions of jobrelevant situations as a predictor of assessment center success. College Park,
MD: University of Maryland.
Landy, F. J. (2007). The validation of personnel decisions in the twenty-first century:
Back to the future. In S. M. McPhail (Ed.), Alternate validation strategies:
Developing and leveraging existing validity evidence (pp. 409-426). San
Francisco: Josey-Bass
Lievens, F., & Patterson, F. (2011).The validity and incremental validity of knowledge
tests, low-fidelity simulations, and high-fidelity simulations for predicting job
performance in advanced-level high-stakes selection.Journal of Applied
Psychology, 96, 927-940.
Lievens, F., Sackett, P. R., &Buyse, T. (2009). The effects of response instructions on
situational judgment test performance and validity in a high-stakes context.
Journal of Applied Psychology, 94, 1095-1101.
McDaniel, M. A., Hartmand, N. S., Whetzel, D. L., & Grubb III, W. L. (2007).
Situational judgment tests, response instructions, and validity: A meta-analysis.
Personnel Psychology, 60, 63-91.
59
McDaniel, M. A., Morgeson, F. P., Finnegan, E. B., Campion, M. A., &Bracerman, E. P.
(2001).Predicting job performance using situational judgment tests: A
clarification of the literature.Journal of Applied Psychology, 86, 730-740.
McDaniel, M. A., Nguyen, N. T. (2001). Situational judgment tests: A review of practice
and constructs assessed. International Journal of Selection and Assessment, 9,
103-113.
McDaniel, M. A., &Whetzel, D. L. (2005).Situational judgment test research: Informing
the debate on practical intelligence theory.Intelligence, 33, 515-525.
McFarland, L. A., & Ryan, A. M. (2000).Variance in faking across non-cognitive
measures.Journal of Applied Psychology, 85, 812-821.
Moss, F. A. (1926). Do you know how to get along with people? Why some people get
ahead in the world while others do not.Scientific American, 135, 26-27.
Motowidlo, S. J., Carter, G. W., Dunnette, M. D., Tippins, N., Werner, S., Burnett, J. R.,
& Vaughan, M. J. (1992). Studies of the structured behavioral interview.Journal
of Applied Psychology, 77, 571-587.
Motowidlo, S. J., Dunnette, M. D., & Carter (1990).An alternative selection procedure:
The low-fidelity simulation.Journal of Applied Psychology, 75, 640-647.
Motowidlo, S. J., Hanson, M. A., & Crafts, J. L. (1997).Low-fidelity simulations. In D.
L. Whetzel& G. R. Wheaton (Eds.), Applied measurement methods in industrial
psychology (pp. 241-260), Palo Alto: Davis Black.
Northrop, L. C. (1989). The psychometric history of selected ability constructs.
Washington, DC: U.S. Office of Personnel Management.
60
Nguyen, N. T., Biderman, M. D.,& McDaniel, M. A. (2005). Effects of response
instructions on faking a situational judgment test. International Journal of
Selection and Assessment, 13, 250-260.
Ployhart, R. E., &Ehrhart, M. G. (2003). Be careful what you ask for: Effects of response
instructions on the construct validity and reliability of situational judgment tests.
International Journal of Selection and Assessment, 11, 1-16.
Ployhart, R. E., & Holtz, B. C. (2008).The diversity-validity dilemma: Strategies for
reducing racioethnic and sex subgroup differences and adverse impact in
selection.Personnel Psychology, 61, 153-172.
Ployhart, R. E., &MacKenzie, W. I. (2010).Situational judgment tests: A critical review
and agenda for the future.InZedeck, S. (Ed.), APA handbook of industrial and
organizational psychology. (pp. 237-252). Washington, DC: American
Psychological Association.
Pulakos, E. D. & Schmitt, N. (1996).An evaluation of two strategies for reducing adverse
impact and their effects on criterion-related validity.Human Performance, 9, 241258.
Richardson, Bellows, Henry, & Co. (1981). Technical Reports Supervisory Profile
Record. Richardson, Bellows, Henry, & Co.
Sacco, J. M., Scheu, C. R., Ryan, A. M., Schmitt, N., Schmidt, D. B., &Rogg, K. L.
(2000).Reading level and verbal test scores as predictors of subgroup differences
and validities of situational judgment tests. Paper presented at the 15th annual
61
conference of the Society for Industrial and Organizational Psychology, New
Orleans, LA.
Salgado, J.F., Viswesvaran, C., & Ones, D. S. (2001). Predictors used for personnel
selection: An overview of constructs, methods, and techniques. In N. R.
Anderson, S. S. Ones, H. K. Sinangil& C. Viswesvaran (Eds.), Handbook of
industrial, work, & organizational psychology, Vol. 1. (pp. 165-199)London:
Sage.
Schmidt, F. L., & Hunter, J. E. (1996). Measurement error in psychological research:
Lessons from 26 research scenarios. Psychological Methods, 1, 199-223.
Schmidt, F. L., & Hunter, J. E. (1998).The validity and utility of selection methods in
personnel psychology: Practical and theoretical implications of 85 years of
research findings.Psychological Bulletin, 124, 262-274.
Schmitt, N., & Chan, D. (2006).Situational judgment tests: Method or construct? In J.A.
Weekley& R. E. Ployhart (Eds.), Situational judgment tests (pp. 135-156).
Mahwah, NJ: Lawrence Erlbaum Associates.
SIOP (2003).Principles for the validation and use of personnel selection
procedures.Bowling Green, OH: Society for Industrial Organizational
Psychology.
Weekley, J. A., & Jones, C. (1999).Further studies of situational tests.Personnel
Psychology, 52, 679-700.
Tabachnik, B. G., &Fidell, L. S. (2007).Experimental Design Using ANOVA.Belmont,
CA: Duxbury.
62
Weekley, J. A., &Ployhart, R. E. (2006).An introduction to situational judgment testing.
In J.A. Weekley& R. E. Ployhart (Eds.), Situational judgment tests Mahwah, NJ:
Lawrence Erlbaum Associates.
Wernimont, P., & Campbell, J. P. (1968).Signs, samples, and criteria.Journal of Applied
Psychology, 52, 372-376.
Whetzel, D. L., & McDaniel, M. A. (2009). Situational judgment tests: An overview of
current research.Human Resource Management Review, 19, 188-202.
Whetzel, D. L., McDaniel, M. A., & Nguyen, N. T. (2008). Subgroup differences in
situational judgment test performance: A meta-analysis. Human Performance, 21,
291-309.