interactions and levels in meta-analysis

advertisement
THE EFFECT OF CRITERION
RELIABILITY ON MEANS AND
INTERACTIONS IN METAANALYSIS
LAWRENCE R. JAMES
PSYCHOLOGY AND MANAGEMENT
GEORGIA INSTITUTE OF TECHNOLOGY
1
META-ANALYSIS
• Correlations involving the same or very
similar predictors and criteria are
retrieved from prior studies.
• This set of validities constitutes a
distribution that can be summarized
statistically using standard descriptors
such as the mean and the variance.
2
VALIDITY GENERALIZATION
• Archival information pertaining to statistical
artifacts that might affect each validity is obtained
(e.g., sampling error, reliability of criterion and
predictor, range restriction).
• Distributional summary statistics are corrected for
artifacts to provide estimates of the mean true
(population) validity and the variance among the
mean true (population) validities.
3
WHY VALIDITY
GENERALIZATION?
• Validity generalization is founded on the
possibility that true validities from different
populations may be equal, and yet the sample
validities may vary because of the operation of
statistical artifacts (Hunter, Schmidt, & Jackson,
1982). (This is a question of interaction.)
• There is also the strong likelihood that true
validities are underestimated by sample validities
due to unreliability and range restriction.
4
RESULTS OF VALIDITY
GENERALIZATION
Meta-analyses based on validity
generalization (VG) procedures continue to
be impressive
5
ILLUSTRATIVE RESULTS
• General intellectual ability is said to have an average corrected
validity of .53 in predicting job performance (Hunter &
Hunter, 1984).
• Structured interviews can attain corrected validities in the .47
to .60 range against job performance (Huffcut & Arthur, 1994).
• Perceptual speed has an average corrected validity of .47
against clerical performance (Schmidt, 1992).
• Integrity tests have average corrected validities of .40 against
job performance (applicant samples) and .47 against
counterproductive behaviors (all samples) (Ones, Viswesvaran,
& Schmidt, 1993).
6
INFERENCES
• Many VG studies suggest that a single intellectual,
cognitive, or personality trait can account for
upwards of 16% to 36% of the variance in some
aspect of job performance.
• The days when 16% of the variance (or a validity
of .40) was the maximum expected for a trait
(Ghiselli & Brown, 1955) are gone, as are the days
when validities in the .20s and .30s were
commonplace in the reports of ”well-done"
validity studies.
7
QUESTION
What precipitated this boost in validities and
accountable variance?
8
BETTER SCIENCE?
• Improved measurement instruments?
• More sophisticated sampling
techniques?
• Superior research designs?
9
Well, not really. We still rely on
• the same measurement procedures
• the same small samples
• the same bivariate correlation designs
10
Then what gave rise to this
bountiful enhancement in
validities?
11
ENHANCEMENT IN
VALIDITIES
• The boosts in validities come from
correcting the observed validities, which
have stayed pretty much the same, for
attenuation due to unreliability in the
criterion (and sometimes the predictor) and
direct range restriction in the predictor.
12
WHAT CHANGED?
• Change was not due to improvements in
science.
• What changed was our historical
cautiousness in applying correction
equations to validity coefficients?
13
A CULTURE OF
CORRECTIONS
The genesis of this “culture of corrections”
can be traced to desires to estimate
relationships devoid of statistical artifacts.
14
A FORERUNNER: LATENT
VARIABLES
For example, latent variable procedures
such as LISREL frame the opportunity to
employ estimates of perfectly reliable
variables in studies of covariation as a
major advance in science.
15
ANOTHER FORM OF LATENT
VARIABLE
No less dedicated to the pursuit of truth and
scientific principle is VG (Schmidt, 1992),
the objective being to estimate correlations
among true scores (i.e., latent variables)
unencumbered by statistical artifacts (e.g.,
unreliability).
16
RECEPTIVENESS TO
CORRECTIONS
• It is the idea that corrected coefficients give
greater insight into scientific truths that
engendered the current culture of
corrections.
• Investigators are prone to compute
corrected coefficients, and editors,
reviewers, and readers tend to be receptive
to them.
17
OUR GOALS
• It is not our intent to stand between scientists and the
seeking of truth via corrected coefficients.
• We do feel that it is reasonable, however, to inquire
about the statistical values that are being used to make
the corrections.
• We are specifically interested in corrections for
attenuation due to unreliability in criteria assessed via
ratings of job performance.
• Study the effects these corrections have on the
estimates of the mean true validity and the variance
among the estimated true validities from separate
populations.
18
INTERRATER RELIABILITY
FOR RATINGS
Viswesvaran, Ones, & Schmidt (1996) concluded
that
• job performance is typically assessed by ratings,
• the reliability of ratings should be estimated via an
interrater reliability analysis, and
• the mean interrater reliability for job performance
ratings over studies is approximately .52.
19
WHERE AND WHEN TO USE
.52
• If a given study in a VG analysis fails to report
criterion reliability, and the criterion is based on
ratings, then the best estimate of the missing
interrater reliability is .52.
• If one is using one of the myriad of VG equations
to estimate means and variances of true
correlations, and interrater reliability for ratings is
missing from many studies (as is often the case),
then .52 is the value to insert into the estimating
equations for mean observed criterion reliability.
20
CONSEQUENCE OF USING
.52
It is instructive to illustrate the product of using .52
as an estimate of interrater reliability. Using the
standard correction for attenuation
• an observed validity of .25 becomes a .35 (i.e.,
25/[.52]1/2 )
• .30 becomes a .42,
• .35 becomes a .49,
• .40 becomes a .55.
21
MAGNITUDE OF INCREASE
So, simply by correcting for attenuation
based on an interrater reliability of .52, we
obtain an 89% increase (i.e., [.552.402]/.402) in what is regarded as the
maximum expected variance accounted for
by a single predictor (i.e., .16 to .30).
22
AN ADVANCE IN SCIENCE?
To what extent is this 89% increase in
maximum expected variance
accounted for reflective of science?
23
COMPARISONS TO OTHER
VARIABLES
• Where else in personnel research do we
accept, and use, measurement procedures
that produce variables with reliabilities of
.52?
• Is it not true that almost every conceivable
variable except performance ratings would
be cast out of personnel research if its
reliability were .52?
24
NUNNALLY & BERNSTEIN,
1994
“A reliability of .80 may not be nearly high
enough in making decisions about
individuals….If important decisions are
being made with respect to specific test
scores, a reliability of .90 is the bare
minimum, and a reliability of .95 should be
considered the desirable standard.” (p.265)
25
DESIRABLE STANDARD FOR
PERFORMANCE RATINGS
If we desire a .95 reliability for the test
scores that are used to hire people for jobs,
it seems reasonable to expect the same
standard of reliability for the ratings that are
used to determine whether people keep their
jobs.
26
PRACTICAL
CONSIDERATIONS
• Many reliabilities for scores used to make
decisions about individuals are not in the
.90s.
• Many, however, are in the .80s.
• With the exception of performance ratings,
almost none are in the .50s.
27
QUESTIONS
• Why are performance ratings allowed to
survive in spite of what most would agree is
questionable measurement?
• How do we allow observed validities to be
corrected for unreliability in what appear to
be flawed variables, and then act as if these
corrected validities actually convey some
sort of credible scientific information?
28
QUESTIONS (continued)
• Does anyone really believe that it makes
sense to talk about a "perfectly reliable
criterion" when the observed criterion
begins with an interrater reliability of .52?
• How exactly does a variable in which
almost one-half of the observed variance is
some form of bias or error become perfectly
reliable?
29
WHERE IS THE NEW
TECHNOLOGY?
• It would seem that researchers would have instituted
the necessary improvements, given that problems with
performance ratings were documented as early as 50
years ago in Guilford’s (1954) classic text in
psychometrics.
• Have not hundreds of articles been written on the
biases and errors that affect performance ratings,
especially after the classic articles on problems with
performance ratings written by Feldman and Landy &
Farr?
• We know what the problems are. Why have we not
fixed them?
30
IS THE PROBLEM
INTRACTABLE?
• Maybe it is not possible to build ratings that
can achieve high interrater reliabilities.
• If we admit that this is true, then should we
also not admit that we cannot justify
inserting .52 in corrections for attenuation
because we know that “theoretically
perfectly reliable” is not going to be even
remotely approximated?
31
Is .52 an accurate estimate of
interrater reliability?
• This issue is currently being debated elsewhere
(LeBreton, Kaiser, Burgess, Atchley, & James,
2001; Murphy & DeShon, 2000a, 2000b;
Schmidt, Viswesvaran, & Ones, 2000).
• If this estimate is later shown to be inaccurate
or ill-founded, then a different debate ensues.
• However, for now, let us assume that the .52
estimate is legitimate and accurate.
32
THE ISSUE
We may then deal with the issue of concern
here, which is basing substantive scientific
judgments on corrections which employ a
below threshold reliability for a criterion to
produce an enhanced, sometimes much
enhanced, estimate of corrected validity.
33
Is 40 years of research wrong and job satisfaction
really is correlated with job performance?
• Judge, Thoresen, Bono, and Patton (2001) used .52 as an
estimate of criterion reliability to repudiate 40 years of
research findings and previous meta-analyses that
concluded that job satisfaction has a low correlation with
overall job performance.
• A mean observed correlation of .18 was corrected to a
mean (estimated) true correlation of .30. Correction for
unreliability in the criterion accounted for approximately
60% of this increase.
• The use of .52 in the correction for attenuation was
justified by arguing that this approach was “consistent with
all contemporary (post-1990) meta-analytic studies
involving job performance.” (p.384)
34
A COMPARISON
• Had criterion reliability been .85 instead of .52, the
corrected correlation would have been approximately .23
(job satisfaction reliability was set at .74). Had the
reliabilities for both variables been .85, the corrected
correlation would have been approximately .21.
• Neither of these correlations suggests a substantial linear,
additive relationship between job satisfaction and job
performance.
• Are we going to change this conclusion based on
corrections engendered by not being able to measure job
satisfaction particularly well and performance hardly at
all?
35
STATISTICAL DYSFUNCTIONS OF
CORRECTING FOR LOW RELIABILITIES
• At this juncture, I hope that you realize that we
have a problem. We cannot base our science on
large corrections engendered by poor
measurement.
• If you have yet to be convinced, then allow me to
proceed to demonstrate some unanticipated
dysfunctions of inserting low reliabilities into
correction equations.
• Statistics are based on a working paper by James,
LeBreton, and Ladd.
36
A SINGLE VG ANALYSIS
A meta-analysis is conducted on the correlations
between scores on a structured interview and
ratings of overall job performance.
• The mean observed correlation is .35.
• Mean criterion reliability is set at .52.
• Mean predictor reliability is set at .80.
• The ratio between the restricted and unrestricted
standard deviations on the predictor is set at .71.
(a common value).
37
Result of a Single VG Analysis
The estimate of mean true validity is .67
(Raju, Burke, Normand, & Langolis, 1991,
Equation 2).
38
ADDITIONAL PREDICTORS
Three additional predictors chosen to
contribute unique variance to prediction.
• intelligence test
• integrity test
• biographical questionnaire
39
PSYCHOMETRICS OF
SEPARATE PREDICTORS
• Each additional predictor has an observed validity
of .35 against job performance, correlates .20 with
each of the other predictors, and has a reliability of
.80.
• The ratio between the restricted and unrestricted
standard deviations is again set at .71.
40
RESULTS OF THREE
ADDITIONAL VG ANALYSES
The estimate of mean true validity in each
additional VG analysis is .67
41
MULTIPLE CORRELATION
ANALYSIS
• Our four separate VG analyses each furnish
an impressive increase in validity from .35
to .67.
• Now let’s compute a multiple correlation by
inserting the results of each separate VG
analysis into a multiple correlation analysis.
42
RESULTS
The squared multiple correlation (R2) is 1.03
We account for more than 100% of the
variance in the job performance ratings.
43
COMPARATIVE RESULTS-1
A multiple correlation analysis based on the
observed or uncorrected data produces an
R2 of .31.
44
COMPARATIVE RESULTS-2
If all corrections remained the same except
that the performance ratings were given a
reliability of .80 rather than .52, then
• the mean estimated true validity for each of the
four variables would have been .54.
• the R2 would have been .67.
45
COMPARATIVE RESULTS-3
If all corrections remained the same except
that the performance ratings were given a
reliability of .70, which is often considered
the lower bound for reliability (Nunnally &
Bernstein, 1994), then
• the mean estimated true validity for each of the
four variables would have been .57.
• the R2 would have been .77.
46
IMPROPER TERRITORY
• With reasonable values for criterion
reliability set by accepted standards in
psychometrics, corrected coefficients
provide R2s in proper ranges.
• When accepted standards are suspended, the
R2 may wander off into improper territory.
47
SLIPPERY SLOPE
• We typically do not see r2s greater than 1.0 in bivariate
studies.
• Investigators have thus failed to realize that once one
begins to suspend judgment about acceptable
thresholds for criterion reliability and to allow a value
as low as .52 into correction equations, one is on a
slippery slope.
• The multiple correlation analysis picked up on the
slippery slope by producing an improper R2. It
follows that the bivariate corrections that engendered
this improper value have a tenuous foundation.
48
VARIANCES
• Heretofore we have focused on the mean of
a distribution of validities and the estimate
of the mean true validity.
• It is also possible to focus on the variance of
a distribution of validities and the estimate
of the variance among true validities.
49
ESTIMATED VARIANCE
AMONG TRUE VALIDITIES
• Each sample validity is corrected for artifacts.
This provides an estimate of the true validity for
the population from which that sample was drawn.
• The variance among the estimated true validities is
calculated.
• This variance is adjusted for sampling error (Raju
et al., 1991).
• If artifact data are not available for each sample,
estimating equations are available.
50
OBSERVED AND
CORRECTED VALIDITIES
• DATA WITH RELIABILITY OF .52
r
Mean
Variance
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
r
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
yy
r
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
yy
1/2
rxr
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
yy
-1/2
51
OBSERVED AND
CORRECTED VALIDITIES
• DATA WITH RELIABILITY OF .75
r
Mean
Variance
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
r
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
yy
r
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
yy
1/2
rxr
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
yy
-1/2
52
ESTIMATES OF TRUE
VARIANCE
0.025
0.020
0.015
ryy = .52
ryy = .75
0.010
0.005
0.000
N = 98
N = 250
N = 500
N = 750
N = 1000
53
KEY IMPLICATIONS
• Lower criterion reliabilities result in higher
estimates of true variance.
• This means that the interpretation of mean true
validity is more likely to be subject to moderation.
• In other words, use of below threshold criterion
reliabilities to enhance validity makes
interpretation of that enhanced validity
ambiguous.
54
CONCLUSIONS
• It is time to call a moratorium on the use of
low mean interrater reliabilities to enhance
estimates of mean true validities in VG
analyses.
• It is time to have a serious debate on how to
estimate the reliability of ratings.
55
56
Download