Validity

advertisement
PSY 5130 – Lecture 2
Validity
Validity is one of the most overused words in statistics and research methods.
We’ve already encountered statistical conclusion validity, internal validity, construct validity I, and external
validity.
Now we’ll introduce criterion-related validity, concurrent validity, predictive validity, construct validity II,
convergent validity, discriminant validity, and content validity. Whew!
A general conceptualization of the “validities” we’ll consider here . . .
All but content validity are concerned with the extent to which scores on a test correlate
with positions of people on some dimension of interest to us.
Specific types of Validity
I. Criterion-Related Validity
Test
Job
“Dimension of interest” is performance on some task or job, e.g., job performance, GPA.
So Criterion-related Validity refers to the extent to which pre-employment or pre-admission test scores correlate
with performance on some measurable criterion.
This is the type of validity that is most important for I-O selection specialists.
But it is also applicable to schools deciding among applicants for admission, for example.
When someone uses the term, “validity coefficient” he/she is most likely referring to criterion-related validity.
It’s the actual Pearson r between test scores and the criterion measure.
Two specific types of criterion-related validity often used in I/O psychology when choosing pre-employment
tests to predict performance on the job.
A. Concurrent Validity
The correlation of test scores with job performance of current employees.
The test scores and the criterion scores are obtained at the same time, e.g., from current employees of an
organization. Most often computed.
B. Predictive Validity.
The correlation of test scores with later job performance scores of persons just hired.
The test scores are obtained prior to hiring. Criterion scores are obtained after those who took the pretest have
been hired. Computed only occasionally.
Validation Study.
A study carried out by an organization in order to assess the validity of a test.
P513 Lecture 2: Validity - 1
Printed on 2/9/2016
Typical Criterion-related Validities. How good a job do we do?
From Schmidt, F. L. (2012). Validity and Utility of Selection Methods. Keynote presentation at River Cities
Industrial-Organizational Psychology Conference, Chattanooga, TN, October.
Unless otherwise noted, all operational validity estimates are of the specific type of test as the only predictor and corrected for
measurement error (i.e., unreliability) in the criterion measure and indirect range restriction (IRR) on the predictor measure
to estimate operational validity for applicant populations.
This means that the correlations below are somewhat larger than those you would obtain from computing Pearson r without the
corrections.
2012
GMA testsa
1998
.65
.51
.46
.41
.58
.51
.60
.38
.22
.31
Reference checks
.26
.26
Biographical data measuresg
.35
.35
Job experience h
.13
.18
b
Integrity tests
Employment interviews (structured)
c
Employment interviews (unstructured)d
Conscientiousness
e
f
Person-job fit measures
i
.18
SJT (knowledge)j
.26
Assessment centersk
.37
.37
.49
.49
.11
.11
.10
.10
.10
.10
Peer ratings
l
T & E point method
m
Years of educationn
o
Interests
p
.24
Emotional Intelligence (mixed)q
.24
Emotional Intelligence (ability)
r
GPA
.34
Person-organization fit measures
Work sample testst
Emotional Stability
s
.13
.33
SJT (behavioral tendency)
u
v
.12
.44
Behavioral consistency methodx
.45
y
.54
.26
Job tryout procedurew
Job knowledge
Really!!??
.48
P513 Lecture 2: Validity - 2
Printed on 2/9/2016
Factors affecting validity coefficients in selection of employees or students. Why aren’t correlations = 1?
1. Problems with the selection test.
A. Test is deficient - doesn't measure characteristics that predict some parts of the job
Test may predict one thing. Job may require something else.
Example:
Job: Manager:
Requirements
Cognitive ability
Conscientiousness
Interpersonal Skills
Test:
Cognitive Ability Test
As it should, the test measures cognitive
ability which predicts part of what the job
involves
Error
Error
Some of the variation in job scores will be
due to individual differences in
Conscientiousness and Interpersonal
Skills. But these differences won’t be
reflected in the Test scores.
Small r
Job
Wonderlic
So r between Test and Job will be smaller
than it should be due to deficiency of test.
CA
Con
Int
Skill
B. Test is contaminated - affected by factors other than the factors important for the job
Example
Job: Small parts assembly
Requirements
Manual Dexterity
Test: Computerized Manual dexterity Test Manual dexterity
Computer skills
Small r
Error
Error
Test
Comp
Skills
Job
Man
Dex
P513 Lecture 2: Validity - 3
Some of the variation in Test scores
will be due to individual differences
in Computer Skills. But these
differences won’t be reflected in Job
scores. So the r between Test and
Job will be smaller than it should be
due to contamination of the test.
Printed on 2/9/2016
2. Reliability of the Test and Reliability of the criterion
This was covered in the section on reliability ceiling
Test
Error
Small r. Observed
validity is affected
by true correlation
AND by errors of
measurement
Job
Score
Test
Score
Test:
Ability
Job
Error
True rTJ
Job
Ability
3. Range differences between the validation sample and the population in which the test will be used.
If the range (difference between largest and smallest) within the sample used for validation does not equal the
range of scores in the population for which the test will be used, the validity coefficient obtained in the
validation study will be different from that which would be obtained in use of the test.
A. Validation sample range is restricted relative to the population for which the test will be used- the typical
case.
e.g. Test is validated using current employees. It will then be used for the applicant pool consisting of persons
from all walks of life, some of whom would not been capable enough to be hired.
Population:
Applicant pool
consists of
persons who
would not have
been capable
enough to have
been hired. - They
would score low
on the test and
low on the job,
making the overall
correlation high.
Validation
sample:
Current
employees
are a select
group.
Job
scores
Test scores
Correlation
between test
and job is
small within
that group.
The result is the correlation coefficient computed from the validation group will be smaller than that which
would have been computed had the whole applicant pool been included in the validation study.
Why do we care about differences in range? When choosing tests, comparing different advertised validities
requires that the testing conditions be comparable. A bad predictor validated on a heterogeneous sample may
have a larger r than a good predictor validated on a homogenous sample.
P513 Lecture 2: Validity - 4
Printed on 2/9/2016
B. Validation sample range is larger than that of the population for which the test will be used - less often
encountered.
A test of mechanical ability is validated on a sample from the general college population, including liberal
arts majors.
But the test is used for an applicant pool consisting of only persons who believe they have the capability to
perform the job which requires considerable mechanical skill. So the range of scores in the applicant pool
will be restricted relative to the range in the validation sample.
Population.
Applicants
are a select
group..
Validation
sample. Consists
of a wide range
of abilities, most
of which are
below the level
appropriate for
the job.
Correlation
between test
and job is
small within
that group.
Bottom Line: I feel that criterion-related validity is the most important characteristic of a pre-employment test.
The Issue of Mindless Empiricism as a criticism of the focus on Criterion-related Validity. Note that the
issue of criterion-related validity of a measure has nothing to do with that measure’s intrinsic relationship to the
criterion. A test may be a good predictor of job performance even though the content of the test bears no
relationship to the content of the job. This means that it does not have to make sense that a given test is a
good predictor of the criterion. The bottom line in the evaluation of a predictor is the correlation coefficient.
If it is sufficiently large, then that’s good. If it’s not large enough, that’s not good. Whether there is any
theoretical or obvious reason for a high correlation is not the issue here.
Thus, focus on criterion-related validity only is a very empirical approach to the study of relationships of
tests to criteria, with the primary emphasis on the correlation, and little thought given to the theory of the
relationship.
Such a focus gets psychologists in trouble with those to whom they’re trying to explain the results. Consider
the Miller Analogies Test (MAT) for example. Example item: “Lead is to a pencil as bullet is to a) lead, b)
gun, c) killing d) national health care policy.” How do you explain to the parent of a student denied admission
that the student’s failure to correctly identify enough analogies in the Miller Analogies Test prevents the student
from being admitted to a graduate nursing program? There is a significant, positive correlation between
(MAT) scores and performance in nursing programs, but the reason, if known, is very difficult to explain.
Do companies conduct validation studies?
Alas, many do not – because they lack expertise, because they don’t see the value, because of small sample
sizes, or because of difficulty in getting the criterion scores, to name four reasons.
P513 Lecture 2: Validity - 5
Printed on 2/9/2016
Correcting validity coefficients for reliability differences and range effects
Why correct?
Skipped in 2016
1. If I’m evaluating a new predictor, I want to compare it with others on a “level” playing field. That includes
removing the effects of unreliability and of range differences between the different samples.
So the corrections here permit comparisons with correlations computed in difference circumstances.
Corrections are in the spirit of removing confounds whenever we can. Examples are standard scores and
standardized tests. Both remove specific characteristics of the test from the report of performance.
2. In meta-analyses, the correlations that are aggregated must be “equivalent”
When comparing different selection tests validated on different samples, we need equivalence.
Standard corrections
1. Correcting for unreliability of the measures (This is based on the reliability ceiling formula.)
rXY
rtX,tY(1) = ---------------------------------sqrt(rXX’)sqrt(rYY’)
The corrected r is labeled (1) because there is a 2nd correction, shown below, that is typically made.
Suppose rXY = .6, but assume rXX’ = .9 and rYY’ = .8.
Then rtX,tY would be .6/sqrt(.9)sqrt(.8) = .6 / (.95)(.89) = .6 / .85 = .71. This is 18% larger than the observed r.
Caution: The reliabilities and the observed r have to be good estimates of the population values, otherwise
correction can result in absurd estimates of the true correlation.
In selection situations, we correct for unreliability in the criterion measure only.
The reasoning is as follows: We correct because we want to assess the “true” amount of a construct. In
selection situations, the “true” job performance is available – it’s what we’ll observe over the years an
employee is with the firm. So we correct for unreliability of a single measure of job performance.
But we don’t correct for unreliability in the test because in selection situations, the observed test is the
only thing we have. We might be interested in the true scores on the dimension represented by the test, but in
selection, we can’t use the true scores, we can only use the observed scores.
So, in selection situations, the correction for unreliability is
rXY
rX,tY(1)= ---------------------------------------sqrt(rYY’)
Note that the corrected correlation is labeled rX,tY, not rtX,tY to indicate that it is corrected only for unreliability
of the criterion, Y.
P513 Lecture 2: Validity - 6
Printed on 2/9/2016
2. Correcting for Range Differences on the criterion variable.
This is applicable in some selection situations.
Skipped in 2016
After correcting for unreliability of X and Y, a 2nd correction, for range differences, is made.
SUse
If SVal is less than SUse, the
rtX,tY(1) * -----------typical restriction of range
SVal
situation, this r(2) will be
rtX,tY(2) = ------------------------------------------------------------------larger than r(1).
S2Use
sqrt(1 - r2tX,tY(1) + r2tX,tY(1)-----------)
S2Val
In this formula,
rtX,tY(1) is the correlation corrected for unreliability from the previous page.
SUse is the standard deviation of Y in the population in which the test will be used.
SVal is the standard deviation of Y in the validation sample.
3. Other corrections.
There is a growing literature on corrections in meta analyses and in selection situations.
References . . .
Stauffer, J. M., & Mendoza, J. L. (2001). The proper sequence for correcting correlation coefficients for range
restriction and unreliability. Psychometrika, 66(1), 63-68.
Sackett, P. R., & Yang, H. (2000). Correction for range restriction: An expanded typology. Journal of Applied
Psychology, 85(1), 112-118.
Hunter, J. E., & Schmidt, F. L. (2004). Methods of Meta-Analysis: Correcting Error and Bias in Research
Findings. 2nd Ed. Thousand Oaks, CA: Sage.
The bottom line of this is that if you’re involved in selection, you should be familiar with the language used
when discussing validity of selection tests. That language will include the concepts of reliability, correction,
and range restriction discussed here.
P513 Lecture 2: Validity - 7
Printed on 2/9/2016
II. Construct Validity II (recall that Construct Validity I was present in the fall.)
Definition: the extent to which observable test scores represent positions of people on underlying, not
directly observable, dimensions.
Why are we interested in this?
1. In Selection.
For many, particularly high level jobs, it is felt that performance is determined by certain attributes which are
not directly observable, such as leadership ability, , motivating potential, initiative, extraversion,
conscientiousness, etc. Look at managerial job descriptions for examples of such constructs.
So it is felt that if we can measure these traits, then we can identify persons who will be good managers,
leaders, sales persons, etc.
Note the distinction between this approach and the criterion-related validity approach. With the former, we
don’t really care WHY someone performs well; all we wish to do is to identify a test that correlates with
performance, regardless of the causes of that performance. It’s a very empirical approach.
Here, we definitely have a theory of why persons perform well, e.g., people who have high leadership ability
will be better managers. So we seek tests that measure those underlying not directly observable attributes
that we believe contribute to good performance. It’s a very theoretical approach.
2. In theory construction.
Our theories are made up of relationships between theoretical constructs - attributes of people and how those
attributes are related to behavior or other attributes.
Construct validity concerns how those not-directly-observable attributes are measured.
e.g. In I-O, there are theories of the relationship of job satisfaction and work motivation to performance, to
turnover, and to other organizational factors.
These theories are examples of what I-O psychology is about. Other theories define other areas of psychology.
All such theories are, ultimately, collections of statements concerning relationships among constructs.
Whether a measure is the best measure of a construct important for our theory is what construct validity is
about. Is our measure of job satisfaction appropriate? If it is, we can proceed to test our theory relating job
satisfaction to other organizational constructs. If it’s not, then there is no point in using that measure to test our
theory.
P513 Lecture 2: Validity - 8
Printed on 2/9/2016
Assessing Construct Validity
How do we know a test measures an unobservable dimension?
This is kind of like pulling oneself up by one’s own bootstraps.
Solution: Ultimately construct validity is based on a subjective consensus of opinion of persons
knowledgeable about the construct under consideration.
Construct validity of the first measure of a construct is purely subjective.
We begin with a measure of the construct that knowledgeable people agree measures the construct.
Construct validity of the 2nd and subsequent measures of a construct use the first measure.
We correlate subsequent measures of the construct with the existing measure (or measures).
Each such evaluation adds to our knowledge of what the construct is.
Specific criteria for Construct Validity of the subsequent measures
Generally speaking, a new measure of a construct has high construct validity if
a. the new measure has high convergent validity, i.e., it correlates strongly with other
purported measures of the construct, and
b. the new measure has high discriminant validity, i.e., it correlates negligibly (i.e.,
near zero) with measures of other constructs which are unrelated (correlate zero with) the
construct under consideration. Discriminant validity refers to lack of correlation.
Convergent validity: The correlation of a test with other purported measures of the same construct.
Two ways of assessing Convergent validity of a test
1. Correlation approach: Correlate scores on the test with other measures of same construct.
2. Group Differences approach: Find groups known to differ on the construct.
Determine if they are significantly different on the test.
Assessing the construct validity of a new measure of extraversion?
Suppose sales potential is determined to a considerable extent by extraversion.
Measure successful sales people and clerks using the new measure.
If two groups that should be different are different – successful sales people vs. clerks - that is an
indication of high convergent validity.
If two groups that should be different are not different, that’s an indication of low convergent
validity.
P513 Lecture 2: Validity - 9
Printed on 2/9/2016
Discrmininant validity: The absence of correlation of a test with measures of other theoretically unrelated
constructs.
Two ways of assessing Discriminant validity of a test.
1. Correlation approach: Correlate the test with measures of other, unrelated constructs.
Near zero, correlations means good discriminant validity.
Conscientiousness: correlation with extraversion should be zero since they’re different
constructs.
2. Group Differences approach: Determine that groups known to differ on other constructs are not
significantly different on the test.
Suppose you’ve developed a new measure of conscientiousness. To assess its discriminant validity
Find a group high on extraversion (sales people) and a group low on extraversion (clerks).
Give them all the conscientiousness test and compare the mean difference between the two groups
on the test.
If the conscientiousness test has good discriminant validity, there will not be a significant
difference between the 2 groups, since sales people and clerks are probably about equal
in conscientiousness.
If it does not have discriminant validity, the two groups will differ significantly.
So, establishing construct validity involves correlating a test with other measures of the same construct and of
different constructs.
Note that high power is required when demonstrating discriminant validity. If there is discriminant validity,
there will not be a relationship in true scores(1 above) or there will not be a difference in true means (2 above).
You must be able to argue that the absence of a relationship or difference was not due to low power.
“The Good Test” (Sundays at 10 PM on The Learning Channel)
High reliability: Large positive reliability coefficient - .8 or better to guarantee acceptance
Good Validity
Good Convergent Validity: Large, correlations in expected direction with other measure of the same
construct.
Good Discriminant Validity: Small correlations with measures of other independent constructs.
P513 Lecture 2: Validity - 10
Printed on 2/9/2016
Examples of assessing Construct validity from our research
Convergent Validity of Bifactor model measures of the Big 5 vs. Scale score measures.
Typically, psychological constructs are assessed using summated scores.
(See the next lecture for more than you ever wanted to know about summated scores.)
We have been investigating the possibility that responses to personality items are affected by an “affective
bias”, a tendency express the affective state of the respondent his or her response to the content of an item.
We believe that measures of the Big Five dimensions with this “affective bias” removed will be better estimates
of the dimensions – “purifying” them, if you will.
At the same time, there is considerable evidence of the usefulness of the Big Five summated scale scores.
For that reason, our “purified” measures should still exhibit convergent validity with the summated scale scores.
Evidence
NEO-FFI-3 questionnaire. N=736.
The Scale Scores
The “purified” Scores
Convergent validity correlations of scale scores with “purified” scores.
Extraversion
.867
Agreeableness
.915
Conscientiousness
.881
Stability
.981
Openness
.909
So the “Purified” measures correlate strongly with the scale score measures, as they should.
But the correlations are not perfect, meaning that perhaps the “contamination” that is present in the scale scores
is not in the “purified” scores.
P513 Lecture 2: Validity - 11
Printed on 2/9/2016
Do the HEXACO-PI-R measures of the Big Five exhibit convergent validity with the NEO-FFI-3 measures.
This is a simple, straightforward test of convergent validity.
The NEO-FFI questionnaire has been used for many years.
The HEXACO questionnaire has been more recently promoted.
The HEXACO is said to measure the Big Five plus one more measure – Honesty/Humility.
What is the convergent validity of corresponding scale scores from the two questionnaires.
F2014 Neo-FFI plus HEXACO DualResponders 141227. N= 409
NEO-FFI-3
HEXACO-PI-R
Convergent validity correlations of NEO-FFI-3 scale scores with HEXACO-PI-R scale scores.
Extraversion
.816
Agreeableness
.518
Conscientiousness
.766
Stability
.445
Openness
.727
Wow!! If you’re measuring Agreeableness or Stability, you have to decide which questionnaire to use. The
scales from the two questionnaires exhibit only fair convergent validity.
It appears that those two scales from the NEO-FFI-3 measure something different from the HEXACO scales of
the same name.
Remember, though – this is just one sample of N=409. We should hold off a definitive decision until the metaanalysis is available.
P513 Lecture 2: Validity - 12
Printed on 2/9/2016
Convergent and Discriminant Validity of Response Inconsistency
We’ve been studying a measure of response inconsistency, defined by the standard deviation of responses to
items within the same scale.
An overall measure for a questionnaire is the average of standard deviations of responses to items within all the
scales within that questionnaire.
We compute an Inconsistency measure from the NEO-FFI-3 and an Inconsistency measure from the HEXACOPI-R administered to the same respondents.
Here’s a scatterplot illustrating the convergent validity of inconsistency measured in the two questionnaires.
Pearson r is .665, p < .001.
So the two measures exhibit pretty good
convergent validity.
Is Inconsistency separate from the Big 5 dimensions? Here are discriminant validity correlations of
inconsistency with scale scores from the two questionnaires.
Correlations
neoe
sdnmea
n
Pearson
Correlation
Sig. (2-tailed)
N
neoa
neoc
neos
neoo
hx
Extraversio
n
ha
Agreeablen
ess
hc
Conscientio
usness
hs Stability
- rev of
emotionalit
y
ho
Openness
.096
-.004
.168
-.072
.059
.000
-.193
.074
-.047
-.046
.014
653
.928
653
.000
653
.067
653
.135
653
.991
653
.000
653
.060
653
.235
653
.239
653
hx
Extraversio
n
ha
Agreeablen
ess
hc
Conscientio
usness
hs Stability
- rev of
emotionalit
y
ho
Openness
Correlations
neoe
sdhmea
n
Pearson
Correlation
Sig. (2-tailed)
N
neoa
neoc
neos
neoo
.116
.138
.142
-.044
.158
.092
-.069
.147
-.093
.028
.003
653
.000
653
.000
653
.262
653
.000
653
.019
653
.080
653
.000
653
.018
653
.478
653
Although some of the correlations are significantely different from zero, none is larger than .17, so generally, I
feel that it’s reasonable to conclude that inconsistency has high discriminant validity. It’s not measuring what
the Big Five or HEXACO scales are measuring.
P513 Lecture 2: Validity - 13
Printed on 2/9/2016
III. Content Validity
The extent to which test content represents the content of the dimension of interest.
Example of a test with high content validity
A test of general arithmetic ability that contains items representing all the common arithmetic
operations – addition, subtraction, multiplication, and division.
Example of a test with low content validity
A test of general arithmetic ability that contains only measurement of reaction time and spatial ability.
Note that the issue of whether or not a test contains the content of the dimension of interest has no direct
bearing on whether or not scores on the test are correlated with position on that dimension.
Of course, the assumption is that tests with high content validity will show high correlations with the
dimensions represented by the tests.
Why bother with Content Validity in view of the previous emphasis on correlation – with criteria or constructs?
1. Time and Money. In many selection situations, it is easier to demonstrate content validity than it is
criterion-related validity. A validation study designed to assess criterion-related validity requires at last
200 participants. In a small or medium-sized company, it may be impossible to gather such data in a
reasonable period of time.
(The VALDAT data on the validity of the formula score as a predictor of performance in our programs
has been gathered over a period of 12 years, increasing at the rate of about 20 / year.. We didn’t have
200 until 10 years into the project.)
2. Politics. It is easier to make lay persons understand the results of a content valid test than one that
has high criterion-related validity but is not content valid. This includes the courts.
Assessing Content Validity: The Content Validity Ratio
1. Convene a panel of subject matter experts (SMEs).
2. Have each judge rate each item on the test as a) Essential, b) Useful, or c) Not necessary.
3. Label the total number of judges as N.
For each item . . .
3. Compute the number of judges rating the item as essential. Label this count, NE.
NE – N/2
4. For each item, compute the Content Validity Ratio (CVR) as CVR = -------------------.
N/2
5. Compute the mean of the individual item CVRs as the test CVR.
Note that the CVR can range from +1, representing highest possible content validity to -1, representing lowest
possible content validity.
Tables of “statistically significant” CVRs have been prepared.
Aiken, L.R. (1980). Content validity and reliability of single items or questionnaires. Educational and
Psychological Measurement, 40, 955–959.
Penfield, R. & Giacobbi, P. (2004) Applying a score confidence interval to Aiken’s item content-relevance
index. Measurement in Physical Education and Exercise Science, 8(4), 213-225.
Schmidt, F. L. (2012). Cognitive tests used in selection can have content validity as well as criterion validity:
A broader research review and implications for practice. International Journal of Selection and
Assessment, 20, 1-13.
P513 Lecture 2: Validity - 14
Printed on 2/9/2016
Download