PSY 5130 – Lecture 2 Validity Validity is one of the most overused words in statistics and research methods. We’ve already encountered statistical conclusion validity, internal validity, construct validity I, and external validity. Now we’ll introduce criterion-related validity, concurrent validity, predictive validity, construct validity II, convergent validity, discriminant validity, and content validity. Whew! A general conceptualization of the “validities” we’ll consider here . . . All but content validity are concerned with the extent to which scores on a test correlate with positions of people on some dimension of interest to us. Specific types of Validity I. Criterion-Related Validity Test Job “Dimension of interest” is performance on some task or job, e.g., job performance, GPA. So Criterion-related Validity refers to the extent to which pre-employment or pre-admission test scores correlate with performance on some measurable criterion. This is the type of validity that is most important for I-O selection specialists. But it is also applicable to schools deciding among applicants for admission, for example. When someone uses the term, “validity coefficient” he/she is most likely referring to criterion-related validity. It’s the actual Pearson r between test scores and the criterion measure. Two specific types of criterion-related validity often used in I/O psychology when choosing pre-employment tests to predict performance on the job. A. Concurrent Validity The correlation of test scores with job performance of current employees. The test scores and the criterion scores are obtained at the same time, e.g., from current employees of an organization. Most often computed. B. Predictive Validity. The correlation of test scores with later job performance scores of persons just hired. The test scores are obtained prior to hiring. Criterion scores are obtained after those who took the pretest have been hired. Computed only occasionally. Validation Study. A study carried out by an organization in order to assess the validity of a test. P513 Lecture 2: Validity - 1 Printed on 2/9/2016 Typical Criterion-related Validities. How good a job do we do? From Schmidt, F. L. (2012). Validity and Utility of Selection Methods. Keynote presentation at River Cities Industrial-Organizational Psychology Conference, Chattanooga, TN, October. Unless otherwise noted, all operational validity estimates are of the specific type of test as the only predictor and corrected for measurement error (i.e., unreliability) in the criterion measure and indirect range restriction (IRR) on the predictor measure to estimate operational validity for applicant populations. This means that the correlations below are somewhat larger than those you would obtain from computing Pearson r without the corrections. 2012 GMA testsa 1998 .65 .51 .46 .41 .58 .51 .60 .38 .22 .31 Reference checks .26 .26 Biographical data measuresg .35 .35 Job experience h .13 .18 b Integrity tests Employment interviews (structured) c Employment interviews (unstructured)d Conscientiousness e f Person-job fit measures i .18 SJT (knowledge)j .26 Assessment centersk .37 .37 .49 .49 .11 .11 .10 .10 .10 .10 Peer ratings l T & E point method m Years of educationn o Interests p .24 Emotional Intelligence (mixed)q .24 Emotional Intelligence (ability) r GPA .34 Person-organization fit measures Work sample testst Emotional Stability s .13 .33 SJT (behavioral tendency) u v .12 .44 Behavioral consistency methodx .45 y .54 .26 Job tryout procedurew Job knowledge Really!!?? .48 P513 Lecture 2: Validity - 2 Printed on 2/9/2016 Factors affecting validity coefficients in selection of employees or students. Why aren’t correlations = 1? 1. Problems with the selection test. A. Test is deficient - doesn't measure characteristics that predict some parts of the job Test may predict one thing. Job may require something else. Example: Job: Manager: Requirements Cognitive ability Conscientiousness Interpersonal Skills Test: Cognitive Ability Test As it should, the test measures cognitive ability which predicts part of what the job involves Error Error Some of the variation in job scores will be due to individual differences in Conscientiousness and Interpersonal Skills. But these differences won’t be reflected in the Test scores. Small r Job Wonderlic So r between Test and Job will be smaller than it should be due to deficiency of test. CA Con Int Skill B. Test is contaminated - affected by factors other than the factors important for the job Example Job: Small parts assembly Requirements Manual Dexterity Test: Computerized Manual dexterity Test Manual dexterity Computer skills Small r Error Error Test Comp Skills Job Man Dex P513 Lecture 2: Validity - 3 Some of the variation in Test scores will be due to individual differences in Computer Skills. But these differences won’t be reflected in Job scores. So the r between Test and Job will be smaller than it should be due to contamination of the test. Printed on 2/9/2016 2. Reliability of the Test and Reliability of the criterion This was covered in the section on reliability ceiling Test Error Small r. Observed validity is affected by true correlation AND by errors of measurement Job Score Test Score Test: Ability Job Error True rTJ Job Ability 3. Range differences between the validation sample and the population in which the test will be used. If the range (difference between largest and smallest) within the sample used for validation does not equal the range of scores in the population for which the test will be used, the validity coefficient obtained in the validation study will be different from that which would be obtained in use of the test. A. Validation sample range is restricted relative to the population for which the test will be used- the typical case. e.g. Test is validated using current employees. It will then be used for the applicant pool consisting of persons from all walks of life, some of whom would not been capable enough to be hired. Population: Applicant pool consists of persons who would not have been capable enough to have been hired. - They would score low on the test and low on the job, making the overall correlation high. Validation sample: Current employees are a select group. Job scores Test scores Correlation between test and job is small within that group. The result is the correlation coefficient computed from the validation group will be smaller than that which would have been computed had the whole applicant pool been included in the validation study. Why do we care about differences in range? When choosing tests, comparing different advertised validities requires that the testing conditions be comparable. A bad predictor validated on a heterogeneous sample may have a larger r than a good predictor validated on a homogenous sample. P513 Lecture 2: Validity - 4 Printed on 2/9/2016 B. Validation sample range is larger than that of the population for which the test will be used - less often encountered. A test of mechanical ability is validated on a sample from the general college population, including liberal arts majors. But the test is used for an applicant pool consisting of only persons who believe they have the capability to perform the job which requires considerable mechanical skill. So the range of scores in the applicant pool will be restricted relative to the range in the validation sample. Population. Applicants are a select group.. Validation sample. Consists of a wide range of abilities, most of which are below the level appropriate for the job. Correlation between test and job is small within that group. Bottom Line: I feel that criterion-related validity is the most important characteristic of a pre-employment test. The Issue of Mindless Empiricism as a criticism of the focus on Criterion-related Validity. Note that the issue of criterion-related validity of a measure has nothing to do with that measure’s intrinsic relationship to the criterion. A test may be a good predictor of job performance even though the content of the test bears no relationship to the content of the job. This means that it does not have to make sense that a given test is a good predictor of the criterion. The bottom line in the evaluation of a predictor is the correlation coefficient. If it is sufficiently large, then that’s good. If it’s not large enough, that’s not good. Whether there is any theoretical or obvious reason for a high correlation is not the issue here. Thus, focus on criterion-related validity only is a very empirical approach to the study of relationships of tests to criteria, with the primary emphasis on the correlation, and little thought given to the theory of the relationship. Such a focus gets psychologists in trouble with those to whom they’re trying to explain the results. Consider the Miller Analogies Test (MAT) for example. Example item: “Lead is to a pencil as bullet is to a) lead, b) gun, c) killing d) national health care policy.” How do you explain to the parent of a student denied admission that the student’s failure to correctly identify enough analogies in the Miller Analogies Test prevents the student from being admitted to a graduate nursing program? There is a significant, positive correlation between (MAT) scores and performance in nursing programs, but the reason, if known, is very difficult to explain. Do companies conduct validation studies? Alas, many do not – because they lack expertise, because they don’t see the value, because of small sample sizes, or because of difficulty in getting the criterion scores, to name four reasons. P513 Lecture 2: Validity - 5 Printed on 2/9/2016 Correcting validity coefficients for reliability differences and range effects Why correct? Skipped in 2016 1. If I’m evaluating a new predictor, I want to compare it with others on a “level” playing field. That includes removing the effects of unreliability and of range differences between the different samples. So the corrections here permit comparisons with correlations computed in difference circumstances. Corrections are in the spirit of removing confounds whenever we can. Examples are standard scores and standardized tests. Both remove specific characteristics of the test from the report of performance. 2. In meta-analyses, the correlations that are aggregated must be “equivalent” When comparing different selection tests validated on different samples, we need equivalence. Standard corrections 1. Correcting for unreliability of the measures (This is based on the reliability ceiling formula.) rXY rtX,tY(1) = ---------------------------------sqrt(rXX’)sqrt(rYY’) The corrected r is labeled (1) because there is a 2nd correction, shown below, that is typically made. Suppose rXY = .6, but assume rXX’ = .9 and rYY’ = .8. Then rtX,tY would be .6/sqrt(.9)sqrt(.8) = .6 / (.95)(.89) = .6 / .85 = .71. This is 18% larger than the observed r. Caution: The reliabilities and the observed r have to be good estimates of the population values, otherwise correction can result in absurd estimates of the true correlation. In selection situations, we correct for unreliability in the criterion measure only. The reasoning is as follows: We correct because we want to assess the “true” amount of a construct. In selection situations, the “true” job performance is available – it’s what we’ll observe over the years an employee is with the firm. So we correct for unreliability of a single measure of job performance. But we don’t correct for unreliability in the test because in selection situations, the observed test is the only thing we have. We might be interested in the true scores on the dimension represented by the test, but in selection, we can’t use the true scores, we can only use the observed scores. So, in selection situations, the correction for unreliability is rXY rX,tY(1)= ---------------------------------------sqrt(rYY’) Note that the corrected correlation is labeled rX,tY, not rtX,tY to indicate that it is corrected only for unreliability of the criterion, Y. P513 Lecture 2: Validity - 6 Printed on 2/9/2016 2. Correcting for Range Differences on the criterion variable. This is applicable in some selection situations. Skipped in 2016 After correcting for unreliability of X and Y, a 2nd correction, for range differences, is made. SUse If SVal is less than SUse, the rtX,tY(1) * -----------typical restriction of range SVal situation, this r(2) will be rtX,tY(2) = ------------------------------------------------------------------larger than r(1). S2Use sqrt(1 - r2tX,tY(1) + r2tX,tY(1)-----------) S2Val In this formula, rtX,tY(1) is the correlation corrected for unreliability from the previous page. SUse is the standard deviation of Y in the population in which the test will be used. SVal is the standard deviation of Y in the validation sample. 3. Other corrections. There is a growing literature on corrections in meta analyses and in selection situations. References . . . Stauffer, J. M., & Mendoza, J. L. (2001). The proper sequence for correcting correlation coefficients for range restriction and unreliability. Psychometrika, 66(1), 63-68. Sackett, P. R., & Yang, H. (2000). Correction for range restriction: An expanded typology. Journal of Applied Psychology, 85(1), 112-118. Hunter, J. E., & Schmidt, F. L. (2004). Methods of Meta-Analysis: Correcting Error and Bias in Research Findings. 2nd Ed. Thousand Oaks, CA: Sage. The bottom line of this is that if you’re involved in selection, you should be familiar with the language used when discussing validity of selection tests. That language will include the concepts of reliability, correction, and range restriction discussed here. P513 Lecture 2: Validity - 7 Printed on 2/9/2016 II. Construct Validity II (recall that Construct Validity I was present in the fall.) Definition: the extent to which observable test scores represent positions of people on underlying, not directly observable, dimensions. Why are we interested in this? 1. In Selection. For many, particularly high level jobs, it is felt that performance is determined by certain attributes which are not directly observable, such as leadership ability, , motivating potential, initiative, extraversion, conscientiousness, etc. Look at managerial job descriptions for examples of such constructs. So it is felt that if we can measure these traits, then we can identify persons who will be good managers, leaders, sales persons, etc. Note the distinction between this approach and the criterion-related validity approach. With the former, we don’t really care WHY someone performs well; all we wish to do is to identify a test that correlates with performance, regardless of the causes of that performance. It’s a very empirical approach. Here, we definitely have a theory of why persons perform well, e.g., people who have high leadership ability will be better managers. So we seek tests that measure those underlying not directly observable attributes that we believe contribute to good performance. It’s a very theoretical approach. 2. In theory construction. Our theories are made up of relationships between theoretical constructs - attributes of people and how those attributes are related to behavior or other attributes. Construct validity concerns how those not-directly-observable attributes are measured. e.g. In I-O, there are theories of the relationship of job satisfaction and work motivation to performance, to turnover, and to other organizational factors. These theories are examples of what I-O psychology is about. Other theories define other areas of psychology. All such theories are, ultimately, collections of statements concerning relationships among constructs. Whether a measure is the best measure of a construct important for our theory is what construct validity is about. Is our measure of job satisfaction appropriate? If it is, we can proceed to test our theory relating job satisfaction to other organizational constructs. If it’s not, then there is no point in using that measure to test our theory. P513 Lecture 2: Validity - 8 Printed on 2/9/2016 Assessing Construct Validity How do we know a test measures an unobservable dimension? This is kind of like pulling oneself up by one’s own bootstraps. Solution: Ultimately construct validity is based on a subjective consensus of opinion of persons knowledgeable about the construct under consideration. Construct validity of the first measure of a construct is purely subjective. We begin with a measure of the construct that knowledgeable people agree measures the construct. Construct validity of the 2nd and subsequent measures of a construct use the first measure. We correlate subsequent measures of the construct with the existing measure (or measures). Each such evaluation adds to our knowledge of what the construct is. Specific criteria for Construct Validity of the subsequent measures Generally speaking, a new measure of a construct has high construct validity if a. the new measure has high convergent validity, i.e., it correlates strongly with other purported measures of the construct, and b. the new measure has high discriminant validity, i.e., it correlates negligibly (i.e., near zero) with measures of other constructs which are unrelated (correlate zero with) the construct under consideration. Discriminant validity refers to lack of correlation. Convergent validity: The correlation of a test with other purported measures of the same construct. Two ways of assessing Convergent validity of a test 1. Correlation approach: Correlate scores on the test with other measures of same construct. 2. Group Differences approach: Find groups known to differ on the construct. Determine if they are significantly different on the test. Assessing the construct validity of a new measure of extraversion? Suppose sales potential is determined to a considerable extent by extraversion. Measure successful sales people and clerks using the new measure. If two groups that should be different are different – successful sales people vs. clerks - that is an indication of high convergent validity. If two groups that should be different are not different, that’s an indication of low convergent validity. P513 Lecture 2: Validity - 9 Printed on 2/9/2016 Discrmininant validity: The absence of correlation of a test with measures of other theoretically unrelated constructs. Two ways of assessing Discriminant validity of a test. 1. Correlation approach: Correlate the test with measures of other, unrelated constructs. Near zero, correlations means good discriminant validity. Conscientiousness: correlation with extraversion should be zero since they’re different constructs. 2. Group Differences approach: Determine that groups known to differ on other constructs are not significantly different on the test. Suppose you’ve developed a new measure of conscientiousness. To assess its discriminant validity Find a group high on extraversion (sales people) and a group low on extraversion (clerks). Give them all the conscientiousness test and compare the mean difference between the two groups on the test. If the conscientiousness test has good discriminant validity, there will not be a significant difference between the 2 groups, since sales people and clerks are probably about equal in conscientiousness. If it does not have discriminant validity, the two groups will differ significantly. So, establishing construct validity involves correlating a test with other measures of the same construct and of different constructs. Note that high power is required when demonstrating discriminant validity. If there is discriminant validity, there will not be a relationship in true scores(1 above) or there will not be a difference in true means (2 above). You must be able to argue that the absence of a relationship or difference was not due to low power. “The Good Test” (Sundays at 10 PM on The Learning Channel) High reliability: Large positive reliability coefficient - .8 or better to guarantee acceptance Good Validity Good Convergent Validity: Large, correlations in expected direction with other measure of the same construct. Good Discriminant Validity: Small correlations with measures of other independent constructs. P513 Lecture 2: Validity - 10 Printed on 2/9/2016 Examples of assessing Construct validity from our research Convergent Validity of Bifactor model measures of the Big 5 vs. Scale score measures. Typically, psychological constructs are assessed using summated scores. (See the next lecture for more than you ever wanted to know about summated scores.) We have been investigating the possibility that responses to personality items are affected by an “affective bias”, a tendency express the affective state of the respondent his or her response to the content of an item. We believe that measures of the Big Five dimensions with this “affective bias” removed will be better estimates of the dimensions – “purifying” them, if you will. At the same time, there is considerable evidence of the usefulness of the Big Five summated scale scores. For that reason, our “purified” measures should still exhibit convergent validity with the summated scale scores. Evidence NEO-FFI-3 questionnaire. N=736. The Scale Scores The “purified” Scores Convergent validity correlations of scale scores with “purified” scores. Extraversion .867 Agreeableness .915 Conscientiousness .881 Stability .981 Openness .909 So the “Purified” measures correlate strongly with the scale score measures, as they should. But the correlations are not perfect, meaning that perhaps the “contamination” that is present in the scale scores is not in the “purified” scores. P513 Lecture 2: Validity - 11 Printed on 2/9/2016 Do the HEXACO-PI-R measures of the Big Five exhibit convergent validity with the NEO-FFI-3 measures. This is a simple, straightforward test of convergent validity. The NEO-FFI questionnaire has been used for many years. The HEXACO questionnaire has been more recently promoted. The HEXACO is said to measure the Big Five plus one more measure – Honesty/Humility. What is the convergent validity of corresponding scale scores from the two questionnaires. F2014 Neo-FFI plus HEXACO DualResponders 141227. N= 409 NEO-FFI-3 HEXACO-PI-R Convergent validity correlations of NEO-FFI-3 scale scores with HEXACO-PI-R scale scores. Extraversion .816 Agreeableness .518 Conscientiousness .766 Stability .445 Openness .727 Wow!! If you’re measuring Agreeableness or Stability, you have to decide which questionnaire to use. The scales from the two questionnaires exhibit only fair convergent validity. It appears that those two scales from the NEO-FFI-3 measure something different from the HEXACO scales of the same name. Remember, though – this is just one sample of N=409. We should hold off a definitive decision until the metaanalysis is available. P513 Lecture 2: Validity - 12 Printed on 2/9/2016 Convergent and Discriminant Validity of Response Inconsistency We’ve been studying a measure of response inconsistency, defined by the standard deviation of responses to items within the same scale. An overall measure for a questionnaire is the average of standard deviations of responses to items within all the scales within that questionnaire. We compute an Inconsistency measure from the NEO-FFI-3 and an Inconsistency measure from the HEXACOPI-R administered to the same respondents. Here’s a scatterplot illustrating the convergent validity of inconsistency measured in the two questionnaires. Pearson r is .665, p < .001. So the two measures exhibit pretty good convergent validity. Is Inconsistency separate from the Big 5 dimensions? Here are discriminant validity correlations of inconsistency with scale scores from the two questionnaires. Correlations neoe sdnmea n Pearson Correlation Sig. (2-tailed) N neoa neoc neos neoo hx Extraversio n ha Agreeablen ess hc Conscientio usness hs Stability - rev of emotionalit y ho Openness .096 -.004 .168 -.072 .059 .000 -.193 .074 -.047 -.046 .014 653 .928 653 .000 653 .067 653 .135 653 .991 653 .000 653 .060 653 .235 653 .239 653 hx Extraversio n ha Agreeablen ess hc Conscientio usness hs Stability - rev of emotionalit y ho Openness Correlations neoe sdhmea n Pearson Correlation Sig. (2-tailed) N neoa neoc neos neoo .116 .138 .142 -.044 .158 .092 -.069 .147 -.093 .028 .003 653 .000 653 .000 653 .262 653 .000 653 .019 653 .080 653 .000 653 .018 653 .478 653 Although some of the correlations are significantely different from zero, none is larger than .17, so generally, I feel that it’s reasonable to conclude that inconsistency has high discriminant validity. It’s not measuring what the Big Five or HEXACO scales are measuring. P513 Lecture 2: Validity - 13 Printed on 2/9/2016 III. Content Validity The extent to which test content represents the content of the dimension of interest. Example of a test with high content validity A test of general arithmetic ability that contains items representing all the common arithmetic operations – addition, subtraction, multiplication, and division. Example of a test with low content validity A test of general arithmetic ability that contains only measurement of reaction time and spatial ability. Note that the issue of whether or not a test contains the content of the dimension of interest has no direct bearing on whether or not scores on the test are correlated with position on that dimension. Of course, the assumption is that tests with high content validity will show high correlations with the dimensions represented by the tests. Why bother with Content Validity in view of the previous emphasis on correlation – with criteria or constructs? 1. Time and Money. In many selection situations, it is easier to demonstrate content validity than it is criterion-related validity. A validation study designed to assess criterion-related validity requires at last 200 participants. In a small or medium-sized company, it may be impossible to gather such data in a reasonable period of time. (The VALDAT data on the validity of the formula score as a predictor of performance in our programs has been gathered over a period of 12 years, increasing at the rate of about 20 / year.. We didn’t have 200 until 10 years into the project.) 2. Politics. It is easier to make lay persons understand the results of a content valid test than one that has high criterion-related validity but is not content valid. This includes the courts. Assessing Content Validity: The Content Validity Ratio 1. Convene a panel of subject matter experts (SMEs). 2. Have each judge rate each item on the test as a) Essential, b) Useful, or c) Not necessary. 3. Label the total number of judges as N. For each item . . . 3. Compute the number of judges rating the item as essential. Label this count, NE. NE – N/2 4. For each item, compute the Content Validity Ratio (CVR) as CVR = -------------------. N/2 5. Compute the mean of the individual item CVRs as the test CVR. Note that the CVR can range from +1, representing highest possible content validity to -1, representing lowest possible content validity. Tables of “statistically significant” CVRs have been prepared. Aiken, L.R. (1980). Content validity and reliability of single items or questionnaires. Educational and Psychological Measurement, 40, 955–959. Penfield, R. & Giacobbi, P. (2004) Applying a score confidence interval to Aiken’s item content-relevance index. Measurement in Physical Education and Exercise Science, 8(4), 213-225. Schmidt, F. L. (2012). Cognitive tests used in selection can have content validity as well as criterion validity: A broader research review and implications for practice. International Journal of Selection and Assessment, 20, 1-13. P513 Lecture 2: Validity - 14 Printed on 2/9/2016