1 Introduction to Validity Validity is related to how well a scale fulfills the function for which it is being used or how “correct” are the inferences that are made based on performance on the scale. It has been described as “scientific inquiry into test (scale) score meaning” (Messick, 1989). A scale is valid if it measures what it is supposed to be measuring. The inferences and conclusions that are reached on the basis of scores on the measure are being validated NOT the scale itself. Construct Validity All other types of validity can be considered under the umbrella of construct validity (Messick, 1989) This type of validity is directly concerned with the theoretical relationships of scores from a measure or scale to other variables. Used when no criterion or universe of content is accepted as entirely adequate (is this ever not true?) It is the extent to which a scale or measure "behaves" the way that it should with respect to established measures of other constructs and can be considered to be the degree to which the measure assesses the theoretical construct or trait that it was designed to measure. This type of validation requires multiple types of evidence, including both logical and empirical evidence Logical evidence can be obtained by examining the items for face validity and clear precise language. If either of these are lacking, then the construct validity of the measure is reduced. Note that the author of your text discounts this as a form of validity. Empirical evidence can be obtained by making and testing predictions about how scores on the measure should behave in various situations These predictions could be based on differences in demographic information, performance criteria, or measures of other constructs for which the relationship between the constructs has already been validated Procedures for Assessing Construct Validity 1. Correlational studies between the test and other related measures 2. Differentiation between groups, such as finding the difference between the average scores of schizophrenic and non-schizophrenic persons on a measure designed to evaluate mental health 3. Factor analysis 2 4. Multi-trait Multi-method Matrix Must measure the construct using two (or more) ways (i.e observation, selfreport, spouse report) and then identify other constructs that can be measured in the same ways AND at different related constructs that can be measured in the same ways. Reliability correlations (using coefficient alpha) should be high and represent item homogeneity within a construct Convergent validity correlations, obtained between measures of the same construct measured using different formats, should be high Discriminant validity correlations, obtained between measures of different constructs measured using either the same (heterotrait-monomethod) or different (heterotrait-heteromethod) formats should be low Content Validity Useful in situations when a measure is to be used to draw inferences about how a person would perform on a larger universe of items, similar to those on the measure itself. This is easiest to evaluate when the domain is well defined (i.e. two-digit addition). When measuring constructs dealing with beliefs, attitudes or dispositions it is more difficult. In theory, a scale has content validity when its items are randomly chosen from the universe of appropriate items. Steps that we have taken to describe the subscales of our scale have helped to ensure content validity. There are three steps to assessing the content validity of a measure: 1. Describe the content domain of the instrument, in detail. For achievement tests these are usually the instructional objectives. 2. Determine what area within the content domain is measured by each item 3. Compare the structure of the measure to that of the content domain and evaluate whether items adequately represents the domain This type of validity is based on individual subjective judgement, rather than empirical evidence. Quite often “experts” are used. Practical Considerations 1. Should objectives be weighted equally or given different weighting based on importance? 2. How should the task be structured when experts are asked to evaluate whether an item is representative of the performance domain or instructional objectives? Five-point scale or dichotomous scale? 3. What aspects of the item should be examined? Subject matter? Cognitive process? Level of complexity? Item format? Response format? 3 4. How should results be summarized? Percentage of items matched to objectives? Percentage of items matched to objectives with high importance rating? Percentage of objectives NOT assessed by any items on the test? Issues 1. Even if all items fit the domain, the domain may not adequately represent the construct 2. Should ethnic, racial, or gender differences be considered? Consider “story” problems in mathematics 3. Should item and/or test performance data be considered? Criterion-Related Validity A criterion is a measure that could be used to determine the accuracy of a decision. This type of validity is useful in situations when scores from a scale are to be used to make decisions about how a respondent would perform on some external behavior, or criterion measure, of practical importance This type of validity is more of a practical issue than a scientific one because it is only concerned with predicting a process, as opposed to understanding it. This type of validity does not imply causal relationships, even when time is an element. Predictive validity relates to how well scores predict success on future performance on the criterion measure. There are two steps in this type of a study: 1. Obtain scores from a group of respondents, but do not use the scores, in any way, for decisions making 2. At some later date, obtain performance measures for the same group of people and correlate these measures with scores to obtain a predictive validity coefficient. This approach is somewhat impractical, since it requires that the sample used in the study be similar to the population. Therefore, selection decisions must be based on a random basis. This could have negative consequences for both the individual and the decision maker. Concurrent validity relates to the current relationship between test scores and the criterion measure. This biases the results because only those who have already been selected are used in the study. Hence, the sample used in the study may be quite different from the population. However, these types of studies are easier and more practical, and research has shown that validity coefficients from these types of studies are similar to those found in predictive validity studies, although they tend to seriously underestimate the population validity due to restriction of range. 4 Practical Problems 1. Criterion measures that are readily available and easily measured are often not sufficiently complete or important 2. Criterion measures that are substantial and important are often difficult to define and measure and typically are better suited to observational assessment 3. Large samples (200 or more) are needed to reflect validity levels that reflect the population accurately at least 90% of the time 4. If those who influence scores on criterion measure are aware of scores on the predictor measure the results may be biased or contaminated 5. Restriction of range, due to selection or ceiling/floor effects, affect correlation coefficients, however this can be corrected for statistically. 6. Correlation coefficients do not reveal how many cases have been correctly classified. 5 Example: Multitrait – Multimethod Matrix H Teacher A I H Tests A I H Observer A I Teacher Ratings Honesty Aggressiveness Intelligence (H) (A) (I) 0.89 0.51 0.89 0.38 0.37 0.76 0.57 0.22 0.09 0.22 0.57 0.10 0.11 0.11 0.46 0.56 0.22 0.11 0.23 0.58 0.12 0.11 0.11 0.45 Observers’ Ratings Tests H A I H A I 0.93 0.68 0.59 0.67 0.43 0.34 0.94 0.58 0.42 0.66 0.32 0.84 0.33 0.34 0.58 0.94 0.67 0.58 0.92 0.60 0.85 Reliability correlations are represented on the diagonal should be high and represent item homogeneity within a construct Convergent validity correlations are underlined. These are the correlations between measures of the same construct measured using different formats. They should be high. Discriminant validity correlations are italicized. These are the correlations obtained between measures of different constructs measured using either the same (heterotrait-monomethod) or different (heterotrait-heteromethod) formats. These should be low