Psychometric Test Validity: Core Concepts

PSYCHOMETRIC PROPERTY OF A PSYCHOLOGICAL TEST CORE CONCEPT 1: Validity The Concept of Validity Can be defined as the agreement between a test score or measure and the quality it is believed to measure. VALIDITY As applied to a test, is a judgment or estimate of how well a test measures what it purports to measure in a particular context. Defined as the answer to the question, “Does the test measure what it is supposed to measure?” It is a judgment based on evidence about the appropriateness of inferences drawn from test scores ➢ No test or measurement technique is “universally valid” for all time, for all uses, with all types of testtaker populations. Rather, tests may be shown to be valid within what we would characterize as reasonable boundaries of a contemplated usage (for example, the test has been shown to be valid for a particular use with a particular population of test takers at a particular time.) ➢ Validity of a test may diminish as the culture or the times change, the validity of a test must be proven again from time to time. ➢ Validation is the process of gathering and evaluating evidence about validity. ➢ The test developer and the test user may play a role in the validation of a test for a specific purpose. Test developer – responsible for supplying validity evidence in the test manual. Test User -conduct their own validation studies with their own groups of test takers. Local validation studies are absolutely necessary when the test user plans to alter in some way the format, instructions, language, or content of the test. (For example, a local validation study would be necessary if the test user sought to transform a nationally standardized test into Braille for administration to blind and visually impaired test takers). ➢ Validity is the evidence for inferences (logical result or deduction) made about a test score. Three types of evidence: (1) construct related (2) criterion related (3) content related Validity is a unitary concept that represents all of the evidence that supports the intended interpretation of a measure or in other words all three types of validity evidence contribute to a unified picture of a test’s validity. ➢ Approaches to assessing validity 1. scrutinizing the test’s content 2. relating scores obtained on the test to other test scores or other measures 3. executing a comprehensive analysis of: a. how scores on the test relate to other test scores and measures b. how scores on the test can be understood within some theoretical framework for understanding the construct that the test was designed to measure CORE CONCEPT 2: Aspects of Validity To establish the validity of a test, we need to gather several types of evidence: Face validity ➢ ➢ ➢ ➢ ➢ - is a judgment concerning how relevant the test items appear to be - is the mere appearance that a measure has validity. Stated another way, if a test definitely appears to measure what it purports to measure “on the face of it,” then it could be said to be high in face validity. For example, a scale to measure anxiety might include items such as “My stomach gets upset when I think about taking tests” and “My heart starts pounding fast whenever I think about all of the things I need to get done.” Face validity is really not validity at all because it does not offer evidence to support conclusions drawn from test scores. We are not suggesting that face validity is unimportant. In many settings, it is crucial to have a test that “looks like” it is valid. These appearances can help motivate test takers because they can see that the test is relevant. Content Validity ➢ How many times have you studied for an examination and known almost everything only to find that the professor has come up with some strange items that do not represent the content of the course? If this has happened, you may have encountered a test with poor content-related evidence for validity ➢ Content Validity considers the adequacy of representation of the conceptual domain the test is designed to cover. ➢ It is the only type of evidence besides face validity that is logical rather than statistical. Examples: 1. With respect to educational achievement tests, it is customary to consider a test a content-valid measure when the proportion of material covered by the test approximates the proportion of material covered in the course. 2. A cumulative final exam in introductory statistics would be considered contentvalid if the proportion and type of introductory statistics problems on the test approximates the proportion and type of introductory statistics problems presented in the course. 3. For an employment test to be content-valid, its content must be a representative sample of the job-related skills required for employment. THE QUANTIFICATION OF CONTENT VALIDITY ➢ Test developers must consider the wording of the items and the appropriateness of the reading level ➢ Determination of content validity evidence is often made by expert judgment. (Multiple judges rate each item in terms of its match or relevance to the content) Lawshe developed a formula termed the content validity ratio (CVR): ➢ Statistical methods such as factor analysis have also been used to determine whether items fit into conceptual domains ➢ Two new concepts that are relevant to content validity evidence were emphasized in the latest version of the standards for educational and psychological tests (Worrell & Roberson, 2016): 1. Construct underrepresentation describes the failure to capture important components of a construct. (For example, if a test of mathematical knowledge included algebra but not geometry, the validity of the test would be threatened by construct underrepresentation.) 2. Construct-irrelevant variance occurs when scores are influenced by factors irrelevant to the construct. (For example, a test of intelligence might be influenced by reading comprehension, test anxiety, or illness.) Often, test scores reflect many factors besides what the test supposedly measures. For example, many students do poorly on tests because of anxiety or reading problems. A slow reader may get a low score on an examination because he or she did not have adequate time to read through all of the questions. Criterion Validity ➢ When we want to know how well someone will do on a job, which students we should select for our graduate program, or who is most likely to get a serious disease, we often depend on psychological testing to forecast behavior and inclinations. ➢ Criterion-Related Validity is a judgment of how adequately a test score can be used to infer an individual’s most probable standing on some measure of interest—the measure of interest being the criterion. ➢ Criterion-Related Validity tells us just how well a test corresponds with a particular criterion. ➢ A criterion is the standard against which the test is compared. ➢ For example, a test might be used to predict which engaged couples will have successful marriages and which ones will get divorced. Marital success is the criterion, but it cannot be known at the time the couples take the premarital test. The reason for gathering criterion validity evidence is that the test or measure is to serve as a “stand-in” for the measure we are really interested in. In the marital example, the premarital test serves as a stand-in for estimating future marital happiness. ➢ Two types of validity evidence are subsumed under the heading criterion-related validity: concurrent and predictive validity 1. Concurrent validity is an index of the degree to which a test score is related to some criterion measure obtained at the same time (concurrently). 2. Predictive validity is an index of the degree to which a test score predicts some criterion measure. Before we discuss each of these types of validity evidence in detail, it seems appropriate to raise (and answer) an important question. What Is a Criterion? It is the standard against which a test or a test score is evaluated. ➢ So, for example, if a test purports to measure the trait of athleticism, we might expect to employ “membership in a health club” or any generally accepted measure of physical fitness as a criterion in evaluating whether the athleticism test truly measures athleticism ➢ Operationally, a criterion can be most anything: pilot performance in fl ying a Boeing 767, grade on examination in Advanced Hairweaving, number of days spent in psychiatric hospitalization; the list is endless. ➢ It can be a test score, a specific behavior or group of behaviors, an amount of time, a rating, a psychiatric diagnosis, a training cost, an index of absenteeism, an index of alcohol intoxication, and so on. ➢ Characteristics of a criterion: 1. Relevant - it is pertinent or applicable to the matter at hand 2. Valid - An adequate criterion measure must also be valid for the purpose for which it is being used. (If one test ( X – Beck’s Depression Inventory) is being used as the criterion to validate a second test ( Y – Hamilton Depression Inventory ), then evidence should exist that test X is valid. 3. Uncontaminated – without extraneous variables Criterion contamination a situation in which a response measure (the criterion) is influenced by factors that are not related to the concept being measured Concurrent Validity ➢ If test scores are obtained at about the same time that the criterion measures are obtained, measures of the relationship between the test scores and the criterion provide evidence of concurrent validity ➢ Applies when the test and the criterion can be measured at the same time. ➢ Comes from assessments of the simultaneous relationship between the test and the criterion—such as between a learning disability test and school performance. ➢ Here the measures and criterion measures are taken at the same time because the test is designed to explain why the person is now having difficulty in school. Examples: Tests Criterion Measures Scores on the Beck Depression Inventory Alcoholism Tendency Scores Clinician’s rating of depression of the same group of clients Significant others’ ratings of amount of alcohol they use Predictive Validity ➢ Is the extent to which a score on a scale or test predicts scores on some criterion measure. ➢ The forecasting function of tests is actually a type or form of criterion validity evidence known as predictive validity evidence. ➢ Measures of the relationship between the test scores and a criterion measure obtained at a future time provide an indication of the predictive validity of the test; that is, how accurately scores on the test predict some criterion measure. Examples: Predictor Variable Criterion College Admission Test scores/results Freshman Grade Point Average Work Productivity Scale Supervisor’s Rating IQ Tests General Weighted Average ➢ Measures of the relationship between college admissions tests and freshman grade point averages, for example, provide evidence of the predictive validity of the admissions tests. ➢ Measures of the relationship between work productivity results and supervisor’s rating, for example, provide evidence of the predictive validity of the work productivity scale. ➢ Measures of the relationship between IQ test results and GWA, for example, provide evidence of the predictive validity of the admissions tests. Validity Coefficient ➢ Is the relationship between a test and a criterion is usually expressed as a correlation ➢ This coefficient tells the extent to which the test is valid for making statements about the criterion. ➢ Typically, the Pearson correlation coefficient is used to determine the validity between the two measures. ➢ Depending on variables such as the type of data, the sample size, and the shape of the distribution, other correlation coefficients could be used like Spearman rho rank-order correlation can be employed. Evaluating Validity Coefficients In its booklet Standards for Educational and Psychological Testing, the joint committee of the AERA, the APA, and the NCME (2014) lists several issues of concern when interpreting validity coefficients. Here are some of its recommendations. 1. Look for Changes in the Cause of Relationships - The logic of criterion validation presumes that the causes of the relationship between the test and the criterion will still exist when the test is in use. Though this presumption is true for the most part, there may be circumstances under which the relationship changes. For example, a test might be used and shown to be valid for selecting supervisors in the industry; however, the validity study may have been done at a time when all the employees were men, making the test valid for selecting supervisors for male employees. If the company hires female employees, then the test may no longer be valid for selecting supervisors because it may not consider the abilities necessary to supervise a sexually mixed group of employees 2. What Does the Criterion Mean? - Criterion-related validity studies mean nothing at all unless the criterion is valid and reliable. The criterion should relate specifically to the use of the test. 3. Review the Subject Population in the Validity Study - Another reason to be cautious of validity coefficients is that the validity study might have been done on a population that does not represent the group to which inferences will be made. - For example, some researchers have debated whether validity coefficients for intelligence and personnel tests that are based primarily on white samples are accurate when used to test African American students. 4. Be Sure the Sample Size Was Adequate - Sometimes a proper validity study cannot be done because there are too few people to study. A common practice is to do a small validity study with the people available. Unfortunately, such a study can be quite misleading. - The smaller the sample, the more likely chance variation in the data will affect the correlation. - The larger the sample size in the initial study, the better the likelihood that the relationship will cross validates. 5. Never Confuse the Criterion with the Predictor 6. Review Evidence for Validity Generalization - Criterion-related validity evidence obtained in one situation may not be generalized to other similar situations. 7. Consider Differential Prediction - Predictive relationships may not be the same for all demographic groups. The validity for men could differ in some circumstances from the validity for women. Or the validity of the test may be questionable because it is used for a group whose native language is not English, even though the test was validated for those who spoke only English. - Under these circumstances, separate validity studies for different groups may be necessary. Construct Validity Before 1950, most social scientists considered only criterion and content evidence for validity. By the mid-1950s, investigators concluded that no clear criteria existed for most of the social and psychological characteristics they wanted to measure. Developing a measure of intelligence, for example, was difficult because no one could say for certain what intelligence was. There was no criterion for intelligence because it is a hypothetical construct. A construct is defined as something built by mental synthesis. As a construct, intelligence does not exist as a separate thing we can touch or feel, so it cannot be used as an objective criterion. Contemporary psychologists often want to measure intelligence, love, curiosity, or mental health. None of these constructs are clearly defined, and there is no established criterion against which psychologists can compare the accuracy of their tests. These are the truly challenging problems in measurement. ➢ A construct is an informed, scientific idea developed or hypothesized to describe or explain behavior. -Intelligence is a construct that may be invoked to describe why a student performs well in school. Anxiety is a construct that may be invoked to describe why a psychiatric patient paces the floor. Other examples of constructs are job satisfaction, personality, bigotry, clerical aptitude, depression, motivation, selfesteem, emotional adjustment, potential dangerousness, executive potential, creativity, and mechanical comprehension, to name but a few. ➢ Construct validity evidence is established through a series of activities in which a researcher simultaneously defines some construct and develops the instrumentation to measure it. ➢ This process is required when “no criterion or universe of content is accepted as entirely adequate to define the quality to be measured” (Cronbach & Meehl, 1955, p. 282; Sackett, 2003) ➢ 2 ways to establish evidence of construct validity: 1. Convergent Evidence - When a measure correlates well with other tests believed to measure the same construct, convergent evidence for validity is obtained - This sort of evidence shows that measures of the same construct converge, or narrow in, on the same thing. Examples: a. New Anxiety Test correlates with Old Anxiety Test - we might expect high positive correlations between this new test and older, more established measures of test anxiety. b. New Anxiety Test correlates with General Anxiety test - we might also expect more moderate correlations c. Correlates assessment scores for Life Satisfaction Scale with scores obtained in Personal wellbeing scale. High positive correlations would indicate high convergent validity. d. Correlates assessment scores for Beck Depression Inventory with scores obtained in Suicide Inventory. High positive correlations would indicate high convergent validity. e. Roach et al. (1981) provided convergent evidence of the construct validity of the Marital Satisfaction Scale by computing a validity coefficient between scores on it and scores on the Marital Adjustment Test (Locke & Wallace, 1959). The validity coefficient of .79 provided additional evidence of their instrument’s construct validity. 2. Discriminant Evidence ➢ Scientists often confront other scientists with difficult questions such as, “Why should we believe your theory if we already have a theory that seems to say the same thing?” An eager scientist may answer this question by arguing that his or her theory is distinctive and better. In testing, psychologists face a similar challenge. Why should they create a new test if there is already one available to do the job? Thus, one type of evidence a person needs in test validation is proof that the test measures something unique. This demonstration of uniqueness is called discriminant evidence, or what some call divergent validation. ➢ A test should have low correlations with measures of unrelated constructs, or evidence for what the test does not measure. Examples: a. Correlates assessment scores for Life Satisfaction Scale with scores obtained in Beck Depression scale. No correlation would indicate high discriminant validity. b. Correlates assessment scores for Aggression Scale with scores obtained in Agreeableness scale. No correlation would indicate high discriminant validity. Evidence of Construct Validity ➢ The various techniques of construct validation may provide evidence, for example, that ■ The test is homogeneous, measuring a single construct. ■ Test scores increase or decrease as a function of age, the passage of time, or an experimental manipulation as theoretically predicted. ■ Test scores obtained after some event or the mere passage of time (that is, posttest scores) differ from pretest scores as theoretically predicted. ■ Test scores obtained by people from distinct groups vary as predicted by the theory. ■ Test scores correlate with scores on other tests in accordance with what would be predicted from a theory that covers the manifestation of the construct in question. CORE CONCEPT 5 : Test Bias VS Test Fairness In the eyes of many laypeople, questions concerning the validity of a test are intimately tied to questions concerning the fair use of tests and the issues of bias and fairness. Let us hasten to point out that validity, fairness in test use, and test bias are three separate issues. It is possible, for example, for a valid test to be used fairly or unfairly. ➢ Test Bias - as applied to psychological and educational tests may conjure up many meanings having to do with prejudice and preferential treatment (Brown et al., 1999). ➢ For psychometricians, bias is a factor inherent in a test that systematically prevents accurate, impartial measurement. ➢ Rating error – is a judgment resulting from the intentional or unintentional misuse of a rating scale. A rating is a numerical or verbal judgment (or both) that places a person or an attribute along a continuum identified by a scale of numerical or word descriptors known as a rating scale. Types of Rating Error: 1. Leniency error (also known as a generosity error ) - a type of rating mistake in which the ratings are consistently overly positive, particularly regarding the performance or ability of the participants. It is caused by the rater's tendency to be too positive or tolerant of shortcomings and to give undeservedly high evaluations. 2. Severity error – a type of rating error in which the ratings are consistently overly negative, particularly with regard to the performance or ability of the participants. It is caused by the rater's tendency to be too strict or negative and thus to give undeservedly low scores. 3. Central tendency error - the rater, for whatever reason, exhibits a general and systematic reluctance to giving ratings at either the positive or the negative extreme. Consequently, all of this rater’s ratings would tend to cluster in the middle of the rating continuum. 4. Halo effect - is when one trait of a person or thing is used to make an overall judgment of that person or thing. Test Fairness ➢ Some tests have been labeled “unfair” because they discriminate among groups of people. ➢ It is unfair to administer to a particular population a standardized test that did not include members of that population in the standardization sample Relationship between Reliability and Validity There are two types of validity: • • Internal validity – the instruments or procedures used in the research study measured what they were supposed to measure External validity -If the results can be generalized beyond the immediate study There are two types of reliability we use when assessing reliability in a test. They are • • internal reliability – the extent to which a measure is consistent within itself external reliability – the extent to which a measure varies from one use to another Relationship between Reliability and Validity Validity and reliability are inter-related aspects in research. In other words, if the research or a test is valid, then the data is reliable. Yet, if a test is reliable, that does not mean that it is valid.

Psychometric Test Validity: Core Concepts

Related documents

Products

Support

Psychometric Test Validity: Core Concepts

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib