Uploaded by Japy Dumlao

VALIDITY

advertisement
PSYCHOMETRIC PROPERTY OF A PSYCHOLOGICAL TEST
CORE CONCEPT 1: Validity
The Concept of Validity
Can be defined as the
agreement between a
test score or measure
and the quality it is
believed to measure.
VALIDITY
As applied to a test, is a
judgment or estimate of
how well a test measures
what it purports to
measure in a particular
context.
Defined as the answer
to the question, “Does
the test measure what
it is supposed to
measure?”
It is a judgment based on
evidence about the
appropriateness of
inferences drawn from
test scores
➢ No test or measurement technique is “universally valid” for all time, for all uses, with all
types of testtaker populations. Rather, tests may be shown to be valid within what we
would characterize as reasonable boundaries of a contemplated usage (for example, the
test has been shown to be valid for a particular use with a particular population of test
takers at a particular time.)
➢ Validity of a test may diminish as the culture or the times change, the validity of a test
must be proven again from time to time.
➢ Validation is the process of gathering and evaluating evidence about validity.
➢ The test developer and the test user may play a role in the validation of a test for a
specific purpose.
Test developer – responsible for supplying validity evidence in the test manual.
Test User -conduct their own validation studies with their own groups of test takers.
Local validation studies are absolutely necessary when the test user plans to alter in
some way the format, instructions, language, or content of the test. (For example, a local
validation study would be necessary if the test user sought to transform a nationally
standardized test into Braille for administration to blind and visually impaired test
takers).
➢ Validity is the evidence for inferences (logical result or deduction) made about a test
score.
Three types of evidence:
(1) construct related
(2) criterion related
(3) content related
Validity is a unitary concept that represents all of the evidence that supports the intended
interpretation of a measure or in other words all three types of validity evidence
contribute to a unified picture of a test’s validity.
➢ Approaches to assessing validity
1. scrutinizing the test’s content
2. relating scores obtained on the test to other test scores or other measures
3. executing a comprehensive analysis of:
a. how scores on the test relate to other test scores and measures
b. how scores on the test can be understood within some theoretical framework for
understanding the construct that the test was designed to measure
CORE CONCEPT 2: Aspects of Validity
To establish the validity of a test, we need to gather several types of evidence:
Face validity
➢
➢
➢
➢
➢
- is a judgment concerning how relevant the test items appear to be
- is the mere appearance that a measure has validity.
Stated another way, if a test definitely appears to measure what it purports to measure “on
the face of it,” then it could be said to be high in face validity.
For example, a scale to measure anxiety might include items such as “My stomach gets
upset when I think about taking tests” and “My heart starts pounding fast whenever I
think about all of the things I need to get done.”
Face validity is really not validity at all because it does not offer evidence to support
conclusions drawn from test scores.
We are not suggesting that face validity is unimportant. In many settings, it is crucial to
have a test that “looks like” it is valid.
These appearances can help motivate test takers because they can see that the test is
relevant.
Content Validity
➢ How many times have you studied for an examination and known almost everything only
to find that the professor has come up with some strange items that do not represent the
content of the course? If this has happened, you may have encountered a test with poor
content-related evidence for validity
➢ Content Validity considers the adequacy of representation of the conceptual domain the
test is designed to cover.
➢ It is the only type of evidence besides face validity that is logical rather than statistical.
Examples:
1. With respect to educational achievement tests, it is customary to consider a test a
content-valid measure when the proportion of material covered by the test
approximates the proportion of material covered in the course.
2. A cumulative final exam in introductory statistics would be considered contentvalid if the proportion and type of introductory statistics problems on the test
approximates the proportion and type of introductory statistics problems
presented in the course.
3. For an employment test to be content-valid, its content must be a representative
sample of the job-related skills required for employment.
THE QUANTIFICATION OF CONTENT VALIDITY
➢ Test developers must consider the wording of the items and the appropriateness of the
reading level
➢ Determination of content validity evidence is often made by expert judgment. (Multiple
judges rate each item in terms of its match or relevance to the content)
Lawshe developed a formula termed the content validity ratio (CVR):
➢ Statistical methods such as factor analysis have also been used to determine whether
items fit into conceptual domains
➢ Two new concepts that are relevant to content validity evidence were emphasized in the
latest version of the standards for educational and psychological tests (Worrell &
Roberson, 2016):
1. Construct underrepresentation describes the failure to capture important
components of a construct. (For example, if a test of mathematical knowledge
included algebra but not geometry, the validity of the test would be threatened by
construct underrepresentation.)
2. Construct-irrelevant variance occurs when scores are influenced by factors
irrelevant to the construct. (For example, a test of intelligence might be influenced
by reading comprehension, test anxiety, or illness.)
Often, test scores reflect many factors besides what the test supposedly measures.
For example, many students do poorly on tests because of anxiety or reading
problems. A slow reader may get a low score on an examination because he or
she did not have adequate time to read through all of the questions.
Criterion Validity
➢ When we want to know how well someone will do on a job, which students we should
select for our graduate program, or who is most likely to get a serious disease, we often
depend on psychological testing to forecast behavior and inclinations.
➢ Criterion-Related Validity is a judgment of how adequately a test score can be used to
infer an individual’s most probable standing on some measure of interest—the measure
of interest being the criterion.
➢ Criterion-Related Validity tells us just how well a test corresponds with a particular
criterion.
➢ A criterion is the standard against which the test is compared.
➢ For example, a test might be used to predict which engaged couples will have successful
marriages and which ones will get divorced. Marital success is the criterion, but it cannot
be known at the time the couples take the premarital test. The reason for gathering
criterion validity evidence is that the test or measure is to serve as a “stand-in” for the
measure we are really interested in. In the marital example, the premarital test serves as a
stand-in for estimating future marital happiness.
➢ Two types of validity evidence are subsumed under the heading criterion-related validity:
concurrent and predictive validity
1. Concurrent validity is an index of the degree to which a test score is related to
some criterion measure obtained at the same time (concurrently).
2. Predictive validity is an index of the degree to which a test score predicts some
criterion measure.
Before we discuss each of these types of validity evidence in detail, it seems
appropriate to raise (and answer) an important question.
What Is a Criterion? It is the standard against which a test or a test score is evaluated.
➢ So, for example, if a test purports to measure the trait of athleticism, we might expect to
employ “membership in a health club” or any generally accepted measure of physical
fitness as a criterion in evaluating whether the athleticism test truly measures athleticism
➢ Operationally, a criterion can be most anything: pilot performance in fl ying a Boeing
767, grade on examination in Advanced Hairweaving, number of days spent in
psychiatric hospitalization; the list is endless.
➢ It can be a test score, a specific behavior or group of behaviors, an amount of time, a
rating, a psychiatric diagnosis, a training cost, an index of absenteeism, an index of
alcohol intoxication, and so on.
➢ Characteristics of a criterion:
1. Relevant - it is pertinent or applicable to the matter at hand
2. Valid - An adequate criterion measure must also be valid for the purpose for
which it is being used. (If one test ( X – Beck’s Depression Inventory) is being
used as the criterion to validate a second test ( Y – Hamilton Depression
Inventory ), then evidence should exist that test X is valid.
3. Uncontaminated – without extraneous variables
Criterion contamination a situation in which a response measure (the criterion)
is influenced by factors that are not related to the concept being measured
Concurrent Validity
➢ If test scores are obtained at about the same time that the criterion measures are
obtained, measures of the relationship between the test scores and the criterion
provide evidence of concurrent validity
➢ Applies when the test and the criterion can be measured at the same time.
➢ Comes from assessments of the simultaneous relationship between the test and the
criterion—such as between a learning disability test and school performance.
➢ Here the measures and criterion measures are taken at the same time because the
test is designed to explain why the person is now having difficulty in school.
Examples:
Tests
Criterion Measures
Scores on the Beck
Depression Inventory
Alcoholism Tendency
Scores
Clinician’s rating of
depression of the same
group of clients
Significant others’ ratings of
amount of alcohol they use
Predictive Validity
➢ Is the extent to which a score on a scale or test predicts scores on some criterion measure.
➢ The forecasting function of tests is actually a type or form of criterion validity evidence
known as predictive validity evidence.
➢ Measures of the relationship between the test scores and a criterion measure obtained at a
future time provide an indication of the predictive validity of the test; that is, how
accurately scores on the test predict some criterion measure.
Examples:
Predictor Variable
Criterion
College Admission Test scores/results
Freshman Grade Point Average
Work Productivity Scale
Supervisor’s Rating
IQ Tests
General Weighted Average
➢ Measures of the relationship between college admissions tests and freshman grade
point averages, for example, provide evidence of the predictive validity of the
admissions tests.
➢ Measures of the relationship between work productivity results and supervisor’s
rating, for example, provide evidence of the predictive validity of the work
productivity scale.
➢ Measures of the relationship between IQ test results and GWA, for example,
provide evidence of the predictive validity of the admissions tests.
Validity Coefficient
➢ Is the relationship between a test and a criterion is usually expressed as a correlation
➢ This coefficient tells the extent to which the test is valid for making statements about the
criterion.
➢ Typically, the Pearson correlation coefficient is used to determine the validity between
the two measures.
➢ Depending on variables such as the type of data, the sample size, and the shape of the
distribution, other correlation coefficients could be used like Spearman rho rank-order
correlation can be employed.
Evaluating Validity Coefficients
In its booklet Standards for Educational and Psychological Testing, the joint committee
of the AERA, the APA, and the NCME (2014) lists several issues of concern when
interpreting validity coefficients. Here are some of its recommendations.
1. Look for Changes in the Cause of Relationships
- The logic of criterion validation presumes that the causes of the
relationship between the test and the criterion will still exist when the test
is in use. Though this presumption is true for the most part, there may be
circumstances under which the relationship changes. For example, a test
might be used and shown to be valid for selecting supervisors in the
industry; however, the validity study may have been done at a time when
all the employees were men, making the test valid for selecting
supervisors for male employees. If the company hires female employees,
then the test may no longer be valid for selecting supervisors because it
may not consider the abilities necessary to supervise a sexually mixed
group of employees
2. What Does the Criterion Mean?
- Criterion-related validity studies mean nothing at all unless the criterion
is valid and reliable. The criterion should relate specifically to the use of
the test.
3. Review the Subject Population in the Validity Study
- Another reason to be cautious of validity coefficients is that the validity
study might have been done on a population that does not represent the
group to which inferences will be made.
- For example, some researchers have debated whether validity
coefficients for intelligence and personnel tests that are based primarily on
white samples are accurate when used to test African American students.
4. Be Sure the Sample Size Was Adequate
- Sometimes a proper validity study cannot be done because there are too
few people to study. A common practice is to do a small validity study
with the people available. Unfortunately, such a study can be quite
misleading.
- The smaller the sample, the more likely chance variation in the data will
affect the correlation.
- The larger the sample size in the initial study, the better the likelihood
that the relationship will cross validates.
5. Never Confuse the Criterion with the Predictor
6. Review Evidence for Validity Generalization
- Criterion-related validity evidence obtained in one situation may not be
generalized to other similar situations.
7. Consider Differential Prediction
- Predictive relationships may not be the same for all demographic groups.
The validity for men could differ in some circumstances from the validity
for women. Or the validity of the test may be questionable because it is
used for a group whose native language is not English, even though the
test was validated for those who spoke only English.
- Under these circumstances, separate validity studies for different groups
may be necessary.
Construct Validity
Before 1950, most social scientists considered only criterion and content evidence for
validity. By the mid-1950s, investigators concluded that no clear criteria existed for most
of the social and psychological characteristics they wanted to measure. Developing a
measure of intelligence, for example, was difficult because no one could say for certain
what intelligence was. There was no criterion for intelligence because it is a hypothetical
construct. A construct is defined as something built by mental synthesis. As a construct,
intelligence does not exist as a separate thing we can touch or feel, so it cannot be used as
an objective criterion.
Contemporary psychologists often want to measure intelligence, love, curiosity, or
mental health. None of these constructs are clearly defined, and there is no established
criterion against which psychologists can compare the accuracy of their tests. These are the
truly challenging problems in measurement.
➢ A construct is an informed, scientific idea developed or hypothesized to describe
or explain behavior.
-Intelligence is a construct that may be invoked to describe why a student performs
well in school. Anxiety is a construct that may be invoked to describe why a
psychiatric patient paces the floor. Other examples of constructs are job
satisfaction, personality, bigotry, clerical aptitude, depression, motivation, selfesteem, emotional adjustment, potential dangerousness, executive potential,
creativity, and mechanical comprehension, to name but a few.
➢ Construct validity evidence is established through a series of activities in which a
researcher simultaneously defines some construct and develops the instrumentation
to measure it.
➢ This process is required when “no criterion or universe of content is accepted as
entirely adequate to define the quality to be measured” (Cronbach & Meehl, 1955,
p. 282; Sackett, 2003)
➢ 2 ways to establish evidence of construct validity:
1. Convergent Evidence - When a measure correlates well with other tests
believed to measure the same construct, convergent evidence for validity is
obtained
- This sort of evidence shows that measures of the same construct converge, or
narrow in, on the same thing.
Examples:
a. New Anxiety Test correlates with Old Anxiety Test - we might expect
high positive correlations between this new test and older, more
established measures of test anxiety.
b. New Anxiety Test correlates with General Anxiety test - we might also
expect more moderate correlations
c. Correlates assessment scores for Life Satisfaction Scale with scores
obtained in Personal wellbeing scale. High positive correlations would
indicate high convergent validity.
d. Correlates assessment scores for Beck Depression Inventory with scores
obtained in Suicide Inventory. High positive correlations would indicate
high convergent validity.
e. Roach et al. (1981) provided convergent evidence of the construct
validity of the Marital Satisfaction Scale by computing a validity
coefficient between scores on it and scores on the Marital Adjustment
Test (Locke & Wallace, 1959). The validity coefficient of .79 provided
additional evidence of their instrument’s construct validity.
2. Discriminant Evidence
➢ Scientists often confront other scientists with difficult questions such as,
“Why should we believe your theory if we already have a theory that
seems to say the same thing?” An eager scientist may answer this
question by arguing that his or her theory is distinctive and better. In
testing, psychologists face a similar challenge. Why should they create
a new test if there is already one available to do the job? Thus, one type
of evidence a person needs in test validation is proof that the test
measures something unique. This demonstration of uniqueness is
called discriminant evidence, or what some call divergent validation.
➢ A test should have low correlations with measures of unrelated
constructs, or evidence for what the test does not measure.
Examples:
a. Correlates assessment scores for Life Satisfaction Scale with scores
obtained in Beck Depression scale. No correlation would indicate
high discriminant validity.
b. Correlates assessment scores for Aggression Scale with scores
obtained in Agreeableness scale. No correlation would indicate high
discriminant validity.
Evidence of Construct Validity
➢ The various techniques of construct validation may provide evidence, for example, that
■ The test is homogeneous, measuring a single construct.
■ Test scores increase or decrease as a function of age, the passage of time, or an
experimental manipulation as theoretically predicted.
■ Test scores obtained after some event or the mere passage of time (that is, posttest scores)
differ from pretest scores as theoretically predicted.
■ Test scores obtained by people from distinct groups vary as predicted by the theory.
■ Test scores correlate with scores on other tests in accordance with what would be
predicted from a theory that covers the manifestation of the construct in question.
CORE CONCEPT 5 : Test Bias VS Test Fairness
In the eyes of many laypeople, questions concerning the validity of a test are intimately
tied to questions concerning the fair use of tests and the issues of bias and fairness. Let us
hasten to point out that validity, fairness in test use, and test bias are three separate issues.
It is possible, for example, for a valid test to be used fairly or unfairly.
➢ Test Bias - as applied to psychological and educational tests may conjure up many
meanings having to do with prejudice and preferential treatment (Brown et al., 1999).
➢ For psychometricians, bias is a factor inherent in a test that systematically
prevents accurate, impartial measurement.
➢ Rating error – is a judgment resulting from the intentional or unintentional misuse of a
rating scale.
A rating is a numerical or verbal judgment (or both) that places a person or an attribute
along a continuum identified by a scale of numerical or word descriptors known as a rating
scale.
Types of Rating Error:
1. Leniency error (also known as a generosity error ) - a type of rating mistake in which
the ratings are consistently overly positive, particularly regarding the performance or
ability of the participants. It is caused by the rater's tendency to be too positive or tolerant
of shortcomings and to give undeservedly high evaluations.
2. Severity error – a type of rating error in which the ratings are consistently overly
negative, particularly with regard to the performance or ability of the participants. It is
caused by the rater's tendency to be too strict or negative and thus to give undeservedly
low scores.
3. Central tendency error - the rater, for whatever reason, exhibits a general and systematic
reluctance to giving ratings at either the positive or the negative extreme. Consequently,
all of this rater’s ratings would tend to cluster in the middle of the rating continuum.
4. Halo effect - is when one trait of a person or thing is used to make an overall judgment
of that person or thing.
Test Fairness
➢ Some tests have been labeled “unfair” because they discriminate among groups
of people.
➢ It is unfair to administer to a particular population a standardized test that did
not include members of that population in the standardization sample
Relationship between Reliability and Validity
There are two types of validity:
•
•
Internal validity – the instruments or procedures used in the research study measured what
they were supposed to measure
External validity -If the results can be generalized beyond the immediate study
There are two types of reliability we use when assessing reliability in a test. They are
•
•
internal reliability – the extent to which a measure is consistent within itself
external reliability – the extent to which a measure varies from one use to another
Relationship between Reliability and Validity
Validity and reliability are inter-related aspects in research. In other words, if the research or a test
is valid, then the data is reliable. Yet, if a test is reliable, that does not mean that it is valid.
Download