Chapter 7 Evaluating What a Test Really Measures

advertisement
Chapter 7
Evaluating What a
Test Really
Measures
Validity
 APA
– Standards for Educational and
Psychological Testing (1985) –
Recognized three ways of deciding
whether a test is sufficiently valid to
be useful.
Validity:
Does the test measure what it
claims to measure?
The appropriateness with
which inferences can be made
on the basis of test results.
Validity
 There
is no single type of validity
appropriate for all testing purposes.
 Validity is not a matter of all or
nothing, but a matter of degree.
Types of Validity
 Content
 Criterion-Related
predictive)
 Construct
 Face
(concurrent or
Content Validity
 Whether
items (questions) on a test
are representative of the domain
(material) that should be covered by
the test.
 Most appropriate for test like
achievement tests (i.e., concrete
attributes)
Content Validity
Guiding Questions:
1. Are the test questions appropriate and
does the test measure the domain of
interest?
2. Does the test contain enough
information to cover appropriately what
it is supposed to measure?
3. What is the level of master at which the
content is being assessed?
***NOTE – Content validity does not involve
statistical analysis.
Obtaining Content Validity
Two ways:
 Define the testing universe and
administer the test.
 Have experts rate “how essential”
each question is. (1- essential, 2useful, but not essential, and 3-not
necessary) Questions are considered
valid if more than ½ experts indicate
question is “essential”.
Defining the Testing Universe
 What
is the body of knowledge or
behaviors that the test represents?
 What
are the intended outcomes
(skills, knowledge)?
Developing A Test Plan
Step 1
 Define testing universe
– Locate theoretical or empirical research on the attribute
– Interview experts
Step 2
 Develop test specifications
– Identify content areas (topics to be covered in test)
– Identify instructional objectives (what one should be
able to do with these topics)
Step 3
 Establish a test format
Step 4
 Construct test questions
Attributes
Concrete Attributes
Attributes that can
be described in
terms of specific
behaviors.
e.g., ability to play
piano, do math
problems
Abstract Attributes
More difficult to
describe in terms
of behaviors
because people
might disagree on
what the behaviors
present
e.g., intelligence,
creativity,
personality
Chapter 8
Using Tests to Make
Decisions:
Criterion-Related Validity
What is a criterion?
This is the standard by which
your measure is being judged or
evaluated.
 The measure of performance that
is correlated with test scores.
 An evaluative standard that can
be used to measure a person’s
performance, attitude, or
motivation.

Two Ways to Demonstrate
Criterion-Related Validity
1.
2.
Predictive Method
Concurrent Method
Criterion-Related Validity
 Predictive
validity – correlating
test scores with future behavior
on the behavior…after
examinees have had a chance to
exhibit the predicted behavior;
e.g., success on the job.
Concurrent validity – correlating test scores
with an independent measure of the same
trait that the test is designed to measure –
currently available.
Or being able to distinguish between groups
known to be different; i.e., significantly
different mean scores on the test.
Examples of Concurrent Validity
E.g.1, Teachers’ ratings of reading
ability validated by correlating with
reading test scores.
E.g.2, validate an index of self-reported
delinquency by comparing responses
to office police records on the
respondents.
 In
both predictive and
concurrent validity, we validate
by comparing scores with a
criterion (the standard by which
your measure is being judged or
evaluated).
 Most appropriate for tests that
claim to predict outcomes.
 Evidence of criterion-related
validity depends on empirical or
quantitative methods of data
analysis.
Example of How To Determine
Predictive Validity
Give test to applicants for a position.
 For all those hired, compare their test
scores to supervisors’ rating after 6
months on the job.
 The supervisors’ ratings are the
criterion.
 If employees scored on the test
similarly to supervisors’ ratings, then
predictive validity of test is
supported.

Problems with using predictive
validity
 Restricted
range of scores on
either predictor or criterion
measure will cause an artificially
lower correlation.
 Attrition
of criterion scores; i.e.,
some folks drop out before you
can measure them on the
criterion measure (e.g., 6 months
later).
Selecting a Criterion
 Objective
criteria: observable and
measurable; e.g., sales figures,
number of accidents, etc.
 Subjective
criteria: based on a
person’s judgment; e.g., employee
job ratings. Example…
CRITERION MEASUREMENTS MUST
THEMSELVES BE VALID!
 Criteria
must be representative of the
events that they are supposed to
measure.
– i.e., sales ability – not just $ amount, but
also # of sales calls made, size of target
population, etc.
 Criterion
Contamination – If the
criterion measures more dimensions
than those measured by the test.
BOTH PREDICTOR AND CRITERION
MEASURES MUST BE RELIABLE
FIRST!
 E.g.,
inter-rater reliability obtained
by supervisors rating the same
employees independently.
 Reliability
estimates of predictors can
be obtained by one of the 4 methods
covered in Chapter 6.
Calculating & Estimating Validity
Coefficients
 Validity
Coefficient – Predictive
and concurrent validity also
represented by correlation
coefficients. Represents the
amount or strength of criterionrelated validity that can be
attributed to the test.
Two Methods for Evaluating Validity
Coefficients
1.
Test of significance: A process of
determining what the probability is that
the study would have yielded the validity
coefficient calculated by chance.
-Requires that you take into account the size of the
group (N) from whom we obtained our data.
-When researchers or test developers report a validity
coefficient, they should also report its level of
significance.


must be demonstrated to be greater
than zero
p < .05. Look up in table.
Two Methods for Evaluating Validity
Coefficients
2.
Coefficient of determination: The amount of
variance shared by two variables being
correlated, such as test and criterion, obtained
by squaring the validity coefficient.
r2 tells us how much covariation exists between
predictor and criterion; e.g., if r = .7, then 49%
of the variance is common to both.
i.e., If correlation (r) is .30, then the coefficient of
determination (r2) is .09. (This means that the
test and criterion have 9% of their variation in
common.)
Using Validity Information To
Make Predictions

Linear regression: predicting Y from X.

Set a “pass” or acceptance score on Y.


Determine what minimum X score
(“cutting score”) will produce that Y score
or better (“success” on the job)
Examples…
Outcomes of Prediction
Hits: a) True positives - predicted to succeed
and did.
b) True negatives - predicted to fail and
did.
Misses: a) False positives - predicted to
succeed and didn’t.
b) False negatives - predicted to fail
and would have succeeded.
WE WANT TO MAXIMIZE TRUE HITS AND
MINIMIZE MISSES!
Predictive validity correlation
determines accuracy of prediction
Chapter 9
Construct Validity
Construct Validity
 The
extent to which the test
measures a theoretical construct.
 Most appropriate when a test
measures an abstract construct (i.e.,
marital satisfaction)
What is a construct?
 An
attribute that exists in theory, but
is not directly observable or
measurable. (Remember there are 2
kinds: concrete and abstract.)
 We can observe & measure the
behaviors that show evidence of
these constructs.
 Definitions of constructs can vary
from person to person.
– i.e., Self-efficacy
 When
some trait, attribute or quality
is not operationally defined you must
use indirect measures of the
construct, e.g., a scale which
references behaviors that we
consider evidence of the construct.
 But
how can we validate that scale?
Construct Validity


Evidence of construct validity of a scale
may be provided by comparing high vs.
low scoring people on behavior implied by
the construct, e.g., Do high scorers on the
Attitudes Toward Church Going Scale
actually attend church more often than
low scorers?
Or by comparing groups known to differ
on the construct; e.g., comparing pro-life
members with pro-choice members on
Attitudes Toward Abortion scale.
Construct Validity (cont’d)
 Factor
analysis also gives you a look
at the unidimensionality of the
construct being measured; i.e.,
homogeneity of items.
 As
does the split-half reliability
coefficient.
 ONLY
ONE CONSTRUCT CAN BE
MEASURED BY ONE SCALE!
Convergent Validity

Evidence that the scores on a test
correlate strongly with scores on
other tests that measure the same
construct.
– i.e.,would expect two measures on
general self-efficacy to yield strong,
positive, and statistically significant
correlations.
Discriminant Validity
 When
the test scores are not
correlated with unrelated
constructs.
Multitrait-Multimethod Method
 Searching
for convergence across
different measures of the same thing
and for divergence between
measures of different things.
Face Validity
The items look like they reflect whatever
is being measured.
 The extent to which the test taker
perceives that the test measures what it is
supposed to measure.
 The attractiveness and appropriateness of
the test at perceived by the test takers.
 Influences how test takers approach the
test.
 Uses experts to evaluate.

Which type of validity would be
most suitable for the following?
a)
b)
c)
d)
mathematics test
intelligence test
vocational interest inventory
music aptitude test
Discuss the value of predictive
validity to each of the following?
a) personnel manager
b) teacher or principal
c) college admissions officer
d) prison warden
e) psychiatrist
f) guidance counselor
g) veterinary dermatologist
h) professor in medical
school
Download