VALIDITY & RELIABILITY

advertisement
VALIDITY,
RELIABILITY &
PRACTICALITY
Prof. Rosynella Cardozo
Prof. Jonathan Magdalena
QUALITIES OF MEASUREMENT
DEVICES
 Validity
Does it measure what it is supposed to measure?
 Reliability
How representative is the measurement?
 Practicality
Is it easy to construct, administer, score and
interpret?
 Backwash
What is the impact of the
teaching/learning process?
test
on
the
VALIDITY
The term validity refers to whether or not a test
measures what it intends to measure.
On a test with high validity the items will be closely linked
to the test’s intended focus. For many certification and
licensure tests this means that the items will be highly
related to a specific job or occupation. If a test has poor
validity then it does not measure the job-related content
and competencies it ought to.
There are several ways to estimate the validity of a test,
including content validity, construct validity, criterionrelated validity (concurrent & predictive) and face validity.
VALIDITY
 Content”: related to objectives and their sampling.
 “Construct”: referring to the theory underlying the




target.
“Criterion”: related to concrete criteria in the real
world. It can be concurrent or predictive.
“Concurrent”: correlating high with another measure
already validated.
“Predictive”: Capable of anticipating some later
measure.
“Face”: related to the test overall appearance.
1. CONTENT VALIDITY
Content validity refers to the connections
between the test items and the subject-related
tasks. The test should evaluate only the content
related to the field of study in a manner
sufficiently representative, relevant, and
comprehensible.
2. CONSTRUCT VALIDITY
It implies using the construct correctly
(concepts, ideas, notions). Construct validity
seeks agreement between a theoretical concept
and a specific measuring device or procedure.
For example, a test of intelligence nowadays
must include measures of multiple intelligences,
rather than just logical-mathematical and
linguistic ability measures.
3. CRITERION-RELATED
VALIDITY
Also referred to as instrumental validity, it
states that the criteria should be clearly
defined by the teacher in advance. It has to
take into account other teachers´ criteria to
be standardized and it also needs to
demonstrate the accuracy of a measure or
procedure compared to another measure or
procedure
which
has
already
been
demonstrated to be valid.
4. CONCURRENT VALIDITY
Concurrent validity is a statistical method using
correlation, rather than a logical method.
Examinees who are known to be either masters or nonmasters on the content measured by the test are
identified before the test is administered. Once the
tests have been scored, the relationship between the
examinees’ status as either masters or non-masters and
their performance (i.e., pass or fail) is estimated based
on the test. This type of validity provides evidence that
the test is classifying examinees correctly. The stronger
the correlation is, the greater the concurrent validity of
the test is.
5. PREDICTIVE VALIDITY
This is another statistical approach to validity that
estimates the relationship of test scores to an
examinee's future performance as a master or nonmaster. Predictive validity considers the question,
"How well does the test predict examinees' future
status as masters or non-masters?" For this type of
validity, the correlation that is computed is based on
the test results and the examinee’s
later
performance. This type of validity is especially useful
for test purposes such as selection or admissions.
6. FACE VALIDITY
Like content validity, face validity is determined by a
review of the items and not through the use of
statistical analyses. Unlike content validity, face
validity is not investigated through formal procedures.
Instead, anyone who looks over the test, including
examinees, may develop an informal opinion as to
whether or not the test is measuring what it is
supposed to measure. While it is clearly of some value
to have the test appear to be valid, face validity alone
is insufficient for establishing that the test is
measuring what it claims to measure.
QUALITIES OF MEASUREMENT
DEVICES
 Validity
Does it measure what it is supposed to measure?
 Reliability
How representative is the measurement?
 Practicality
Is it easy to construct, administer, score and
interpret?
 Backwash
What is the impact of the
teaching/learning process?
test
on
the
RELIABILITY
Reliability is the extent to which an experiment,
test, or any measuring procedure shows the same
result on repeated trials. Without the agreement
of independent observers able to replicate
research procedures, or the ability to use research
tools and procedures that produce consistent
measurements, researchers would be unable to
satisfactorily
draw
conclusions,
formulate
theories, or make claims about the generalizability
of their research. For researchers, four key types
of reliability are:
RELIABILITY
 “Equivalency”: related to the co-occurrence of
two items
 “Stability”: related to time consistency
 “Internal”: related to the instruments
 “Inter-rater”:
related to the examiners’
criterion
 “Intra-rater”:
related to the examiners’
criterion
1. EQUIVALENCY RELIABILITY
Equivalency reliability is the extent to which two items measure
identical concepts at an identical level of difficulty. Equivalency
reliability is determined by relating two sets of test scores to
one another to highlight the degree of relationship or association.
For example, a researcher studying university English students
happened to notice that when some students were studying for
finals, they got sick. Intrigued by this, the researcher attempted
to observe how often, or to what degree, these two behaviors cooccurred throughout the academic year. The researcher used the
results of the observations to assess the correlation between
“studying throughout the academic year” and “getting sick”. The
researcher concluded there was poor equivalency reliability
between the two actions. In other words, studying was not a
reliable predictor of getting sick.
2. STABILITY RELIABILITY
Stability reliability (sometimes called test, retest reliability) is the agreement of measuring
instruments over time. To determine stability, a
measure or test is repeated on the same subjects
at a future date. Results are compared and
correlated with the initial test to give a measure
of stability. This method of evaluating reliability is
appropriate only if the phenomenon that the test
measures is known to be stable over the interval
between assessments. The possibility of practice
effects should also be taken into account.
3. INTERNAL CONSISTENCY
Internal consistency is the extent to which tests or
procedures assess the same characteristic, skill or
quality. It is a measure of the precision between the
measuring instruments used in a study. This type of
reliability often helps researchers interpret data
and predict the value of scores and the limits of the
relationship among variables. For example, analyzing
the internal reliability of the items on a vocabulary
quiz will reveal the extent to which the quiz focuses
on the examinee’s knowledge of words.
4. INTER-RATER RELIABILITY
Inter-rater reliability is the extent to which two or more
individuals (coders or raters) agree. Inter-rater reliability
assesses the consistency of how a measuring system is
implemented. For example, when two or more teachers use a
rating scale with which they are rating the students’ oral
responses in an interview (1 being most negative, 5 being
most positive). If one researcher gives a "1" to a student
response, while another researcher gives a "5," obviously
the inter-rater reliability would be inconsistent. Interrater reliability is dependent upon the ability of two or
more individuals to be consistent. Training, education and
monitoring skills can enhance inter-rater reliability.
4. INTRA-RATER RELIABILITY
Intra-rater reliability is a type of reliability
assessment in which the same assessment is completed
by the same rater on two or more occasions. These
different ratings are then compared, generally by
means of correlation. Since the same individual is
completing both assessments, the rater's subsequent
ratings are contaminated by knowledge of earlier
ratings.
SOURCES OF ERROR
 Examinee (is a human being)
 Examiner (is a human being)
 Examination (is designed by and for
human beings)
RELATIONSHIP BETWEEN
VALIDITY & RELIABILITY
Validity and reliability are closely
related.
A test cannot be considered valid unless
the measurements resulting from it
are reliable.
Likewise, results from a test can be
reliable and not necessarily valid.
QUALITIES OF MEASUREMENT
DEVICES
 Validity
Does it measure what it is supposed to measure?
 Reliability
How representative is the measurement?
 Practicality
Is it easy to construct, administer, score and
interpret?
 Backwash
What is the impact of the
teaching/learning process?
test
on
the
PRACTICALITY
It refers to the economy of time, effort and
money in testing. In other words, a test should
be…
 Easy to design
 Easy to administer
 Easy to mark
 Easy to interpret (the results)
QUALITIES OF MEASUREMENT
DEVICES
 Validity
Does it measure what it is supposed to measure?
 Reliability
How representative is the measurement?
 Practicality
Is it easy to construct, administer, score and
interpret?
 Backwash
What is the impact of the
teaching/learning process?
test
on
the
BACKWASH EFFECT
Backwash effect (also known as washback) is
the influence of testing on teaching and
learning. It is also the potential impact that the
form and content of a test may have on
learners’ conception of what is being assessed
(language proficiency) and what it involves.
Therefore, test designers, delivers and raters
have a particular responsibility, considering that
the testing process may have a substantial
impact, either positive or negative.
LEVELS OF BACKWASH
It is believed that backwash is a subset of a test’s
impact on society, educational systems and
individuals. Thus, test impact operates at two levels:
 The micro level (the effect of the test on individual
students and teachers)
 The macro level (the impact of the test on society
and the educational system)
Bachman and Palmer (1996)
THANKS
Download