Developing the Instrument

PhD Research Seminar Series:
Reliability and Validity in
Tests and Measures
Dr. K. A. Korb
University of Jos
Theory of Reliability
Split-Half Reliability
Test-Retest Reliability
Alternate Forms Reliability
Inter-Rater Reliability
Dr. K. A. Korb
University of Jos
Construct Validity
Criterion Validity
Content Validity
Face Validity
Test Developer: The person who created a
Test user: A person administering the test
Test taker: Person taking the test
Dr. K. A. Korb
University of Jos
Reliability: Consistency of results
Dr. K. A. Korb
University of Jos
Reliability Theory
Actual score on test = True score + Error
True Score: Hypothetical actual score on test
The reliability coefficient indicates the ratio
between the true score variance on the
test and the total variance
Dr. K. A. Korb
University of Jos
In other words, as the error in testing decreases,
the reliability increases
Reliability: Sources of Error
Error in Test construction
Error in Item Sampling: Results from items that measure more than one
construct in the same test
Error in Test Administration
 Test environment: Room temperature, amount of light, noise, etc.
For example: A test that has items assessing both reading and math ability
will have lower reliability than a test that assess just reading
Test-taker variables: Illness, amount of sleep, test anxiety, etc.
Examiner-related variables: Absence of examiner, examiner’s
demeanor, etc.
Error in Test Scoring
 Scorer: With subjectively marked assessments, different scorers may
give different scores to the same responses
Dr. K. A. Korb
University of Jos
Error due to Test Construction
Measured by Split-Half Reliability: Determines how
consistently your measure assesses the construct of interest.
A low split-half reliability indicates poor test construction.
If your measure assesses multiple constructs, split-half reliability
will be considerably lower.
 Separate the constructs that you are measuring into different sections
of the questionnaire and calculate the reliability separately for each
If you get a low reliability coefficient, then your measure is probably
measuring more constructs than it is designed to measure.
 Revise your measure to focus more directly on the construct of
Dr. K. A. Korb
University of Jos
When validating a measure, you will most likely calculate the
split-half reliability of your instrument.
Error due to Test Construction
Calculating Split-Half Reliability
If you have dichotomous items (e.g., right-wrong answers)
as you would with multiple choice exams, calculate the
If you have a Likert scale, essays, or other types of items,
use the Spearman-Brown formula.
For a step-by-step example of calculating the
Split-Half Reliability, see the associated
presentation entitled Calculating Reliability of
Quantitative Measures.
Dr. K. A. Korb
University of Jos
Error due to Test Administration
Test-Retest Reliability: Determines how
much error in a test score is due to
problems with test administration.
To calculate:
Dr. K. A. Korb
University of Jos
Administer the same test to the same participants
on two different occasions, perhaps a week or two
Correlate the test scores of the two
administrations of the same test using Pearson’s
Product Moment Correlation.
Error due to Test Construction with Two Forms
of the Same Measure
Parallel Forms Reliability: Determines the
similarity of two different versions of the
same measure.
To calculate
Dr. K. A. Korb
University of Jos
Administer the two tests to the same participants
within a short period of time.
Correlate the test scores of the two tests using
Pearson’s Product Moment Correlation.
Error due to Test Scoring
Inter-Rater Reliability: Determines how
closely two different raters mark the
To calculate
Dr. K. A. Korb
University of Jos
Give the exact same test results from one test
administration to two different raters.
Correlate the two markings from the different
raters using Pearson’s Product Moment
Validity: Measuring what is supposed
to be measured
Dr. K. A. Korb
University of Jos
Three types of validity:
Construct validity: Measure the appropriate
psychological construct
Criterion validity: Predict appropriate outcomes
Content validity: Adequate sample of content
Each type of validity should be established
for all psychological tests.
Dr. K. A. Korb
University of Jos
Construct Validity
Definition: Appropriateness of inferences drawn
from test scores regarding an individual’s status of
the psychological construct of interest
For example, a test is developed to measure Reading
Ability. Once the test is administered to students, does
their score on the test accurately reflect their true reading
Two considerations:
Dr. K. A. Korb
University of Jos
Construct underrepresentation
Construct irrelevant variance
Construct Validity
Construct underrepresentation: A test does not measure
all of the important aspects of the construct.
For example, a test of academic self efficacy (perceived
effectiveness in academics) might measure self efficacy only in
math and science, thus ignoring other important academic
Construct-irrelevant variance: Test scores are affected by
other unrelated processes
Dr. K. A. Korb
University of Jos
For example, a test of statistical knowledge that requires
complex calculations is likely influenced by construct-irrelevant
variance. In addition to measuring statistical knowledge, the test
is also measuring calculation ability.
Sources of Construct Validity
Homogeneity: The test measures a single construct
Convergence: Test is related to other measures of the same
construct and related constructs
Evidence: High internal consistency - calculated by Split-Half reliability
Evidence: Highly correlations with other measures – same as Criterion
Theory: The test behaves according to theoretical propositions
about the construct
Evidence by changes in test scores according to age: Scores on the
measure should change by age as predicted by theory.
Evidence from treatments: Scores on the measure change as predicted
by theory from a treatment between pretest and posttest.
Dr. K. A. Korb
University of Jos
For example, intelligence scores of one person should increase as that
person gets older because theories of intelligence dictate increases by age.
For example, scores on a test of Knowledge of Nigerian Government
should significantly increase after a course on the Nigerian Government.
Criterion Validity
Definition: Correlation between the measure and a
Criterion: Other accepted measures of the construct or
measures of other constructs similar in nature.
A criterion can consist of any standard with which your test
should be related
Dr. K. A. Korb
University of Jos
Behavior (e.g., misbehavior in class, teacher’s interactions with
students, days absent from school)
Other test scores (e.g., standardized test scores)
Ratings (e.g., teachers ratings of helpfulness)
Psychiatric diagnosis (e.g., depression, schizophrenia)
Criterion Validity
Three types:
Dr. K. A. Korb
University of Jos
Convergent validity: High correlations with
measures of similar constructs taken at the same
Divergent validity: Low correlations with
measures of different constructs taken at the
same time.
Predictive validity: High correlation with a
criterion in the future
Criterion Validity
Example: You developed an essay test of science reasoning
to admit students into the science programme at the
Dr. K. A. Korb
University of Jos
Convergent Validity: Your test should have high correlations
with other science tests, particularly well established science
Divergent Validity: Your test should have low correlations with
measures of writing ability because your test should only
measure science reasoning, not writing ability.
Predictive Validity: Your test should have high correlations with
future grades in science courses because the purpose of the
test is to determine who will do well in the science programme
at the university.
Criterion Validity Example
Criterion Validity Evidence for New Science Reasoning Test:
Correlations between Science Reasoning and Other Measures
New Science Reasoning Test
WAEC Science Scores
School Science Marks
WAEC Writing scores
WAEC Reading Scores
Future marks in university science
Dr. K. A. Korb
University of Jos
High correlations with
other measures of
science ability indicates
good criterion validity.
Low correlations with
measures unrelated to
science ability indicates
good criterion validity.
High correlation with
future measures of
science ability indicates
good criterion validity.
Content Validity
Definition: Sampling the
entire domain of the
construct it was designed to
For example:
Dr. K. A. Korb
University of Jos
The first chart represents the
amount of time in class spent
on each maths topic
The second chart represents
the amount of test questions on
each maths topic
This test does NOT
demonstrate content validity
because the proportion of test
questions does not match the
proportion of coverage in class.
Class Coverage
Test Coverage
Content Validity
Class Coverage
Test Coverage
For academic tests, a test is
considered content valid when
the proportion of material
covered by a test
approximates the proportion of
material covered in a class.
This maths test demonstrates
good content validity because
the proportion of test
questions on each topic
matches the proportion of time
spent in class on each topic.
Content Validity
Content validity tends to be an important
consideration ONLY for achievement tests
To assess:
Dr. K. A. Korb
University of Jos
Gather a panel of judges
Give the judges a table of specifications of the amount of
content covered in the domain
Give the judges the measure
Judges draw a conclusion as to whether the proportion of
content covered on the test matches the proportion of
content in the domain.
Face Validity
Face validity addresses whether the test appears to measure
what it purports to measure.
To assess: Ask test users and test takers to evaluate whether
the test appears to measure the construct of interest.
Face validity is rarely of interest to test developers and test
Dr. K. A. Korb
University of Jos
The only instance where face validity is of interest is to instill
confidence in test takers that the test is worthwhile.
Face validity is NOT a consideration for educational
Face validity CANNOT be used to determine the actual
interpretive validity of a test.
Concluding Advice
The best way to determine that the measures
you use are both reliable and valid is to use a
measure that another researcher has developed
and validated
This will assist you in three ways:
Dr. K. A. Korb
University of Jos
You can confidently report that you have accurately
measured the variables you are studying.
By using a measure that has been used before, your study
is intimately tied to previous research that has been
conducted in your field, an important consideration in
determining the importance of your study.
It saves you time and energy in developing your measure
Finding Pre-Existing Measures
Information on how to find pre-existing measures:
Online directory of pre-existing measures
Dr. K. A. Korb
University of Jos
Type the construct you want to measure in the empty box and click the
Search button.
Find the test that is most relevant to for your purposes.
When you click on the measure name in blue, if it has a journal article listed
in the Availability category, the measure will be published in that journal
Some tests can also be ordered from the ETS Tests collection for about
N3000 and then downloaded to your computer.
You can also try googling the name of the test to determine if somebody
else has published the measure on the internet.
Websites for Pre-existing Measures
Personality Variables: International Personality
Item Pool
Motivation Constructs: Self Determination Theory
Motivation Constructs: Students’ goal orientations:
Dr. K. A. Korb
University of Jos