Developing the Instrument

advertisement
PhD Research Seminar Series:
Reliability and Validity in
Tests and Measures
Dr. K. A. Korb
University of Jos
Outline

Reliability






Theory of Reliability
Split-Half Reliability
Test-Retest Reliability
Alternate Forms Reliability
Inter-Rater Reliability
Validity




Dr. K. A. Korb
University of Jos
Construct Validity
Criterion Validity
Content Validity
Face Validity
Overview



Test Developer: The person who created a
test
Test user: A person administering the test
Test taker: Person taking the test
Dr. K. A. Korb
University of Jos
Reliability: Consistency of results
Reliable
Dr. K. A. Korb
University of Jos
Reliable
Unreliable
Reliability Theory

Actual score on test = True score + Error


True Score: Hypothetical actual score on test
The reliability coefficient indicates the ratio
between the true score variance on the
test and the total variance

Dr. K. A. Korb
University of Jos
In other words, as the error in testing decreases,
the reliability increases
Reliability: Sources of Error

Error in Test construction

Error in Item Sampling: Results from items that measure more than one
construct in the same test


Error in Test Administration
 Test environment: Room temperature, amount of light, noise, etc.



For example: A test that has items assessing both reading and math ability
will have lower reliability than a test that assess just reading
Test-taker variables: Illness, amount of sleep, test anxiety, etc.
Examiner-related variables: Absence of examiner, examiner’s
demeanor, etc.
Error in Test Scoring
 Scorer: With subjectively marked assessments, different scorers may
give different scores to the same responses
Dr. K. A. Korb
University of Jos
Reliability:
Error due to Test Construction

Measured by Split-Half Reliability: Determines how
consistently your measure assesses the construct of interest.

A low split-half reliability indicates poor test construction.

If your measure assesses multiple constructs, split-half reliability
will be considerably lower.
 Separate the constructs that you are measuring into different sections
of the questionnaire and calculate the reliability separately for each
construct.

If you get a low reliability coefficient, then your measure is probably
measuring more constructs than it is designed to measure.
 Revise your measure to focus more directly on the construct of
interest.

Dr. K. A. Korb
University of Jos
When validating a measure, you will most likely calculate the
split-half reliability of your instrument.
Reliability:
Error due to Test Construction

Calculating Split-Half Reliability



If you have dichotomous items (e.g., right-wrong answers)
as you would with multiple choice exams, calculate the
KR-20.
If you have a Likert scale, essays, or other types of items,
use the Spearman-Brown formula.
For a step-by-step example of calculating the
Split-Half Reliability, see the associated
presentation entitled Calculating Reliability of
Quantitative Measures.
Dr. K. A. Korb
University of Jos
Reliability:
Error due to Test Administration

Test-Retest Reliability: Determines how
much error in a test score is due to
problems with test administration.

To calculate:


Dr. K. A. Korb
University of Jos
Administer the same test to the same participants
on two different occasions, perhaps a week or two
apart.
Correlate the test scores of the two
administrations of the same test using Pearson’s
Product Moment Correlation.
Reliability:
Error due to Test Construction with Two Forms
of the Same Measure

Parallel Forms Reliability: Determines the
similarity of two different versions of the
same measure.

To calculate


Dr. K. A. Korb
University of Jos
Administer the two tests to the same participants
within a short period of time.
Correlate the test scores of the two tests using
Pearson’s Product Moment Correlation.
Reliability:
Error due to Test Scoring

Inter-Rater Reliability: Determines how
closely two different raters mark the
assessment.

To calculate


Dr. K. A. Korb
University of Jos
Give the exact same test results from one test
administration to two different raters.
Correlate the two markings from the different
raters using Pearson’s Product Moment
Correlation
Validity: Measuring what is supposed
to be measured
Valid
Dr. K. A. Korb
University of Jos
Invalid
Invalid
Validity

Three types of validity:




Construct validity: Measure the appropriate
psychological construct
Criterion validity: Predict appropriate outcomes
Content validity: Adequate sample of content
Each type of validity should be established
for all psychological tests.
Dr. K. A. Korb
University of Jos
Construct Validity

Definition: Appropriateness of inferences drawn
from test scores regarding an individual’s status of
the psychological construct of interest


For example, a test is developed to measure Reading
Ability. Once the test is administered to students, does
their score on the test accurately reflect their true reading
ability?
Two considerations:


Dr. K. A. Korb
University of Jos
Construct underrepresentation
Construct irrelevant variance
Construct Validity

Construct underrepresentation: A test does not measure
all of the important aspects of the construct.


For example, a test of academic self efficacy (perceived
effectiveness in academics) might measure self efficacy only in
math and science, thus ignoring other important academic
subjects.
Construct-irrelevant variance: Test scores are affected by
other unrelated processes

Dr. K. A. Korb
University of Jos
For example, a test of statistical knowledge that requires
complex calculations is likely influenced by construct-irrelevant
variance. In addition to measuring statistical knowledge, the test
is also measuring calculation ability.
Sources of Construct Validity
Evidence

Homogeneity: The test measures a single construct


Convergence: Test is related to other measures of the same
construct and related constructs


Evidence: High internal consistency - calculated by Split-Half reliability
Evidence: Highly correlations with other measures – same as Criterion
Validity
Theory: The test behaves according to theoretical propositions
about the construct

Evidence by changes in test scores according to age: Scores on the
measure should change by age as predicted by theory.


Evidence from treatments: Scores on the measure change as predicted
by theory from a treatment between pretest and posttest.

Dr. K. A. Korb
University of Jos
For example, intelligence scores of one person should increase as that
person gets older because theories of intelligence dictate increases by age.
For example, scores on a test of Knowledge of Nigerian Government
should significantly increase after a course on the Nigerian Government.
Criterion Validity

Definition: Correlation between the measure and a
criterion.


Criterion: Other accepted measures of the construct or
measures of other constructs similar in nature.
A criterion can consist of any standard with which your test
should be related

Examples:




Dr. K. A. Korb
University of Jos
Behavior (e.g., misbehavior in class, teacher’s interactions with
students, days absent from school)
Other test scores (e.g., standardized test scores)
Ratings (e.g., teachers ratings of helpfulness)
Psychiatric diagnosis (e.g., depression, schizophrenia)
Criterion Validity

Three types:



Dr. K. A. Korb
University of Jos
Convergent validity: High correlations with
measures of similar constructs taken at the same
time.
Divergent validity: Low correlations with
measures of different constructs taken at the
same time.
Predictive validity: High correlation with a
criterion in the future
Criterion Validity

Example: You developed an essay test of science reasoning
to admit students into the science programme at the
university.



Dr. K. A. Korb
University of Jos
Convergent Validity: Your test should have high correlations
with other science tests, particularly well established science
tests.
Divergent Validity: Your test should have low correlations with
measures of writing ability because your test should only
measure science reasoning, not writing ability.
Predictive Validity: Your test should have high correlations with
future grades in science courses because the purpose of the
test is to determine who will do well in the science programme
at the university.
Criterion Validity Example
Criterion Validity Evidence for New Science Reasoning Test:
Correlations between Science Reasoning and Other Measures
New Science Reasoning Test
WAEC Science Scores
.83
School Science Marks
.75
WAEC Writing scores
.34
WAEC Reading Scores
.24
Future marks in university science
courses
.65
Dr. K. A. Korb
University of Jos
High correlations with
other measures of
science ability indicates
good criterion validity.
Low correlations with
measures unrelated to
science ability indicates
good criterion validity.
High correlation with
future measures of
science ability indicates
good criterion validity.
Content Validity


Definition: Sampling the
entire domain of the
construct it was designed to
measure
For example:



Dr. K. A. Korb
University of Jos
The first chart represents the
amount of time in class spent
on each maths topic
The second chart represents
the amount of test questions on
each maths topic
This test does NOT
demonstrate content validity
because the proportion of test
questions does not match the
proportion of coverage in class.
Class Coverage
Addition
Subtraction
Multiplication
Division
Test Coverage
Addition
Subtraction
Multiplication
Division
Content Validity

Class Coverage
Addition
Subtraction
Multiplication
Division

Test Coverage
Addition
Subtraction
Multiplication
Division
For academic tests, a test is
considered content valid when
the proportion of material
covered by a test
approximates the proportion of
material covered in a class.
This maths test demonstrates
good content validity because
the proportion of test
questions on each topic
matches the proportion of time
spent in class on each topic.
Content Validity


Content validity tends to be an important
consideration ONLY for achievement tests
To assess:




Dr. K. A. Korb
University of Jos
Gather a panel of judges
Give the judges a table of specifications of the amount of
content covered in the domain
Give the judges the measure
Judges draw a conclusion as to whether the proportion of
content covered on the test matches the proportion of
content in the domain.
Face Validity

Face validity addresses whether the test appears to measure
what it purports to measure.


To assess: Ask test users and test takers to evaluate whether
the test appears to measure the construct of interest.
Face validity is rarely of interest to test developers and test
users.



Dr. K. A. Korb
University of Jos
The only instance where face validity is of interest is to instill
confidence in test takers that the test is worthwhile.
Face validity is NOT a consideration for educational
researchers.
Face validity CANNOT be used to determine the actual
interpretive validity of a test.
Concluding Advice

The best way to determine that the measures
you use are both reliable and valid is to use a
measure that another researcher has developed
and validated

This will assist you in three ways:
1.
2.
3.
Dr. K. A. Korb
University of Jos
You can confidently report that you have accurately
measured the variables you are studying.
By using a measure that has been used before, your study
is intimately tied to previous research that has been
conducted in your field, an important consideration in
determining the importance of your study.
It saves you time and energy in developing your measure
Finding Pre-Existing Measures

Information on how to find pre-existing measures:


http://www.apa.org/science/faq-findtests.html#printeddirec
Online directory of pre-existing measures






Dr. K. A. Korb
University of Jos
http://www.ets.org/testcoll/
Type the construct you want to measure in the empty box and click the
Search button.
Find the test that is most relevant to for your purposes.
When you click on the measure name in blue, if it has a journal article listed
in the Availability category, the measure will be published in that journal
article.
Some tests can also be ordered from the ETS Tests collection for about
N3000 and then downloaded to your computer.
You can also try googling the name of the test to determine if somebody
else has published the measure on the internet.
Websites for Pre-existing Measures

Personality Variables: International Personality
Item Pool


Motivation Constructs: Self Determination Theory


http://ipip.ori.org/ipip/
http://www.psych.rochester.edu/SDT/
Motivation Constructs: Students’ goal orientations:

Dr. K. A. Korb
University of Jos
http://www.umich.edu/~pals/
Download