Module 5

advertisement
Characteristics of a Good Test
1
Module 5
______________________________________________________________________________________________
Module 5
CHARACTERISTICS OF A GOOD TEST
ED 103 Assessment of Student Learning
CSED; MTh 10:30-12:00
Group 5
Leader:
Jay Mark Biboso
Secretary: Julie Ann Leoberas
Members: Ella Jane Tagala
Katrina Alona D. Bustria
Loisa Mae Paje
Judelyn Mae Francisco
Jachinlee Laspiñas
Angie Mae Padrillan
Gebelyn Chavez
Alyssa May Nanat
Module 5 Characteristics of a Good Test
Lesson 1: Characteristics of a Good Test
a. Important thing to Remember about Validity
b. Types of Validity
c. Factors Affecting the Validity of a Test Item
d. Ways to Reduce the Validity of the Test Items
Lesson 2: Reliability of the Test
a. Factors Affecting the Reliability of Test
b. Four Methods of Establishing Reliability
Lesson 3. Objectivity of the Test
Characteristics of a Good Test
2
Module 5
______________________________________________________________________________________________
LESSON 1
VALIDITY OF THE TEST
Learning Objectives: Upon the successful completion of this lesson, students
will be able to:
 Understand the definition of validity.
 Know some of the important things to remember about validity.
 Identify the different types of validity.
 Know the importance of validity of the test and how valid it is.
 Enumerate the factors affecting the validity of the test item.
 Give the ways on reducing the validity of the test.
1.1
Important thing to Remember about Validity
Validity refers to the appropriateness of source-based inferences; or decision made based
on the students test results. The extent to which a test measures what it’s supposed to measure.
Validity is the degree to which evidence and theory support the interpretations of test scores
entailed by the proposed uses of a test (AERA, APA, & NCME, 1999, P. 9).
1. Validity refers to the decision we make, and not to the test itself or to the
measurement.
2. Like reliability, validity is not an all or nothing concept; it is never totally absent or
absolutely perfect.
3. A validity estimate, called a validity coefficient, refers to specific type of validity. It
ranges between 0 to 1.
4. Validity can never be finally determined; it is specific to each administration of the text.
1.2 Types of Validity
1. Content Validity determines the extent of which the assessment is the representative of
the domain of interest.
2. Construct Validity defines how well a test or experiment measures up to its claims. A
test designed to measure depression must only measure that particular construct, not
closely related ideals such as anxiety or stress.

Convergent Validity – tests that construct that are expected to be related
are, in fact, related.
Characteristics of a Good Test
3
Module 5
______________________________________________________________________________________________

Discriminant Validity- occurs where that is expected not to relate do not,
such that is possible to discriminate between these construct.
3. Internal Validity is the measure which ensures that a researcher’s experiment design
closely follows the principle of cause and effect.
4. Conclusion Validity occurs when you can conclude that there is a relationship of some
kind between the two variables being examined.
5. External Validity occurs when the causal relationship discovered can be generalized to
other people, times and context.
6. Criterion Validity determines the relationship between an assessment and another
measure of the same trait.

Concurrent Validity- measure the test against a benchmark test and high
correlation indicates that the test has strong criterion validity.

Predictive Validity- is a measure of how well a test predicts ability.
7. Face Validity occurs where something appears to be valid. This of course depends very
much on the judgment of the observer.
8. Instructional Validity determines to what extent the domain of content in the test is
taught in class.
1.3 Factors Affecting the Validity of the Test Item
Factors affecting the validity of the test item:

The test item itself.

Unclear test directions.

Personal factors influencing how students response to the test.

Arrangement of a test.

Length of a test.

Untaught items
Characteristics of a Good Test
4
Module 5
______________________________________________________________________________________________
1.4 Ways to Reduce the Validity of the Test
Validity refers to the accuracy of an assessment; it is the most important consideration in
test evaluation. The term validity refers to whether or not the test measures what it claims to
measure.
Reducing validity causes:
1.
Poorly constructed test items

Validity will also be affected by how closely the selection of a correct answer
on a test reflects mastery of the material contained in the standards.

Any additional information that is irrelevant to the question, can distract or
confuse the student, thus providing an alternative explanation for why the item
was missed. Keep it simple.
Example:
Change
To
The purchase of the Louisiana Territory,
The purchase of the Louisiana Territory
completed in 1803 and considered one of
primarily grew out of our need for
Thomas Jefferson's greatest
accomplishments as president, primarily
grew out of our need for
a. the port of New Orleans*
a. the port of New Orleans*
b. helping Haitians against Napoleon
b. helping Haitians against Napoleon
c. the friendship of Great Britain
c. the friendship of Great Britain
d. control over the Indians
d. control over the Indians
2.
Unclear directions
3.
Ambiguous items
4.
Reading vocabulary too
5.
Complicated syntax

difficult
Keep the grammar consistent between stem and alternatives
Example:
Change
To
What is the dietary substance that is often
associated with heart disease wh often found in high levels
in the blood?
Characteristics of a Good Test
5
Module 5
______________________________________________________________________________________________
a. glucose
a. glucose
b. cholesterol*
b. cholesterol*
c. beta carotene
c. beta carotene
d. Proteins
d. protein
6.
Inadequate time limit
7.
Inappropriate level of difficulty

If the test is too long or appears too difficult, the examinees will be tempted
to guess, and this will increase error directly.
8.
Unintended clues

9.
Avoid providing clues for one item in the wording of another item on the test.
Improper arrangements of items
Characteristics of a Good Test
6
Module 5
______________________________________________________________________________________________
LESSON 2
Reliability of the Test
Learning Objectives: After the report, students should be able to:
 Know the Importance of Reliability.
 Identify the different factors that affect reliability of test
 Know the four methods of establishing reliability.
 Understand the importance of the four methods of establishing
reliability.
 Have more knowledge on the application of the methods of reliability.
 Apply the different methods of establishing reliability.
2.1 Reliability
Reliability is thus a measure of how much you can trust the results of a test. Tests often
have high reliability – but at the expense of validity. In other words, reliability refers to the
ability to measure something consistently. That is, to obtain consistent scores every time
something is measured.
2.2 Factors Affecting Reliability Test:
1. Test length
Generally, the longer a test is, the more reliable it is.
2. Speed
When a test is a speed test, reliability can be problematic. It is inappropriate to estimate
reliability using internal consistency, test-retest, or alternate form methods. This is because not
every student is able to complete all of the items in a speed test. In contrast, a power test is a
test in which every student is able to complete all the items.
3. Group homogeneity
In general, the more heterogeneous the group of students who take the test, the more
reliable the measure will be.
4. Item difficulty
When there is little variability among test scores, the reliability will be low. Thus,
reliability will be low if a test is so easy that every student gets most or all of the items correct or
so difficult that every student gets most or all of the items wrong.
Characteristics of a Good Test
7
Module 5
______________________________________________________________________________________________
5. Objectivity
Objectively scored tests, rather than subjectively scored tests, show a higher reliability.
6. Test-retest interval
The shorter the time interval between two administrations of a test, the less likely that
changes will occur and the higher the reliability will be.
7. Variation with the testing situation
Errors in the testing situation (e.g., students misunderstanding or misreading test
directions, noise level, distractions, and sickness) can cause test scores to vary.
2.3 Four Methods of Establishing Reliability
Researchers use four methods to check the reliability of a test: the test-retest method,
equivalent forms (parallel forms), internal consistency, and inter-scorer (inter-rater) reliability.
Not all of these methods are used for all tests. Each method provides research evidence that the
responses are consistent under certain circumstances.
1. Test-Retest Reliability
Test-retest reliability is established by correlating scores obtained, on two separate
occasions, from the same group of people on the same test. The correlation coefficient obtained
is referred to as the coefficient of stability. With test-retest reliability, one attempts to determine
whether consistent scores are being obtained from the same group of people over time; hence,
one wish to learn whether scores are stable over time.
For example:
One administers a test, say Test A, to students on August 10, then re-administers the
same test (Test A) to the same students at a later date, and say on August 25. Scores from the
same person are correlated to determine the degree of association between the two sets. Table 1
shows an example.
Table 1: Example of Test-Retest Scores for Reliability
Test Form A
August 10
Person
August 25
Administration Administration
Scores
Scores
Lailanie
85
83
Katrina
75
77
Criana
63
60
Ella
59
57
Characteristics of a Good Test
8
Module 5
______________________________________________________________________________________________
Julie
91
89
Jaymark
35
40
Judelyn
55
60
Usher
95
99
Nathan
86
83
Loisa
83
77
The correlation between the sets of scores in Table 1 is r = .98, which indicates a strong
association between the scores. Note that people who scored high on the first administration
also scored high on the second, and those who scored low on first administration scored low on
the second. There is a strong relationship between these two sets of scores, so high reliability.
The problem with test-retest reliability is that it is only appropriate for instruments for
which individuals are not likely to remember their answers from administration to
administration. Remembering answers will likely inflate, artificially, the reliability estimate. In
general, test-retest reliability is not a very useful method for establishing reliability.
2. Inter-Scorer/Inter-Rater Reliability
Inter-Scorer/Inter-Rater reliability is used to determine the consistency in which one
rater assigns scores (intra-judge), or the consistency in which two or more raters assign scores
(inter-judge).
Intra-judge reliability refers to a single judge assigning scores. Remember that consistency
requires multiple scores (at least two) in order to establish reliability, so for intra-judge reliability
to be established, a single judge or rater must score something more than once.
For example:
If asked to judge an art exhibit, to establish reliability, a judge must rate the same exhibit
more than once to learn if the judge is reliable in assigning scores. If the judge rates it high once
and low the second time, obviously rater reliability is lacking.
For inter-judge reliability, one is concerned with showing that multiple raters have
consistency in their scoring of something.
For example:
Consider the multiple judges used at the Olympics. For the high dive competition, often
about seven judges are used. If the seven judges provide scores like 5.4, 5.2, 5.1, 5.5, 5.3, 5.4,
and 5.6, then there is some consistency there. If, however, scores are something like 5.4, 4.3,
5.1, 4.9, 5.3, 5.4, and 5.6, then it is clear the judges are not using the same criteria for
determining scores, so they lack consistency, reliability.
Characteristics of a Good Test
9
Module 5
______________________________________________________________________________________________
3. Equivalent-forms
Equivalent-forms reliability is established in a manner similar to test-retest. Scores are
obtained from the same group of people, but the scores are taken from different forms of a test.
The different forms of the test (or instrument) are designed to measure the same thing, the same
construct. The forms should be as similar as possible, but use different questions or wording. It
is not enough to simply rearrange the item order; rather, new and different items are required
between the two forms. Examples of parallel (equivalent) forms include SAT, GRE, MAT, Miller's
Analogy Test, and others. It is unlikely that should you take one of these standardized tests
more than once would you take the same form.
To establish equivalent forms reliability, one administers two forms of an instrument to
the same group of people, take the scores and correlate them. The higher the correlation
coefficient, the higher the equivalent forms reliability. Table 2 below illustrates this.
Table 2: Example of Equivalent-Forms Reliability
Test Form A
Instrument
Person
Form
Instrument
A Form
Scores
Scores
Lailanie
85
83
Katrina
75
77
Ella
63
60
Criana
59
57
Julie
91
89
Jaymark
35
40
Judelyn
55
60
Usher
95
99
Nathan
86
83
Loisa
83
77
B
The scores are the same as given in Table 1; the only difference is that these scores come
from two different instruments. The correlation between the sets of scores in Table 1 is r = .98,
which indicates a strong association between the scores. Note that people who scored high on
the Form A also scored high on Form B, and those who scored low on Form A scored low on
Form B. There is a strong relationship between these two sets of scores, so high reliability.
Equivalent forms reliability is not a practical method for establishing reliability. One
reason is due to the inability or difficulty in developing forms of an instrument that are parallel
(equivalent). A second problem is the impractical aspect of asking study participants to complete
Characteristics of a Good Test
10
Module 5
______________________________________________________________________________________________
two forms of an instrument. In most cases a researcher wishes to use instruments that are as
short and to the point as possible, so asking one to complete more than one instrument is not
often reasonable.
4.
Internal Consistency
Internal consistency is essentially the degree to which similar responses are provided for
items designed to measure the same construct (variable), This is the preferred method of
establishing reliability for most measuring instruments. Internal consistency reliability
represents the consistency with which items on an instrument provide similar scores.
There are four sub-types of internal consistency reliability.
A. Split-Half Reliability
Split-Half Reliability is a test given and divided into halves and are scored
separately, then the score of one half of test are compared to the score of the remaining
half to test the reliability (Kaplan & Saccuzzo, 2001). Split-Half Reliability is a useful
measure when impractical or undesirable to assess reliability with two tests or to have
two test administrations (because of limited time or money) (Cohen & Swerdlik, 2001).
How to use Split-Half Method?
Divide test into halves. The most commonly used way to do this would be to assign
odd numbered items to one half of the test and even numbered items to the other, this is
called, Odd-Even reliability.
Find the correlation of scores between the two halves by using the Pearson r
formula.
Adjust or re-evaluate correlation using Spearman-Brown formula which increases
the estimate reliability even more. The longer the test, the more reliable it is so it is
necessary to apply the Spearman-Brown formula to a test that has been shortened, as we
do in split-half reliability (Kaplan & Saccuzzo, 2001).
B. Cronbach Alpha/Coefficient Alpha
The Cronbach Alpha/Coefficient Alpha formula is a general formula for estimating
the reliability of a test consisting of items on which different scoring weights may be
assigned to different responses.
C. Average Inter-Item Correlation
The average inter-item correlation uses all of the items on instrument that are
designed to measure the same construct.
D. Average Item-Total Correlation
This approach also uses the inter-item correlations; In addition, the computed total
score for the items are used as another variable in the analysis.
Characteristics of a Good Test
11
Module 5
______________________________________________________________________________________________
Lesson 3
Objectivity of the Test
Learning Objectives: At the end of this lesson, students are expected to:
 Define Objectivity of the test
 Define and differentiate the different Levels of Measurement
3.1 Objectivity
Objectivity represents the agreement of two or more raters or test administrators
concerning the score of a student. If the two raters who assess the same student on the same
text can’t agree on score, the test lacks objectivity and the score of neither judge is valid, thus
lack of objectivity reduces test validity in the same way that lack reliability influence validity (
LET Reviewer 2010 ed., B. Conception, et.al.)
It is the degree to which personal bias is eliminated in the scoring of the answers
(Reviewer for the Licensure Examination for Teachers( LET) 2011 ed.)
Measures of student instructional outcomes are rarely as precise as those of physical
characteristics such as height and weight. Student outcomes are more difficult to define difficult
to define, and the units of measurement are usually not physical units. The measures we take
on students vary in quality, which prompts the need for different scales of measurement. Terms
that describe the levels of measurement in these scales are nominal, ordinal, interval and ratio.
3.2 Levels of Measurement
1. Nominal Measurement is the least sophisticated. They merely classify objects or events by
assigning numbers to them. They are measurement the numerical values just "name" the
attribute uniquely. No ordering of the cases is implied. These numbers are arbitrary and
imply no quantification, but the categories must be mutually exclusive and exhaustive. For
example, one could nominally designate baseball positions by assigning the pitcher the
numeral 1; the catcher 2; the first baseman 3; the second baseman 4; and so on. These
assignments are arbitrary; no arithmetic of these are meaningful. For example, 1 plus 2 does
not equal 3, because a pitcher plus a catcher does not equal a first baseman.
2. Ordinal Measurement classifies, but they also assign rank order. Here, distances between
attributes do not have any meaning. There is a rough quantitative sense to their
measurement, but the differences between scores are not necessarily equal. They are thus in
order, but not fixed. An example of this is ranking individuals ina class according to their
test scores. Students’ scores could be ordered from 1st, 2nd and 3rd and so forth to the lowest
Characteristics of a Good Test
12
Module 5
______________________________________________________________________________________________
scores. Such a scale gives more information than nominal measurement, but still has
limitations. The units of ordinal measurement are most likely unequal.
3. Interval
Measurement
contains
the
nominal
and ordinal properties
and is also
characterized by equal units between score points. In this measurement the distance
between attributes does has meaning. Examples include when we measure temperature (in
Fahrenheit), the distance from 30-40 is same as distance from 70-80. The interval between
values is interpretable. Because of this, it makes sense to compute an average of an interval
variable, where it doesn't make sense to do so for ordinal scales. But note that in interval
measurement ratios don't make any sense - 80 degrees is not twice as hot as 40 degrees
(although the attribute value is twice as large). The advantage of equal units of measurement
is straightforward. Sums and differences now make sense, both numerically and logically.
4. Ratio Measurement is the most sophisticated type of measurement. The zero point is not
arbitrary; a score of zero includes the absence of what is being measured. This means that
you can construct a meaningful fraction (or ratio) with a ratio variable. Weight is a ratio
variable. In applied social research most "count" variables are ratio. For example, if a
person’s wealth equaled zero, he or she would have no wealth at all.
Characteristics of a Good Test
13
Module 5
______________________________________________________________________________________________
BIBLIOGRAPHY
Books:
B. Conception, et. al( 2010).Licensure Examination for Teachers LET Reviewer 2010 ed.
Buendicho, F. Ph.D. (2010). Assessment of Learning 1. Rex Bookstore Inc., Manila Philippines
(p. 56-57)
Concepcion,
Benjamin,
Esmane,Manvelito,Espiritu,Rogelio,
Gabuyo,
Yonardo,
Gonzales,
Emmanuel,Padilla, Edward John, Pasigue, Ronnie,Quintao, Myrna and Yambao, Roel. Licensure
Examination For Teacher. G\F unit M, Paseo del colegio building S.H. Loyola St. Sampaloc
Manila. Published and exclusively distributed by MEF review center.
Excerpt from Assessment of Children & Youth with Special Needs, by L.G. Cohen, L.J. Spenciner,
2007 edition, p. 43.
Duka, Cecillo D. (2011).Reviewer for the Licensure Examination for Teachers( LET) 2011 ed.
Esmane, Manuelito et. al.(2010). Licensure Examination for Teachers Reviewer :1.,000 Test
Question with Rationalization, 2010 edition. Tristan Printing Co, Cavite Phil.
Jackson, Sherri L. (2011). Research Methods and Statistics: A Critical Thinking Approach. Pages
69-71. 4th Edition. Wadsworth Publishing.
Kaplan, R.M. and Saccuzzo, D.P. (2001). Psychological Testing: Principle, Applications and Issues
(5th Edition), Belmont, CA: Wadsworth
Key, James P. Research Design in Occupational Education, Module R10: Validity and Reliability.
Oklahoma State University, copyright 1997.
Regarit, A. R., Elicay, R. S.P, & Laguete,C.C. (2010) . Assessment of Student Learning 1
(Cognitive Learning). p.18. C & E Publishing Inc., Quezon City
MET Review Center, Licensure Examination for Teacher, p. 425. G/F Unit Paseo del Colegio
Building, S. H Loyola St. Sampaloc Manila
Ornstein, Allan C. Strategies for Effecting Teaching. Harper Collins Publishing House, New York
City.
Online:
http://www.hmi.missouri.edu/course_materials/Executive_HSM/semesters/s99materials/Hsm
450/boren/measurement.htm
http://www.bwgriffin.com/gsu/courses/edur7130/content/reliability.htm
http://www.socialresearchmethods.net/kb/reltypes.php
http://web.sau.edu/WaterStreetMaryA/NEW%20intro%20to%20tests%20&%20measures%20w
ebsite_files/reliability.htm
http://books.google.com.ph/books?id=YXHuw_aIIgYC&pg=PA69&lpg=PA69&dq=four+methods+
of+establishing+reliability&source=bl&ots=kSQOPTtb0r&sig=KifUiE55XnGRhC0gbB5HlqWaKc&hl=fil&sa=X&ei=AVK4UM3rMub-
Characteristics of a Good Test
14
Module 5
______________________________________________________________________________________________
iAensYCoBQ&ved=0CFkQ6AEwBQ#v=onepage&q=four%20methods%20of%20establishing%20re
liability&f=false
http://www.socialresearchmethods.net/kb/measlevl.php
http://www.polsci.wvu.edu/duval/ps601/Notes/Levels_of_Measure.html
http://infinity.cos.edu/faculty/woodbury/stats/tutorial/ Data Levels.html
http://www.med.ottawa.ca/sim/data/Measurement.html
http://explorable.com\types-of-validity.html
http://www.proftesting.com/test_topics/pdfs/test_quality_validity.pdf
http://changingminds.ord\explanations\research\design\types validity.htm
http://jfmueller.faculty.noctrl.edu/toolbox/tests/gooditems.htm
http://oct.sfsu.edu/assessment/evaluating/htmls/improve_rel_val.html
http://www.netplaces.com/controlling-anxiety/psycological-testing/when-is-a-test-invalid.htm
http://www.proftesting.com/test_topics/pdfs/test_quality_validity.pdf
http://isearch.babylon.com/?s=web&babsrc=HP_ss&q=factors+affecting+the+validity+of+a+test
+items&start=20
http://pedritaranoajordan.blogspot.com/2011/01/validity-vs-reliability.html
http://www.appstate.edu/-bacharachavr/chapter8-validity.pdf
http://www.appstate.edu/~bacharachvr/chapter8-validity.pdf
http://www.nagb.org/content/nagb/assets/documents/naep/cizek-introduction-validity.pdf
http://www.spanglefish.com/adamsonuniversity/documents/Psychological%20Measurement/C
hapter_3_Psychometrics_Reliatility_Validity.pdf
www.education.com
http://books.google.com.ph/books?id=i-
Download