Characteristics of a Good Test 1 Module 5 ______________________________________________________________________________________________ Module 5 CHARACTERISTICS OF A GOOD TEST ED 103 Assessment of Student Learning CSED; MTh 10:30-12:00 Group 5 Leader: Jay Mark Biboso Secretary: Julie Ann Leoberas Members: Ella Jane Tagala Katrina Alona D. Bustria Loisa Mae Paje Judelyn Mae Francisco Jachinlee Laspiñas Angie Mae Padrillan Gebelyn Chavez Alyssa May Nanat Module 5 Characteristics of a Good Test Lesson 1: Characteristics of a Good Test a. Important thing to Remember about Validity b. Types of Validity c. Factors Affecting the Validity of a Test Item d. Ways to Reduce the Validity of the Test Items Lesson 2: Reliability of the Test a. Factors Affecting the Reliability of Test b. Four Methods of Establishing Reliability Lesson 3. Objectivity of the Test Characteristics of a Good Test 2 Module 5 ______________________________________________________________________________________________ LESSON 1 VALIDITY OF THE TEST Learning Objectives: Upon the successful completion of this lesson, students will be able to: Understand the definition of validity. Know some of the important things to remember about validity. Identify the different types of validity. Know the importance of validity of the test and how valid it is. Enumerate the factors affecting the validity of the test item. Give the ways on reducing the validity of the test. 1.1 Important thing to Remember about Validity Validity refers to the appropriateness of source-based inferences; or decision made based on the students test results. The extent to which a test measures what it’s supposed to measure. Validity is the degree to which evidence and theory support the interpretations of test scores entailed by the proposed uses of a test (AERA, APA, & NCME, 1999, P. 9). 1. Validity refers to the decision we make, and not to the test itself or to the measurement. 2. Like reliability, validity is not an all or nothing concept; it is never totally absent or absolutely perfect. 3. A validity estimate, called a validity coefficient, refers to specific type of validity. It ranges between 0 to 1. 4. Validity can never be finally determined; it is specific to each administration of the text. 1.2 Types of Validity 1. Content Validity determines the extent of which the assessment is the representative of the domain of interest. 2. Construct Validity defines how well a test or experiment measures up to its claims. A test designed to measure depression must only measure that particular construct, not closely related ideals such as anxiety or stress. Convergent Validity – tests that construct that are expected to be related are, in fact, related. Characteristics of a Good Test 3 Module 5 ______________________________________________________________________________________________ Discriminant Validity- occurs where that is expected not to relate do not, such that is possible to discriminate between these construct. 3. Internal Validity is the measure which ensures that a researcher’s experiment design closely follows the principle of cause and effect. 4. Conclusion Validity occurs when you can conclude that there is a relationship of some kind between the two variables being examined. 5. External Validity occurs when the causal relationship discovered can be generalized to other people, times and context. 6. Criterion Validity determines the relationship between an assessment and another measure of the same trait. Concurrent Validity- measure the test against a benchmark test and high correlation indicates that the test has strong criterion validity. Predictive Validity- is a measure of how well a test predicts ability. 7. Face Validity occurs where something appears to be valid. This of course depends very much on the judgment of the observer. 8. Instructional Validity determines to what extent the domain of content in the test is taught in class. 1.3 Factors Affecting the Validity of the Test Item Factors affecting the validity of the test item: The test item itself. Unclear test directions. Personal factors influencing how students response to the test. Arrangement of a test. Length of a test. Untaught items Characteristics of a Good Test 4 Module 5 ______________________________________________________________________________________________ 1.4 Ways to Reduce the Validity of the Test Validity refers to the accuracy of an assessment; it is the most important consideration in test evaluation. The term validity refers to whether or not the test measures what it claims to measure. Reducing validity causes: 1. Poorly constructed test items Validity will also be affected by how closely the selection of a correct answer on a test reflects mastery of the material contained in the standards. Any additional information that is irrelevant to the question, can distract or confuse the student, thus providing an alternative explanation for why the item was missed. Keep it simple. Example: Change To The purchase of the Louisiana Territory, The purchase of the Louisiana Territory completed in 1803 and considered one of primarily grew out of our need for Thomas Jefferson's greatest accomplishments as president, primarily grew out of our need for a. the port of New Orleans* a. the port of New Orleans* b. helping Haitians against Napoleon b. helping Haitians against Napoleon c. the friendship of Great Britain c. the friendship of Great Britain d. control over the Indians d. control over the Indians 2. Unclear directions 3. Ambiguous items 4. Reading vocabulary too 5. Complicated syntax difficult Keep the grammar consistent between stem and alternatives Example: Change To What is the dietary substance that is often associated with heart disease wh often found in high levels in the blood? Characteristics of a Good Test 5 Module 5 ______________________________________________________________________________________________ a. glucose a. glucose b. cholesterol* b. cholesterol* c. beta carotene c. beta carotene d. Proteins d. protein 6. Inadequate time limit 7. Inappropriate level of difficulty If the test is too long or appears too difficult, the examinees will be tempted to guess, and this will increase error directly. 8. Unintended clues 9. Avoid providing clues for one item in the wording of another item on the test. Improper arrangements of items Characteristics of a Good Test 6 Module 5 ______________________________________________________________________________________________ LESSON 2 Reliability of the Test Learning Objectives: After the report, students should be able to: Know the Importance of Reliability. Identify the different factors that affect reliability of test Know the four methods of establishing reliability. Understand the importance of the four methods of establishing reliability. Have more knowledge on the application of the methods of reliability. Apply the different methods of establishing reliability. 2.1 Reliability Reliability is thus a measure of how much you can trust the results of a test. Tests often have high reliability – but at the expense of validity. In other words, reliability refers to the ability to measure something consistently. That is, to obtain consistent scores every time something is measured. 2.2 Factors Affecting Reliability Test: 1. Test length Generally, the longer a test is, the more reliable it is. 2. Speed When a test is a speed test, reliability can be problematic. It is inappropriate to estimate reliability using internal consistency, test-retest, or alternate form methods. This is because not every student is able to complete all of the items in a speed test. In contrast, a power test is a test in which every student is able to complete all the items. 3. Group homogeneity In general, the more heterogeneous the group of students who take the test, the more reliable the measure will be. 4. Item difficulty When there is little variability among test scores, the reliability will be low. Thus, reliability will be low if a test is so easy that every student gets most or all of the items correct or so difficult that every student gets most or all of the items wrong. Characteristics of a Good Test 7 Module 5 ______________________________________________________________________________________________ 5. Objectivity Objectively scored tests, rather than subjectively scored tests, show a higher reliability. 6. Test-retest interval The shorter the time interval between two administrations of a test, the less likely that changes will occur and the higher the reliability will be. 7. Variation with the testing situation Errors in the testing situation (e.g., students misunderstanding or misreading test directions, noise level, distractions, and sickness) can cause test scores to vary. 2.3 Four Methods of Establishing Reliability Researchers use four methods to check the reliability of a test: the test-retest method, equivalent forms (parallel forms), internal consistency, and inter-scorer (inter-rater) reliability. Not all of these methods are used for all tests. Each method provides research evidence that the responses are consistent under certain circumstances. 1. Test-Retest Reliability Test-retest reliability is established by correlating scores obtained, on two separate occasions, from the same group of people on the same test. The correlation coefficient obtained is referred to as the coefficient of stability. With test-retest reliability, one attempts to determine whether consistent scores are being obtained from the same group of people over time; hence, one wish to learn whether scores are stable over time. For example: One administers a test, say Test A, to students on August 10, then re-administers the same test (Test A) to the same students at a later date, and say on August 25. Scores from the same person are correlated to determine the degree of association between the two sets. Table 1 shows an example. Table 1: Example of Test-Retest Scores for Reliability Test Form A August 10 Person August 25 Administration Administration Scores Scores Lailanie 85 83 Katrina 75 77 Criana 63 60 Ella 59 57 Characteristics of a Good Test 8 Module 5 ______________________________________________________________________________________________ Julie 91 89 Jaymark 35 40 Judelyn 55 60 Usher 95 99 Nathan 86 83 Loisa 83 77 The correlation between the sets of scores in Table 1 is r = .98, which indicates a strong association between the scores. Note that people who scored high on the first administration also scored high on the second, and those who scored low on first administration scored low on the second. There is a strong relationship between these two sets of scores, so high reliability. The problem with test-retest reliability is that it is only appropriate for instruments for which individuals are not likely to remember their answers from administration to administration. Remembering answers will likely inflate, artificially, the reliability estimate. In general, test-retest reliability is not a very useful method for establishing reliability. 2. Inter-Scorer/Inter-Rater Reliability Inter-Scorer/Inter-Rater reliability is used to determine the consistency in which one rater assigns scores (intra-judge), or the consistency in which two or more raters assign scores (inter-judge). Intra-judge reliability refers to a single judge assigning scores. Remember that consistency requires multiple scores (at least two) in order to establish reliability, so for intra-judge reliability to be established, a single judge or rater must score something more than once. For example: If asked to judge an art exhibit, to establish reliability, a judge must rate the same exhibit more than once to learn if the judge is reliable in assigning scores. If the judge rates it high once and low the second time, obviously rater reliability is lacking. For inter-judge reliability, one is concerned with showing that multiple raters have consistency in their scoring of something. For example: Consider the multiple judges used at the Olympics. For the high dive competition, often about seven judges are used. If the seven judges provide scores like 5.4, 5.2, 5.1, 5.5, 5.3, 5.4, and 5.6, then there is some consistency there. If, however, scores are something like 5.4, 4.3, 5.1, 4.9, 5.3, 5.4, and 5.6, then it is clear the judges are not using the same criteria for determining scores, so they lack consistency, reliability. Characteristics of a Good Test 9 Module 5 ______________________________________________________________________________________________ 3. Equivalent-forms Equivalent-forms reliability is established in a manner similar to test-retest. Scores are obtained from the same group of people, but the scores are taken from different forms of a test. The different forms of the test (or instrument) are designed to measure the same thing, the same construct. The forms should be as similar as possible, but use different questions or wording. It is not enough to simply rearrange the item order; rather, new and different items are required between the two forms. Examples of parallel (equivalent) forms include SAT, GRE, MAT, Miller's Analogy Test, and others. It is unlikely that should you take one of these standardized tests more than once would you take the same form. To establish equivalent forms reliability, one administers two forms of an instrument to the same group of people, take the scores and correlate them. The higher the correlation coefficient, the higher the equivalent forms reliability. Table 2 below illustrates this. Table 2: Example of Equivalent-Forms Reliability Test Form A Instrument Person Form Instrument A Form Scores Scores Lailanie 85 83 Katrina 75 77 Ella 63 60 Criana 59 57 Julie 91 89 Jaymark 35 40 Judelyn 55 60 Usher 95 99 Nathan 86 83 Loisa 83 77 B The scores are the same as given in Table 1; the only difference is that these scores come from two different instruments. The correlation between the sets of scores in Table 1 is r = .98, which indicates a strong association between the scores. Note that people who scored high on the Form A also scored high on Form B, and those who scored low on Form A scored low on Form B. There is a strong relationship between these two sets of scores, so high reliability. Equivalent forms reliability is not a practical method for establishing reliability. One reason is due to the inability or difficulty in developing forms of an instrument that are parallel (equivalent). A second problem is the impractical aspect of asking study participants to complete Characteristics of a Good Test 10 Module 5 ______________________________________________________________________________________________ two forms of an instrument. In most cases a researcher wishes to use instruments that are as short and to the point as possible, so asking one to complete more than one instrument is not often reasonable. 4. Internal Consistency Internal consistency is essentially the degree to which similar responses are provided for items designed to measure the same construct (variable), This is the preferred method of establishing reliability for most measuring instruments. Internal consistency reliability represents the consistency with which items on an instrument provide similar scores. There are four sub-types of internal consistency reliability. A. Split-Half Reliability Split-Half Reliability is a test given and divided into halves and are scored separately, then the score of one half of test are compared to the score of the remaining half to test the reliability (Kaplan & Saccuzzo, 2001). Split-Half Reliability is a useful measure when impractical or undesirable to assess reliability with two tests or to have two test administrations (because of limited time or money) (Cohen & Swerdlik, 2001). How to use Split-Half Method? Divide test into halves. The most commonly used way to do this would be to assign odd numbered items to one half of the test and even numbered items to the other, this is called, Odd-Even reliability. Find the correlation of scores between the two halves by using the Pearson r formula. Adjust or re-evaluate correlation using Spearman-Brown formula which increases the estimate reliability even more. The longer the test, the more reliable it is so it is necessary to apply the Spearman-Brown formula to a test that has been shortened, as we do in split-half reliability (Kaplan & Saccuzzo, 2001). B. Cronbach Alpha/Coefficient Alpha The Cronbach Alpha/Coefficient Alpha formula is a general formula for estimating the reliability of a test consisting of items on which different scoring weights may be assigned to different responses. C. Average Inter-Item Correlation The average inter-item correlation uses all of the items on instrument that are designed to measure the same construct. D. Average Item-Total Correlation This approach also uses the inter-item correlations; In addition, the computed total score for the items are used as another variable in the analysis. Characteristics of a Good Test 11 Module 5 ______________________________________________________________________________________________ Lesson 3 Objectivity of the Test Learning Objectives: At the end of this lesson, students are expected to: Define Objectivity of the test Define and differentiate the different Levels of Measurement 3.1 Objectivity Objectivity represents the agreement of two or more raters or test administrators concerning the score of a student. If the two raters who assess the same student on the same text can’t agree on score, the test lacks objectivity and the score of neither judge is valid, thus lack of objectivity reduces test validity in the same way that lack reliability influence validity ( LET Reviewer 2010 ed., B. Conception, et.al.) It is the degree to which personal bias is eliminated in the scoring of the answers (Reviewer for the Licensure Examination for Teachers( LET) 2011 ed.) Measures of student instructional outcomes are rarely as precise as those of physical characteristics such as height and weight. Student outcomes are more difficult to define difficult to define, and the units of measurement are usually not physical units. The measures we take on students vary in quality, which prompts the need for different scales of measurement. Terms that describe the levels of measurement in these scales are nominal, ordinal, interval and ratio. 3.2 Levels of Measurement 1. Nominal Measurement is the least sophisticated. They merely classify objects or events by assigning numbers to them. They are measurement the numerical values just "name" the attribute uniquely. No ordering of the cases is implied. These numbers are arbitrary and imply no quantification, but the categories must be mutually exclusive and exhaustive. For example, one could nominally designate baseball positions by assigning the pitcher the numeral 1; the catcher 2; the first baseman 3; the second baseman 4; and so on. These assignments are arbitrary; no arithmetic of these are meaningful. For example, 1 plus 2 does not equal 3, because a pitcher plus a catcher does not equal a first baseman. 2. Ordinal Measurement classifies, but they also assign rank order. Here, distances between attributes do not have any meaning. There is a rough quantitative sense to their measurement, but the differences between scores are not necessarily equal. They are thus in order, but not fixed. An example of this is ranking individuals ina class according to their test scores. Students’ scores could be ordered from 1st, 2nd and 3rd and so forth to the lowest Characteristics of a Good Test 12 Module 5 ______________________________________________________________________________________________ scores. Such a scale gives more information than nominal measurement, but still has limitations. The units of ordinal measurement are most likely unequal. 3. Interval Measurement contains the nominal and ordinal properties and is also characterized by equal units between score points. In this measurement the distance between attributes does has meaning. Examples include when we measure temperature (in Fahrenheit), the distance from 30-40 is same as distance from 70-80. The interval between values is interpretable. Because of this, it makes sense to compute an average of an interval variable, where it doesn't make sense to do so for ordinal scales. But note that in interval measurement ratios don't make any sense - 80 degrees is not twice as hot as 40 degrees (although the attribute value is twice as large). The advantage of equal units of measurement is straightforward. Sums and differences now make sense, both numerically and logically. 4. Ratio Measurement is the most sophisticated type of measurement. The zero point is not arbitrary; a score of zero includes the absence of what is being measured. This means that you can construct a meaningful fraction (or ratio) with a ratio variable. Weight is a ratio variable. In applied social research most "count" variables are ratio. For example, if a person’s wealth equaled zero, he or she would have no wealth at all. Characteristics of a Good Test 13 Module 5 ______________________________________________________________________________________________ BIBLIOGRAPHY Books: B. Conception, et. al( 2010).Licensure Examination for Teachers LET Reviewer 2010 ed. Buendicho, F. Ph.D. (2010). Assessment of Learning 1. Rex Bookstore Inc., Manila Philippines (p. 56-57) Concepcion, Benjamin, Esmane,Manvelito,Espiritu,Rogelio, Gabuyo, Yonardo, Gonzales, Emmanuel,Padilla, Edward John, Pasigue, Ronnie,Quintao, Myrna and Yambao, Roel. Licensure Examination For Teacher. G\F unit M, Paseo del colegio building S.H. Loyola St. Sampaloc Manila. Published and exclusively distributed by MEF review center. Excerpt from Assessment of Children & Youth with Special Needs, by L.G. Cohen, L.J. Spenciner, 2007 edition, p. 43. Duka, Cecillo D. (2011).Reviewer for the Licensure Examination for Teachers( LET) 2011 ed. Esmane, Manuelito et. al.(2010). Licensure Examination for Teachers Reviewer :1.,000 Test Question with Rationalization, 2010 edition. Tristan Printing Co, Cavite Phil. Jackson, Sherri L. (2011). Research Methods and Statistics: A Critical Thinking Approach. Pages 69-71. 4th Edition. Wadsworth Publishing. Kaplan, R.M. and Saccuzzo, D.P. (2001). Psychological Testing: Principle, Applications and Issues (5th Edition), Belmont, CA: Wadsworth Key, James P. Research Design in Occupational Education, Module R10: Validity and Reliability. Oklahoma State University, copyright 1997. Regarit, A. R., Elicay, R. S.P, & Laguete,C.C. (2010) . Assessment of Student Learning 1 (Cognitive Learning). p.18. C & E Publishing Inc., Quezon City MET Review Center, Licensure Examination for Teacher, p. 425. G/F Unit Paseo del Colegio Building, S. H Loyola St. Sampaloc Manila Ornstein, Allan C. Strategies for Effecting Teaching. Harper Collins Publishing House, New York City. Online: http://www.hmi.missouri.edu/course_materials/Executive_HSM/semesters/s99materials/Hsm 450/boren/measurement.htm http://www.bwgriffin.com/gsu/courses/edur7130/content/reliability.htm http://www.socialresearchmethods.net/kb/reltypes.php http://web.sau.edu/WaterStreetMaryA/NEW%20intro%20to%20tests%20&%20measures%20w ebsite_files/reliability.htm http://books.google.com.ph/books?id=YXHuw_aIIgYC&pg=PA69&lpg=PA69&dq=four+methods+ of+establishing+reliability&source=bl&ots=kSQOPTtb0r&sig=KifUiE55XnGRhC0gbB5HlqWaKc&hl=fil&sa=X&ei=AVK4UM3rMub- Characteristics of a Good Test 14 Module 5 ______________________________________________________________________________________________ iAensYCoBQ&ved=0CFkQ6AEwBQ#v=onepage&q=four%20methods%20of%20establishing%20re liability&f=false http://www.socialresearchmethods.net/kb/measlevl.php http://www.polsci.wvu.edu/duval/ps601/Notes/Levels_of_Measure.html http://infinity.cos.edu/faculty/woodbury/stats/tutorial/ Data Levels.html http://www.med.ottawa.ca/sim/data/Measurement.html http://explorable.com\types-of-validity.html http://www.proftesting.com/test_topics/pdfs/test_quality_validity.pdf http://changingminds.ord\explanations\research\design\types validity.htm http://jfmueller.faculty.noctrl.edu/toolbox/tests/gooditems.htm http://oct.sfsu.edu/assessment/evaluating/htmls/improve_rel_val.html http://www.netplaces.com/controlling-anxiety/psycological-testing/when-is-a-test-invalid.htm http://www.proftesting.com/test_topics/pdfs/test_quality_validity.pdf http://isearch.babylon.com/?s=web&babsrc=HP_ss&q=factors+affecting+the+validity+of+a+test +items&start=20 http://pedritaranoajordan.blogspot.com/2011/01/validity-vs-reliability.html http://www.appstate.edu/-bacharachavr/chapter8-validity.pdf http://www.appstate.edu/~bacharachvr/chapter8-validity.pdf http://www.nagb.org/content/nagb/assets/documents/naep/cizek-introduction-validity.pdf http://www.spanglefish.com/adamsonuniversity/documents/Psychological%20Measurement/C hapter_3_Psychometrics_Reliatility_Validity.pdf www.education.com http://books.google.com.ph/books?id=i-