Planning Classroom Tests

Planning Classroom Tests Designing Test and Item Specifications to Help Ensure Validity and Reliability You learned last week that measurement is the process of assigning a number to express the degree to which a particular characteristic is present. Most of us know the common examples of height, in which the scale used to measure it is in feet and inches, and weight, in which the scale used to measure it is in pounds. The task here is to measure learning effectively and accurately, and skills attainment is usually measured as a score on an achievement (test) instrument. The process of evaluation, or a procedure used to determine quality, is then used to determine if the test does what it is intended to do. Test evaluation involves comparing the measure of performance to some established criteria in order to establish its worth in skills attainment. Test evaluation of the instrument itself involves asking the following: 1. If the test were to be given repeatedly, would it consistency yield the same score for each individual (reliability)? 2. Does the test measure what it is supposed to measure (validity)? 2. Is the cost of the test reasonable and is it practical for the purpose intended (utility)? Criterion-Referenced Tests Most classroom teachers are concerned with specific content and skills that are subsumed under state standards and design and develop tests that are specifically linked to these standards in order to determine learning. The content being measured, in which the student is to acquire skills, is often referred to as a domain, i.e., there is an existing goal framework of subordinate skills with hierarchical and non-hierarchical relationships that is to be learned in order to accomplish a particular goal, or overall learning task. Tests that measure such specific content and skills are referred to as Criterion-Referenced Tests (CRT) in which the items are drawn from a delineated domain of tasks. CRT’s are also referred to as objective-referenced tests in that specific measurable behavioral objectives are developed before test construction and guide item development. In order to guide item development, so as to link the behavioral objectives to specific types of behavior that are to be demonstrated by the learner, a Table of Specifications is developed as that guide. Table of Specifications A table of specification is a two-way matrix that has the objectives that are to be learned with the level of learning required, and also indicates the item format and the number of items to be used on the test. Such a table insures content validity, or the appropriateness of inferences that are to be made from the test scores for particular subject matter. As stated earlier, behavioral objectives are developed by using the behavior and content directly from the goal framework with the addition of conditions of performance. Questions include: 1. Do the behavioral objectives represent the goal framework? A direct representation insures that there is content validity, i.e., the objectives represent the main content area and are balanced across the different levels of learning. © by Carolyn Pearson 2005. All rights reserved. 1 Assessment/Planning/Dr. Carolyn Pearson 2. How will test results be used? There are several uses for test results including remediation (entry-level skills), evaluation of the instruction (practice tests), enrichment (when it has been determined that much of the learning has already occurred), and to insure that there were lessons for all of the objectives (content validity). 3. Which test item format should be used to measure achievement of the objective? It is necessary to insure that there is a direct match between the behavior that is stated in the objective and the type of response that is required of the student. 4. How many items should be included in the table of specifications and the test so as to measure performance adequately (number of items for minimal mastery)? Simple skills that are measured at low levels of learning, such as knowledge level tasks usually require only one item because the learner either knows the correct answer or does not. For more complex levels of learning, such as application level tasks and higher need several items in order to allow for discrimination between those learners who do indeed know the content and those learners who do not. .For example, let’s say you have a safety objective with 4 items measuring the objective. In this case you would probably want every item passed since the objective is so important. For other complex objectives, minimal mastery may be 3 out of 4, or 3 or 4 out of 5, a simple majority. 4. When and how will the test be given? Various types of tests serve various purposes, for example, pretests help the teacher plan lessons, determine what remediation is needed, and which objectives to emphasize or not emphasize depending on what skills were already determined to be possessed by the learners. Practice tests, or post-instruction tests, help the learners to practice the skills they are learning, help the teacher in evaluating the quality of each lesson, and determine if additional review or enrichment activities are needed. Post-tests, which are given after all instruction and review, should be announced in advance to allow for adequate learner preparation. Reliability Reliability refers to the consistency of the test scores obtained as if the learner has undertaken repeated measures. Similar scores over time indicate a reliable measure, which indicates that one has more closely obtained the learners “true score.” There are many factors that can impact a test score including memory loss, fatigue, attention, and motivation, all of which will introduce random error. Random errors cause the learner to potentially miss items on content they truly do know; thus, impacting the ability to obtain the “true score.” To increase reliability one must: 1. Obtain a representative sample of objectives from the goal framework, select an appropriate number of items per objective, place the same number of items on different forms, and balance the test in terms of difficulty. 2. Select items that adequately represent the skills that are being measured, and make sure the items cover all possible categories. For example, when covering the learning goal nouns, make sure person, place and thing are all tested and not just person. © by Carolyn Pearson 2005. All rights reserved. 2 Assessment/Planning/Dr. Carolyn Pearson 3. Format the items so as to reduce guessing, which will reduce the reliability of the scores obtained. 4. Make sure the students have ample time to answer all of the items, as timed tests create anxiety, which will also reduce the reliability of the scores obtained. 5. Try to maintain a positive learner attitude when testing. Testing should never be used as a punishment technique (no “pop” quizzes), lesson plan (not prepared for the day), etc. Types of Subordinate Skills It was mentioned earlier that various types of test serve various purposes. Skills that learners should already possess before an instructional unit begins are called prerequisite skills and the tests that measure them are called readiness tests. Readiness tests gather information to let the teacher know if remediation is necessary, if the goal to be taught is too complex, or if the learners consist of masters or non-masters. Skills that focus on the current instruction are called enabling skills and are measured using pretests, which are used for lesson planning, lets the teacher know what to emphasize during the instruction, and enables grouping of the students by mastery level. Practice tests are usually given on a subset of skills immediately after a segment of instruction and are usually linked to in-class or homework assignments. Data from practice tests are used to determine instructional effectiveness, mastery, or new presentations; regrouping of the learners according to mastery; for diagnosing misconceptions during the instruction; and for the development of review materials. Practice tests also lessen test anxiety that often occurs during post-tests. Post-tests are used at the actual end of instruction to evaluate student progress and assign grades. Examples of post-tests are quizzes, unit tests, and comprehensive exams. When developing tables of test specifications for individual units, it is important to group the objectives and classify them by their level of learning (Bloom) and whether they are prerequisite or enabling skills. Then for each objective, it should be decided what item format is most appropriate in measuring the stated behavior. Next, it should be decided what number of items is needed to adequately measure objective in order to establish minimal learning levels. Also, one should review the evaluation decision that is going to be made regarding learning progress (pretest, practice, post) and from the table, select a subset or entire set of skills and objectives for the needed test. For example, using the nutritional analysis below, a practice test could be developed for just carbohydrates (blue), proteins (red), and fats (green), and a comprehensive posttest could be developed for just the highest level skills (pink). A comprehensive posttest should balance all of the goals that were taught, should be balanced across all of the levels of learning (with time for the learner to consider and answer each item), include an adequate number of items in order to determine minimal competency, and represent all of the skills in the framework (content validity). © by Carolyn Pearson 2005. All rights reserved. 3 Assessment/Planning/Dr. Carolyn Pearson Foods carbos fats proteins Criteria for Constructing Objective Test Items There are five criteria for constructing and evaluating objective test items: 1. The items are congruent with the objective. Every objective must have items that demonstrate different types of behaviors that are congruent with that objective, they must have content that covers all of the areas taught, and one must be sure any words used are in the proper context. The conditions must be met, that is the learners must have the appropriate resources to do the activity (equipment such as microscopes). 2. There is congruence of the item to the learners’ characteristics. One must use relevant items for the learners, and one must use a suitable format and vocabulary for the items. Try to stay away from sensitive areas such as religion and politics and use them only if specified by the instructional goal. Two factors will affect reliability in this case, bias which may be cultural, racial, or sexual, so try to avoid these © by Carolyn Pearson 2005. All rights reserved. 4 Assessment/Planning/Dr. Carolyn Pearson types; and familiarity with the response format, get the learner familiar with the item type, optical scanning sheets, web-based forms, and procedures before the test. 3. The items are clearly written. Write the items clearly to avoid ambiguity and unnecessary complexity. Leave out extraneous information and items should have only one correct answer. 4. There is accuracy of the measures. Factors related to accuracy include item novelty, do not reuse items as they introduce a memory effect that will take about a month to overcome; susceptibility to guessing, such as clues in the directions; give several choices, not one or two; and susceptibility to cheating, so use alternate forms and a different item sequence on each form. 5. There is freedom from bias. No biases due to stereotyping, cultural, racial, or sexual and be aware of culturaldependent contexts. Formative Evaluation of Items For posttests, one should evaluate a draft copy before giving it to the learners, look for itemobjective congruence, then student characteristics, item clarity, and last, accuracy of the measures. Place the item and objective together and break them down into their respective components. It is worthwhile to pull a small representative sample of students to go over test with you so as to clarify the directions and items and to have them identify any clues to the correct answers. Also, have colleagues judge the items independently and use the same people who checked item-objective congruence. Accuracy of measures is based on a personal decision and only the teacher knows their own seating chart, if the items are novel, and if guessing is likely. Objective Items Items are considered objective when they can be scored based on the learners’ selection of a response. Written response items include sentence completion (fill-in-the-blank) and shortanswer when the learner has to supply the answer from memory. Selected response items include alternative response (true-false, fact-opinion), matching exercises, keyed items (choices are at the top and are used repeatedly for a cluster of questions, and multiple choice. One may ask, why are these item types used? First, objective items allow for evaluation according to both Gagne's Types of Learning and Bloom's Taxonomy; such items are also measured quickly, with a wide range of content being measured (larger sampling of the content domain); next, objective tests are easier to administer, score, and analyze; the items can easily be adapted for machine scoring; and the scores are more reliable due to fewer scoring errors. Written-Response Test Items Written-response test items require learners’ to recall information from memory or to apply a skill before they write the answer. Positive features of these items are that guessing is minimized, since no alternatives are given; original responses give insight into learner © by Carolyn Pearson 2005. All rights reserved. 5 Assessment/Planning/Dr. Carolyn Pearson misconceptions; and are versatile in that they use a variety of formats. Negative features are that they must be clearly written in order to produce the proper response, and written responses cannot be machine scored, thus reducing their efficiency. One form of written-response item is the completion item, where learners are required to write in a key word(s) in a blank near the end. In order for these items to work well, the possibilities of several answers should be eliminated, as well as any clues to the correct answer. The teacher should use paraphrasing to avoid rote memorization of the content, and reduce scoring time by using short answer sheets. Short-Answer items require the learner to complete a statement by having the learner insert a word, phrase, or sentence. Such items are common on mathematics tests, and require the learner to associate a given stimulus with a response. It is important that the teacher provide a blank per word, specify the units for the answer (i.e., grams, pounds, feet, etc.), and insure that the directions for a cluster of items are appropriate for all of the items. Selected Response Items Selected response items require the learners’ to choose an answer from a set of given choices. Incorrect alternatives are known as distracters or foils, yet should not be tricky or contain unknown information. Positive features include that learners must choose from alternatives, there is increased time to allow learners to answer more items, and they lend themselves to machine scoring. One negative feature is that there is an increased risk of guessing. One form of selected response item is Alternative response, in which the learner has a choice of two responses, where each item is a statement that the learner must judge. The most commonly known is true or false items, correct or incorrect, or fact and opinion. Positive features include that they test recall, can require classification, allow for the judging of analysis or synthesis material, can assess learners’ ability to evaluation, and allow for the learner to apply principles to judge the accuracy of statements of causality or correlation. Negative features include that guessing can become a 50-50 probability, and if the learner is provided with no word to respond to, they tend to write false down as their answer. Such items require that the statement be worded positively with no clues to possible answers. Matching items provide the learners with two columns in which they must match information. The left side contains premises while the right side contains responses. The learner must have clear directions on how many times the response can be used, the content in the set must be kept homogeneous, the teacher must construct the set so as to avoid the process of elimination, and to avoid any one-to-one correspondence. Keyed items combine matching and alternative response, in that the same responses (usually at the top of the question set) are used for different questions. It is important to note here that when developing and administering tests for non-readers that the teacher read a general statement of the skill to be performed to the learner, that the skill be demonstrated on the response form, and that the non-reader be asked questions for clarification. The most popular item type, multiple-choice items with corresponding Item Banks, measure Gagnes′ verbal information and intellectual skills, and all levels of learning of Bloom’s Taxonomy. Positive features include that they can narrow in and focus on a particular aspect of a problem; allow for multiple responses which force the learner to choose from alternatives; have a best answer, which measures the ability to make fine discriminations; can have several correct © by Carolyn Pearson 2005. All rights reserved. 6 Assessment/Planning/Dr. Carolyn Pearson answers; can measure a wide range of content; are reliably scored; are easily adapted for machine scoring or computer administration; and are easily compiled into item banks. Limitations include that they cannot be used to measure attitudes or motor skills, that there is an increased chance of guessing, and it is hard to write distracters that reflect actual misconceptions. Multiple-choice items include a stem; a complete sentence or question, an incomplete statement, or have the stem embedded in the directions. Responses are called distracters or foils. © by Carolyn Pearson 2005. All rights reserved. 7 Assessment/Planning/Dr. Carolyn Pearson

Planning Classroom Tests

Related documents

Products

Support

Planning Classroom Tests

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib