Table of Contents Module 1 Overview ............................................................................................................................ 4 Module 1 Part 1 ................................................................................................................................... 6 Measure, Test, Assess, or Evaluate? .......................................................................................... 6 Measure .............................................................................................................................................. 6 Test (or Instrument) ..................................................................................................................... 6 Assess ................................................................................................................................................. 7 Evaluate ............................................................................................................................................. 7 Module 1 Part 2: High Stakes Testing...................................................................................... 10 Additional Resources ................................................................................................................... 10 Alignment ........................................................................................................................................ 11 Ethics ................................................................................................................................................. 11 Practice ............................................................................................................................................. 12 Module 1 Part 3: Purposes and Types of Tests .................................................................... 13 Overview of classifications ........................................................................................................ 13 Common types of classifications ............................................................................................ 13 Module 2 Overview..................................................................................................................................... 15 Module 3 Overview..................................................................................................................................... 51 Module 4 Overview..................................................................................................................................... 77 Module 5 Overview .......................................................................................................................... 93 Module 5 Part 1: Central Tendency........................................................................................... 95 Summarizing Data (Please read Chapter 12 and 13 in the Textbook) ................... 95 Central Tendency ......................................................................................................................... 95 Module 5 Part 2: Variability .......................................................................................................... 98 Summarizing Data: Variability (Please read Chapter 13 in the Textbook)............ 98 Variability ...................................................................................................................................... 100 Printable View of: Module 6: Correlation and Validity ............. Error! Bookmark not defined. File: Module 6 Overview .......................................................... Error! Bookmark not defined. Module 6 Overview ........................................................................................................................ 103 Module 6 Part 1 Correlation ....................................................................................................... 104 Interpreting the Correlation Coefficient (r) and Coefficient of Determination ( ) .......................................................................................................................................................... 104 Scatterplots .................................................................................................................................. 106 Module 6 Part 2 Validity ............................................................................................................... 111 Content Validity........................................................................................................................... 111 Criterion-related Validity: Concurrent and Predictive .................................................. 111 Construct Validity ....................................................................................................................... 111 Printable View of: Module 7: Reliability and Accuracy ............ Error! Bookmark not defined. File: Module 7 Overview .......................................................... Error! Bookmark not defined. Module 7 Overview ........................................................................................................................ 113 Module 7 Reliability and Accuracy ........................................................................................... 114 Reliability ....................................................................................................................................... 114 Stability, Equivalence, and Stability and Equivalence Methods ............................... 115 Internal Consistency Reliability Estimates........................................................................ 117 Inter-rater reliability ................................................................................................................. 118 Standard Error of Measurement ........................................................................................... 119 Factors That Influence Reliability Interpretation ........................................................... 120 Printable View of: Module 8: Standardized Test Score Interpretation ...... 122 File: edf6432 Module 8 Overview ................................................................................. 122 Module 8 Overview ........................................................................................................................ 122 Module 8 Part 1: Standardized Testing ................................................................................. 124 Basic Characteristics of Standardized Tests .................................................................... 124 Module 8 Part 2: Standardized Test Score Interpretation ............................................. 126 Module 9 Overview ........................................................................................................................ 129 Module 9 Assessment Issues for Language Enriched Pupils & Exceptional Student Education Settings . 130 General Principles........................................................................................................................... 130 Systematically Planned and Constructed .......................................................................... 131 Designed to Fit Characteristics of Content ....................................................................... 133 Designed to Fit Characteristics of Learners ..................................................................... 134 Designed for Maximum Feasibility ....................................................................................... 134 Professional in Format and Appearance ............................................................................ 135 Part 2 .................................................................................................................................................. 136 More Resources Related to Accommodations for Students in ESE and ESOL/LEP Programs ........................................................................................................................................... 136 Considerations When Identifying and Implementing Accommodations for Students in ESE Programs ...................................................................................................... 136 Assistive Technology ................................................................................................................. 137 Specific Exceptionalities........................................................................................................... 137 Empirical and Research-oriented Studies Related to Assessment Accommodations ........................................................................................................................ 137 Considerations When Identifying and Implementing Accommodations for Students in ESOL Programs ................................................................................................... 138 EDF6432 - Measurement and Evaluation in Education Dr. Haiyan Bai Module 1 Overview The concepts in this module are important whether you are using measurement skills as a teacher in a classroom or in another professional role such as school leader, counselor, instructional designer, or researcher. As you begin, consider all the ways that proficiency in measurement and evaluation is vital to your effective professional performance. Consider how these measurement skills can assist you in performing your role, consistent with your professional philosophy, and with high quality information at your fingertips to make effective decisions. These measurement skills are also important for interpreting and conducting research (teacher or school leader action research, school leaders' or private, non-profit evaluation research, scholarly research). Module 1 corresponds to Chapters 1, 2, & 4 in our textbook. We will begin with general but critical concepts related to measurement and evaluation continue with high-stakes testing and go on to purposes and specific types of tests. Content in this module relates to the text but includes content not found in the textbook as well. One of the most important attributes of high quality assessment is the validity of results (the extent that inferences that we make from the results are appropriate). One of the most important steps to ensuring validity is identifying what it is you want to assess (who and what), for what purpose (why), and under what conditions (how). In this module, we will learn skills that will help you enhance the validity of results of tests you use, create, or evaluate for research purposes. The table below contains the objectives, readings, learning activities, and assignments for Module 1. Module 1 focuses on the following objectives: Chapter 1 Objectives Compare and contrast testing and assessment. Explain why testing and assessment skills are vital to today’s classroom teacher. Identify the implications of current trends in educational measurement for today’s classroom teacher Chapter 2 Describe the broad impact high-stakes testing has on students, teachers, administrators, schools, and the community. Explain the relationship between academic standards, performance standards, and alignment in standardsbased reform. Identify AERA’s 12 conditions that highstakes testing programs should meet. Chapter 4 Associate various types of decisions with the types of tests that provide data for these decisions. Determine whether or not a given test is appropriate for a given purpose. Describe the various types of tests available, and identify situations in which they would be appropriate. Discriminate among the various types of tests and their appropriate uses. Chapters 1, 2, & 4 in text Content and articles specified in module (explore) Florida Department of Education Accountability, Research, and Measurement (ARM) found at http://www.fldoe.org/arm/ Readings Professional standards in your field (see list under Materials tool) (selected student performance standards from) Florida Sunshine State Standards found at http://www.fldoe.org/bii/curriculum/sss/ Several non-posted practice tasks Learning Activities posting to group (professional standard related to measurement and evaluation; student performance standard that has been classified by level along with a derived learning target) Begin Final Project Part A (Instructions will be Assignments available under Assignments tool: Final Project) Module 1 Part 1 Measure, Test, Assess, or Evaluate? We often hear measurement related-terms substituted for each other when they actually have distinct meanings. While the terms are often used interchangeably by the public, media, and even colleagues, we can communicate more effectively if we use the terms correctly on a consistent basis. Let's first define the terms and then practice classifying real life examples. We will still hear the terms used frequently in different ways by others. At least we will know how they are appropriately defined by accomplished educators and measurement scholars. We may be more likely to use the terms correctly in our communications with others and this may help others to use the terms correctly, as well. Measure Measurement is the process of quantifying or describing the degree to which an attribute is present using numbers. In our work as educators, we often need to describe the extent that a person possesses a certain characteristic. We often must express how much learning has taken place or the strength or presence of an attitude. Examples include descriptions of: o how many problems students are able to solve o the extent that a student can explain or apply a concept o the level of motivation to learn that a student or a class demonstrates o our own level of performance with the Florida Teacher Accomplished Practices In this effort, we attempt to assign a number to help us express the amount of an attribute that is present. Among other things, this enables us to: o communicate with others more precisely using quantified information o monitor changes in amounts of attributes over time o make more precise and reasonable plans Test (or Instrument) A test is what is used to obtain the measure. In education, tests are categorized in many different ways according to their purpose (more on this in a later module). For example, in educational settings, tests are often categorized as either objective or alternative. o Examples of objective style tests include multiple choice, matching, and short answer. o Examples of alternative style tests include product and performance tests (with rubrics), portfolios, and behavior rating scales. In addition to the instrument itself, a test could be considered a set of procedures used to get a measure. Consider these examples of tests that involve a set of procedures designed to get a measure. o 50 yard dash: mark out 50 yards of track, locate student at starting point, signal start, use stop watch to record time from start to finish o reading comprehension: student reads designated passage, responds to questions designed to indicate extent of understanding, teacher documents level of understanding expressive communication: a student is placed in a specific, structured social context; an observer determines the number of appropriate verbal interactions exhibited in a one hour block of time If we want high quality measures, or good information on which to base our educational decisions, we must use high quality tests. The following characteristics are most important for determining the quality of an instrument of any type. o validity of results o reliability of results o utility of the test in the specific measurement context o Assess Assessment is a process of gathering information, both measures and verbal descriptions, about an attribute such as students' progress toward instructional goals, the operations of an educational program, or a teacher's development as an effective professional. The information is usually needed to make an educational decision. Formal assessment is a systematic process and each step should be scrutinized for quality. The process often includes the following steps: o identify decision to be made o identify information needed to make the decision o collect the information (could be both formally collected and informally collected data) o judge the quality of the information gathered o integrate the information o make the decision The quality of educational decisions depends on the quality of the assessment process. Even when informal assessments are conducted, one should consider the quality of the information that was used. Evaluate Evaluation is the process of making a judgment according to a set of criteria. Evaluation is conducted at many levels. In the broad scope, we take the results of our assessment and use them to make judgments. For example, we might use the results of assessments of students' progress, assessment of students' motivation, assessment of teachers' attitudes in order to make a judgment about the quality or worth of a particular academic program. We might use the assessment results to judge the program as effective, ineffective, or even harmful. On a more narrow scope, we might use standards of reliability to judge the results of one of our teacher-made tests as either poor or good quality. To have confidence that our evaluation is reasonable, we again are depending on high quality measures and assessments. The more impact that the evaluative decsion has, the greater our responsibility to ensure that the assessments used were of high quality. Examine the sample scenarios and then try your hand at categorizing an activity as testing, assessment, or evaluation. Testing A teacher creates a set of math problems designed to tell how well students can recognize the relationships between fractions, decimals, and percents; express a given quantitiy in a variety of ways; determine if numbers expressed in different ways are equal; and convert numbers expressed in one way to their equivalent in another form. The students complete the math problems and the teacher determines how many each student got right. A teacher is making a decision about whether the student is making progress in controling the appropriateness of their expressive communication behavior. The teacher gathers the results of formal behavior observations, perceptions of other teachers, and the Assessment perceptions of the individual student from their journal as well as from a personal interview. The results are used to determine how much control is being demonstrated by the student. The teacher will compare the current results to the students' previous results as well as to expectations set forth in a developmental communication scale. Evaluation A team of teachers is charged with determining whether a given software program containing a variety of maps (local, world, social, geographical, etc.) that the district is considering for purchase will meet the needs of both the social science and physical science curriculum. The team decides to base the judgment about the software program on cost, utility (easy to use, matches content in both science and social studies, etc.), and quality of maps (whether they are current, complete, easy to read, etc.). Now try your hand at classifying the following scenarios. Make a note of your choice and then scroll down to compare your classification with that of the author. Classification Scenario Test, Assessment, or Evaluation? A. A teacher has collected several pieces of information in order to decide which selections the student orchestra will play for the spring program. The teacher surveyed the participants (students, parents, school personnel) to determine their musical preferences; reviewed the difficulty levels of available sheet music; calculated the amount of practice time; and made a chart of the students' performance levels. After analyzing the available information, the teacher selected three appropriate pieces. Test, Assessment, or Evaluation? B. A panel of three teachers has been given the task of determining the fluency level of graduates of a Spanish language training program. The teachers listen to students read assigned passages, engage in a brief conversation with each student, and then listen to a three minute oral presentation by each student. After these activities the panel determines whether students speak at an appropriate pace, are able to make themselves understood, and are able to use a range of appropriate vocabulary. Considering this information, the panel categorizes the fluency level of each student as beginning, intermediate, or proficient. Test, Assessment, or Evaluation? C. A teacher has given students a set of instructions to write a paragraph on a specified topic. The students will earn points for various aspects of correct paragraph construction. They will also earn points for using correct spelling and punctuation. The teacher will use a scoring rubric to determine the number of points students earn showing how well they can correctly construct a paragraph. Test, Assessment, or Evaluation? D. Students complete a set of 40 multiple choice questions. The number correct out of 40 indicates the extent they have mastered a list of 10 instructional objectives. Now compare your choices with the those below. Classification Scenario Assessment A. A teacher has collected several pieces of information in order to decide which selections the student orchestra will play for the spring program. The teacher surveyed the participants (students, parents, school personnel) to determine their musical preferences; reviewed the difficulty levels of available sheet music; calculated the amount of practice time; and made a chart of the students' performance levels. After analyzing the available information, the teacher selected three appropriate pieces. Evaluation B. A panel of three teachers has been given the task of determining the fluency level of graduates of a Spanish language training program. The teachers listen to students read assigned passages, engage in a brief conversation with each student, and then listen to a three minute oral presentation by each student. After these activities the panel determines whether students speak at an appropriate pace, are able to make themselves understood, and are able to use a range of appropriate vocabulary. Considering this information, the panel categorizes the fluency level of each student as beginning, intermediate, or proficient. Test C. A teacher has given students a set of instructions to write a paragraph on a specified topic. The students will earn points for various aspects of correct paragraph construction. They will also earn points for using correct spelling and punctuation. The teacher will use a scoring rubric to determine the number of points students earn showing how well they can correctly construct a paragraph. Test D. Students complete a set of 40 multiple choice questions. The number correct out of 40 indicates the extent they have mastered a list of 10 instructional objectives. If you were off target for most, you may wish to discuss your thoughts with classmates on the Help from Classmates discussion board. Module 1 Part 2: High Stakes Testing Additional Resources High-stakes tests: Tests for which there are significant contingencies associated with the results. High stakes testing is a topic that is earning considerable attention from many educational participants and stakeholders as well as media. While we read in the paper or hear/read in the news about events and the perceptions of people without extensive training in assessment, what do those with considerable training and research background in assessment think? In this part of the module you will explore selected resources related to high stakes testing that are based on research and best practice. As you continue to develop your educational philosophy and practice, you are encouraged to include the information and skills from these resources to your assessment and evaluation "tool kit." The following are sites on the World Wide Web that will supplement your reading from Chapter 2 in our textbook. AERA Research Point The contribution made by teachers in a value added assessment. American Evaluation Association Position Statement on High-Stakes Testing in Pre-K - 12 Education. Additional Readings (locate at least one of the articles and then check out your professional organization website for info on high-stakes testing): Volume 42, Issue 1 of the journal Theory Into Practice focused on high stakes testing. These citations are for two of the articles in that issue. Mapping the Landscape of High-Stakes Testing and Accountability Programs , By: Goertz, Margaret, Duffy, Mark, Theory Into Practice, 00405841, Winter2003, Vol. 42, Issue 1. Retrieved online from Academic Search Premier. NAEP and No Child Left Behind: Technical Challenges and Practical Solutions , By: Hombo, Catherine M., Theory Into Practice, 0040-5841, January 1, 2003, Vol. 42, Issue 1. Retrieved online from Academic Search Premier. High Stakes Testing Uncertainty and Student Learning Amrein, A. L., & Berliner, D. C. (2002, March 28). High-stakes testing, uncertainty, and student learning. Education Policy Analysis Archives, 10(18). Retrieved September 16, 2002, from http://epaa.asu.edu/epaa/v10n18/ Locate the website for your professional organization. Determine whether there is a position statement related to high-stakes testing. Example: National Council of Teachers of Mathematics Position Statement on High-Stakes Testing Alignment An important consideration when it comes to understanding and interpreting the results of high-stakes tests in education is the extent to which the curriculum, instruction and assessment that is related to the tests are in alignment. The No Child Left Behind (NCLB) Act of 2002 requires that states test in reading and math at least in grades 3 through 8 and one time in high school; and starting in the 2007-2008 year, tests in science will be administered at least once in grades 3 through 5, 6 through 9, and 10 through 12. The law states that the tests are to be aligned with the challenging state-mandated curriculum standards. These tests are considered highstakes because of the significant contingencies associated with them. It is important to monitor the extent that curriculum, instruction, and formative assessment is aligned with the standards tested with these high-stakes, standards-based tests. Ethics Test developers and users must adhere to ethical standards for the creation and administration of tests (high-stakes or not). These ethical standards can be found in various sources including the Codes listed below and in various professional organizations' resources. It is wise to review these periodically to monitor the extent that assessment activities in your context are consistent with ethical standards. Another source of ethical practice is the Standards for Educational and Psychological Testing (1999) published by the American Educational Research Association and prepared by the Joint Committee on Standards for Educational and Psychological Testing of the American Educational Research Association, American Psychological Association, and the National Council on Measurement in Education. You are encouraged to visit the websites of these organizations periodicMay 22, 2007li> Many of the principles for the use of standardized tests are true for the high-stakes category of testing. We will study these further in later chapters and modules. Explore the Codes listed in the links below and then try to locate others that pertain to ethics of assessment for professionals in your specific field. Code of Fair Testing Practices Copyright 2004 by the Joint Committee on Testing Practices. Reprinted with Permission. Code of Fair Testing Practices in Education. (Copyright 2004). Washington, DC: Joint Committee on Testing Practices. (Mailing Address: Joint Committee on Testing Practices, Science Directorate, American Psychological Association (APA),750 First Street, NE, Washington, DC 20002-4242; http://www.apa.org/science/jctpweb.html.) Code of Professional Responsibilities in Educational Measurement copyright 1995 National Council on Measurement in Education. Any portion of this Code may be reproduced and disseminated for educational purposes. Practice Try your hand at classifying the following scenarios as that of high-stakes or not high stakes testing. Category: High-stakes or not? Scenario High-stakes or A student gives a speech as a performance assessment. The score not? obtained by using a rubric to evaluate the student's performance will contribute 15% of the student's unit grade. The score is considered a summative measure of the student's skill in delivering a speech. High-stakes or Results of a performance test will be used to determine whether or not not? a student is allowed to participate in the school orchestra. The student will either be considered musically talented and invited to join the orchestra or not musically talented and prohibited from joining the orchestra based on the obtained performance test score. High-stakes or A student will be allowed to graduate from high school only if he or she not? earns a score equal to or beyond a certain cut-off point. Compare your choices with the feedback in the following table. Category: High-stakes or not? Scenario Not high-stakes (the grade for a unit on speech will not have consequences at the level of significance to be considered highstakes.) A student gives a speech as a performance assessment. The score obtained by using a rubric to evaluate the student's performance will contribute 15% of the student's unit grade. The score is considered a summative measure of the student's skill in delivering a speech. High-stakes (joining or not being allowed to join the orchestra is more significant given the potential impact on a student's life and the categorization: talented or not talented) Results of a performance test will be used to determine whether or not a student is allowed to participate in the school orchestra. The student will either be considered musically talented and invited to join the orchestra or not musically talented and prohibited from joining the orchestra based on the obtained performance test score. High-stakes (graduating or failing to A student will be allowed to graduate from high graduate from high school will have school only if he or she earns a score equal to or a significant impact on a student's beyond a certain cut-off point. life) Module 1 Part 3: Purposes and Types of Tests In this section we will review the purposes and types of tests. An important aspect of the validity of an instrument is that it will measure what it was designed to measure. We must choose a test that is appropriate for getting the information we need. Also, when designing our own tests, there are design considerations specific to the various types of tests. Therefore, we must be very clear on the purpose of the test and what it is supposed to measure before we build it or select it from the many choices available. Overview of classifications One reason for considering the many classifications for tests is that it is important to pick the right type of test for the task at hand. Consider the aspects of what you are trying to measure and then make sure the instrument you select is a good match for those aspects. Another reason for being aware of the various classifications is to communicate more effectively. You will be more likely to find the instrument you are looking for if you know how it has been classified - you will be more likely to look in the right places. You will be more effective in communicating about the results if you know from which type of test the results were obtained. A single instrument can fit into more than one of the categories. It is helpful to think of the primary aspects that are to be emphasized and then categorize the instrument along those lines. Use the primary aspects first, secondary aspects next, and then omit the classifications that aren't relevant to the purpose of the test. An instrument can be criterion-referenced, objective, power, and teacher-made all at the same time. It couldn't be classified as both objective and subjective at the same time, though. An instrument could be subjective and summative but could not be summative and formative at the same time. (There are exceptions, though; subsections within an instrument could be classified differently; Part A of an instrument might be power and Part B might be speed, for example.) Common types of classifications Review the classification types and definitions in Chapter 4. Criterion-referenced and norm-referenced are described in the next module. A single test may be used in different ways at different times. For example, an objective style test may be used as a formative evaluation of students' performance in one context while that same test may be used as a summative evaluation in another context. o Think of the product test in which students must write a short paragraph. A teacher might use this as a practice test while students are still in the formative stage of developing their skills for one group; while in another class, that same paragraph test could be used as an end-of-unit posttest to determine how well students can create a paragraph after instruction has been completed. The scores in the formative assessment would not count toward a student's grade while the scores of the posttest would be considered a summative description of the students' end performance and would count toward their term grade. Common types of instruments in education and an example of each are listed in the following table. For each one, try to think of an example of an instrument you have encountered in your past experience that also fits that category. Objectively scored A multiple choice test that measures students' understanding of key concepts within a unit on geography. Subjectively scored An essay test in which the student describes key elements of democracy and then predicts what would happen if each country on a specific continent adopted a democratic form of government. Individually administered The Weschler Intelligence Scale for Children (WISC) Group administered The Cognitive Abilities Test measures reasoning abilities that are comMay 22, 2007 Verbal Almost any test that requires that students read the test instructions and then read questions and write their responses; our midterm exam is an example of a verbal test. Even if the instructions and questions were read to the students and they answered orally, it would be considered a verbal test because taking the test is so dependent on students' verbal abilities. Non-verbal Naglierri Nonverbal Abilities Test measures students' non-verbal reasoning and problem solving skills. Speed A test of 'keyboarding' skills that measures how many words students can type per minute. (This doesn't just mean that there is a limit to how much time a student can take. It means how fast they can perform the skill.) Power A standardized achievement test (e.g., Stanford Achievement Test) is designed to measure the amount of content the student has mastered. The breadth and depth a student's knowledge is emphasized over how fast one can perform. Academic Bracken Basic Concepts Scale measures skills typically needed to succeed in school. Also, any test that covers subjects typically taught in schools would be considered an academic test when contrasted with the next category: affective tests. Affective The Keirsey Temperament Sorter II is a personality questionnaire that is available online. Teacher-made Our first product exam that will ask you to create an objective style test is an example of a teacher-made test (not made by a commercial test publishing company or educational materials company). Standardized The Florida Comprehensive Assessment Test is a standardized test. It is to be administered and scored in the same manner everywhere that it is used (same administration procedures, same instructions to teachers and students, and same scoring procedures). How did you do? Were you able to think of (or locate online) another example for each of the categories? If not, you may want to use a search engine and see what you come up with or use the Buros Mental Measurements Yearbook website (search Buros for test reviews) to find examples. Tasks for Module 2 Week 3 & 4 tasks: 1. Learn Module 2 and Chapters 5, 6, and 7. 2. Objective Exam 1 is available on Monday, Open: 1/27-2/7 and due at 11:59pm on 2/7. You can take it anytime during this period of time, and you can also take three times, the best score will be recorded to count to your final grade. Please don’t miss it because it is open long enough for you to select a good day to take it. The answer key will be available after I receive all the submissions. 3. By the end of this week, you should start Part A for the Final Project, but you don’t need to submit Part A. It should be submitted together with Part B by the end of the semester. You should follow the schedule so that you are not left behind. I will check online to see your progress and provide my comments when necessary. If you have any questions, please feel free to let me now and I am ready to help you for a successful semester. Thank you all for a good start! Haiyan Bai Close this window Module 2 Overview The concepts in this module are important whether you are using measurement skills as a teacher in a classroom or in another professional role such as school leader, counselor, instructional designer, or researcher. As you begin, consider all the ways that proficiency in measurement and evaluation is vital to your effective professional performance. Consider how these measurement skills can assist you in performing your role, consistent with your professional philosophy, and with high quality information at your fingertips to make effective decisions. These measurement skills are also important for interpreting and conducting research (teacher or school leader action research, school leaders' or private, non-profit evaluation research, scholarly research). Module 2 corresponds to Chapters 5, 6 & 7 in our textbook. We will critical concepts related to normand criterion-referenced tests, test types and purposes, learning outcomes and constructing objectivestyle items. Content in this module relates to the text but includes content not found in the textbook as well. One of the most important attributes of high quality assessment is the validity of results (the extent that inferences that we make from the results are appropriate). One of the most important steps to ensuring validity is identifying what it is you want to assess (who and what), for what purpose (why), and under what conditions (how). In this module, we will learn skills that will help you enhance the validity of results of tests you use, create, or evaluate for research purposes. The table below contains the objectives, readings, learning activities, and assignments for Module 1. Module 2 focuses on the following objectives: Chapter 5 Discriminate between norm- and criterion-referenced tests. Describe why content validity is important for classroom achievement tests. Objectives Discriminate between specific instructional or behavioral objectives, and general or expressive objectives. Chapter 6 Describe the components of a well-written instructional objective. Discriminate between observable and unobservable learning outcomes. Write objectives at different levels of the taxonomy. Construct a test blueprint for a given unit of instruction, according to the guidelines provided. Chapter 7 Identify the type of item format appropriate for different objectives. Describe ways to minimize the effects of guessing on true-false items. Write fault-free objective test items that match their instructional objectives. Chapters 5, 6, & 7 in text Content and articles specified in module Readings (explore) Florida Department of Education Accountability, Research, and Measurement (ARM) found at http://www.fldoe.org/arm/ Professional standards in your field (see list under Materials tool) (selected student performance standards from) Florida Sunshine State Standards found at http://www.floridastandards.org/index.aspx Several non-posted practice tasks Learning Activities Assignments Postings to working group and specific discussion topics (assessment procedure validity compromise; student performance standard that has been classified by level; planning to create instructional objective and test item) Continue Final Project Part A (to be found under Assignments tool) Module 2 Purposes and Types of Tests - Criterion Referenced We continue to examine ways in which the many tests used in education can be classified according to purpose and use. Two categories that are highly relevant to educators are Criterion-referenced and Norm-referenced tests. This section and the next will discuss these two important categories. It is not just the test itself (the paper with print or the set of procedures used in a performance test), it is the perspective from which scores are interpreted that helps us categorize the instrument as criterionreferenced or norm-referenced. Criterion referenced interpretation The majority of the instruments that are designed by teachers for use in their classrooms are criterionreferenced. A set of test items is developed to measure a carefully analyzed framework of instructional outcomes. The student's score on the test is interpreted to represent the extent the student can successfully perform the set of outcomes. The student's performance is compared to the original set of outcomes. Scores interpreted from this perspective are often reported as a list of objectives mastered, number of objectives mastered, and/or a percentage loosely representing a percentage of the original set of outcomes that has been mastered. Imagine that a teacher is planning an instructional unit and would like to know whether the students have acquired the prerequisite skills needed to be successful with the new skills included in the upcoming unit. Let's say the unit is based on: Social Studies, Grades 6 -8, Sunshine State Standard 3: The student understands Western and Eastern civilization since the Renaissance. (SS.A.3.3); # 5 Understands the differences between institutions of Eastern and Western civilizations (e.g., differences in governments, social traditions and customs, economic systems and religious institutions). If the teacher had criterion referenced information available he could look up exactly which pre-requisite skills the class had acquired and which were missing. Can the students locate the Western and Eastern hemispheres? Can the students define and give examples of government, religion, economy, etc.? What would not be as helpful in this context is to know the proportion of the norm group that performed lower than a particular student, or the stanine scores earned by students in the class, or the rank standing of each student in the class compared to the norm group. For the specific information needed by this teacher, comparing performance against a domain of skills would be more helpful than comparing performance of students to a norm group. Compare the two types of information in the following tables. Both sets of information are useful but each is useful for different purposes in different circumstances. Notice that in the Criterion Referenced Table, the information shows that Annabelle has acquired approximately 86% of the terms in the domain and can locate the hemispheres. The teacher might interpret that the group is strong in relation to definition of terms but only about half of them know where the hemispheres are located. This information could help the teacher in planning the upcoming unit. The information in Table B, while helpful in another context, would not help the teacher in determining the specific content that must be incorporated into the learning activities for the students to be successful in the new unit. You may also imagine this interpretation from the perspective of a school leader who is identifying which goals and objectives related to the school plan are areas of strength and which areas need more attention. Effective, data-driven decision-making involves the comparison of gathered evidence related to school effectiveness indicators against the targeted goals (yes/no; 60% attained; etc.). This constitutes interpretation of data from a criterion-referenced perspective. When trying to know which actions to take in the coming years, it is not as helpful to know how you compare against other schools but rather how you compare against your own targeted goals. A. Criterion Referenced Perspective B. Norm Referenced Perspective Objective: Student Defines Terms: Govt, Econ., ... Percentage Scores Locates Hemispheres Student Social Studies Rdg. Comp. Percentile Rank Percentile Rank Annabelle 72 88 Annabelle 86 Y Fred 89 78 Fred 92 N Randal 82 75 Randal 88 Y Ysela 90 70 Ysela 94 N Module 2 Purposes and Types of Tests - Norm Referenced We have learned that there are many ways of classifying tests based on their format, purpose, or types of scores. Next we'll look at the norm referenced perspective of score interpretation. This type of interpretation is often required when trying to interpret standardized test score reports. It can be helpful when describing a group's prior achievement (e.g., "above average, homogeneous prior achievement within a subject area" or "heterogeneous, including below average, average, and above average achievement across subject areas"). In contrast to the criterion referenced perspective, this perspective is not as helpful for detailed planning of instructional units. Norm referenced interpretation Recall that a norm referenced interpretation of scores means we are comparing a student's performance against that of a norm group. With norm referenced interpretation, we will be describing a student's performance as below average, average, or above average compared to a reference group, called the norm group. To accurately interpret the students' performances, we must have some information about the reference group. In our classroom context, we would want to know if the characteristics of students in the norm group are similar to the characteristics of the students in our class. If they are, we would have more confidence in making the comparison. If they are not similar, we would be less confident in comparing the performance of the students in our class to the performance of the norm group. We will study this concept further in later chapters on standardized testing. Do you remember the criterion and norm referenced tables in the previous module section? Let's examine a circumstance where the norm referenced information could be useful. Imagine a teacher was ready to begin planning prior to the start of a school year. The teacher would like to select strategies, materials, and activities that will be the best match for the characteristics of the students. He would like to gather as much information as possible about the relevant characteristics of the students. One relevant characteristic would be the group's prior achievement in both the targeted subject area and areas that are related to the target subject. For example, the teacher knows that social studies skill acquisition is related to reading comprehension. It would probably be impossible to test each student in the social studies class on their reading comprehension. It would also be too tedious and time consuming to review the criterion-referenced score reports on every student for each reading skill. Examining the class standardized test score report in the areas of reading comprehension and social studies may be a more realistic possibility. The specific skill information in Table A was useful for the unit-to-unit detailed planning. The information in the norm referenced table would be useful for the teacher's general information need in this specific context. FIX THIS A. Criterion Referenced Perspective B. Norm Referenced Perspective Objective: Student Defines Locates Hemispheres Terms: Govt, Econ., ... Annabelle 86 Fred Randal Ysela 92 88 Student Social Studies Rdg. Comp. Percentile Rank Percentile Rank Annabelle 72 88 Fred 89 78 Randal 82 75 Ysela 90 70 Y N Y 94 N One can see from the information in the norm referenced table that the students are somewhat heterogeneous (average to above average) in the prior achievement in social studies while they are homogeneous (mostly average) in their reading comprehension when compared to the norm group. This is a useful piece of information when selecting instructional strategies for the group (e.g. the teacher may choose to reinforce text readings with activities that incorporate other learning styles). As we have seen, the two score interpretations serve different purposes and each can be useful in a certain context. Module 2 - Part 3 Domain of Skills Goal Frameworks A domain of skills is a collection of goals and objectives related to a specific topic or group of topics within a subject. In our field of education, we hear many different uses of the term "domain". When the collection of goals and objectives are represented in a carefully analyzed format with the content and learning levels clearly delineated according to a learning taxonomy, you could call it a framework of subordinate skills. In this type of domain, learning outcomes have been identified based on careful analysis of the goals. The subordinated skills have been derived based on a breakdown of subskills and the relationships among the subskills within the goal. Test scores that come from a test based on this type of domain would be a meaningful representations of the students' performance in relation to the domain. This type of score interpretation is called criterion-referenced. We are interpreting students' performance in relation to a criterion (the domain of skills). This type of interpretation in sometimes also called domain-referenced (i.e. scores are linked to a domain). Sometimes learning objectives are gathered more informally. They are collected because they generally cover the same topic or a loose collection of related topics. Some of our colleagues refer to this as a domain of skills, also. This type of domain is clearly different from the framework of subordinate skills described above. Results from a test based on this type of "domain" would be interpreted differently than results from a test based on a careful analysis of goals and subskills within a subject. Results from this type of test are also called domain referenced. Do you see the difference between the two? When we are using the terms and hear them used by another, it is important to understand to which type of domain the person refers. When instructional objectives have been written from the collection of subordinate skills within the domain and a test is constructed from those objectives, we may hear those results referred to as objectives-referenced scores. Again, we will want to determine whether this objectives-referenced test was developed from a carefully analyzed goal framework of subordinate skills or from a loose collection of skills that are related to a similar topic. Scores from the two types of objectives-referenced tests would be interpreted differently. When it comes to selecting and designing high quality classroom assessments, the most important quality criterion is the validity of the test results. Validity means the extent that inferences made from the results of a test are appropriate. Validity of results allows us to have confidence that the information that we get from the administration of an instrument to a particular set of test takers, in a particular context, under specified conditions is the best information possible. The results are in fact giving us information about the attribute we are trying to measure and are not clouded by other factors we were not trying to measure. There are various ways of examining validity and we will study this concept further in later chapters and modules. In this module section we are examining one type in more detail: content validity. We are examining the extent the items on a test represent the domain of skills on which instruction was based. Using the design principles from the previous module section, examine the following scenarios to see if you can spot compromises to the validity of the test results. Compromises to the validity include instances where inappropriate inferences are being made from the results of the test.. Make a note of each possible compromise you recognize. A science teacher would like to increase her understanding of the students' reading comprehension level so she can select appropriate instructional materials. She is especially interested in knowing how much she can use the Internet as a tool to convey instructional content to the class. She has found a reading comprehension test that students can take online. The class has very limited access to the computer lab in order to take the test but she Example is able to schedule the class of 28 students from 10:45 to 11:15 AM on Monday morning. 1 There are 25 computers in the lab and the teacher did not notice when the three most shy students were sitting at tables without computers. Once she did notice, she felt guilty for not Part 1 noticing sooner but was not otherwise concerned because these students never give her any trouble. She concludes that they must be good readers or they would ask more questions. The teacher also noticed some students about which she was unsure of their reading comprehension. Instead of taking the test, they were playing video games. Eventually the students randomly bubble in answers and submit their tests. Five of the students in that same class speak English as a second language but there are no dictionaries in the computer lab. The teacher advises the students to quickly find a foreign language dictionary online and use it, if needed, as they are taking the test. The students are surprised because they did not know these types of dictionaries were available online. Example At 10:55, the students typical lunch hour, the class begins taking the exam which is designed 1 to be taken individually and requires 30 minutes for administration. The directions found in Part 2 the administrator's manual of the test state that the teacher is to read instructions aloud to the class before students are allowed to begin. The teacher is afraid the class won't be able to finish before 11:15 so she tells the students to start reading the comprehension passages while she reads the directions aloud aloud to them; she will then go around individually while the class is working to ask if they have any questions. Example Once the students have started, the teacher has a chance to read the passages herself. She 1 Part 3 gets a chuckle out of some of the vocabulary found on the test. There are some unusual words like "lift" instead of elevator; "dustbin" for garbage can; and "lorry" for truck. She hopes the students have watched enough movies with British settings so they will not be bothered by the differences. The teacher also believes the students might not notice the different vocabulary because the text on the screen is rather fuzzy because of the aging computer monitors. The screen fades in and out from time to time if the computers are on for longer than 20 minutes at a time. The bell rings at 11:15 and the new class of students begins to arrive. She tells her class to hit the "finish" button no matter where they are and then submit their results for scoring because the class must give up the lab. She looks forward to getting the test results and finding out just how strong the reading comprehension skills are within this class. A statewide comprehensive assessment test covering learning outcomes from the state curriculum standards is administered each year to every student in the state. At one school, the faculty researched an alternative curriculum which they voted to adopt this past year and was approved by the school board. They believe the alternative curriculum is more current and has more comprehensive and relevant goals than the state-adopted curriculum. They Example waited with great excitement to see the results of the comprehensive assessment test because they believed the curriculum they had chosen was better than the state-adopted 2 curriculum. Their hopes were dashed when they found out their students earned lower scores than the other students in their district. They had been so confident that their students had learned a great deal after their experiences with the new curriculum. The faculty remains perplexed and disillusioned. They are not sure how to proceed with this frustrating information. The state department of education is examining the performance of various schools throughout the state. One particular school has generated a great deal of excitement because of high levels of dedication by the faculty, attendance and motivation of the student body, and participation by families of students. The school has received recognition from a national Example corporation for how hard everyone has been working and their great attitudes toward learning. Students are feeling high levels of satisfaction both because of the amount they 3 have learned since the program started (they have made almost two years worth of gain in a single school year) and because of the recognition by popular celebrities. After the assessment results are published, everyone is shocked because using the state's grading system, the school received a grade of "D". You have probably found many compromises to validity in these examples. Select one or two that resonate with you and post a brief comment on the discussion board under the Validity discussion topic. Which compromises resonate most with your classmates? Module 2 - Part 4: Writing Instructional Objectives Teachers learn about instructional objectives in several different courses or workshops. As you may have noticed, objectives are sometimes defined and written differently in different contexts. In one context you may have learned that the objective should always be written with the phrase Students will be able to..., while in another context you learned that you don't need to write it like that because you already know that the objective is what the student does and not what the teacher will do. In one context you may have learned that objectives are written at a fairly general level while in another context you learned they are to be written with great specificity almost like a test item. As a student of pedagogy, you will synthesize all of this advice for your own educational context and use it to design high quality instruction, to select or create instructional materials and techniques using the appropriate curriculum for your students, and to select or create high quality assessments. In this section we will learn to write instructional objectives in a format that will help you design the tests that will yield the most valid and reliable results. Whatever the format, most all agree that well-written instructional objectives are instrumental in producing higher quality lesson plans and assessments. It is hoped that the instructional objective format that you learn here will contribute to your skills in effective instructional design and assessment. Defining and Recognizing Instructional Objectives An instructional objective is a statement of learning outcome that contains conditions, content, behavior, and criteria. The term learning outcome is an important part of the definition of an instructional objective. A learning outcome is what the student will be able to do (in terms of the skill) following instruction. It is the knowledge or skill or ability that they take with them and perform even outside of the context of the lesson. The learning outcome is not the specific task on the test that elicits the skill (e.g., match the definition with the term), it is not the practice exercise they do during the lesson (e.g., locate the vocabulary words in the word puzzle); it is the skill they will be able to perform under any context that presents itself in the future (e.g., define the terms associated with ...). Conditions are statements of any materials, equipment, stimulus material, or context that must be provided for the student to perform the skill specified by the behavior and content in the objective. Conditions are very important to the instructional objective because they create the opportunity for the student to demonstrate the skill and influence the level of difficulty at which it will be performed. The conditions will influence the validity and reliability of the resulting test item. Validity is influenced because the expected task to be elicited by the item is more clear when conditions have been specified. Reliability is influenced in that it is more likely that the item will be clear and uniform across its presentation on tests because the way in which student is to demonstrate the skill has been clearly specified. Students will be performing under similar conditions each time the item is presented. Content is the topic or subject matter or issue within the skill contained in the objective. Behavior is the action part of the statement. It must be presented in observable, measurable terms (i.e. there must be a product or performance that you can see, hear, touch, ...; you can see the students product resulting from "...solve quadratic equations... but you can't see the student's comprehension in "...comprehends quadratic equations.... You could measure whether or not the student solved but you could not measure whether or not the student comprehended until the skill was operationalized using observable, measurable terms. Instructional objectives often must include the criteria or mastery level at which the student should perform the skill. Some instructional designers believe that the criteria level is only necessary when there are degrees of correctness expected or when degrees of correctness differ across the developmental levels of the students. For example, consider the objective: From memory, state the capital of Florida. The mastery level of the learning outcome is implied in that the student is expected to state the capital of Florida correctly every time the opportunity presents itself, not every 2 out of 3 times or 8 out of 10 times. Now consider this learning outcome: Given a basketball with regulation size hoop and distance from the foul line, make a free throw. The student could not reasonably be expected to make the free throw every time. Criteria would state the mastery level appropriate for the characteristics of the students (e.g., third graders versus professional basketball players). Instructional designers who believe the criteria is not needed when it is implied that the student should be able to do it right every time believe that the goal mastery level is set at the test level and not set at the instructional objective level (e.g., 75% correct equals satisfactory performance on the test or 80% mastery earns a "B" on the test). Identify the Parts of Instructional Objectives: Examine the instructional objectives in the table below. Identify the part indicated in the left hand column. The feedback follows. 1. Behavior Supplied with a term related to musical notation and a set of definitions that include the correct and incorrect definitions, select the definition for the specified musical notation term. 2. Behavior Given a piano and sheet music for an etude, play the etude with fewer than 5 errors. 3. Content Given an example of a specified ecosystem containing producers, consumers, and decomposers, identify the specified role as either a producer, consumer, or decomposer. 4. Content Given a liquid and the necessary heating and cooling devices, demonstrate the change in matter from each state to each of the other states. 5. Conditions Given pairs of mixed numbers, explains the effect of multiplication, division, and the inverse relationship of multiplication on the mixed numbers. 6. Conditions Given sets of data on two variables with at least ten observations, displays the data using histograms, bar graphs, circle graphs, and line graphs. 7. Criteria Given names of historical leaders who have influenced western civilization since the Renaissance, identifies at least one of the factors for which the leader was influential. 8. Criteria Given an annual salary and lifestyle example with specified necessary expenses, create a monthly budget correct to within $10. Compare your choices with the feedback in the table below. The indicated part appears in red. 1. Behavior Supplied with a term related to musical notation and a set of definitions that include the correct and incorrect definitions, select the definition for the specified musical notation term. 2. Behavior Given a piano and sheet music for an etude, play the etude with fewer than 5 errors. 3. Content Given an example of a specified ecosystem containing producers, consumers, and decomposers, identify the specified ecosystem role as either a producer, consumer, or decomposer. 4. Content Given a liquid and the necessary heating and cooling devices, demonstrate the change in matter from each state to each of the other states. 5. Conditions Given pairs of mixed numbers, explains the effect of multiplication, division, and the inverse relationship of multiplication on the mixed numbers. 6. Conditions Given sets of data on two variables with at least ten observations, displays the data using histograms, bar graphs, circle graphs, and line graphs. 7. Criteria Given names of historical leaders who have influenced western civilization since the Renaissance, identifies at least one of the factors for which the leader was influential. 8. Criteria Given an annual salary and lifestyle example with specified necessary expenses, create a monthly budget correct to within $10. How did you do? Are you ready to create instructional objectives that contain each of these parts? You are asked to do this in the Final Project. If you have already started Final Project, it may be a good idea to review the objectives that you have written to make sure they contain all the necessary pieces. Classifying Instructional Objectives According to Learning Level Another important aspect of instructional objectives is the learning level they represent. From a measurement perspective, this is especially important. To ensure the congruence between the learning level expected in the instructional objective and the level contained in the instruction and tests, we must first be aware of the learning level expected in the objective. Is it a higher level or a lower one? If it is higher, we want to make sure we are providing students with instruction and practice opportunities at that level. If it is a lower level of learning, we must make sure we are not demanding more in the test item than was expected in the objective and instructional activities. This is especially important for the validity of our test results. After reading about the types and levels of learning described in your textbook, try to classify the following objectives accordingly. In the first table, you are asked to identify the type of learning as either affective, cognitive, or psychomotor. The next table asks you to classify the objective according to the levels of learning within the cognitive type. Type? 1. Supplied with a term related to musical notation and a set of definitions that include the correct and incorrect definitions, select the definition for the specified musical notation term. Type? 2. Given a regulation size soccer ball in a game situation, pass the soccer ball to an offensive player on the player's team. Type? 3. Given sets of data on two variables with at least ten observations, displays the data using histograms, bar graphs, circle graphs, and line graphs. Type? 4. Given a variety of paint brushes, demonstrates the technique and control that would be necessary to obtain a variety of visual effects. Type? 5. Within a game situation, demonstrates consideration for others. Type? 6. Given the opportunity to interact with people of differing physical abilities during team sports, chooses to show respect for people of like and different physical ability. Compare your choices with the feedback in the table below. You may wish to discuss discrepancies or confusion within your group's discussion area. Cognitive 1. Supplied with a term related to musical notation and a set of definitions that include the correct and incorrect definitions, select the definition for the specified musical term. Psychomotor 2. Given a regulation size soccer ball in a game situation, pass the soccer ball to an offensive player on the player's team. Cognitive 3. Given sets of data on two variables with at least ten observations, displays the data using histograms, bar graphs, circle graphs, and line graphs. Psychomotor 4. Given a variety of paint brushes, demonstrates the technique and control that would be necessary to obtain a variety of visual effects. Affective 5. Within a game situation, demonstrates consideration for others. Affective 6. Given the opportunity to interact with people of differing physical abilities during team sports, chooses to show respect for people of like and different physical ability. Now try to classify objectives according to their level within the cognitive component of Bloom's taxonomy (knowledge, comprehension, application, analysis, synthesis, and evaluation). 1. Given a problem within a specified context that contains relevant and irrelevant information, Level? decides what information is appropriate and then collects, displays, and interprets data to answer relevant questions regarding the problem. Level? 2. Given written examples of elements, molecules, and compounds, recognizes them as such. Level? 3. Given two different strategies and problems of length in units of feet, correctly estimates length to within one foot. Level? 4. Given a short story, critique the author's use of the elements of plot (setting, events, problems, conflicts, and resolutions). Level? 5. From memory, recalls ways in which conflict can be resolved. Level? 6. Given a specified topic, writes a speech and uses a variety of techniques to convey meaning to an audience (movement, placement, gestures, silence, facial expression). Compare your choices with the feedback in the table below. You may wish to discuss discrepancies or confusion within your group's discussion area. analysis 1. Given a problem within a specified context that contains relevant and irrelevant information, decides what information is appropriate and then collects, displays, and interprets data to answer relevant questions regarding the problem. comprehension 2. Given written examples of elements, molecules, and compounds, recognizes them as such. application 3. Given two different strategies and problems of length in units of feet, correctly estimates length to within one foot. evaluation 4. Given a short story, critique the author's use of the elements of plot (setting, events, problems, conflicts, and resolutions). knowledge 5. From memory, recalls ways in which conflict can be resolved. synthesis 6. Given a specified topic, writes a speech and uses a variety of techniques to convey meaning to an audience (movement, placement, gestures, silence, facial expression). Review a set of student performance standards in an area of interest to you (e.g., Sunshine State Standards). While these are more broad than the instructional objectives we have been practicing with, they will still be helpful to practice classifying skills according to learning levels. Select two (one that is lower and one that is higher) and classify them according to type and level of learning. Post them to your group's discussion area and review the standards and classifications your group members have posted. Discuss any differences you may observe. While you are selecting the standards, examine those you have selected to determine whether they are written using observable and measurable terms for the behavior or not. Comment on this in your posting, also. Module 2 Part 5 Test Blueprints In the previous modules we have been learning about basic test design considerations as well as the specific foundations of test items: instructional objectives. We will now learn about an especially important tool when it comes to test design: the test blueprint. A test blueprint is also referred to as a Table of Test Specification. It guides us in the development of our test and will help ensure that the test is designed to yield the most valid and reliable results. A test blueprint is a plan for the test that designates: what content will be included on the test which learning levels will be included on the test number of items on the total test how many items and what proportions of the test by content and learning level the format of the test items an estimate of time for the overall test and for each item Before we examine how a test will do this, review the two most important criteria related to the quality of a test: validity and reliability. Validity Validity is defined as the appropriateness of inferences made from test results. Previously, you may have heard the definition of validity as "whether or not the test measures what it is supposed to measure." This definition is true but somewhat limited as it targets content validity but does not give the full picture when it comes to the quality of test results. In addition to measuring what it is supposed to measure, we must ensure that the interpretations or judgments we make from the test results are appropriate. There are test design considerations that we must consider to help us ensure that the results we get will be valid (Carey, 2001). Consider how well the subordinate skills selected for the test represent the overall goal framework (all aspects of its content and learning levels; is there a good match between subskills and test items) Consider the information expected from the test and plan accordingly (i.e., will it be formative or summative; is it to determine whether students have mastered the pre-requisite skills?; if they have they mastered the skills in the current unit?; will the test be used to evaluate the effectiveness of your instruction?) Identify the best format for the tasks or skills included on the test (written response, selected response, essay, product, portfolio, etc.) Determine the appropriate number of items needed to effectively measure each skill. Is one item sufficient (such as with recall level skills) or do students need multiple opportunities to demonstrate the skill (such as skills that require classification, problem-solving, physical movement)? Determine how and when the test will be administered (how much time will students need to prepare, when do you need the information for your planning or feedback purposes?) Considering these factors as you are designing the test will usually ensure more appropriate inferences will be made from the results, i.e. more valid results. Reliability Reliability is the next most important criteria when it comes to test design. Reliability is the consistency with which scores are obtained with the same measure under the same conditions. We think of reliability as the consistency or stability of the scores. As Nitko (2004) further explains, reliability is the extent students' assessment results are the same when: they complete the same tasks on two or more different occasions; two or more raters mark their performance on the same task; or they complete two or more different but equivalent tasks on the same or different occasions. Knowing how much consistency there is in the set of scores is useful for determining how much confidence you can have in the test scores. Later in the semester we will study ways to estimate the reliability of the test results. As with validity, there are important steps we can take in the design of our instruments (Carey, 2001) to help ensure the consistency of results. These include select a representative sample of content and skill levels from the goal framework or set of instructional objectives make sure there are enough items to adequately capture the skill that is to be demonstrated (e.g., if the skill is to 'make plurals,' give students the opportunity to make more than one word or type of word plural - s, es, ies) select the test item formats that will best reduce the possibility of guessing make sure you have selected only the number of items students can reasonably finish in the time allotted help students develop and maintain a positive attitude toward testing (announce tests in advance; tell them the purpose of the test; avoid using tests as punishment) i.e., consider what your students will need to be motivated to do well Considering these factors as you are designing and constructing the test will usually ensure more consistency in performance, i.e., more reliable scores and more confidence in the results. We will study various ways to estimate validity in a later part of the course. Creating the Test Blueprint or Table of Test Specifications Now we can take all of this advice about general test design, validity, and reliability and use it to make the best possible test. To do this we need a good plan - that's where the blueprint comes in. Good tests take careful planning and a blueprint is a way to plan tests using all of the design considerations we have learned. To create a table of test specifications, start with a matrix. It almost looks like the goal framework. There is an example Table of Test Specifications format under the Materials tool. You may want to download or print that now as you practice constructing a blueprint. You may also use it to create your test blueprint in your Final Project if you like. The table of test specifications helps you record the content that you have selected to be on the test you are planning. The row headings on the left indicate what content you are selecting to appear on the test. The content is classified according to learning level (these classifications could come from the goal framework if you have one to work with) and the column headings are then used to indicate the learning levels that will be present on the test. These steps are important as you are trying to make sure the content and learning levels on the test match the content and learning levels from the instructional objectives that were used to plan the lessons or learning activities. This is especially important when considering the content validity of the test. Identify an instructional goal that is relevant for your discipline (subject matter/content and age/grade level). Create a rough draft of the major content and learning levels that would likely be needed for a table of specifications for that goal. If you are not yet able to identify an instructional goal from your discipline, you may wish to use the framework in the next section (Table 5.1) as the foundation for this draft table of specifications. You may wish to discuss your draft with your group members and give them some feedback on their draft as well. This is one of the skills found on our Final Project so you may wish to get some practice-with-feedback within your group at this time. Creating Items from Subordinate Skills and Instructional Objectives Now we will look at the actual selection of objectives to place on the test and how they are used to develop effective test items. Well-planned and well-written items will tend to contribute to valid and reliable test results. As we practice constructing an actual blueprint, let's trace the path of test items from their subordinate skills to their development into test questions. Remember the goal framework from Module 2? It is copied as Table 3.1 below. It contains the results of the analysis of the instructional goal: Write instructional objectives. Take a moment to review the framework noticing the content (lefthand row headings) and the learning levels (column headings) that make up the skill. The intersection of the column and row creates the particular subordinate skill; and the contents of the cell at that intersection is what the students will need to know or do. Let's look at the intersection between Row 2 that is concerned with the "behavior" part of an instructional objective. Now look at Column C which asks students to state the qualities of the content in question. So the intersection of Column C and Row 2 means that students are asked to state the qualities of the behavior aspect of an instructional objective C.2 = State the quality characteristics of behavior in an instructional objective. If you asked the student, "What are the qualities of 'behavior' in an instructional objective?" they should then be able to tell you the contents of cell C.2: "It is clear and the behavior is observable and measurable, also, the behavior should be appropriate for my students." While we are using a framework that is on the topic of instructional objectives, try to think of what a framework in your area of study might look like (one on states of matter, history of western civilization, problems of mass and volume, engaging in a social conversation in an alternative language, etc.). The subordinate skills that make up the goal framework will then become instructional objectives. You turn the subordinate skills into instructional objectives by adding the conditions and criteria. Make sure the objective contains all the right pieces: behavior, content, conditions, and criteria. Table 3.1 Instructional Goal Framework for the Goal: Write instructional objectives. Learning Levels Comprehensio n Knowledge Content A. State/Recall Physical Characteristic s 1. A statement Instructiona of learning l Objective outcome that contains 3 - 4 Applicatio n Evaluation B. State/Recall Functional Characteristic s C. State/Recall D. Discriminate E. Create Quality Examples and an Characteristic Non-examples example s F. Evaluate given examples Serves as the foundation of instructional planning and Clear, appropriate scope. Matches (Use criteria from columns A Discriminate between: instructional objectives and parts (below). assessment. subordinate skill in content and learning level. instructional activities; instructional objectives and goals - C to evaluate given examples. ) 2. Behavior The action or "verb" part of the objective. Specifies the action part of the skill the student is to perform. Helps guide construction of test items or tasks Clear, observable, measurable. Appropriate for learners. Discriminate between behavior and the other parts (content, conditions, criteria). (Use criteria from columns A - C to evaluate given examples. ) 3. Content The part of the objective that states the subject matter or topic of learning. Identifies topic or subject matter the student is learning. Serves as basis of lesson planning, material selection, and test items or tasks. Clear, relevant, observable, measurable. Appropriate for learners. Discriminate between content and the other parts (behavior, conditions, criteria). (Use criteria from columns A - C to evaluate given examples. ) 4. Conditions The part of the objective that specifies equipment and materials. Usually at beginning of objective and starts with "Given ....". Specifies equipment or materials the learner needs to perform the skill. Assists in setting the level of difficulty. Clear, relevant, practical. Appropriate for learners and context. Discriminate between conditions and the other parts (behavior, content, criteria). (Use criteria from columns A - C to evaluate given examples. ) Helps ensure the test item will match the instruction. 5. Criteria The part of the objective that indicates the mastery level. Indicates the level of mastery at which the skill is to be performed (correct 85% of the time; correct to within 3 feet of the target, etc.). Clear, observable, measurable. Appropriate for learners and content. Discriminate between criteria and the other parts (content, conditions, behavior). (Use criteria from columns A - C to evaluate given examples. ) Now let's review steps we take in deciding on the type of test question. After considering the goals and subordinate skills within a specific unit, a teacher will then determine the best way to measure students' skills. As you examine the behaviors from the first three columns under the learning levels (Columns A C: "state/recall") of Table 3.1, try to imagine the best way to measure students' mastery of these lowerorder skills. What do we mean by best? In most cases, we mean "what type of instrument or what set of tasks will provide the most valid and reliable results with maximum authenticity, feasibility, and efficiency." (Remember design characteristics from Module 2?) Now examine the last Column E ("create"). Would you select the same types of items to measure the subskills in this column as you chose for Columns A - C? Probably not. Skills from Columns A - C could be efficiently measured with selected response type items like multiple choice or matching while Column E would be measured best with a written response item such as a short answer format or even as part of a product exam (such as a portfolio that contains lesson plans and examples of instruments to measure student outcomes following the lessons). Both types of formats can yield valid and reliable results but one type is more authentic than the other and so may be more desirable in this context. It would not be efficient to measure the definitions such as are found in Columns A - C with short answer items (because of the length of time to grade, subjectivity of scoring, etc.). Likewise, the best way to measure the skill "create an instructional objective" would not be with a multiple choice or true false item. In the next module section, we will study more about creating various objective style items. Now we can practice tracing the development of items using the subordinate skills found in the framework above. Examine the table below to trace the path of some objective style items. You will be asked to create an objective and item of your own as well. Subordinate Skill from Table 3.1 1.B State or recall the function of an instructional objective. 1.E Create an instructional objective. Instructional Objective for Specified Subordinate Skill Example of Test Item or Task to Measure the Subordinate Skill 1.B.1 Given the term "instructional objective" and a list of purposes of related instructional terms, recall the purpose of an instructional objective. Item for skill 1.B.1: 1.B.2 Given the term "instructional objective", state the purpose. Item for skill 1.B.2: 1.E.1 Given a subordinate skill, write a complete instructional objective for the skill. Item for skill 1.E.1: 1.E.2 Given a goal framework and instructions for the task of creating an authentic lesson plan and test, create an instructional objective and test item from a skill in the framework. Task for skill 1.E.2 For what reasons do we use instructional objectives? a. Foundation of lesson plans b. Basis for test items c. Keep accurate attendance d. Both a and b are correct. What is the purpose for writing an instructional objective? ________________________________ Use the following subordinate skill to create May 23, 2007m>Identify the migratory patterns of birds in the western hemisphere.) ________________________________ (This task would be one part of a larger portfolio assessment that would include the instructions for creating objectives, designing lessons, and creating tests to measure student learning outcomes following the lessons.) e.g.... ...For each of the skills in the framework you have selected, write an appropriate instructional objective to serve as the foundation of your lesson plan and test. Review the objective you created to see if it contains the necessary parts... 4.D (Now it's your turn. Create an instructional objective for this subskill: 4D.) (Yes, now it's time to try your hand at writing a test item for that instructional objective. Don't worry, we will learn more about this in another module section.) Discuss and post the instructional objective and test item (if you can create some items at this point) for your Final Project under the your group Discussion forum. Review and offer constructive criticism to the postings of your classmates to improve the objectives and test items (you can actually use the criteria found in Table 5.1, Columns A - C to remind yourself of the characteristics of good instructional objectives). Module 2 - Part 6 Test Design General Principles While you were reviewing the professional standards related to assessment from your professional organization, did you notice that it was important to be able to develop or select a variety of high quality assessment instruments? For example, review the Assessment standard in the Florida Accomplished Practices. (Remember, links to other professional standards related to assessment are found on Page 1.) These principles will guide you as you are creating or selecting instruments to use in your specific teaching/learning context. In this module, we will consider principles of good test design. The two most important characteristics of a high quality instrument are: validity and reliability. They are followed by utility (feasibility, cost, etc.). The principles described here will help you create instruments that will provide valid and reliable results for your decision making needs. You can also apply many of these design principles to the state and district-wide tests that you are required to administer and interpret. As professionals, we must evaluate these instruments using design criteria that are based on research and best practice rather than popular opinion. Tests that are welldesigned and appropriately administered would yield more useful results for our decision-making purposes. Results from those less well-designed and poorly administered should be interpreted with great caution, if at all. Most authors and researchers in the field of measurement agree on a similar set of principles that guide effective test design and construction. When these principles are employed, the instruments will tend to provide more valid and reliable results than if the principles are not employed. Remember, we are trying to set the students up for success. Nothing, except lack of proficiency, should get in the way of their successful performance. The principles that will help you ensure validity, reliability, and utility of results include: Systematically planned Good match for content Good match for learners Feasibility Professional in format and appearance Systematically Planned and Constructed Teachers must consider their resources (time, materials, cost, professional skill, etc.) along with the entire instructional curriculum and then set up an assessment system at the beginning of the year that will support their instructional system. The two systems go hand in hand and are widely considered two parts of a single Instruction/Assessment system. However you wish to describe it, the assessment system requires a variety of good testing materials and procedures. Effective teachers use good assessment techniques to gather and analyze the information needed for the many decisions they must make. A good assessment system considers the types and number of decisions a teacher must make throughout the term and includes reasonable methods for gathering information needed to make the most informed decisions. Some methods will be formal such as objective-style or alternative tests (essay, product, or performance tests, portfolios) and some will be informal such as anecdotal observations. The plan should be tailored to your specific context and resources. It is a challenge to find the balance between a plan that is both comprehensive and systematic as well as feasible. Using a spreadsheet program or a commercially published instructional manager program may facilitate your effort. Once you have a plan in place, you can then select or create the best instrument for the type of decision to be made, for the content or attribute to be measured, and for the students or participants from which you will gather information. Specific procedures for constructing an instrument (Test Blueprint and Item Design) will be covered in another section of this module. Table 2.1 Example of Generic Assessment Plan for a School Term What Decisions/Information is Needed? When to Gather Data?* How Will Information be Collected?** 1. Quality of student progress on instructional objectives: (plan would list actual objectives) 1.1 Following individual Various methods: lessons (include dates) Quizzes with both selected and 1.2 Following units, etc. written response items Objective # (keyed to SSS) or Portfolio entries Individualized Educational Plan short or long term objective # Unit exams Other objectives specific to the context 2. Quality of data gathering procedures. Target the instruments that are new, most critical, or have not been subject to evaluation in the past. 2.1 Following each Quantitative and Logical Analysis targeted objective style Procedures (keyed to the targeted exam (review item instruments): analysis data) Use Excel or other spreadsheet (list the specific instruments that were 2.2 Upon completion of program or management tool to targeted, e.g. Quiz 3, Portfolio entry grading each targeted calculate difficulty and discrimination indexes, to conduct distractor #6, etc.) alternative-style test analysis, to calculate reliability estimates Use logical analysis by reviewing group performance on the alternative exams and portfolio entries Review blueprints for validity 3. Quality of instructional techniques and materials. 3.1 Following individual Various methods: lessons (include dates) Quizzes with both selected and 3.2 Following units, etc. written response items 3.3 Following the Portfolio entries implementation of a set of materials, techniques Unit exams Student satisfaction questionnaires, interviews, observations Notes from discussion with colleagues 4. Student attitudes: motivation satisfaction 4.1 Toward beginning, Variety of methods: middle, and end of term interviews with sample of students (list actual dates) anonymous questionnaires (see attitudinal attributes listed on report cards) 5. Professional self-evaluation: effectiveness of instruction ethics interviews with a sample of parents informal observation of behavior 5.1 Toward beginning, Variety of methods: middle, and end of term Peer evaluation (list dates and peers (list actual dates) involved) Accomplished Practice indicators Supervisor evaluation (list dates, methods e.g. interview observation, etc.) other indicators specific to my context (school, district, professional organization) Quantitative and logical analyis of student performance data (specify the specific data sets to be used) satisfaction with job Student feedback (specify data sets) 6. Others as needed for your professional context Others as needed... Others as needed... *The actual plan would include estimated dates of the data collection events; ** The actual plan could include the specific planned instrument or data gathering technique keyed to the decision in Column 1 and date in Column 2 of the table. Try conducting a quick search using your favorite search tool (Google, Yahoo, etc.) to find available course management tools that might be useful in your particular teaching/learning context. (I used "instructional management software" for search terms and came up with many hits.) As you can see, educational materials are part of a multi-billion dollar industry with very effective marketing strategies. As a professional, what criteria would you use to select from among the many possibilities? Compare the tools you find in your search with those your district may already be using. Then come back and continue studying the important test design considerations. Will you change your mind after learning more? Designed to Fit Characteristics of Content As mentioned previously, it is important to pick the right kind of instrument (or items) for the objectives you are trying to measure. This generally takes some analysis of the instructional content. You can use one of the educational taxonomies in Taxonomy of Educational Objectives (Chapter 5). This framework serves as the basis of your instruction and assessment for the lesson or unit. It includes at minimum, the content and learning levels contained within a goal but could be as large in scope as a unit or even year's worth of instruction. Frameworks may come with your commercially published teaching materials, may be available from your professional organization's resources, or are available online from other sources. If goal frameworks of subordinate skills do not exist for the particular subject or goals you are teaching, it may be worth your time to develop a rudimentary set on your own. They are essential for planning effective instruction and assessment. Table 2.2 is an example framework for the instructional objective: Given a subordinate skill, write an instructional objective. Notice how the content is clearly specified (row headings) and the learning levels at which students are to perform have been identified (column headings). Imagine that you have been asked to teach a workshop for paraprofessionals at your school. The topic is: How to Write Learning Objectives for Planning Lessons and Assessments. As you prepare to design the instructional activities and then measures that will help determine whether your workshop participants learned anything from the lessons, wouldn't it be useful to have a resource like this? It would help to ensure your lessons were complete (covered all the subskills) and then would help you go on to create a posttest that really matched the instruction. Review Table 2.2 and then locate frameworks of skills in your own subject area and grade level. These may be provided by your school district or you may need to search online. Compare the various frameworks you find for clarity, comprehensiveness, level of content expertise, etc. Table 2.2 Instructional Goal Framework for the Goal: Write instructional objectives. Learning Levels Knowledge Comprehension Application Evaluation A. State/Recall B. State/Recall C. State/Recall D. Discriminate E. Create Physical Functional Quality Examples and an Characteristics Characteristics Characteristics Non-examples example Content 1. A statement Instructional of learning outcome that Objective contains 3 - 4 Serves as the foundation of instructional planning and Clear, appropriate scope. Matches Discriminate between: instructional objectives and F. Evaluate given examples (Use criteria from columns A parts (below). assessment. subordinate skill in content and learning level. instructional activities; instructional objectives and goals - C to evaluate given examples.) 2. Behavior The action or Specifies the "verb" part of action part of the objective. the skill the student is to perform. Helps guide construction of test items or tasks Clear, observable, measurable. Appropriate for learners. Discriminate between behavior and the other parts (content, conditions, criteria). (Use criteria from columns A - C to evaluate given examples.) 3. Content The part of the objective that states the subject matter or topic of learning. Identifies topic or subject matter the student is learning. Serves as basis of lesson planning, material selection, and test items or tasks. Clear, relevant, observable, measurable. Appropriate for learners. Discriminate between content and the other parts (behavior, conditions, criteria). (Use criteria from columns A - C to evaluate given examples.) 4. Conditions The part of the objective that specifies equpment and materials. Usually at beginning of objective and starts with "Given ....". Specifies equipment or materials the learner needs to perform the skill. Assists in setting the level of difficulty. Helps ensure the test item Clear, relevant, practical. Appropriate for learners and context. Discriminate between conditions and the other parts (behavior, content, criteria). (Use criteria from columns A - C to evaluate given examples.) will match the instruction. 5. Criteria The part of the objective that indicates the mastery level. Indicates the level of mastery at which the skill is to be performed (correct 85% of the time; correct to within 3 feet fo the target, etc.). Clear, observable, measurable. Appropriate for learners and content. Discriminate between criteria and the other parts (content, conditions, behavior). (Use criteria from columns A - C to evaluate given examples.) After considering the goals and subordinate skills, a teacher will then determine the best way to measure students' progress in relation to those skills. As you examine the behaviors from the first three columns under the learning levels (Columns A - C: "state/recall") of Table 2.2, try to imagine the best way to measure students' mastery of these lower-order skills. What do we mean by best? In most cases, we mean "what type of instrument or what set of tasks will provide the most valid and reliable results with maximum authenticity, feasibility, and efficiency." Now examine the last Column E ("create"). Would you select the same types of items to measure the subskills in this column as you chose for Columns A - C? Probably not. Skills from Columns A - C could be efficiently measured with selected response type items like multiple choice or matching while Column E would be measured best with a written response item such as a short answer format or even as part of a product exam (such as a portfolio that contains lesson plans and examples of instruments to measure student outcomes following the lessons). Both types of formats can yield valid and reliable results but one type is more authentic than the other and so may be more desirable in this context. It would not be efficient to measure the definitions such as are found in Columns A - C with short answer items (because of the length of time to grade, subjectivity of scoring, etc.). Likewise, the best way to measure the skill "create an instructional objective" would not be with a multiple choice or true false item. In the next module section, we will look at an example of this process. Designed to Fit Characteristics of Learners It is very important that instruments are designed so that they are a good match for the characteristics of the learners. What characteristics of learners (in addition to their proficiency with the skill) will have an impact on students' performances? Did you think of factors such as reading and vocabulary level? These are some of the many characteristics to keep in mind during test design and construction. Consider developmental characteristics specific to the age group (attention span, interests, physical dexterity, ...). Keep in mind whether or not students have physical, cognitive, or social-emotional challenges (visual impairment, cerebral palsey, development delay or severe mental retardation, a specific learning disability, behavior disorders, ...). It is especially important to note whether they are receiving Exceptional Student Education (ESE) Services and have specific testing accommodations identified on their Individualized Educational Plan (IEP). Consider whether the learners speak English as a second language. Be aware of cultural backgrounds of students in the group. Their background may influence the way they approach an exam, the way they interpret questions, and/or the way they respond to the tasks or questions. Consider students' experiential backgrounds. For example, if they have not been to a snowy climate then it may not be a good idea to include a snow scenario as background in an item (unless this is a class on weather patterns, of course). Be aware of students' prior experience with any equipment needed to perform the skill (if students practiced with a plastic ball and bat then it would not be good to suddenly produce a real baseball and bat for the actual performance test; similarly, if they practiced writing paragraphs with a paper and pencil then you would not provide a PC on which to take the exam unless you knew they have word processing experience). Designed for Maximum Feasibility Besides the obvious feasibilty issues related to cost and availability of materials, access to special equipment or contexts (electron microscopes, different sized rooms and tape measures for calculating area, etc.) time is an important factor when it comes to feasibilty of testing procedures. Time to select or design the instrument, time to create it, time to score students' work all relate to feasibility. Is there time to take the entire class of students with severe developmental delay to the grocery store to measure whether they can shop with a list, make purchases using both coins and paper money, make sensible purchases, and stay within their budget? The most authentic determination of their skills would be to observe them during an actual performance. The logistics of setting up these testing procedures would likely be prohibitive. In this case, a teacher would try to determine the next most authentic context (e.g. a classroom simulation of a grocery store). In this instance, we would lose some authenticity but may gain in reliability. After all, how much attention could you give to the individual student you were trying to observe when you must protect the safety of the other 19 students in the class at the same time? Professional in Format and Appearance Here we must consider other important details that may contribute or inhibit students' successful demonstration of skills during test administration. As effective educators we want our materials to appear professional in quality. Mistakes and lack of clarity can inhibit students' performance. They are distracting to students and may even compromise students' positive attitudes toward test taking. The following are design considerations to keep in mind to achieve a professional caliber result. It's a good idea to review your testing materials with fresh eyes and even to hunt down some help with this (colleagues, other students, a willing family member - but not Fido). absence of bias (culture, race, gender, religion, politics, etc.) spelling grammar legibility clarity Applying these design principles will help ensure your instruments will provide the most valid and reliable results possible when the tests you create or select are administered to students. As you gain experience, you will probably start to apply the principles more automatically but even seasoned professionals benefit from reviewing them occasionally. The time we invest as we create the instruments is worth the payoff in getting the best possible information available to make the important decisions we must make on our jobs every day. Module 2 Part 7: Creating Objective-style Items Objective style items are very widely used and have many advantages when it comes to classroom assessment. They are versatile and can measure any level of learning, although there are some types of learning (psychomotor and attitudes) that they are not able to measure. (An attitudinal survey that uses a Likert type scale is not considered a multiple choice test). Objective-style items include multiplechoice, matching, completion, and alternative response type items like true-false questions. Make sure you have studied the guidelines for writing test items and the advantages and disadvantages of the various item types. Creating Objective-style Test Items While reading your text, you have studied the rules or guidelines for creating each of the objective-style items. There are many sets of guidelines available in measurement textbooks and online. You have probably discovered that it takes some practice to write high quality objective items once you have learned that there are guidelines to be followed to ensure validity and reliability. In the table of contents for this module, you will find another set of guidelines by Popham (2003). Now try using the guidelines from our text and the Popham handout to evaluate existing items. Locate a couple of existing tests that contain objective-style items. Try to pick from a couple different sources: one that you may have created in the past, one that is included in a commercially published text, or one that you find online for example. Evaluate the quality of the items on the tests using the criteria from the textbook (Chapter 7) and Popham handout ("Guidelines for Item Construction" under the Materials Tool) . You do not need to post or submit this; it is just for your own practice. Test Instructions and Format Now we will learn some recommendations for the format of the instructions you include on a test. To the extent possible, you should inlcude these elements in the instructions in this order: Begin the instructions by stating the skill the students are to perform. Direct their attention to any stimulus material they need to respond to the item. Tell them how they are to respond; or how they are to record their answer. Finally, give them any additional information they may need (number of points, amount of time, whether to show their work or not, etc.) Students who get their instructions in this order will likely do better than students who take a test with the instructions that are not in this sequence. Take a look at an example of this sequence applied to a set of matching items. This set comes from the textbook (even though our text author did not use this sequence in the instructions). Recall events associated with United States presidents. Column A describes events associated with United States presidents. Column B contains names of presidents. Find the name that matches the event and write the letter that matches that president in the space beside the event. Each name may only be used once. Column A: Events Column B: Names of United States presidents _____ 1. A president not elected to office. a. Abraham Lincoln _____ 2. Delivered the Emancipation Proclamation. b. Richard Nixon _____ 3. Only president to resign from office. c. Gerald Ford _____ 4. Only president elected for more than two terms. d. George Washington _____ 5. Our first president e. Franklin Roosevelt f. Theodore Roosevelt g. Thomas Jefferson h. Woodrow Wilson Practice Exercise Now try evaluating and creating a variety of objective-style items on your own. Download the objective item writing practice . It is a practice exercise to evaluate faulty items and then create items on your own. Module 2 - Practice Exercises Norm and Criterion-Referenced Interpretations and Content Validity - click on the Next arrow in the left corner to take the self test Example Tables of Test Specifications - Download this Word document for practice exercise Objective Item Writing Practice - Download this Word document for practice exercise Jump to Navigation Frame Jump to Content Frame Module 3 Overview The concepts in this module are important whether you are using measurement skills as a teacher in a classroom or in another professional role such as school leader, counselor, instructional designer, or researcher. As you begin, consider all the ways that proficiency in measurement and evaluation is vital to your effective professional performance. Consider how these measurement skills can assist you in performing your role, consistent with your professional philosophy, and with high quality information at your fingertips to make effective decisions. These measurement skills are also important for interpreting and conducting research (teacher or school leader action research, school leaders' or private, non-profit evaluation research, scholarly research). Module 3 corresponds to Chapters 8, 9, & 10 in our textbook. We will learn the use of alternative assessments, writing and scoring performance-based tasks, portfolio assessment, and writing and scoring essay items. Content in this module relates to the text but includes content not found in the textbook as well. One of the most important attributes of high quality assessment is the validity of results (the extent that inferences that we make from the results are appropriate). One of the most important steps to ensuring validity is identifying what it is you want to assess (who and what), for what purpose (why), and under what conditions (how). In this module, we will learn skills that will help you enhance the validity of results of alternative tests you use, create, or evaluate for research purposes. The table below contains the objectives, readings, learning activities, and assignments for Module 3. Module 3 focuses on the following objectives: Chapter 8 Identify the types of learning outcomes for which essays are best suited. Objectives Identify situations in which use of essay items is appropriate. Construct a complete extended response essay item, including a detailed scoring scheme that considers content, organization and process criteria. Distinguish between the assessment of knowledge organization and concepts. Chapter 9 Develop a scoring rubric. Develop a primary trait scoring scheme. Identify the primary constraints that must be decided on when developing a performance measure. Compare and contrast student portfolios with other performance assessment measures. Chapter 10 Identify the cognitive skills that will be assessed by student portfolios. Identify the pitfalls that can undermine the validity of portfolio assessment. Prepare instructions to students for how work gets turned in and returned. Construct the criteria to use in judging the extent to which the purposes for portfolios are achieved. Complete a portfolio development checklist to ensure the quality of the portfolio. Chapters 8, 9, & 10 in text Content and articles specified within module Readings Professional standards in your field (see list under Materials tool) (selected student performance standards from) Florida Sunshine State Standards found at http://www.floridastandards.org/index.aspx Several non-posted practice tasks Learning Activities Posting to working group (example objectives, restricted/extended response essay questions with scoring procedures, and critique; example performance assessment with critique) Assignments Continue your work on the Final Project (Please read the instructions of the Final Project carefully) Part 1: Essay, Product, & Performance Assessment Basic Characteristics of Essay Tests Use the following framework as a study guide for your review and practice with creating essays tests. Content within the cells provides a review of the content from Chapter 8 of the Kubiszyn & Borich text. (Are you gaining a better understanding of goal frameworks? Are you getting closer to being able to develop one on your own?) Learning Levels Conten t A. B. C. D. State or Recall Quality Characteristics Discriminate Evaluate Examples Examples: from NonExamples: Measures complex cognitive skills or processes, communicatio n skills Able to sample less content Essays from objective items (for both objectives and items) Requires original response from student Relatively easy to construct; requires longer time to score Reduces No single Essays from correct answer, State or Recall State or Recall Physical Functional Characteristic Characteristics s 1. Essay Test with questions to test which students supply responses Test questions, scoring rubric, planned set of procedures for Less reliable than objective tests Essays from active performanc e E. Flaws to detect: mismatch of format F. Create Examples: Restricted and extended types at various Mismatch with levels of objective complexit Inappropriatel y for various y measures age lower-level groups, skill content areas, administering and scoring the test Two types: Restricted response (1 page or less) Extended response (>1 and <20 pages) guessing Requires student to organize, integrate, and synthesize knowledge Requires students to use information to solve problems subject to bluffing Requires that scorer is knowledgeable , possibly expert non-written products Unclear Fails to specify length, criteria on which response will be graded student types Question provides appropriate structure to students’ responses Enables consistency in scoring when: Clear, Contains appropriate guidance and organizational information (specifies response length, # points or amount of time, other scoring criteria that will be used) Now that you have reviewed the features of essay tests, examine the following benchmarks from various instructional standards (e.g., Sunshine State Standards). In the table that follows, you will find pairs of standards within a subject area. Within the pairs, determine which would be better measured with objective style items and which would be more appropriately measured with essay items. Suggested feedback can be found in a table immediately following the examples. Learning Outcome Type of test? (objective or essay) Example 1 (Pre K - 2; Technology Standards) Objective: Essay: Prior to completion of Grade 2 students will communicate about technology using developmentally appropriate and accurate terminology. #___ #___ Use technology resources (e.g., puzzles, logical thinking programs, writing tools, digital cameras, drawing tools) for problem solving, communication, and illustration of thoughts, ideas, and stories. International Society for Technology in Education (2005). Standards for students. Retrieved February 10, 2005 from http://cnets.iste.org/currstands/ Example 2 (Grades 9 - 12; Social Studies Standards) Objective: Essay: Understands how government taxes, policies, and programs affect individuals, groups, businesses, and regions. #___ #___ Understands basic terms and indicators associated with levels of economic performance and the state of the economy. Florida Department of Education (2005). Sunshine state standards: Social studies grades 9 - 12. Economics standard 2. Retrieved February 10, 2005 from http://www.firn.edu/doe/curric/prek12/index.html Example 3 (Grades 3 - 5; Language Arts) Objective: Essay: Identifies the author's purpose in a simple text. #___ #___ Reads and organizes information for a variety of purposes, including making a report, conducting interviews, taking a test, and performing authentic work. Florida Department of Education (2005). Sunshine state standards: Language arts grades 3 - 5. Reading standard 2. Retrieved February 10, 2005 from http://www.firn.edu/doe/curric/prek12/index.htm Example 4 (Grades 6 -8; Science) Objective: Essay: Knows that the structural basis of most organisms is the cell and most organisms are single cells, while some, including humans, are multicellular. #___ Knows that behavior is a response to the environment and influences growth, #___ development, maintenance, and reproduction. Florida Department of Education (2005). Sunshine state standards: Science grades 6 - 8. Processes of life standard 1. Retrieved February 10, 2005 from http://www.firn.edu/doe/curric/prek12/index.html Example 5 (Grades 9 - 12; The Arts: Music) Objective: Essay: Understands the musical elements and expressive techniques (e.g., tension and release, tempo, dynamics, and harmonic and melodic movement) that generate aesthetic responses. #___ #___ Analyzes music events within a composition using appropriate music principles and technical vocabulary. Florida Department of Education (2005). Sunshine state standards: The arts: Music grades 9 - 12. standard 2. Retrieved February 10, 2005 from http://www.firn.edu/doe/curric/prek12/index.html Example 6 (Grades 4 - 8; ESL) Objective: Essay: Identify and associate written symbols with words (e.g., written numerals with spoken numbers, the compass rose with directional words). #___ #___ Take a position and support it orally or in writing. Teachers of English to Speakers of Other Languages (2005). ESL Standards for Pre-K12 Students, Online Edition. Retrieved February 10, 2005, from http://www.tesol.org/s_tesol/seccss.asp?CID=113&DID=1583 Compare your responses with those in the feedback table below. Type of test? Learning Outcome (objective or essay) Example 1 (Pre K - 2; Technology Standards) Objective: Essay: Prior to completion of Grade 2 students will communicate about technology using developmentally appropriate and accurate terminology. #_1_ #_2_ Use technology resources (e.g., puzzles, logical thinking programs, writing tools, digital cameras, drawing tools) for problem solving, communication, and illustration of thoughts, ideas, and stories. International Society for Technology in Education (2005). Standards for students. Retrieved February 10, 2005 from http://cnets.iste.org/currstands/ Example 2 (Grades 9 - 12; Social Studies Standards) Objective: Essay: Understands how government taxes, policies, and programs affect individuals, groups, businesses, and regions. #_2_ #_1_ Understands basic terms and indicators associated with levels of economic performance and the state of the economy. Florida Department of Education (2005). Sunshine state standards: Social studies grades 9 - 12. Economics standard 2. Retrieved February 10, 2005 from http://www.firn.edu/doe/curric/prek12/index.html Example 3 (Grades 3 - 5; Language Arts) Objective: Essay: Identifies the author's purpose in a simple text. #_1_ #_2_ Reads and organizes information for a variety of purposes, including making a report, conducting interviews, taking a test, and performing authentic work. Florida Department of Education (2005). Sunshine state standards: Language arts grades 3 - 5. Reading standard 2. Retrieved February 10, 2005 from http://www.firn.edu/doe/curric/prek12/index.htm Example 4 (Grades 6 -8; Science) Objective: Essay: Knows that the structural basis of most organisms is the cell and most organisms are single cells, while some, including humans, are multicellular. #_1_ #_2_ Knows that behavior is a response to the environment and influences growth, development, maintenance, and reproduction. Florida Department of Education (2005). Sunshine state standards: Science grades 6 - 8. Processes of life standard 1. Retrieved February 10, 2005 from http://www.firn.edu/doe/curric/prek12/index.html Example 5 (Grades 9 - 12; The Arts: Music) Objective: Essay: Understands the musical elements and expressive techniques (e.g., tension and release, tempo, dynamics, and harmonic and melodic movement) that generate aesthetic responses. #_1_ #_2_ Analyzes music events within a composition using appropriate music principles and technical vocabulary. Florida Department of Education (2005). Sunshine state standards: The arts: Music grades 9 - 12. standard 2. Retrieved February 10, 2005 from http://www.firn.edu/doe/curric/prek12/index.html Example 6 (Grades 4 - 8; ESL) Objective: Essay: Identify and associate written symbols with words (e.g., written numerals with spoken numbers, the compass rose with directional words). #_1_ #_2_ Take a position and support it orally or in writing. Teachers of English to Speakers of Other Languages (2005). ESL Standards for Pre-K12 Students, Online Edition. Retrieved February 10, 2005, from http://www.tesol.org/s_tesol/seccss.asp?CID=113&DID=1583 How did you do? You may want to go to the Students Helping Students Discussion topic for some peer feedback if you disagreed with many of the choices. You might also want to check with your group members in case they need a little help. Remember, just because these objectives contrasted lower and higher order skills for discriminating objective and essay type items, it doesn't mean you can't write higher-order objective style items. This is a common misconception based on poor practices from the past (objective style items were written primarily at the knowledge and comprehension level; skilled objective style item writers like you can write multiple written- and selected-response items at higher learning levels). The Nitko & Brookhart (2007) textbook has better coverage than most textbooks on how to write objective-style items at higher cognitive levels. You may want to consult that text for your item construction in the future. As a test designer, it is important to think about the levels of learning that are intended by the behaviors in the Sunshine State Standards. When the behaviors are ambiguous, the learning activities and test items are less likely to be congruent with the standards. This can become a challenge to the validity of test results if care is not taken to clarify before planning learning activities and in turn the best instruments to measure students' performance. In Part 2 of this module we will contrast the two types of essay items (restricted and extended response). We are continuing to learn how to select the best measurement procedure for the targeted skills and the needs of our learners. This is critical to the validity of test results. Part 2 : Extended and Restricted Response Formats Now that you are familiar with the features of essay items in general, examine the differences between the restricted response and extended response varieties. Distinguish Between Restricted and Extended Response Essay Items Use the goal framework that follows to review the characteristics of restricted and extended response essay items. Consider educational contexts with which you are familiar. Think about when the restricted response format is more appropriate and when the extended response is more appropriate in those contexts. Try to imagine some examples of the different formats as you are working through the contexts. Learning Levels Content 1. Restricted Response Essay A. B. C. D. E. F. State or Recall State or Recall Physical Functional Characteristics Characteristics State or Recall Discriminate Evaluate Create Quality Examples Examples Examples Characteristics from NonExamples Test with questions to which students supply responses Often used in conjunction with objective style items Restricted from objective variety Due to time constraints, better to use with small classes, in smaller numbers, or when fewer objectives need to be Restricted from Extended variety Causes students to recall and organize information, then to draw conclusions and present them Test within imposed questions, constraints of scoring rubric, time and length planned set of procedures Often used to for assess administering knowledge, and scoring comprehension, and application Flaws to detect: mismatch of format Restricted variety for various age Mismatch groups, with content objective areas, Unclear student types Fails to specify length, criteria on which the test level skills covered Restricted response (1 page or less) Good to use when test security is an issue Used more frequently than extended response items Good to use when information must be supplied rather then recognized or selected response will be graded Can cover somewhat more content than extended response Can be scored with more reliability than extended response Learning Levels Content 1. Extended Response Essay A. B. C. State or Recall Physical Characteristics State or Recall Functional Characteristics State or Recall Discriminate Evaluate Quality Examples Examples Characteristics from NonExamples Test with questions to which students supply responses Often used to assess analysis, synthesis, and evaluation level skills Student determines length and complexity of response Test questions, scoring rubric, planned set of procedures for administering and scoring the Causes students to use higher order cognitive skills Student must D. Extended from restricted variety E. Flaws to detect: F. Create Examples Extended variety at various Mismatch levels of of format complexity Extended for various Takes from active Mismatch age groups, relatively more performance with objective content time to variety areas, develop and Measures student score lowertypes assemble, critically analyze Longer information and responses than use it to solve restricted new problems; response items students (often more synthesize than one page concepts and usually less principles and than 20 pages) then predict or evaluate outcomes test Relatively more difficult to score Requires more time and resources of students level skill Unclear Fails to specify length, criteria on which response will be graded Can be used to evaluate students’ communication skills Recall that an essay test really consists of two parts. The first part is the set of questions and instructions for students and the second part is the scoring procedures (checklist or rubric and the steps or instructions that will be followed by the rater). We will practice with the questions part first. Examine the essay questions below. The objectives from which they were written are included so that you can determine the congruence between the skills specified in the objective (behavior, content, conditions, and criteria) and the actual item. Notice how some objectives are more appropriately measured with the shorter restricted response items while others are more appropriately measured with the more complex extended response items. Objective: Restricted Response Example 1 Given the terms for selected music elements and expressive techniques used by composers (e.g., tension and release, tempo, dynamics, and harmonic and melodic movement), explain the aesthetic responses a composer would expect they would generate. Restricted Response Item Set: Explain the aesthetic responses that a composer would be able to generate with each of the following music elements. For each of the terms, briefly explain the aesthetic response in the space provided. Each element is worth 2 points. Keep answers brief. It is not necessary to use complete sentences but correct spelling is required. Complete all five elements. 1. tension and release 2. tempo 3. dynamics 4. harmonic movement 5. melodic movement Objective (for teacher's planning and test development purposes): Given a musical piece from a widely known composer of the Romantic Era, analyze the piece for the five music elements and the way they were employed by the composer. Item (to be administered as an essay test to the students): Extended Response Example 1 Listen to Hungarian Rhapsody #2 Lento a Capriccio in C# by Franz Liszt. As you listen to the piece, identify an example of how each of the five music elements were used and explain the music principle associated with each. Then explain how Liszt used each of the elements in this sonata to evoke an aesthetic response by the listener. You will have one hour to listen to the piece and write your explanation. You may listen to the entire piece or parts of the piece as many times as needed. It is about 10 minutes in length so be mindful of the time. The test is worth 20 points (1 point for correctly identifying the element; 1 point for including the correct music principle behind the element ; 3 points for analyzing the composer's use of the element within the piece). Make sure you identify clearly where each of the elements within the piece is located. Present your response in essay format with paragraphs and complete sentences. Spelling and grammar will not be graded on this test. Create Restricted and Extended Response Essay Items To ensure the validity and reliability of the essay test results, it is helpful to design the items using guidelines developed from research and best practice. Review the suggestions for creating essay items in the table below.You will then be asked to practice creating items on your own. Suggestions for writing Clearly identify learning outcomes (content and learning levels) to be measured essay questions by the test. Create items that demonstrate content validity (items match objectives in content and learning level). Create items that are good match for the characteristics of the target students. Create questions that clearly delineate the task students are to perform. Explain tasks, time, scope, and point values: orally, in the instructions included with the overall test, or within the individual test items. Indicate whether spelling, grammar, and organization will count toward overall score. Indicate scope of response and to what extent supporting data is required. Create questions that elicit higher-order rather than knowledge and comprehension level responses. Demonstrate ethics in content and process of test administration and scoring. Allow reasonable time to complete test; indicate to students the amount of time allowed. Use when objective items would be inadequate or inappropriate. Refrain from offering optional items. Specify criteria for poor, acceptable, and excellent responses ahead of time Determine format needed (checklist or rating scale) based on content and learning levels Identify the elements that will be scored (content, organization, process of solving the problem or drawing the conclusion) Suggestions for developing scoring procedures Strive for consistency, avoid drift from content, avoid drift of rigor (strict/lax) Avoid biases Keep work anonymous while grading when possible Score all students in the group on each question before moving on to next question Avoid influence of prior questions’ scores within a student’s paper Reevaluate the scores (or at least part of them) before returning the papers Now practice creating restricted and extended response essay questions on your own. Use a wordprocessing program to list an objective and the essay question that would be derived from it. Hang onto the question for now. In the next module section, you will be asked to create a scoring rubric to post in your group's discussion area. It's not necessary to use this same table format, just make sure it is apparent which items and objectives belong together. You may wish to review the list of outcomes for which restricted response items are recommended in Chapter 8 of the text to help you get started. Instructional objectives Essay items 1a.(Locate an instructional objective from a set with which you are familiar, e.g., Sunshine State Standards, that would be more appropriately measured with a restricted response essay item.) 1b.(Now create a restricted response essay question following the guidelines from your text and the table above.) 2a.(Locate an instructional objective from a set with which you are familiar, e.g., Sunshine State Standards, that would be more appropriately measured with an extended response essay item.) 2b.(Now create an extended response essay question following the guidelines from your text and the table above.) Part 3: Scoring Essays Essay Item Scoring Procedures Now that you are familiar with the features of restricted and extended response essay questions, it is time to create scoring procedures for evaluating students' work. Keep in mind that many of the recommendations for creating scoring procedures for essay items will apply to scoring other types of product and performance measures as well (e.g., observation rating scales, rubrics for portfolio exhibits, product exam rubrics). Depending on the complexity of the response, you may use either a checklist or rating scale format to evaluate the student's work. There are certain features in the design of checklists and rating scales that enhance the reliability of scoring. A checklist is used when the elements of the response are more easily observed and are either present or absent (all there or not there at all). If the level of quality and not just the presence or absence of the element is to be rated, a rating scale format would be preferred. If "degrees of correctness" are to be rated, i.e. the element is present and rated with quality categories like "needs improvement," "adequate," and "very good," then a rating scale format would be more desirable than a checklist. Kubiszyn & Borich present another set of features that many use when scoring essays. These include content, organization, and process. An example of this process can be found in Table 7.2 in the text. Please review the example essay test found there. Others use checklists and rating scales such as are found in Figures 8.5 - 8.8 on pp. 173 - 175 of Chapter 8 in the textbook. Next we will examine suggested scoring procedures for the example items presented in Part 2 of this module. Notice the overall format of the scoring guide, the format of the components being rated, and the quality categories. Each choice the designer makes concerning the format of the scoring guide will have an impact on the validity and reliability of instrument results. Restricted Response Item Scoring Procedures Example Scoring Procedure for the Restricted Response Essay Item Recall the Objective that was presented earlier: Given the terms for selected music elements and expressive techniques used by composers (e.g., tension and release, tempo, dynamics, and harmonic and melodic movement), explain the aesthetic responses a composer would expect they would generate. Recall the Test Question that was presented earlier: Explain the aesthetic responses that a composer would be able to generate with each of the following music elements. For each of the terms, briefly explain the aesthetic response in the space provided. Each element is worth 2 points. Keep answers brief. It is not necessary to use complete sentences but correct spelling is required. Complete all five elements. 1. tension and release 2. tempo 3. dynamics 4. harmonic movement 5. melodic movement Now examine the scoring rubric that might be developed for this question. Checklist for the Aesthetic Elements of Music Short Essay Test Name:_________________________________________________________________________________ ____________________ Components: Total Score: ______________ 1. Tension and release Not present or Incomplete Somewhat complete Complete 1 2 Somewhat complete Complete 1 2 Somewhat complete Complete 1 2 Somewhat complete Complete 1 2 Somewhat complete Complete 1 2 Date : ____ _ 0 Not present or Incomplete 2. Tempo ____ _ 0 Not present or Incomplete 3. Dynamics ____ _ 0 4. Harmonic movement Not present or Incomplete ____ _ 0 5. Melodic movement Not present or Incomplete 0 Comments: Extended Response Item Scoring Procedures Now examine a suggested scoring procedure for the extended response essay item example. ____ _ Example Scoring Procedure for the Extended Response Essay Item Recall the Objective that was presented earlier: Given a musical piece from a widely known composer of the Romantic Era, analyze the piece for the five music elements and the way they were employed by the composer. Recall the Test Question that was presented earlier: Listen to Hungarian Rhapsody #2 Lento a Capriccio in C# by Franz Liszt. As you listen to the piece, identify an example of how each of the five music elements were used and explain the music principle associated with each. Then explain how Liszt used each of the elements in this sonata to evoke an aesthetic response by the listener. You will have one hour to listen to the piece and write your explanation. You may listen to the entire piece or parts of the piece as many times as needed. It is about 10 minutes in length so be mindful of the time. The test is worth 25 points (1 point for correctly identifying the element; 1 point for including the correct music principle behind the element ; 3 points for analyzing the composer's use of the element within the piece). Make sure you identify clearly where each of the elements within the piece is located. Now examine the scoring rubric that might be developed for this question. Checklist for the Aesthetic Elements of Music Short Essay Test Name:___________________________________________________________________________ Components: Total Score: ______________ 1. Tension and release Present (1)__ Ineffective explanation Somewhat effective Principle (1)__ 1 2 Date: Effective explanation 3 2. Tempo 3. Dynamics Present (1)__ Ineffective explanation Principle (1)__ 1 Present (1)__ Ineffective explanation Principle (1)__ 1 Somewhat complete Effective explanation 2 3 Somewhat complete Effective explanation 2 3 4. Harmonic movement 5. Melodic movement Subtotals: Present (1)__ Ineffective explanation Principle (1)__ 1 Present (1)__ Ineffective explanation Principle (1)__ 1 ____/10 Somewhat complete Effective explanation 2 3 Somewhat complete Effective explanation 2 3 ____/15 Comments: A. Read the first article in the list that follows. Note the others for your resources and future reference. In addition to being useful resources, they may provide some good examples as you begin the Project Part B. 1. Tierney, Robin & Marielle Simon (2004). What's still wrong with rubrics: focusing on the consistency of performance criteria across scale levels. Practical Assessment, Research & Evaluation, 9(2). Retrieved May 22, 2007 from http://PAREonline.net/getvn.asp?v=9&n=2. 2. Moskal, Barbara M. (2000). Scoring rubrics: what, when and how?. Practical Assessment, Research & Evaluation, 7(3). Retrieved May 22, 2007 from http://PAREonline.net/getvn.asp?v=7&n=3 . 3. Mertler, Craig A. (2001). Designing scoring rubrics for your classroom. Practical Assessment, Research & Evaluation, 7(25). Retrieved May 22, 2007 from http://PAREonline.net/getvn.asp?v=7&n=25 . Part 4: Performance Assessment Module 3 is related to Chapter 8, 9, and 10 in the Kubiszyn & Borich (2009) textbook. Our Final Project Part B uses the information and skills in these chapters and module as the basis for constructing a performance assessment. You may want to read the Final Project Part B instructions after reading Chapters 8, 9, and 10 in your text. Also, there are performance assessment resources relevant to these skills under the Materials tool. Performance Based-Assessment Background You may find it helpful to build a table similar to the one below to organize important concepts related to designing performance-based assessment. Describe the four steps involved in Step 1: construction of a performance Step 2: assessment.* Step 3: Step 4: Describe the three components of Component 1: a good performance assessment. Component 2: ** Component 3: *Do you recognize this as content for a cell under the column State/Recall the Physical Characteristics if we were to create a framework of subordinate skills related to performance-based assessment? **Do you recognize this as content for a cell under the column State/Recall the Quality Characteristics if we were to create a framework of subordinate skills related to performance-based assessment? Measuring the Five Learner Accomplishments Note the five types of learner accomplishments that can/should be measured with performance-based assessments. Locate a set of instructional standards (e.g. Sunshine State Standards) of interest to you. For each of the types of learner accomplishments in the table below, locate a standard that would call for that type of learner accomplishment. Select examples that you would be able to use as the basis for developing a performance measure (instructions and scoring scheme). You will later be asked to try your hand at actually developing a performance assessment. An example of each type of standard has been included to help you get started. Learner Standard Representing the Type of Accomplishment Accomplishments Products Example: Designs and performs real-world statistical experiments that involve more than one variable, then analyzes results and reports findings. Florida Department of Education (2008). Sunshine State Standards Mathematics Grades 9 - 12: Data Analysis and Probability Standard 2. Retrieved September 20, 2008 online at http://www.paecfame.org/math_standards/Math_Standards_HighSchool.pdf Your example: Complex cognitive processes Example: Uses a variety of maps, geographic technologies including geographic information systems (GIS) and satellite-produced imagery, and other advanced graphic representations to depict geographic problems. Florida Department of Education (2008). Sunshine State Standards Social Studies Grades 9 - 12: People Places and Environments Standard 1. Retrieved September 20, 2008 online at http://www.floridaconservation.org/panther/teachers/plans/lesson17.pdf Your example: Observable performance Example: Students practice responsible use of technology systems, information, and software. International Society for Technology in Education (2005). Standards for students. Retrieved September 20, 2008 from http://www.doe.virginia.gov/VDOE/Superintendent/Sols/compteck12.doc Your example: Habits of mind Example: Identifies specific personal listening preferences regarding fiction, drama, literary nonfiction, and informational presentations. Florida Department of Education (2008). Sunshine State Standards Language Arts Grades 3 -5:The student uses listening strategies effectively Standard 2. Retrieved September 20, 2008 online at http://sage.pinellas.k12.fl.us/htm/SSS/SSS_GLE_Lang_3-5.htm Your example: Social skills Example: Recognizes the benefits that accompany cooperation and sharing. Florida Department of Education (2008). Sunshine State Standards Health Education & Physical Education PreK-2:Advocate and Promote Physically Active Lifestyles Standard 2. Retrieved September 20, 2008 online at http://www.ed.uiuc.edu/ylp/9495/PE-Benchmarks.html Your example: (You may want to consider one of your examples for your Project Part B learning standard.) The following are some URL's for sites that contain examples of commercially published and teacher made rubrics and checklists. (Found via Internet search tool using "teacher created rubrics"). http://www.uwstout.edu/soe/profdev/rubrics.shtml#powerpoint http://school.discovery.com/schrockguide/assess.html#rubrics http://www.rubrician.com/science.htm http://k6educators.about.com/gi/dynamic/offsite.htm?site=http%3A%2F%2Fwww.4teachers.org%2Fpr ojectbased%2Fchecklist.shtml Select two of the standards (or benchmarks contained within them) you identified for the table above and create a performance assessment for each. Consider trying one cognitive and one affective topic to differentiate the advantages/challenges of each. Before you begin, specify the subject area, grade level, and a brief description of the context (e.g., 4th grade science, self-contained classroom of 25 students, heterogeneous group, five students participating in ESOL program, three students participating in ESE program, two-week instructional unit). Exchange one of your assessments with a member of your group and offer constructive criticism using the criteria we have learned in chapter 8 of the text. Part 5: Portfolio Assessment Portfolio Advantages and Disadvantages As you have probably understood from your reading, portfolios have both strengths and weaknesses when it comes to classroom assessment. Portfolio assessment is another important tool we must have it available in order to design a comprehensive and high quality assessment system in our educational context (classroom, program, district, etc.). Read the following research articles and relate the findings to your personal experiences with alternative assessment. Title: IMPACT OF A CONTENT SELECTION FRAMEWORK ON PORTFOLIO ASSESSMENT AT THE CLASSROOM LEVEL , By: Simon, Marielle, Forgette-Giroux, Renee, Assessment in Education: Principles, Policy & Practice, 0969594X, Mar2000, Vol. 7, Issue 1 Database: Academic Search Premier Title: DEVELOPING A VALID AND RELIABLE PORTFOLIO ASSESSMENT IN THE PRIMARY GRADES: BUILDING ON PRACTICAL EXPERIENCE , By: Shapley, Kelly S., Bush, M. Joan, Applied Measurement in Education, 0895-7347, April 1, 1999, Vol. 12, Issue 2 Database: Academic Search Premier Complete the following steps for development of a portfolio assessment in a learning context familiar to you. Use instructional objectives that would be relevant for your professional context. If you have been planning to conduct a portfolio assessment in your own learning context, this would be a good opportunity to begin the developmental work to ensure the validity and reliability of the instrument results. If you are already using portfolios, this would be a good opportunity to make any necessary revisions. This is for your practice and does not need to be submitted to the discussion area or drop box. However, this will help you with your Project Part B. Identify the purposes that you would want a portfolio in your grade or content area to achieve: Step 1 1.___________________________________________________________________________ ____________________________________________ 2.___________________________________________________________________________ ____________________________________________ 3.___________________________________________________________________________ ____________________________________________ 4.___________________________________________________________________________ ____________________________________________ Identify the cognitive learning outcomes (e.g. metacognitive skills), several important behaviors (e.g. self-reflection, planning), and significant dispositions (e.g. flexibility, persistence) that will be reflected in your learners' portfolios: Outcomes:______________________________________________________________________ ________________________________________ _______________________________________________________________________________ _______________________________________ Step 2 Behaviors:_______________________________________________________________________ ______________________________________ _______________________________________________________________________________ ______________________________________ Dispositions:_____________________________________________________________________ _____________________________________ _______________________________________________________________________________ _____________________________________ Now identify in what general curricular area you will plan your portfolio (science, geography, reading, math) and describe how you will make decisions about which content areas and how many samples within each area to include. Be sure to include several categories of content from which learners will choose representative samples: Step 3 Curricular area:___________________________________________________________________________ _______________________________ _______________________________________________________________________________ ________________________________________ Content areas:__________________________________________________________________________ __________________________________ _______________________________________________________________________________ ________________________________________ Number of samples:________________________________________________________________________ _______________________________ _______________________________________________________________________________ ________________________________________ Prepare a rubric for one of the content areas identified in Step 3. Also, indicate the type of scale for rating the portfolio as a whole: Step 4 Rubric:_________________________________________________________________________ _______________________________________ Scale:__________________________________________________________________________ _______________________________________ Now you are ready to choose a procedure to aggregate all portfolio ratings and to assign a grade to the completed portfolio. Decide how you will weight (1) drafts when computing a content area rating, (2) content area ratings when they are averaged, and (3) your rating of the whole portfolio with the average rating of the content areas: 1. Drafts:__________________________________________________________________________ _____________________________________ _______________________________________________________________________________ _______________________________________ Step 5 2. Content areas:__________________________________________________________________________ _______________________________ _______________________________________________________________________________ _______________________________________ 3. Whole portfolio:________________________________________________________________________ _______________________________ _______________________________________________________________________________ _______________________________________ Step Finally, describe how you will handle the following logistical issues: 6 Timelines:_______________________________________________________________________ ______________________________________ _______________________________________________________________________________ ______________________________________ How products are turned in and returned:_______________________________________________________________________ ______________ _______________________________________________________________________________ ______________________________________ Where final products are kept:___________________________________________________________________________ __________________ _______________________________________________________________________________ ______________________________________ Who has access to the portfolio:________________________________________________________________________ ____________________ _______________________________________________________________________________ _______________________________________ Here are other resources that may be useful in your quest to design or select high quality alternative assessments. Due to copyright constraints, they are not links but addresses only. These are just for your reference and to offer models and perspectives from a variety of interests and contexts. Select at least one to explore in depth and make yourself aware of the others for possible future use. ERIC/OSEP Special Project News Brief. Making alternative portfolio assessment a success. http://ericec.org/osep/newsbriefs/news17.html Browse through Dr. Helen Barrett's favorite links on alternative assessments and portfolios. Beware of information overload, there are a large number of resources here; but be sure to note some related to use of technology for portfolio/alternative assessment development. http://electronicportfolios.com/portfolios/bookmarks.html Northwest Educational Resource Laboratory Assessment Scoring Guides. A variety of scoring guide resources, note those for Spanish writing, young children, and group assessment. http://www.nwrel.org/assessment/scoring.php Intercultural Development Research Association Newsletter. Portfolios in Secondary ESL Classroom Assessment: Bringing it All Together (1993). While it appears dated, this article raises important issues that remain current. http://www.idra.org/Newslttr/1993/Nov/Adela.htm CRESST Performance Assessment Models: Assessing Content Area Explanations. Comprehensive resource, while dated 1992, the models (see p. 14) are current and helpful. Also, http://cresst96.cse.ucla.edu/CRESST/Sample/Perm.pdf (skip the first cold link: test preparation samples and check out the list by subject and grade level for useful examples). The efficacy of portfolios for teacher evaluation and professional development: Do they make a difference? http://eaq.sagepub.com/cgi/content/abstract/39/5/572 for abstract and full article available: The Efficacy of Portfolios for Teacher Evaluation and Professional Development: Do they make a difference. Tucker et al. Educational Administration Quarterly.2003; 39: 572-602. Module 4 Overview The concepts in this module are important whether you are using measurement skills as a teacher in a classroom or in another professional role such as school leader, counselor, instructional designer, or researcher. Consider how these measurement skills (evaluating the technical quality of instruments and designing fair and appropriate marking systems) can assist you in performing your role, consistent with your professional philosophy, and with high quality information at your fingertips in order to make effective decisions. These measurement skills, describing and evaluating technical characteristics of tests and creating marking systems, are also important for interpreting and conducting research (teacher or school leader action research, school leaders' or private, non-profit evaluation research, scholarly research). Module 4 corresponds to Chapters 11 & 12 in our textbook. We will learn ways to administering a test, evaluate test quality using item analysis, and create a marking system to grade student performance fairly. Content in this module relates to the text but includes content not found in the textbook as well. The most important attributes of high quality assessment include the validity and reliability of results (the extent that inferences that we make from the results are appropriate and that results are consistent and accurate). In order to accurately evaluate student or program performance and make appropriate summative inferences, we must have high quality data. Conducting item and test analysis will help to ensure you are making decisions with the best possible information. The table below contains the objectives, readings, learning activities, and assignments for Module 1. Module 4 focuses on the following objectives: Chapter 11 Discriminate between quantitative and qualitative item analysis. Identify multiple-choice options in need of modification, given quantitative item analysis data. Objectives Identify acceptable ranges for item difficulty levels and discrimination indices. Compute quantitative item analysis data for criterionreferenced tests using the modified norm-referenced procedures described in the text. Interpret these data to assess the appropriateness of criterion-referenced test items. Chapter 12 Describe the problem of mixing factors other than achievement into marks. Define and discriminate among the five marking systems presented in the text. Define and discriminate among the different symbol systems presented in the text. Describe the procedure suggested (i.e., equate before you weight) in the text to minimize the likelihood that such distortions will affect final marks. Describe the front-end and back-end equating procedures used to combine performance measures and traditional measures into a single mark. Chapters 11 & 12 in text Content and articles specified in module Student Evaluation Standards (2003) - see link within module Readings Professional standards in your field (see list under Materials tool) as needed Selected student performance standards from) Florida Sunshine State Standards found at http://www.floridastandards.org/index.aspx (as needed) Learning Activities Assignments Several non-posted practice tasks within module; also, sets of practice exercises found under Table of Contents Posting to class discussion topic on Grading Plan and Item Analysis as needed to compare practice feedback. Start Final Project Part B Module 4 Part 1 Evaluating Test Quality Module 4 Part 1 corresponds to Chapter 10 in the Kubiszyn and Borich (2007) textbook. Please read the chapter and review the Power Points (click on the link) before beginning this section. You may want to have a calculator and some scratch paper handy. These tasks are for your practice; they do not need to be uploaded or posted. It would be a good idea to compare your responses with your group members if you have any confusion. In this section of the module, we are practicing item analysis skills. These are the skills test designers use to evaluate the quality of each individual item on the test. It is especially useful when examining the quality of newly created items to ensure they are functioning as planned. You have learned good test design and construction guidelines including item-writing recommendations, creating good test directions, and appropriate test administration procedures. For this practice, we will assume students have experienced high quality instruction. It is now time to implement procedures that enable us to evaluate the quality of the items on the test. As we are learning this material keep in mind that ultimately, from a criterion-referenced perspective, a teacher would like every student to answer every item on the test correctly. In other words, we want all students to learn all of the skills that are represented by the sample of items on the test. But does this always (ever?) happen? Not usually. So then we examine the test item data to determine whether the obtained results were reasonable under the circumstances. In making a judgment about the functioning of the items in the specific context, you not only are able to evaluate test quality, you will also gain information about the performance of the students and quality of instruction. Difficulty Analysis The difficulty index is an indication of how difficult a specific test item was for the group. It represents the proportion of students answering an item correctly. The difficulty index ranges from .00 (everyone got it wrong) to 1.00 (everyone got it right). The formula for p shows the number of students answering the item correctly (R) divided by the number of students answering the item (n) or: p = R / n . Specific criteria are used to interpret the difficulty index, depending on the test context (e.g., normreferenced vs. criterion-referenced). The process for interpreting the difficulty index for each item involves identifying the range of difficulty levels that would be reasonable for this group of students with this instructional objective. You then compare the obtained index with what would be reasonable to decide whether the item is functioning well or seems problematic. If the item difficulty level seems either too easy or too difficult for the students under the acknowledged conditions (achievement characteristics of the group, complexity of the material), investigate to determine the source of the problem. Try to determine whether the item was miskeyed, ambiguous, subject to guessing, or exhibited some other problem. Use the criteria in Table 4.1 to interpret the p values you obtain from the data in Table 4.2. These values are found in Carey (2003) and are appropriate for the interpretation of difficulty indices from a criterionreferenced test that was administered shortly after instruction. Table 4.1 Standards for interpreting the Difficulty Index for items from a criterion-referenced, objective style test following instruction. Difficulty Index Range Description of Item Difficulty Level*. .90 - 1.00 Very easy for the group. .70 - .89 Moderately easy for the group. .50 - .69 Fairly difficult for the group. less than .50 Very difficult for the group. *Remember that these must be interpreted with awareness of the context of the test and characteristics of the students objectives. Items that are close to 1.00 (very easy for the group) may be fine or it may mean there is a problem with the it Items with p values less than .50 are most often described as problematic from a criterion-referenced perspective. When 50% of the class answers correctly in a criterion-referenced situation following adequate to excellent instruction, there is problem with the item. These standards are not applied in the same way from a norm-referenced perspective. Examine the item-by-student data in Table 4.2. Compute the difficulty index for the six items. Imagine the items were developed from objectives that were classified according to the Learning Levels indicated in the table and that there were three objectives covered by the test (items 1 and 2 came from one objective, items 3 and 4 came from the second objective, and items 5 and 6 came from the third objective). Imagine also that it was a homogeneous group of students who were struggling with even the most basic unit objectives. Do the item difficulty indices seem reasonable under the circumstances? Table 4.2 Item-by-student data for a practice test with six items. Learning Levels K** K C C A A Student Items 1 2 3 4 5 6 Total CR* 4 2 1 3 2 4 (6) Jasmin 4 2 1 3 2 4 (6) Alberto 4 2 1 3 2 4 (6) Chad 4 2 1 3 2 4 (6) Renee 4 2 1 1 2 4 (5) Gustaf 4 2 1 2 2 1 (4) Devone 4 2 1 1 2 3 (4) Maricela 4 2 1 4 2 3 (4) Garrett 4 2 1 2 2 3 (4) Brenda 4 2 1 3 4 3 (4) Qing 4 2 3 3 3 4 (4) Kadar 4 2 1 1 2 3 (4) Anna 4 3 1 4 2 4 (4) Joe 4 2 1 3 1 1 (4) Nadia 4 2 4 2 2 4 (4) Andrew 4 2 3 1 2 4 (4) upper p Katrina 4 2 2 2 1 4 (3) Igor 2 3 4 4 2 4 (2) Elena 1 3 2 2 2 4 (2) Gordon 3 1 3 1 3 4 (1) Burt 3 3 4 4 1 1 (0) lower p difficulty total Note: *CR = Correct Response; numbers within cells indicate the answer selected by the student; i.e. Jasmin selected response choice 4 for item #1 and response choice 2 for item #5. **K = knolwedge level; C = Comprehension level; A = Application level of learning. Table 4.2 Item Analysis Feedback Index 1 2 3 4 5 6 upper p 1.00 1.00 .90 .50 .80 .50 lower p .60 .50 .30 .10 .60 .70 difficulty total .80 .75 .60 .30 .70 .60 Moderately easy Moderately easy Fairly difficult Very difficult Moderately easy Fairly difficult The difficulty indiex for items 1 and 2 seem reasonable. The questions covered lower-order skills (knowledge) and 75 - 80% of the group answered them correctly. The difficulty index for items 3 and 4 may indicate a problem. Both items cover the same comprehension-level objective. There should not be such a big difference between the two p values if those items cover the same objective. Also, they cover lower order skills. While the p value for item 3 (p = .60) seems low but possibly appropriate for the group, the difficulty index for item 4 seems very low, even for this group. The difficulty index for items 5 and 6, while showing a fair amount of difficulty, seem reasonable under these circumstances. We would expect that the difficulty indices follow the pattern of complexity of the objectives i.e., 1 & 2 are comprehension level items with somewhat higher p values; items 3 and 4 are somewhat more complex at the comprehension level with slightly lower p values; and items 5 & 6 are even more complex at the application level with even lower p values. Did the difiiculty indices follow this expected pattern? Items 1, 2, 5, and 6 seemed to follow the expected pattern but items 3 & 4 seemed both more difficult than they should have been and too far apart in value to be appropriate. They must be investigated for possible revision if they are to be used on future tests. In summary, the items 1, 2, 5, and 6 seem reasonable for a practice test with this group of students and this level of complexity. Items 3 and 4 need to be investigated for possible problems. While p = .60 for question 3 does not seem too unreasonable for this group on a practice test, it does not seem reasonable when compared to their performance on items 5 and 6. After all, how could they get more difficult (application level) skills correct while getting less difficult (comprehension) skills wrong? It's a good idea to investigate further. We will now look at how the items are able to discriminate students who knew the material fairly well from students who did not know the material well overall. Item Discrimination The discrimination index (d) is another tool to help evaluate the quality of a test item. It lets us know if the item is doing the job it was intended to do; d lets us know if the item is capable of telling us whether or not students knew the material. Item discrimination is based on the following assumption. Students who performed well overall on the test are most likely the students who answer correctly on an item-to-item basis and students who performed poorly on the overall test are likely to be the ones who miss any given item. (A logical assumption, right? If you "buy" this assumption, you are well on your way to understanding item discrimination.) Item discrimination ranges from -1.00 to 1.00 and is calculated with the formula: d = [(Number of students in the upper group who answered correctly) minus (Number of students in the lower group who answered correctly)] divided by (Number of students in either group) or (Ru - Rl)/n of either group. Another formula is: (p value for upper group) minus (p value for lower group) or d = pu - pl. The discrimination index is a function of the value of the difficulty index. Items with difficulty values closer to .50 have more potential to discriminate between people who knew the material and people who did not know the material. Items at the more extreme ends of the difficulty scale do not have the same potential to tell the difference between who got it right and who got it wrong. In other words, if everyone gets the question right (p = 1.00) the item does not have a chance to discriminate whether the people who answered correctly came from the upper rather than the lower performers overall on the test (because everyone performed the same - they all got it right). The same with the lowest p values (p values between .00 and .10). If everyone got it wrong the item can't discriminate whether those who knew the material overall answered the item correctly and those who did not know the material overall answered the item wrong (again, because everyone performed the same - they all got it wrong). Standards for interpretation will differ somewhat from author to author. Discrimination values > .30 (i.e., greater than or equal to .30) would be considered strong, items > .20 would be considered good, items between .00 and .20 would be considered weak but adequate, and items < .00 would be considered poor. Examine the item by student data in Table 4.2 (repeated here from the earlier section). Calculate and interpret the item discrimination values for Items 1 - 6. While it is important to be able to calculate the difficulty and discimination values and to understand the concepts, that is only half the job. The other half is to be able to take the information you get from interpreting the indices and put it to work to improve the items and eventually make a better test. How are these items performing? Table 4.2 Item-by-student data for a practice test with six items. Learning Levels K** K C C A A Items 1 2 3 4 5 6 Total CR* 4 2 1 3 2 4 (6) Jasmin 4 2 1 3 2 4 (6) Alberto 4 2 1 3 2 4 (6) Chad 4 2 1 3 2 4 (6) Renee 4 2 1 1 2 4 (5) Gustaf 4 2 1 2 2 1 (4) Devone 4 2 1 1 2 3 (4) Maricela 4 2 1 4 2 3 (4) Student Garrett 4 2 1 2 2 3 (4) Brenda 4 2 1 3 4 3 (4) Qing 4 2 3 3 3 4 (4) Kadar 4 2 1 1 2 3 (4) Anna 4 3 1 4 2 4 (4) Joe 4 2 1 3 1 1 (4) Nadia 4 2 4 2 2 4 (4) Andrew 4 2 3 1 2 4 (4) Katrina 4 2 2 2 1 4 (3) Igor 2 3 4 4 2 4 (2) Elena 1 3 2 2 2 4 (2) Gordon 3 1 3 1 3 4 (1) Burt 3 3 4 4 1 1 (0) upper p lower p difficulty total discrim. index How did you do? Compare your results to the values and interpretations in the feedback table below. Table 4.2 Item Analysis Feedback Index 1 2 3 4 5 6 upper p 1.00 1.00 .90 .50 .80 .50 lower p .60 .50 .30 .10 .60 .70 discrimination index .40 .50 .60 .40 .20 -.20 good good good good adequate poor Items 5 and 6 are possibly in need of revision according to the discrimination indices. Notice for item 6 how more people in the group of students that scored lower overall on the test (lower group) answered the item correctly than the group of students who tended to be more knowledgeable on the test (upper group). This is not consistent with the assumption on which discrimination is based. In fact, it is illogical and indicates the item likely needs revision. Item Distractor Analysis Distractor analysis is our final tool for investigating the quality of items. After creating test items and trying them out on a test with students, it is important to investigate the quality of the items to make sure they are doing their jobs: a) telling us who knew the material and who did not know the material and b) giving us information on what aspects of the skills the students have not learned. Distractor analysis is a procedure for examining patterns in students' response choices to detect faulty test items (especially faulty response sets). Distractor analysis allows us to examine the distribution of students' choices across the response set to determine whether the distractors were functioning as intended. If they are not functioning well, we will be able to detect this and revise them for future use. If distractors are functioning appropriately, and if students were selecting wrong answers, we can determine which wrong answers they were selecting. This tells us what misconceptions students may have in relation to the skill. This is pretty important information from an instructional perspective. We learn "who knows what" - and if they don't know, what part of the skill they are having trouble with (especially if item stems and responses are based on learning objectives and are well written in terms of their potential for diagnosing students' misconceptions). We are using qualitative analysis along with the p and d values (quantitative analysis) to evaluate the quality of our test items. Review the pattern of responses in Table 4.2. Are there any items in which the distractors are not functioning? In other words, are there any incorrect choices that were not selected by any of the students? Also, are there any patterns in the responses that show signs of ambiguity or guessing? Did any of these items appear to be miskeyed? Table 4.2 Item Analysis Feedback for Distractor Analysis Items (Qualitative Judgments) 1 2 Distractor not functioning: no student selected response choice 4. Check to see if it is too obviously incorrect because of a clue, becuase of implausibility, or if the teacher has "taught to the test", or inadvertently used the item/response in a previous example. 3 4 5 6 Indicates a pattern of guessing; too few answered correctly and the rest of the students' choices were "all over the place". Notice that this is true even for the students in the upper group who tended to know the material. Distractor not functioning: no student selected response choice 4. Check to see if it is too obviously incorrect because of a clue, becuase of implausibility, or if the teacher has "taught to the test", or inadvertently used the item/response in a previous example. Distractor not functioning; no student selected response choice 2. Check to see if it is too obviously incorrect because of a clue, becuase of implausibility, or if the teacher has "taught to the test", or inadvertently used the item/response in a previous example. There is a pattern of ambiguity in that students tended to select answer choice 3 as frequently as 4. Check to see whether they are poorly written (equally correct). Otherwise this is good diagnostic information i.e., students aren't skilled enough to distinguish correctness of 4 over choice 3. Please do the practice exercise: Item Analysis Exercise. the link is also available on the late page of this module. Module 4 Part 2: Grading and Reporting Achievement This part of the module corresponds to Chapter 11 in your text. In addition to information found in textbooks, a resource related to this topic that would be extremely useful for teachers, administrators, and other school personnel is: The Joint Committee on Standards for Educational Evaluation. (2003). The student evaluation standards: How to improve evaluations of students. Arlen R. Gullickson, Chair. Thousand Oaks, CA: Corwin Press. Here is a link to the Joint Committee site: Student Evaluation Standards Follow the links until you reach the specific standards within each of the classifications (proprietary, etc.). Make sure you carefully review the specific Accuracy standards (A1 - A11; e.g., A1 Validity Orientation: Student evaluations should be developed and implemented, so that interpretations made about the performance of a student are valid and not open to misinterpretation.) Review the other Evaluation Standards (Program Evaluation, Personnel Evaluation) as well. Reflect on ways that your professional activities are consistent or may be inconsistent with the standards. Marking Systems - Purpose and Features The Student Evaluation Standards that are described in the resource cited above were developed, reviewed, and agreed upon by the members of the Joint Committee on Standard for Educational Evaluation (Joint Committee, 2003, p.2). Sixteen major education organizations are represented on the committee. The standards are organized around four necessary attributes. These four attributes are listed in the table below. They are elaborated more fully in the book and are recommended as a practical and philosophical guide for professionals, students, parents, and others involved in educational evaluation of students at the classroom level. The four attributes are listed here for your convenience and review. Proprietary Standards The proprietary standards help ensure that student evaluations will be conducted legally, ethically, and with due regard for the well-being of the students being evaluated and other people affected by the evaluation results. There are seven proprietary standards listed in the resource. Utility Standards The utility standards ensure that student evaluations are useful. Useful student evaluations are informative, timely, and influential. There are seven utility standards. Feasibility Standards The feasibility standards help ensure that student evaluations can be implemented as planned. Feasible evaluations are practical, diplomatic, and adequately supported. There are three feasibility standards. Accuracy Standards The accuracy standards help ensure that student evaluation will produce sound information about a student's learning and performance. Sound information leads to valid interpretations, justifiable conclusions, and appropriate follow-up. There are 11 accuracy standards. Recall the purpose of grading as stated by the authors of our text: to provide feedback about academic achievement. In addition to this purpose, other authors point to other functions of grading and reporting systems. Note that some are intended and some are not intended (please see Table 4.2.1). When grades are used for purposes other than those intended, we must consider carefully whether grades are really valid for those uses. Table 4.2.1 Reported Functions of Grading and Reporting Systems Authors Functions of Grading Included in Authors' Discussions Linn & Miller (2005) (1) instructional uses (improvement of student learning and development) (2) reporting to parents/guardians (help parent understand the objectives of the school and how well their child is achieving the intended outcomes) (3) administrative and guidance uses (determining promotion and graduation, awarding honors, determining athletic eligibility, reporting to other schools and prospective employers) Oosterhoff (2005) (1) motivate students (generally undesirable, consequences unknown) (2) discipline students (grades should not be used for this purpose) Nitko (2005, (1) reaffirm what is already known about classroom achievement; (2) documentation; (3) obtain extrinsic rewards, punishment; (4) obtain social attention or teacher attention; (5) p. 332) request new educational placement; (6) judge a teacher's competence or fairness; (7) indicate school problems for a student; (8) support vocational or career guidance explorations; (9) limit or exclude student's participation in extracurricular activities; (10) promotion or retention; (11) granting graduation/diploma; (12) determining whether student has necessary prerequisite for a higher level course; (13) selecting for postsecondary education; (14) deciding whether an individual has basic skills needed for a particular job Locate and read one of the following articles (available via UCF online library search selecting ERIC data base). Synthesize this information with the other material you are learning in this module. Lambating, J. & Allen, J. D. (2002). How the multiple functions of grades influence their validity and value as measures of academic achievement. Paper presented at the annual meeting of the American Educational Research Association (New Orleans, LA, April 1 - 5, 2002). Guskey, T.R. (2002). Perspectives on Grading and Reporting: Differences among teachers, students, and parents. Paper presented at the annual meeting of the American Educational Research Association (New Orleans, LA, April 1 - 5, 2002). McMillan, J.H., Myran, S., & Workman, D. (2002). Elementary teachers' classroom assessment and grading practices. [Electronic version]. Educational Researcher, 95(4), 203-214. Validity is one of the most important characteristics to consider when designing a grading or marking system. There are procedures to follow that will help to ensure the validity of grades. A teacher or other person responsible for evaluating performance must select an appropriate set of indicators to represent the expected instructional outcomes. The set must accurately and fairly represent the person's achievement of the expected goals. A teacher will ask himself, what tasks, tests, projects, portfolio exhibits, etc. will I include as components that will contribute to the students' composite term grades? How will I combine these components to fairly and accurately depict students' achievement. A good answer to these questions will help ensure the inferences resulting from the interpretation of grades will be appropriate. A poor answer compromises the validity of the grades. Marking Systems - Creating Composite Scores and Assigning Grades Grades must first be defined to let students, school personnel, and families know what they mean. In typical school systems, grades represent students' achievement related to the goals and objectives covered during the marking term. When this is true, grades are assigned from a criterion-referenced perspective and the various letter grades are defined to represent students' performance in relation to the skills taught. In other words, an A represents that the student has mastered all goals at a high level of skill, a B represents that students have mastered all or most of the goals at least at a minimal level. A grade of C means that the student has mastered the majority of the goals but is having difficulty and a grade of D means the student is having difficulty with a majority of the skills. A grade of F means that the student has made little progress, if any, toward the skills during the term (Carey, 2003, p.425). When these definitions are used uniformly within and across school systems, we have a better understanding of interpreting the meaning of grades from one context to another. In some programs, a norm referenced grading system is used in which grades are defined as students' achievement of the goals and objectives in relation to the peer group. From this perspective, grades are defined to represent the extent student's performance on goals and objectives is below average, average, or above average when compared to the peer group. As noted in the previous articles, classroom teachers unknowingly or deliberately combine variables from each of the two perspectives when assigning grades. When variables other than those intended to be measured and/or reported are included in the grade, we say that the achievement grade is confounded (mixed with other variables that should not be included such as ability, attendance, attitude, etc.). When grades are not well defined and their meaning not communicated effectively to stakeholders, confusion, frustration, and resentment will often be the result. Please do the practice exercise: Grading Practice. (The link can be found on next page). It is included to offer practice related to calculating term composites and assigning grades. If you get stuck, you may want to discuss the procedures within your working group. You may also want to discuss the rationale other members used to assign their percentages, if they were very different among your group members. Locate and read the following article describing a study that examined procedures for adapting a grading system for middle school students in an exceptional education program. Think about the many factors that must be considered when designing a grading system using best pedagogical practices, with the needs of exceptional students and students in the general population in mind, and implementing good measurement practice. Think about the principles related to an effective grading system. To what extent were they implemented in the experimental grading procedures? Using what you know about the importance of implementing accommodations for exceptional students, about designing instruction and assessment to support learning and enhance student motivation, critique the grading system described in the article. If you were invited to help revise the Personalized Grading Plan procedures for use in your educational context, what changes or additions would you include? Munk, D.D. & Bursuck, W. D. (2001). Preliminary findings on personalized grading plans for middle school students with learning disabilities. Exceptional Children, 67(2), 211-234. This article is available using a UCF Library online search (selecting ERIC database). Module 4 Exercises Item Analysis Exercise (click on the link) Grading Practice (click on the link) Module 5 Overview The central tendency and variability concepts in this module are important whether you are gathering, summarizing, and interpreting data as a teacher in a classroom or in another professional role such as school leader, counselor, instructional designer, or researcher. As you begin, consider all the ways that proficiency in using data to help make decisions is vital to your effective professional performance. These measurement skills are also important for interpreting and conducting research (teacher or school leader action research, school leaders' or private, non-profit evaluation research, scholarly research). Module 5 corresponds to Chapters 13 & 14 in our textbook. We will learn how to summarize test scores and group performance and convert raw scores to standard scores. Content in this module relates to the text but includes content not found in the textbook as well. Once assessments are used to gather information about an attribute (individual student or group performance), it is important to summarize and make sense of the results. Concepts in this module will assist you in this process. The table below contains the objectives, readings, learning activities, and assignments for Module 1. Module 5 focuses on the following objectives: Chapter 13 Objectives Compare and contrast histograms, frequency polygons, and smoothed curves. Discriminate among positively skewed, symmetrical, and negatively skewed distributions. Determine the mean, median and mode, given a set of data. Locate correctly the relative positions of the measures of central tendency in various distributions represented by smooth curves. Identify the measure of central tendency that best represents the data in various distributions. Draw conclusions about data based on the measures of central tendency and/or smooth curves based on the data. Chapter 14 Compare and contrast the range, semiinterquartile range, and standard deviation. Determine quartiles and percentiles for a given set of data. Discriminate between raw and converted scores. Use z-score conversions to facilitate comparisons of scores from different distributions. Determine equivalent raw scores, zscores, T-scores, and percentile ranks. Use the measures of central tendency, variability, converted scores, and the properties of the normal curve to make decisions about measurement data, both for the individual students and for groups. Chapters 13 & 14 in text Readings Learning Activities Content specified in module Several non-posted practice tasks Continue Final Project Part B (to be found Assignments under Assignments tool) Module 5 Part 1: Central Tendency This module part corresponds to Chapters 12 and 13 in the Kubiszyn and Borich (2007) textbook. Please read the chapters before beginning this section. You may want to have a calculator and some scratch paper handy. These tasks are for your practice; they do not need to be uploaded or posted. Don't forget that you can discuss and compare your responses with your group members if you have any confusion. In this section of the module we will focus on summarizing data as well as calculating and interpreting measures of central tendency. Most of you will have seen these skills before and just need some review and practice with application of the skills to a new context. Summarizing Data (Please read Chapter 12 and 13 in the Textbook) There are a number of ways to summarize test performance data and the choice often depends on the context. Factors within the context may include number of observations, purpose of the data summary (e.g., making instructional decisions, interpreting results obtained in a research study, program evaluation; report dissemination) and available technology. We will concentrate mostly on educational contexts such as individual student performance, class performance, and school or district-wide performance data. A first step in summarizing a data set is to create a simple or grouped frequency distribution. Once you have done that, it may be useful to create and then interpret a graph such as a histogram, frequency polygon, or smoothed curve. To further interpret the data, it is usually useful to calculate and interpret measures of central tendency: mean, median, and/or mode. Central tendency tells us how well the group performed. The choice of central tendency measure will often depend on the nature of the data and the questions being asked of it. Along with the measure of central tendency, it is useful to calculate and interpret measures of variability: range, variance, and standard deviation. Measures of variability tell us how dispersed or clustered the scores were about the mean. Variability is covered in Part 2 of this module. In academic settings, it is helpful to use the following procedure to describe a group's performance (or in the ESE or counseling context, a group of scores obtained from one individual over time). o Set up expectations about the group's performance while keeping in mind these factors: characteristics of the group; complexity of the material; quality of the test; quality of the instruction (assume the quality of the test and instruction is good unless you have information to the contrary). o Calculate measures of central tendency and variability. o Compare the obtained values to the values that were expected. Evaluate the obtained results (e.g. "the group generally performed well but were somewhat more heterogeneous than expected"). o Seek reasonable explanations for any discrepancies between the expected central tendency and variability and what was obtained. o Document and use the information for making decisions about the quality of the performance, quality of instruction and materials, and the quality of your own performance. Central Tendency There are three commonly used measures of central tendency: mean, median, and mode. The mean represents the average of the scores in the group and is calculated using the following formula: ∑ means “to sum” X represents a score (thus ∑X means to sum the scores) n = number of scores The median represents the midpoint in the set of scores. Half of the scores will fall below this point and half will be above the point that is the median. Some authors in the field (Carey, 2001) provide a formula for the median. It allows a more precise estimate of the "midpoint of values." The last measure of central tendency is the mode. The mode is the score in the set that appears most frequently. At the end of this part of the module you will be asked to calculate the mean and find the median for Set 1 in Table 5.2A. But first you will be asked to set up reasonable expectations for the group's performance as noted in the interpretation procedures listed toward the beginning of the module. Continue reading about variability and then proceed with the calculations and interpretations. Use the sets of scores in Table 5.2A to practice summarizing a given data set. You will later use these scores to calculate and interpret measures of central tendency and variability (Please refer to Chapter 12 and 13 in theTextbook ). Table 5.2A Summarize a Data Set Scores Data Summary Procedures Set 1: 20, 20, 19, 19, 19, 19, 18, 18, 14, 14 Create Simple Frequency Distributions for Set 2: 30, 26, 22, 21, 20, 20, 20, 18, 17, 16 Create a Grouped Frequency Distribution Set 3: 50, 48, 45, 40, 40, 40, 33, 30, 29, 25 Create a histogram for Set 1. Create Frequency Polygons for Sets 1 and Describe the distributions of scores as rela Examine the polygons you created from the data in Table 5.2A. Think about the nature of the performance represented by the distributions of scores. What can you tell about the group's performance after studying the distributions? When interpreting the distributions, pay attention to the location of the distribution along the raw score scale. Notice whether the polygon is situated toward the lower end of the scale or toward the higher end. This gives you an idea of how well the group performed. Next, notice whether the polygon is indicating the scores were clustered together about the mean, fairly spread out around the mean, or widely dispersed about the mean. This gives you a sense of the variability of the performance (studied further in Part B of the module). Also, notice whether the polygon is shaped like a bell, indicates skewness, or contains more than one mode (bimodal, multi-modal). This gives you further insight into the nature of the performance. It shows you where there are clusters within the group - clusters at the high end are more desirable when interpreting achievement-related scores and clusters toward the low end present a challenge. You must then decide whether or how to provide remedial instruction for those students who are not succeeding with the skills or extension experiences for those students who need to reach even higher. So, pictures really are worth a thousand words when it comes to summarizing a group's performance. Practice calculating and interpreting group performance using the data in Set 1 from Table 5.2A. Use the information in the following scenario to set your expectations for the group's performance (is it reasonable to expect them to do very well? to do moderately well? or to have difficulty with most of the skills?). Then calculate the actual (obtained) measures of central tendency. Use that information along with the frequency polygon from the beginning of this page to interpret the results. You may wish to use the table format after the scenario to guide your work. We will use this same data set in Part B of this module to practice with variability concepts (if you would like to wait and do them both together, that's ok, too). Imagine this scenario: A teacher is about to administer a posttest to a group of students following a unit of instruction. This teacher has implemented some new teaching techniques and would like to know if the students have succeeded in mastering the skills. In the past, the group has been quite heterogeneous in their performance and struggling to achieve even moderate mastery of skills. The sub-skills that will be measured are classified as fairly difficult. There are 20 possible points on the exam. What mean might be expected from this group under these circumstances? What distribution shape will likely result? Expectation Obtained Result (make prediction based on scenario) Indices (calculate using actual Set 1 data) Mean Range (continues in Part 2 of module) Standard (continues in Part 2 of module) Deviation Distribution Shape Now compare your obtained values to what you reasonably expected and write your description of the group's performance. Wondering how you did? You may wish to post your predictions and obtained results under your group's discussion topic to talk it over and compare results. Module 5 Part 2: Variability This module part corresponds to chapters 12 and 13 in the Kubiszyn and Borich (2007) textbook and focuses on variability in data. Please read the chapters before beginning this section. You may want to have a calculator and some scratch paper handy. These tasks are for your practice; they do not need to be uploaded or posted. Don't forget that you can discuss and compare your responses with your group members if you have any confusion. In this section of the module we will focus on summarizing data as well as calculating and interpreting measures of variability. Again, this may be a review or a chance for you to apply the skills in a new context. Summarizing Data: Variability (Please read Chapter 13 in the Textbook) There are a number of ways to summarize test performance data and the choice often depends on the context. Factors within the context may include number of observations, purpose of the data summary (e.g., making instructional decisions, interpreting results obtained in a research study, program evaluation; report dissemination) and available technology. We will concentrate mostly on educational contexts such as individual student performance, class performance, and school or district-wide performance data. Along with the measure of central tendency (see Part 1 of this module), it is useful to calculate and interpret measures of variability: range, variance, and standard deviation. While the range reflects the distance between the high and low scores, other measures of variability tell us how dispersed or clustered the scores were about the mean. In academic settings, it is helpful to use the following procedure to describe a group's performance (or in the case of ESE and counseling contexts, possibly a group of scores obtained from one individual over time). o Set up expectations about the group's performance while keeping in mind these factors: characteristics of the group; complexity of the material; quality of the test; quality of the instruction (assume the quality of the test and instruction is good unless you have information to the contrary). o Calculate measures of central tendency and variability. o Compare the obtained values to the values that were expected or reasonable within the context. Evaluate the obtained results (e.g. "the group generally performed well but were somewhat more heterogeneous than expected"). o Seek reasonable explanations for any discrepancies between the expected central tendency and variability and what was obtained. o Document and use the information for making decisions about the quality of the performance, quality of instruction and materials, and the quality of your own performance. Review the sets of scores in Table 5.2A and your practice summarizing data and calculating central tendency (from Part 1). You will now use these scores to calculate and interpret measures of variability. Table 5.2A Summarize a Data Set Scores Data Summary Procedures Set 1: 20, 20, 19, 19, 19, 19, 18, 18, 14, 14 Create Simple Frequency Distributions for each of the se Set 2: 30, 26, 22, 21, 20, 20, 20, 18, 17, 16 Create a Grouped Frequency Distribution for Set 2. Set 3: 50, 48, 45, 40, 40, 40, 33, 30, 29, 25 Create a histogram for Set 1. Create Frequency Polygons for Sets 1 and 2. Describe the distributions of scores as related to symme Review the polygons for amount of variability in the data set. Next, notice whether the polygon is indicating the scores were clustered together about the mean, fairly spread out around the mean, or widely dispersed about the mean. This gives you a sense of the variability of the performance. Also, notice whether the polygon is shaped like a bell, indicates skewness, or contains more than one mode (bimodal, multi-modal). This gives you further insight into the nature of the performance. It shows you where there are clusters within the group - clusters at the high end are more desirable when interpreting achievement-related scores and clusters toward the low end present a challenge. You must then decide whether or how to provide remedial instruction for those students who are not succeeding with the skills or extension experiences for those students who need to reach even higher. So, pictures really are worth a thousand words when it comes to summarizing a group's performance. Variability Measures of variability include the range, variance, and standard deviation. The range is calculated by subtracting the lowest earned score from the highest earned score. It tells you how far the students' scores spanned along the raw score scale. To interpret the range, you may wish to use the following rule of thumb (Carey, 2001) : a range that is equal to 1/4th or less of the total points possible is considered to be a homogeneous performance; a range that is equal to about 1/3rd of the total number of points is considered to be somewhat heterogeneous; and a range that is 1/2 or more of the total possible points is considered to be a very heterogeneous performance. (Example: total possible points on a test is equal to 60; highest earned score was 58 and the lowest earned score was 28; 58 - 28 = 30; 30 divided by 60 equls .50 or about 1/2 of the total possible points; this means the group's performance was very heterogeneous.) R = range represents the highest earned score represents the lowest earned score The standard deviation is a number that represents, on average, how far the scores in the set were away from the mean. As a measure of variability it is telling us how clustered together or how dispersed the scores were about the mean. Interpretation takes some practice but a larger number represents more variability and a smaller number represents less variability. You can further interpret the group's performance by comparing the standard deviation to the range. If the standard deviation is about 1/4th or less of the range, the scores are more clustered together within the range; a standard deviation that is around 1/3rd of the range indicates scores are somewhat dispersed within the range; and a standard deviation that is about 1/2 or more of the range represents scores that are quite dispersed throughout the range. SD = standard deviation ∑ means “to sum” (thus deviations) X represents a score = the mean of the scores means the sum of the squared n = number of scores in the group The variance is another measure of variability that tells us how dispersed the scores were about the mean. The variance is calculated much like the standard deviation. It is an important measure of variability when it comes to conducting and interpreting statistical analyses. Notice how calculation of the variance is the same as for the standard deviation until you take the square root. In other words, variance is the standard deviation squared. variance ∑ means “to sum” (thus means the sum of the squared deviations) X represents a score = the mean of the scores n = number of scores in the group Finish practicing calculating and interpreting group performance using the data in Set 1 from Table 5.2A. Use the information in the following scenario to set your expectations for the group's performance. Then calculate the actual (obtained) measures of variability. Use that information along with the frequency polygon from the beginning of this page to interpret the results. You may wish to use the table format after the scenario to guide your work. Imagine this scenario: A teacher is about to administer a posttest to a group of students following a unit of instruction. This teacher has implemented some new teaching techniques and would like to know if the students have succeeded in mastering the skills. In the past, the group has been quite heterogeneous in their performance and struggling to achieve even moderate mastery of skills. The sub-skills that will be measured are classified as fairly difficult. There are 20 possible points on the exam. What mean, range, and standard deviation might be expected from this group under these circumstances? What distribution shape will likely result? Expectation Obtained Result (make prediction based on scenario) Indices Mean (calculate using actual Set 1 data) Range Standard Deviation Distribution Shape Now compare your obtained values to what you reasonably expected and write your description of the group's performance. Wondering how you did? You may wish to post your predictions and obtained results under your group's discussion topic to talk it over and compare results. There is more practice in the handout available under the Table of Contents for this module. These exercises will give you some practice with another data set. If you would like even more practice with illustrations of these concepts, there is a site you may find helpful (more practice for mean and standard deviation calculation) URL is: http://www.easycalculation.com/statistics/learn-standard-deviation.php. It is a copyrighted product of HIOX. Extend your learning beyond the ordinary by trying to complete the calculations and graphs for Table 5.2 using a spreadsheet program like Microsoft EXCEL. There is a site published by Dr. Del Siegel at the University of Connecticut illustrating this procedure at the following link: (practice using EXCEL to calculate mean and standard deviation) URL is http://www.gifted.uconn.edu/siegle/research/Normal/stdexcel.htm. Module 6 Overview The concepts in this module are important whether you are using skills related to correlation and validity as a teacher in a classroom or in another professional role such as school leader, counselor, instructional designer, or researcher. As you begin, consider how many times you have talked or read about correlation without knowing some of the finer points of interpretation. Consider how these skills can assist you in reading and understanding research and evidence of validity of test scores. Module 6 corresponds to Chapters 15 & 16 in our textbook. We will learn to determine relationships with correlation and examine test validity. Content in this module relates to the text but includes content not found in the textbook as well. One of the most important attributes of high quality assessment is the validity of results (the extent that inferences that we make from the results are appropriate). In this module, we will learn skills that will help you understand and evaluate validity of results of tests you use, create, or evaluate for research purposes. The table below contains the objectives, readings, learning activities, and assignments for Module 1. Module 6 focuses on the following objectives: Chapter 15 Objectives Interpret correlation coefficients as to strength and direction. Describe why the presence of even a very strong correlation does not imply causality. Compare and contrast the correlation coefficient and the coefficient of determination. Describe a curvilinear relationship. Explain why correlation coefficients computed from a truncated range of data will be weaker than if computed from the entire range of data. Chapter 16 Identify types of evidence that indicate whether a test may be valid or invalid for various purposes. Compare and contrast content validity, concurrent validity, and predictive validity evidence. Describe procedures used to establish the content validity evidence of a test. Identify the type of validity evidence most important for achievement tests. Explain how group heterogeneity affects the size of a validity coefficient. Identify the most appropriate type of validity evidence when given different purposes for testing. Chapters 15 & 16 in text Readings Learning Activities Content and articles specified in module Several non-posted practice tasks (within module) Practice exercises found under module Table of Contents Assignments Continue Final Project Part B Module 6 Part 1 Correlation Module 6 corresponds to Chapters 14 and 15 in the Kubiszyn and Borich textbook. There are practice activities to help you calculate and interpret correlation. You will be asked to locate a research article that reports a correlation coefficient. The activities found in this part of the module are not submitted to the assignments tool for a grade. Earlier editions of many measurement textbooks and the more advanced resources go into more detail about correlation and validity if you are interested in extending your knowledge even further. Our library has volumes by Sax (1989) and Hopkins (1990) as well as others that would be very useful when you are ready to take a look. Interpreting the Correlation Coefficient (r) and Coefficient of Determination ( ) A correlation coefficient is a number that ranges from -1.00 to +1.00 and represents an association between measures of variables. There are two dimensions represented by a given correlation coefficient: degree (i.e. strength of association) and direction (positive correlation, i.e. high scores on one variable with high scores on the other, along with low scores on one variables with low scores on the other; OR negative correlation, i.e. high score on one variable with low scores on the other). There are many different types of correlation coefficients and the one calculated and reported usually depends on the type of data (nominal, ordinal, interval, or ratio; or continuous versus dichotomous). The Pearson Product-Moment Correlation Coefficient is one of the most commonly reported in published research. Do not interpret the correlation coefficient as an indication of cause and effect. Also, do not interpret it as if it represents a percentage of association. The coefficient of determination is obtained by squaring the correlation coefficient and multiplying by 100. It is a way of describing the hypothetical percentage of the factors associated with the two variables being correlated (Sax, 1989). Coefficient of determination = correlation coefficient squared or . Two computational formulas for the correlation coefficient are provided below. Also, recall the rank difference correlation coefficient from your textbook. You will not be required to compute with the formulas given below on our objective exam unless advised otherwise. They are here for your information and so that you can see what the spreadsheet or statistical analysis program (like SPSS for example) is doing for you. It is possible that you will calculate and interpret a rank difference correlation coefficient on the exam. r = correlation coefficient (e.g., Pearson Product-Moment Correlation Coefficient) Another way to calculate this same type of correlation is using the z score formula: r= Locate criteria for interpreting a correlation coefficient and a research article in a field of interest to you that calculated the correlation between two variables. Using the criteria, practice interpreting the strength and direction of the coefficient reported in the article. In your own words in about 2 -3 sentences, summarize the relationship between the variables examine in that article. If you have any questions, please post them to discussion board under the topics "students help students" or "questions for instructor". Here are three reputable websites that offer three different standards for interpretation. This is not to frustrate or confuse you, it is to illustrate how many factors contribute to interpretation of this widely used index and why you will see so many different interpretations (relative to precision) in research articles you read. Correlation interpretation from US gov HHS's Another example of guidelines Another example of r guidelines (note the diagram illustrating interpretation of r squared) Example The Effects of Confidence and Perception of Test-taking , By: Smith, Lisa F., North American Journal of Psychology, 1527-7143, April 1, 2002, Vol. 4, Issue 1 Database: Academic Search Premier These authors found a statistically significant correlation of .46 between students' confidence and their performance. This is a small, positive correlation and indicates that about 21% of the factors related to students' confidence are related to factors contributing to their performance. Further, the authors found a very low, positive correlation between students' perceptions of their test taking skills and their performance (r = .14). This low correlation indicates that only about 2% of the factors contributing to students' perceptions of their test-taking skills are related to factors contributing to their test performance. Scatterplots Scatterplots are another way to represent distributions of scores. Interpreting the scatterplot enables you to use a graph to determine the strength and direction of correlation between scores. Interpretation of the scatterplot will also allow you to determine what type of relationship exists between the variables (linear or curvilinear). Matching Exercise for Practice Examine and interpret the scatterplot examples below. Column A contains examples of scatterplots. Column B contains a list of relationship interpretations. Match the number of the scatterplot from Column A with the appropriate title in Column B. Check your work with the feedback that follows. Once you have practiced interpreting the strength and direction of the relationship, try to imagine two variables that could be related as represented in a given distribution. For example, Plot #1 could represent Variable A as student enrollment while Variable B represents recruitment effort by university administration. As recruitment effort goes up, student enrollment goes up (strong positive correlation). Column A: Types of relationships Column B: Examples of Scatterplots A. No correlation B. Small, positive correlation C. Strong, positive correlation D. Small, negative correlation E. Strong, negative correlation F. Curvilinear correlation Check your choices with the correct matches below. Column A: Types of relationships Column B: Examples of Scatterplots C. Strong, positive correlation B. Small, positive correlation A. No correlation E. Strong, negative correlation D. Small, negative correlation F. Curvilinear correlation Module 6 Part 2 Validity Module 6 corresponds to chapters 14 and 15 in the Kubiszyn and Borich textbook. Please read those chapters before working on this module. In this part of the module you will review the types of validity evidence introduced in the text. You will be asked to use abstracts of articles reporting studies that have been conducted to examine the validity of results of specific instrument (find them using the link in the Table of Contents within this module). You may want to download and skim them before going through this material. Content Validity Content validity is the extent that the items or tasks on a test are congruent with whatever the test was designed to measure. It is the most important type of validity evidence when measuring achievement. In typical classroom assessment contexts, it is important that tests are congruent with the instructional goals and objectives they were designed to measure as well as the instruction that the students experienced. o This means that test items or tasks should be a good match with the behavior (learning levels) and content of the objectives. o There should be representation of the domain. There should be good coverage of the content and learning levels that should be present and the test does not include content and learning levels that were not part of the instructional objectives and learning activities. Content validity is largely determined through careful logical analysis of the test items and instructional objectives. Other types of tests besides achievement tests must also demonstrate good content validity. The test blueprint or Table of Test Specifications helps to ensure strong content validity of test results. Criterion-related Validity: Concurrent and Predictive Concurrent validity is demonstrated when the scores on one test show a strong positive correlation with scores on another test that measures the same or nearly the same thing. Test results are said to have predictive validity when the scores on a test show a strong positive correlation with a measure of the criterion of interest that is taken after a specified time amount of time has passed. When this strong positive correlation is obtained, the scores are then thought to predict the future performance. Both criterion and predictive validity are dependent on the reliability of the measures involved. Both the test being created and the criterion (whether conconcurrent or future) must yield reliable results. Construct Validity Construct validity is demonstrated when the results of a test are shown to be consistent with outcomes that were predicted or expected based on the theory and research surrounding the attribute that is being defined and measured. Developing construct validity often entails: demonstrating that the construct is important, relevant, or necessary to the field distinguishing the construct from other similar attributes being measured demonstrating that the construct can be operationalized (i.e., made measurable; that which was abstract can be made observable) o showing that the test results converge with other, similar tests to suggest or reinforce that there is such a thing as the phenomenon being measured o demonstrating that the test results do not correlate with tests that do not measure the same or similar attributes Empirical procedures (often involving correlation) are used to establish construct validity. Both content and criterion-related validity evidence may be employed. Construct validity is more often associated with creating measures of psychological attributes or more complex abstract phenomena than is typically measured in classroom settings. o o o (No need to post on these activities) Look up an article that describes the process of gathering validity evidence on an instrument of interest to you. Identify and evaluate the types of validity evidence that have been obtained. Consider how you might add to that body of evidence based on the particular uses of the instrument in your professional context. Module 7 Overview The concepts in this module are important whether you are using measurement skills as a teacher in a classroom or in another professional role such as school leader, counselor, instructional designer, or researcher. As you begin, consider all the ways that proficiency in using results of various instruments is vital to your effective professional performance. Skills in interpreting and estimating reliability of test scores are important for evaluating and conducting research (teacher or school leader action research, school leaders' or private, non-profit evaluation research, scholarly research). Module 7 corresponds to Chapters 17 & 18 in our textbook. Content in this module relates to the text but includes content not found in the textbook as well. One of the most important attributes of high quality assessment is the validity of results (the extent that inferences that we make from the results are appropriate). To improve the validity of the assessment, the same representation of content and learning level should follow through to the item or task on the test they build. To describe the way the skills are originally classified (such as by the arrangement of the item difficulty levels) and seek the consistency of the testing items and the course content is usually called item mapping. The next most important attribute is reliability. In this module, we will learn skills that will help you enhance the reliability of results of tests you use, create, or evaluate for research purposes. The table below contains the objectives, readings, learning activities, and assignments for Module 1. Module 7 focuses on the following objectives: Chapter 17 Objectives Describe procedures used to estimate test-retest score reliability and alternateforms score reliability. Describe procedures used to estimate split-half and Kuder-Richardson estimates of internal consistency. Describe how the Spearman-Brown Prophecy Formula is used, and its effect on the reliability coefficient. Select the best test for a given purpose when provided with score reliability information for different tests. Select one best test for a given purpose when given reliability and validity information for several tests. Identify the most relevant type of score reliability estimate when given different purposes for testing. Explain the factors that affect the obtained value of the reliability coefficient (length of test, heterogeneity of group, content, etc.). Chapter 18 Explain how error can operate to both increase and decrease test scores. Define and discriminate among obtained, true and error scores. Discriminate between the standard deviation and the standard error of measurement. Construction 68, 95, and 99% confidence bands around obtained scores when given the standard error. Identify the four sources of error in testing, and give examples of each. Describe the extent to which the various estimates of reliability are differentially affected by the sources of error. Chapters 17 & 18 in text Readings Learning Activities Content and articles specified in module Several non-posted practice tasks Continue Project Part B (found under Assignments Assignments tool) Module 7 Reliability and Accuracy This module corresponds to chapters 16 and 17 of the Kubiszyn and Borich textbook. Please read the chapters before completing this module. It may be a good idea to skim the material in the paragraphs below, then read the materials recommended under the "DO THIS" icon at the end of this page. Then go back and re-read the sections on this page to synthesize across the resources. Reliability Reliability is the consistency with which an instrument is able to obtain results. After validity, it is the most important characteristic of instrument results. If instrument results are not reliable, they cannot be valid. There are a variety of ways to estimate reliability of test scores. The choice of reliability estimate will depend on a number of factors and sometimes it may be necessary to use multiple procedures to fully estimate reliability of scores. Sometimes it is necessary to use a method that would be less than ideal but is still the best available. The various method of estimating reliability of test scores include: stability, equivalence, stability and equivalence, internal consistency, and inter-rater consistency. Different methods will result in different values for the reliability coefficient. Different resources may suggest different rules of thumb for interpreting reliability but most would agree that the higher the stakes, the higher the reliability that should be expected of the scores. Standardized multiple-choice tests typically have reliability coefficients in the range of .85 - .95. Paper and pencil tests may range between .65 and .80. Portfolio assessments may range between .40 - .60. Consult reputable resources (e.g. Standards for Educational Psychological Testing) when interpreting reliability coefficients. There are a number of factors that affect their interpretation. Stability, Equivalence, and Stability and Equivalence Methods Stability is measured by estimating the correlation of test scores obtained from the same individuals over a period of time (test-retest reliability). o Test-retest estimation involves administering the same instrument two times to the same group of students after a pre-determined period of time. Evidence of the stability of the test scores is provided if the students who score high the first time around are the same students who score high the second time around (and the remaining students keep the same relative ranks from the first to the second administration as well). o The amount of time that elapses is an important consideration. The longer the interval between administrations the lower the correlation between the sets of scores. When there is a long interval between administrations, the correlation between sets of scores is not only diminished by the lack of stability of the scores but also by other factors that could interfere. In addition to the lack of stability of scores, changes in the student could occur which make it less likely that the same rank will be held from one test administration to the next. o Scores that might be obtained at one point in time but may be relevant or used at a point later in time must demonstrate stability (standardized tests, for example, that are used for admissions to academic programs). Stability is less important for scores that will be used relatively shortly after they are obtained and are not likely be relevant at a point in the future (classroom unit tests, for example). o As an example of the criteria for evaluating a test-retest reliability coefficient, measures of stability commonly reported for standardized tests of aptitude and achievement over results of tests administered within the same year are about .80. This is important when using standardized assessment scores from students' permanent records. Consider the date of the assessment and whether there is stability evidence available to indicate that the scores are still relevant after the amount of time that has elapsed (Linn & Miller, 2005). Equivalence is evaluated by creating two or more forms (parallel forms) of the same instrument and administering them to the same individuals at about the same time. o Parallel forms is an indication of the short range constancy of performance as well as the adequacy of sampling of the domain. Recall that to ensure reliability, it is important to get a representative sample of items from the many possible items of the domain. If scores on equivalent forms of the test are highly correlated, it is an indication that the test is an appropriate sample of the domain; if multiple equivalent samples of the domain are correlated reflecting similarity in the results, o then this indicates that the (equivalent) samples are a good representation of the domain. This method of estimation is often used with standardized tests as there are often multiple (parallel) forms needed. Along with the content validity, evidence of the equivalence of the forms must be provided. This is true for any type of test that would offer parallel forms reliability estimates. The following research article abstract is an example of a study that examined the equivalence of test scores across testing methods. Abstract This study explores the equivalence of web-based administration with no local supervision and traditional paper-and-pencil supervised versions of OPQ32i (the ipsative format version of the Occupational Personality Questionnaire). Samples of data were collected from a range of client projects and matched in terms of industry sector, assessment purpose (selection or development) and candidate category (graduate or managerial/professional). The analysis indicates that lack of local supervision in highstakes situations has little if any impact on scale scores. At worst, some scales appear to show shifts of less than quarter of an SD, with most scales showing little change, if any. Analyses in terms of the Big Five show differences of less than .2 of an SD. Scale reliabilities and scale covariances appear to be unaffected by the differences between the supervised and unsupervised administration conditions. Bartram, D. & Brown, A. (2004). Online testing: mode of administration and the stability of OPQ 32i scores. International Journal of Selection and Assessment, 12(3), 278-284. The following research article abstract is an example of a study that examined the stability and equivalence of test scores from a modified version of a widely used test in early childhood. Abstract Examined the psychometric properties of a set of preliteracy measures modified from the Dynamic Indicators of Basic Early Literacy Skills (DIBELS) with a sample of 75 kindergarten students. The modified battery (called DIBELS--M) includes measures of Letter Naming Fluency, Sound Naming Fluency, Initial Phoneme Ability, and Phonemic Segmentation Ability. These measures were assessed through repeated administrations in 2-wk intervals at the end of the kindergarten year. Results indicate interrater reliability estimates and coefficients of stability and equivalence for 3 of the measures ranged from .80 to the mid .90s with about one-half of the coefficients above .90. Correlations between DIBELS--M scores and criterion measures of phonological awareness, standardized achievement measures, and teacher ratings of achievement yielded concurrent validity coefficients ranging from .60 to .70. Hierarchical regression analysis showed that the 4 DIBELS--M measures accounted for 73% of the variance in scores on the Skills Cluster of the Woodcock-Johnson Psychoeducational Battery--Revised. The contributions of the study, including psychometric analysis of the DIBELS--M with a new sample and formation of composite scores, are discussed in relation to the extant literature. Elliott, J., Lee, S., & Tollefson, N. (2001). A reliability and validity study of the Dynamic Indicators of Basic Early Literacy Skills-Modified. School Psychology Review, 30(1), 33-49. Stability and equivalence is obtained by administering two forms with a relatively long delay in between. o The combined procedure is even more rigorous than the stability or equivalence procedures alone and is often estimated for standardized test results. While it is a o rigorous test of reliability of scores, it is not as commonly reported in the literature. The procedure is in effect a measure of both constancy of the scores and representativeness of the domain. Internal Consistency Reliability Estimates Internal consistency methods involve only one test administration and involve procedures for estimating correlations of items within the test. Common internal consistency estimates include split-half reliability, Kuder-Richardson formulas (KR20 and KR21), and Cronbach's Alpha. Split-half reliability requires the administration of the test to only one group. The test is then split into halves that are equivalent. The correlation between the scores on the two halves is then estimated. o Split-half is similar to the equivalent forms method of estimation in that it provides an indication of whether or not the sample of items on the test is a dependable sampling of the domain. o A problem with using split-half reliability estimation for classroom achievement tests is how to split items such that content and difficulty is equivalent across the halves. Kuder-Richardson methods estimate an average reliability found by taking all possible splits of the test. The methods assume that items on a single test measure the same attribute and that the test is a power test and not a speed test. o The estimation involves comparison of the sum of the item variances to the overall test variance. (Variance, like the range and standard deviation, is a measure of variability. It is equal to the standard deviation squared.) o Because the KR methods rely on item variance (as can be observed in the KR20 and KR21 formulas), it is important to remember that they may not be the best reflection of reliability for criterion-referenced tests in which it is possible and appropriate that every student answers an item or many items correctly. When this happens, item variance is reduced and the resulting reliability estimate will be lower. Therefore, while it may not be the best method for estimating reliability of results for a criterion-referenced teacher-made test, it is still the most practical method available and must be interpreted in light of these factors. o Kuder-Richardson internal consistency reliability methods are not appropriate for speeded tests - they tend to be inflated with speeded tests. Teacher-made criterion referenced tests are not affected as much by this caution as a standardized test in which time allowance is more of a factor. The KuderRichardson reliability estimates must be interpreted with caution unless it has been determined that respondents generally have adequate time. o Another limitation is that the KR reliability estimation methods do not reflect the stability of scores over time. Cronbach's alpha (coefficient alpha) is a variation of the Kuder-Richardson methods that is used when the responses are not scored on a dichotomous scale (i.e., two possible answer judgments "correct" or "incorrect") but rather from responses that come from a scale with multiple response choices where the answer can receive more than one point. o This method is used on types of scales, such as Likert type scales (e.g., 5 strongly agree, 4 - agree, 3 - neutral, 2 - disagree, 1 - strongly disagree) which are often used on survey questionnaires or instruments measuring psychological traits or attitudes. o The same cautions that apply to the KR methods apply to coefficient alpha (should not be used in speeded tests and do not indicate stability over time). The same advantage (single test administration) corresponds to coefficient alpha as well. The following research article abstract is an example of a study that examined the Cronbach's Alpha internal consistency reliability of scores from a teacher report measure of reading engagement. Abstract This study examined psychometric properties of the Kindergarten Reading Engagement Scale (KRES), a brief teacher-report measure of classroom reading engagement. Participants were 27 students with identified reading deficits from a predominantly low-income, African-American community. Data were collected in kindergarten (Time 1) and first grade (Time 2). The KRES demonstrated strong internal consistency (Cronbach's alpha=.96) and modest test-retest reliability (r=.66). KRES ratings were significantly correlated with scores from the Word Reading subtest of the Wechsler Individual Achievement Test-Second Edition and the Sound Matching subtest of the Comprehensive Test of Phonological Processing, measured at Time 1 and Time 2. Strategies for refining the scale and implications for applying the KRES in school-based program evaluations are discussed. Clarke, A.T., Power, T.J., Blon-Huffman, J., Dwyer,J.F., Kelleher, C.R., Novak, M. (2003). Kindergarten reading engagement: An investigation of teacher ratings. Journal of Applied School Psychology, 20(1), 131-144. Inter-rater reliability Interrater reliability indicates the consistency of scores that require judgments. On a performance rating for example, it would be important to know if the ratings that were obtained would be consistent if more than one rater evaluated the work or if the work was evaluated on more than one occasion. Consistency is defined as the similarity of the rank order of ratings by two different judges. o The reliability estimate may be a correlation estimated using the two sets of scores (i.e. the scores from judge one and the scores from judge two). o Another interrater reliability estimation method involves computing the percentage of agreement between the two scorers. o The method selected depends on the purpose of the scores or ratings. If the scores needed are rankings, the correlation coefficient would be selected; if the actual score is needed (such as in a pass/fail decision) the percentage agreement method would be selected. The following research article abstract is an example of a study that examined the rater agreement method of estimating reliability of scores from a structured interview process to determine special health care needs of children. Abstract The purpose of this study was to determine if two teams of raters could reliably assign codes and performance qualifiers from the Activities and Participation component of the International Classification of Functioning, Disability, and Health (ICF) to children with special health care needs based on the results of a developmentally structured interview. Method . Children ( N = 40), ages 11 months to 12 years 10 months, with a range of health conditions, were evaluated using a structured interview consisting of open-ended questions and scored using developmental guidelines. For each child, two raters made a binary decision indicating whether codes represented an area of need or no need for that child. Raters assigned a performance qualifier, based on the ICF guidelines, to each code designated as an area of need. Cohen's Kappa statistic was used as the measure of inter-rater reliability. Results . Team I reached good to excellent agreement on 39/39 codes and Team II on 38/39 codes. Team I reached good to excellent agreement on 5/5 qualifiers and Team II on 10/14 qualifiers. Conclusions . A developmentally structured interview was an effective clinical tool for assigning ICF codes to children with special health care needs. The interview resulted in higher rates of agreement than did results from standardized functional assessments. Guidelines for assigning performance qualifiers must be modified for use with children. Kronk, R., Ogonowski, J., Rice, C., & Fledman, H. (2005). Reliability in assigning ICF codes to children with special health care needs using a developmentally structured interview. Disability & Rehabilitation, 27(17), 977-983. Standard Error of Measurement Our common sense tells us that no measurement procedure is perfect. We must acknowledge that every score contains a certain amount of error. If we use scores to make decisions we have the responsibility of making ourselves familiar with an estimate of the amount of error it contains. The higher the stakes, the more responsibility we have to learn about the error and interpret the score accordingly. Tests scores with low reliability contain large amounts of error and test scores with high reliability contain lower amounts of error. Low reliability would mean that there would be large variations in assessment results if students were to be retested over and over again. If scores reflected high reliability we would have more confidence that if students were tested over and over again, there would be little variation in the ranking of their scores. The standard error of measurement (SEM) provides an estimate of the amount of error in a score. It represents the amount that a score would be expected to vary if the test were administered over and over again. Scores should be interpreted within the context of the error they contain. This involves reporting a band (range) within which the observed score falls. Upon retesting, the score might fall anywhere within the band. Standardized tests report the standard error estimates in the technical manual and often provide the bands on the score reports. The band indicates how much a student's score may vary upon retesting. Bands can be reported at different levels of confidence or probability (based on the standard normal curve). Bands may be reported at the 68% confidence level, 95% confidence level, or 99% confidence level. The standard error of measurement is estimated using the following formula: Depending on the degree of confidence needed, a band is obtained by adding and subtracting the SEM until the desired level of confidence is reached. A band at the 68% confidence is obtained by adding and subtracting one standard error of measurement to the score. Consider this example: X = 54 and SEM = 2 (recall that X is the symbol for a score). o the 68% band is represented by the range of scores from 52 - 56 or (54 + 1SEM; subtract the SEM of 2 from 54 and then add the SEM of 2 to 54); upon retesting, 68% of the time, the student's score is likely to fall between 52 and 56. o the 95% band is represented by the range of scores from 50 - 58 or (54+ 2SEM or 54 + 4); upon retesting, 95% of the time, the students score is likely to fall between 50 and 58. the 99% band is represented by the range of scores from 48 - 60 or (54 + 3SEM or 54 + 6); upon retesting, 99% of the time, the students score is likely to fall between 48 and 60. More advanced measurement textbooks will further explain the meaning of standard error of measurement, how it is derived, and how it is best interpreted. You are encouraged to study further than what is allowed by the time constraints of this course, especially if you use test scores to make the types of decisions that have lasting effects on the lives of other people. o The following research article abstract is an example of a study that reported test-retest reliability and standard error of measurement values for the correlation indices. Abstract Test-retest reliability of the Test of Variables of Attention (T.O.V.A.) was investigated in two studies using two different time intervals: 90 min and 1 week (±2 days). To investigate the 90min reliability, 31 school-age children (M = 10 years, SD = 2.66) were administered the T.O.V.A. then readministered the test 90 min afterward. Significant reliability coefficients were obtained across omission (.70), commission (.78), response time (.84), and response time variability (.87). For the second study, a different sample of 33 school-age children (M = 10.01 years, SD = 2.59) were administered the test then readministered the test 1 week later. Significant reliability coefficients were obtained for omission (.86), commission (.74), response time (.79), and response time variability (.87). Standard error of measurement statistics were calculated using the obtained coefficients. Commission scores were significantly higher on second trials for each retest interval. TABLE 2 Scores for the 1-Week Interval ( N= 33) First Time Second Time T.O.V.A. Score M M Omission 90.39 21.85 91.42 Commission 92.39 19.95 105.88* 15.37 .74 7.65 Response time 94.63 15.55 90.85 21.05 .79 6.87 Response time variability 97.70 18.32 98.64 20.94 .87 5.41 SD SD r SEM 21.86 .86 5.61 *p < .01. Leark, R.A., Wallace, D.R., & Fitzgerald, R. (2004). Test-Retest Reliability and Standard Error of Measurement for the Test of Variables of Attention (T.O.V.A.) With Healthy School-Age Children. Assessment, 11( 4), 285-289. Factors That Influence Reliability Interpretation Several factors affect the calculation and interpretation of reliability estimates and must be kept in mind when interpreting test results. The number of items or tasks on the test affect the reliability estimate. Tests with more items tend to have higher reliability estimates and shorter tests have lower reliability coefficients. o The variability of the scores also influences the obtained reliability estimate. Less variability results in lower reliability while higher variability tends to result in higher estimates of reliability. o The level of objectivity influences the reliability. More objectivity results in higher reliability while lower objectivity is related to lower reliability. o Difficulty of test items affects the reliability. Extreme levels in which all students answer incorrectly or all students answer correctly result in lower reliability. It is important to consider the possible sources of error in the scores when selecting and interpreting correlation coefficients. o Read the National Council on Measurement in Education Instructional Modules on Reliability found at NCME Instructional Module on Reliability of Scores from Teacher-Made Tests found at http://www.ncme.org/pubs/items/ITEMS_Mod_3.pdf NCME Instructional Module on Understanding Reliability found at http://www.ncme.org/pubs/items/15.pdf Now do the practice exercises on the worksheet found in the table of contents of this module. You do not need to submit these practices but they are very helpful for demonstrating your understanding of the reliability of test results. You may want to discuss with your group members any difficulties you are having with the content of the modules or the calculation and interpretation practices on the worksheets. Examine an article that reports empirical research on a topic of interest to you. It may be one that you have located for one of your other courses (or a new one for this purpose, if you wish). Examine the "instrument" or "measurement" section of the article to locate the reliability estimates reported for the instruments used to collect data for that study. Evaluate the estimates of reliability using the criteria you have just learned. Practice Exercises (please click the link to download the PDF file for the exercises). Jump to Navigation Frame Jump to Content Frame Printable View of: Module 8: Standardized Test Score Interpretation Print Save to File File: edf6432 Module 8 Overview EDF 6432 - Measurement and Evaluation in Education Dr. Haiyan Bai Module 8 Overview The concepts in this module are important whether you are interpreting standardized test scores as a teacher in a classroom or in another professional role such as school leader, counselor, or researcher. Consider all the attention that standardized tests scores receive in the media and research. These measurement skills can assist you in performing your role, consistent with your professional philosophy, and with high quality information at your fingertips to make effective decisions. The skills are also important for interpreting and conducting research (teacher or school leader action research, school leaders' or private, non-profit evaluation research, scholarly research). Module 8 corresponds to Chapters 19 & 20 in our textbook. We will study standardized test and their derived scores. Content in this module relates to the text but includes content not found in the textbook as well. We have studied the important attributes of high quality assessment: validity and reliability of results (the extent that inferences we make from the results are appropriate and the consistency with which we obtain results). In this module, you will find information on validity and reliability of standardized test results and interpret specific types of derived scores. The table below contains the objectives, readings, learning activities, and assignments for Module 8. Module 8 focuses on the following objectives: Chapter 19 Objectives Compare and contrast standardized and teacher-made tests. Indicate the sources of error controlled or minimized by standardized tests. Develop a local norms table given a set of test scores. Describe how student-related factors can affect standardized test scores. Convert grade equivalents and percentile ranks to standard scores (using a supplied conversion table) to facilitate determining aptitudeachievement discrepancies. Compare and contrast grade equivalent scores, age equivalent scores, percentile ranks, and standard scores. Chapter 20 Discriminate among standardized achievement test batteries, single subject achievement tests, and diagnostic achievement tests. Discriminate between aptitude tests and achievement tests. Compare and contrast diagnostic tests and survey batteries. Explain why there is no universally accepted definition of personality. Compare and contrast objective and projective personality assessment techniques and identify the major advantages and disadvantages of each approach. Chapters 19 & 20 in text Readings Content and articles specified in module Florida Department of Education website related to FCAT reports Several non-posted practice tasks Learning Activities Posting to class Standardized Test Score discussion topic Assignments Revise the Final Project Module 8 Part 1: Standardized Testing This module accompanies Chapters 18-19 in the Kubiszyn & Borich (2007) textbook. Please read the chapters 18 and 19. You must also use the principles of validity and reliability as well as central tendency and variability to completely understand the content related to this module. Basic Characteristics of Standardized Tests Standardized tests are usually commercially published after a long and expensive development process. They are called standardized because they are to be administered and scored according to specified procedures in the same way every time they are used. As we have learned from "The Standards," the author, publisher, and users of test results are responsible for identifying the evidence for the validity of the test results. After reading about the different types of tests from chapters 18 - 19 in the textbook, locate an example of each of the following types of standardized tests. You may want to visit the Buros Mental Measurements Yearbook site again to look for your examples. Recall/review the definition of each of the test types and then look specifically for evidence of the validity of the test results. (Many of the large commercial test publishers have websites where you can find some of this information; other information can be found by conducting an online library or ERIC search.) This activity is to extend your application of the concepts related to the qualities of instruments that we have learned throughout the semester. (It is not necessary to post these examples as an assignment.) Norm-referenced academic achievement test (locate an example of this type of standardized test) Criterion-referenced academic achievement test (locate an example of this type of standardized test) Scholastic aptitude test (locate an example of this type of standardized test) Identify an example of a standardized test that has been used in a research study of interest to you. (You may use one of the instruments from the examples above if you wish.) Now locate the author's or publisher's description of the test's purpose and any information about the validity of the test that is available from the author/publisher. Compare the way the test was administered and results interpreted in the research study to the author's stated purpose and use for the test. Post a brief summary (one short paragraph) to the Standardized Tests Discussion Topic of what you found when you compared the intended use to the actual use in the research study . Briefly comment on your reaction to what you have found. Read a couple of your classmates postings to compare their findings with yours. You may wish to use the following outline to guide your work or for evaluating standardized tests in general. It contains some of the most critical elements that should be considered when evaluating the quality of a standardized test. I. Reference Data a. Title b. Author(s) c. Publisher d. Type of test e. Description of test and subtests II. Practical Consideration: a. Cost b. Time limits c. Alternate forms d. Appropriate grade levels e. Availability of manual f. Copyright data of manual and test booklets g. Purpose of test h. Required administrator qualifications III. Reliability a. Reliability for each recommended use b. Type(s) of reliability reported IV. Validity a. Validity evidence for each recommended use b. Types of validity evidence V. Scales and Norms: a. Types of norms provided b. Difficulty levels of items c. Population used for norm group d. Methods used to select norm group e. Year and time of year standardization data was collected Module 8 Part 2: Standardized Test Score Interpretation This part of the module is associated with Chapters 18 and 19 in the Kubiszyn and Borich textbook. Please read those chapters before working on these activities. Practice interpreting a variety of standardized test scores. Create a diagram of the Standard Normal Curve. Using the means and standard deviations of the scales listed below, plot the scores that correspond to the familiar marker points. You may want to use Figure 18.3 in the textbook to help you get started. Then practice interpreting as required in the questions that follow. Remember the Students Helping Students discussion topic if you would like to compare answers with classmates or get some pointers. Scales: z (Mean 0; SD 1) T (Mean 50; SD 10) IQ (Mean 100; SD 15 for the Weschler or 16 for the Stanford Binet) Normal Curve Equivalence (Mean 50; SD 21.06) Questions for interpretation practice: How would you describe the performance of a student who earned a z score of 2.5 on a norm referenced standardized test? How would you describe the performance of a student who earned a T score of 28 on a norm referenced standardized test? How would you describe the performance of a student who earned a percentile rank of 28 on a norm referenced standardized test? What Normal Curve Equivalent score is equal to a percentile rank of 16? Recall the process of interpreting percentile bands presented in Chapter 17. Practice building percentile bands (68%) around the following Obtained Percentile Rank scores. Assume there was a SEM of 4 on this particular subtest and interpret the performances as indicated. Bands and interpretations for Chris and Angelina have been done to help you get started. Student Per cent ile Ran k Percenti Plot the 95% Percentile Band le Band Interpretation Bounda 1_____10_____20_____30_____40_____50_____60__ ries ___70_____80_____90____99 Angelina 82 78 - 86 1_____10_____20_____30_____40_____50_____60__ Average compared ___70____X80XXX__90____99 to the norm group Chris 92 88 - 96 1_____10_____20_____30_____40_____50_____60__ Above average ___70_____80____X90XXX__99 compared to the norm group; Chris performed better than Angelina Dewan 88 1_____10_____20_____30_____40_____50_____60__ ___70_____80_____90____99 Cheerie 24 1_____10_____20_____30_____40_____50_____60__ ___70_____80_____90____99 Lan 55 1_____10_____20_____30_____40_____50_____60__ ___70_____80_____90____99 Visit the Florida Department of Education website and locate information on the Florida Comprehensive Assessment Test. Locate the Assessment and Accountability Briefing Book (especially pp. 21 - 24) and the FCAT for Reading and Math Technical Report (use FCAT reliability to search within the FL DOE site under the Shortcuts keyword search). Read about the meaning of the various scores that are reported. Choose an area of interest to you (i.e., Math, Reading, etc.) to examine more closely. Locate the validity, reliability, and standard errors reported in the technical report. Next use the tables found at this link Scale Scores at Achievement Levels (URL is http://fcat.fldoe.org/pdf/fcAchievementLevels.pdf) to find the range of scale scores associated with each of the FCAT Levels? For example, what range of Reading Scale Scores is associated with Grade 3, Level 3? (answer: 284 - 381) Also, visit the Publications (Educator) for Florida Comprehensive Assessment Test at http://fcat.fldoe.org/fcatpub2.asp for important information when interpreting the FCAT results. You may want to save the URL for future reference. For practice, locate the following information within the documents found on that website. Examine the FCAT Mathematics 2007 Grade 9 content focus. How many points are possible from the Data Analysis and Probability content area?________________ (example answer: 8) Examine the FCAT Reading 2007 Grade 10 content focus. How many points are possible from the Conclusions and Inferences (Cluster 1) content area?_____________ Examine the FCAT Summary of Tests and Design. What percentage of points are of moderate complexity on the FCAT Mathematics 6th-7th grade test (Table 9)?_____ Examine the FCAT Technical Report for 2003. Locate Table 69. What is the Cronbach's Alpha reliability coefficient for the Grade 3 Reading total battery?_________ Consider how much more of this technical information you are able to understand and use because of the hard work you have done acquiring the skills in this course. Consider what you know now compared to what you knew prior to studying the content of this course. Good job! Module 9 Overview The concepts and resources in this module are important for assessing students receiving exceptional student education services and students with limited English proficiency whether you are a teacher in a classroom or someone in another professional role such as school leader, counselor, instructional designer, or researcher. Consider the importance of validity and reliability principles to the assessment process and how important they are when selecting, designing, and administering instruments for students with special needs. Principles for ensuring validity (appropriateness of inferences made from test results) and reliability (consistency) are no less critical when making accommodations with existing instruments. These concepts and skills are also important for interpreting and conducting research (teacher or school leader action research, school leaders' or private, non-profit evaluation research, scholarly research) that require measures of students with special needs. Module 9 corresponds to Chapters 3 & 21 in our textbook. We will learn the concept of individual educational plan (IEP), the assessment of children with special needs, and understand the related policy and practice. Content in this module relates to the text but includes content not found in the textbook as well. One of the most important attributes of high quality assessment is the validity of results (the extent that inferences that we make from the results are appropriate). One of the most important steps to ensuring validity is identifying what it is you want to assess (who and what), for what purpose (why), and under what conditions (how). In this module, we will learn skills that will help you enhance the validity of results of tests you use, create, or evaluate for research purposes. The table below contains the objectives, readings, learning activities, and assignments for Module 1. Module 9 focuses on the following objectives: Chapter 3 Objectives Identify the types of assessment data the classroom teacher may be called upon to provide as part of the child identification process. Identify the types of assessment instruments the classroom teacher may employ during the individual assessment process. Describe what is response-tointervention (RTI). Explain the classroom teacher’s role in RTI development and its implementation. State the purpose of the RTI. Understand how the requirements of the NCLB, IDEIA, and the shift to formative assessment (CBM and RTI) have altered regular classroom testing. Chapter 20 Summary of the course. Chapters 3 & 21 in text Content and articles specified in module; note there are many sites to explore and keep handy for future reference. Readings Chapter on FLDOE website: Accommodations: Assisting Students with Disabilities (link is within module) Brochure from FLDOE Planning FCAT Accommodations for Students with Disabilities (link is within module) Learning Activities Review resources within module (take advantage of guidelines, research, and tutorials, etc. available on Internet that are listed within the module) Assignments Revise the final version of the Final Project for the final submission due. Module 9 Assessment Issues for Language Enriched Pupils & Exceptional Student Education Settings General Principles The general principles we have learned that guide selection, design, and construction of assessment for the general population apply just as much or more so for Language Enriched Pupils (LEP) and in Exceptional Student Education (ESE) settings. Recall these principles from Module 2 Part 6: Systematically planned Good match for content Good match for learners Feasibility Professional in format and appearance Imagine students who speak languages other than English in their home. As an educator, you will no doubt at some time be responsible for the learning of students who are in the process of learning to speak, read, write, and understand the English language along with the subjectmatter content (math, science, social studies, music, etc.). Now think about students who must learn differently than students in the general population. For example, think of a student with mild mental retardation, or with cerebral palsy, or with a reading disability. This module will help you more effectively plan assessments for students with these or similar challenges. Systematically Planned and Constructed Teachers must consider their resources (time, materials, cost, professional skill, etc.) along with the entire instructional curriculum and then set up an assessment system at the beginning of the year that will support their planned instructional system. At times it is necessary to make adaptations to the assessments within the system to accommodate the needs of special learners. While an effective instructional/assessment system is tailored to your specific context and resources, it must also be adapted to meet the needs of a variety of learners. It can be a challenge to find the balance between a plan that is both individualized for specific learners as well as feasible. Knowledge and creativity will help in the effort required to meet that challenge; administrative support, assistance from trained ESE and LEP professionals, as well as patience are also needed. Educators must make themselves aware of available resources to help meet the challenge of appropriately adapting assessments. Look for resources that will help you become familiar with the specific needs of the learners. These resources may include school-based personnel with special training in ESOL and ESE; Internet web-sites with information and tips; volunteers from the community with knowledge of various languages, cultures, or challenges to learning; or textbooks about these topics. Of course, the learners and their families would be good sources of information about the learners' strengths and needs as well. Once you have made yourself familiar with the learning modalities of the special learners, it is time to apply your creativity and measurement "best practices" to make needed adaptations to the instruments or administration procedures. Explore the sites listed in Table 9.1 and find others that will help you become familiar with available resources related to instruction and assessment of language enriched students or students with exceptional learning needs. Note that you must evaluate the suggestions you find here through a "Best Practices in Measurement" filter and then accept, revise, or reject the suggestion based on consistency with measurement best practices. You may want to save the addresses of the sites that will be helpful in your professional context. Several of these sites will also be helpful as you work on our Product Exam Part B. Table 9.1 Example Resources: Leaning and Assessment for Special Learners Learners Resources 1. English Speakers of Other Languag es 1.1 NWREL publication Spring 2006, Volume 11, #3, Everyone's Child (URL is http://www.nwrel.org/nwedu/11-03/child/) This site is from the Northwest Regional Educational Laboratory publication of Spring 2006, Volume 11, #3. Note tips for general education teachers at the end of the article Everyone's Child. 1.2 NCREL Critical Issue: Mastering the Mosaic (URL is http://www.ncrel.org/sdrs/areas/issues/content/cntareas/math/ma700.htm) Mastering the Mosaic - Framing Critical Impact Factors to Aid Limited English Proficient Students in Math and Science. Resource from the North Central Regional Educational Laboratory contains extensive research-based information. Note the overviews of ELL programs and the sections on Instruction and Assessment. This resource is useful to many education professionals such as teachers, administrators, and ELL personnel. 1.3 Office of Academic Achievement Through Language Acquisition website is Florida Department of Education Office of AALA website with important resources for teachers and administrators. (URL is http://www.fldoe.org/aala/). Among the many resources found there, be sure to note the Documents and Publications link. In the long list of resources, especially take note of Accommodations for Limited English Proficient Students in the Administration of Statewide Assessments; Technical Assistance Paper: Modifications to the Consent Decree...; Inclusion as an Instructional Model for LEP students; Clustering II: Technical Assistance Note.... 1.4 University of South Florida St. Petersburg ESOL Infusion Site contains extensive ESOL resources for a wide variety of education professionals. (URL is http://fcit.usf.edu/esol/resources/resources_articles.html) This may be a useful site to bookmark as there are so many resources. 1.5 CCSSO Ensuring Accuracy in Testing for English Language Learners (URL is http://www.ccsso.org/Resources/Programs/English_Language_Learners_%2 8ELL%29.html) This is a publication found on the website of the Council of Chief State School Officers. It is useful to all but essential to administrators. You will have to follow links to Publications and then search using keyword: Assessment under category: Limited English Proficient Students. This is a valuable free, downloadable pdf file with excellent guidelines on accommodations in test design. Note especially chapters 4, 5, and 9. 1.6 Organizing and Assessing in the Content Area Class (URL is http://www.everythingesl.net/inservices/judith2.php) provides some useful suggestions for instruction and assessment. Helpful suggestions (but use measurement filter here). 1.7 CRESST Reports from the National Center for Research on Evaluation, Standards, and Student Testing (URL is: http://www.cse.ucla.edu/products/reports.asp). Contains a variety of research-based reports. Explore those related to LEP concerns in your discipline. 2. Students with Exceptio nal Needs 2.1 At the Florida Department of Education Bureau of Exceptional Education and Student Services site, locate the Publications Index (URL is http://www.fldoe.org/ese/). The site contains many useful resources. Locate Technical Paper 312783 TAP FY2007-4 Accommodations for students with Disabilities Taking the Florida Comprehensive Assessment Test (FCAT). You may want to download or bookmark this for future reference. 2.2 Adaptations and Accommodations for Students with Disabilities (URL is http://www.nichcy.org/pubs/bibliog/bib15txt.htm) contains a bibliography of many useful resources. You may want to find the article: Suchs & Fuchs (1998, Winter). General educators' instructional adaptations for students with learning disabilities. Learning Disability Quarterly, 21(1), 23 - 33. 2.3 The Access Disabled Assistance Program for Tech Students site there is a page on Teaching Students with Disabilities http://www.catalog.gatech.edu/general/services/assist.php). There are sections on characteristics of students with various types of disabilities and lists of academic accommodations grouped by category of disability. 2.4 Assistive Technology Educational Network (ATEN) (URL ishttp://www.aten.scps.k12.fl.us/resources.html) provides many resources to explore. Note the Assistive Technology Links section and keep this site (and Network) in mind for future reference. 2.5 U.S. Office of Special Education Programs Ideas That Work site contains many useful resources under Info and Reports (URL is http://www.seels.net/info_reports/children_we_serve.htm). One of them is The Children We Serve: Demographic Characteristics of Elementary and Middle School Students with Disabilities and Their Households. Explore the reports that may be valuable in your professional role. Designed to Fit Characteristics of Content As mentioned previously, it is important to pick the right kind of instrument (or items) for the objectives you are trying to measure. This generally takes some analysis of the instructional content - recall the work you have done classifying types and levels of learning (knowledge, comprehension, application, etc.; psychomotor skills; affective skills). At the same time, keep in mind the characteristics of the special learners as you are designing your instruction and then selecting or creating the assessment tools that will measure students' learning. You will want to select the best format for the content as well as for the variety of learners in your educational context. While we are making adaptations, it is important to maintain the integrity of the content while adapting the format to meet the learners' needs. If the content changes, the meaning of achievement scores will change and this has an impact on validity of the scores. On the other hand, failing to make appropriate adaptations when they have been specified in an Individualized Educational Plan or by a LEP classification, so the learner doesn't have a chance to demonstrate mastery of the content, also changes the meaning of achievement scores and has an impact on validity. Designed to Fit Characteristics of Learners It is very important that instruments are designed so that they are a good match for the characteristics of the learners. In an earlier module, you were asked to think about the specific characteristics of the learners in your educational context. As educators, we would think of some of the following factors as we are designing instructional activities and assessments. Consider developmental characteristics specific to the age group (attention span, interests, physical dexterity, ...). Keep in mind whether or not students have physical, cognitive, or social-emotional challenges (visual impairment, cerebral palsy, development delay or severe mental retardation, a specific learning disability, behavior disorders, ...). It is especially important to note whether they are receiving Exceptional Student Education (ESE) Services and have specific testing accommodations identified on their Individualized Educational Plan (IEP). Consider whether the learners speak English as a second language. Be aware of cultural backgrounds of students in the group. Their background may influence the way they approach an exam, the way they interpret questions, and/or the way they respond to the tasks or questions. Consider students' experiential backgrounds. For example, if they have not been to a snowy climate then it may not be a good idea to include a snow scenario as background in an item (unless this is a class on weather patterns, of course). Be aware of students' prior experience with any equipment needed to perform the skill (if students practiced with a plastic ball and bat then it would not be good to suddenly produce a real baseball and bat for the actual performance test; similarly, if they practiced writing paragraphs with a paper and pencil then you would not provide a PC on which to take the exam unless you knew they have had experience taking exams on computers). When students demonstrate developmental disabilities, we must think beyond these general factors and come up with very concrete ideas on how students can acquire skills and then demonstrate what they know and are able to do. The same with students whose primary language is other than English. We must give some thought to the best way for them to demonstrate what they have learned related to the targeted instructional objectives. We must come up with ways these students can demonstrate what they can do without letting the lack of English or their disability stand in the way. This requires both general knowledge of characteristics of student's with that particular condition and specific knowledge of the student's capabilities. Designed for Maximum Feasibility We have realized that time is an important factor when it comes to feasibility of testing procedures. Time to select or design the instrument, time to create it, time to score students' work all relate to feasibility. In addition to time, feasibility issues include factors such as support from classroom or program assistants, specialized equipment and materials that will facilitate the student's performance, alternative space, scheduling or possibly permission factors. It takes creativity and support for educators to make adaptations to assessments for students with special learning challenges. The law, ethics, and our own professional motivation to support the learning of all students require us to find the time and seek the support we need to make this happen. Professional in Format and Appearance Here we must consider other important details that may contribute or inhibit students' successful demonstration of skills during test administration. As effective educators we want our materials to appear professional in quality. We are reminded that when we make adaptations for special learners, we must continue to consider the following format and administration procedures. absence of bias (culture, race, gender, religion, politics, etc.) spelling grammar legibility clarity When you select or design and adapt an instrument for special learners, it is especially important to seek feedback on the effectiveness of the adaptations. We know from experience that the resulting product may not achieve the desired outcome. Critique from colleagues and from members of groups of special learners helps to refine the instructions and test items or tasks to create the best assessment possible. Applying these design principles will help ensure your instruments will provide the most valid and reliable results when the tests you create or select are administered to special needs students. As you gain experience, you will probably start to apply the principles more automatically but even seasoned professionals benefit from reviewing them occasionally. The time we invest as we create the instruments is worth the payoff in getting the best possible information available to make the important decisions we must make on our jobs every day. Read the chapter 3 (The chapter has very important information you need to know.): Assignments and Assessments starting on page 28 of the following document (Accommodations: Assisting Students with Disabilities) available on Florida's Department of Education website (the url is http://www.cpt.fsu.edu/ese/pdf/acom_edu.pdf). Just for practice (i.e. no posting required ), review the recommendations found in the chapter and compare them against the guidelines you have learned for ensuring validity and reliability of test results. Identify a specific conflict and then come up with an idea for revising the recommendation to make it more consistent with validity and reliability guidelines. Read (and download or make a copy of) the brochure found on the Department of Education Exceptional Student Education site Planning FCAT Accommodations for Students with Disabilities . The URL is http://www.fldoe.org/ese/fcat/fcat-tea.pdf. Consider how these suggested accommodations might be useful for teacher-made assessments of students in your professional context. Think about a student, friend, or family member you have known with a disability. Briefly summarize that person's challenge (protecting confidentiality, of course) and then identify a recommendation from the Accommodations chapter that would be helpful for them as they are taking a test. Really consider the characteristics of the person (strengths and challenges) and think about the suggested accommodations. Are they "generic - one size fits all" or would they really be useful to the person without compromising the validity of the test results? Part 2 More Resources Related to Accommodations for Students in ESE and ESOL/LEP Programs Take some time to read or review any of the resources you found within the course site that were relevant to your interests or professional role (especially the chapter on validity). There are two required articles related to accommodations for students in ESE and many other optional resources related to assessment accommodations for students in ESE and/or ESOL programs in this part of the module. Considerations When Identifying and Implementing Accommodations for Students in ESE Programs Much of the research in this area has been related to accommodations in formal or large-scale standardized assessment contexts. Less research has been conducted on accommodations in less formal classroom or other educational contexts. We as (teacher, administrator, researcher, etc.) test developers and users are responsible for adapting the large-scale assessment recommendations as appropriate for our target students and for evaluating them using the most important criteria for evaluating an assessment practice - validity. Read the following two articles related to ESE accommodations from the Council for Exceptional Children journal Teaching Exceptional Children. They are important resources that you will likely find useful in your future work. Then select and read at least one of the other articles that is of most interest to your professional role. Council for Exceptional Children (2005). Supplemental section: Guiding principles for appropriate adaptations and accommodations. TEACHING Exceptional Children (Sept/Oct), 53-54. Zirkel, P.A. (2006). What does the law say? TEACHING Exceptional Children, (Jan/Feb), 62-63. Make sure you are aware of the resources on the Florida Department of Education Bureau of Exceptional Education and Student Services Publications website found at http://www.fldoe.org/ese/. There are many important resources related to assessment and accommodations for exceptional students. Click on Publications and browse through the list. A publication that most educators should be aware of is Technical Paper 312783 Accommodations for Students with Disabilities Taking the Florida Comprehensive Assessment Test (FCAT). (Select and read at least one from the many options listed below.) Assistive Technology Consider the following resources when identifying possible accommodations as a member of an IEP development team, as a classroom teacher, a school administrator, or any other professional responsible for accommodations for students with exceptionalities. These are ideas and suggestions for assistive technology useful for assessment that you may not be aware of yet. Reed, P. (2004). Critical Issue: Enhancing system change and academic success through assistive technologies for k - 12 students with special needs. North Central Regional Educational Laboratory. Available online at http://www.ncrel.org/sdrs/areas/issues/methods/technlgy/te700.htm (Note: click on "technology devices in use" to view examples of assistive technology.) Reed, P.P. & Walser, P. (2001). Utilizing assistive technology in making testing accommodations. Wisconsin Assistive Technology Initiative. Available online at http://www.wati.org/AT_Services/pdf/Utilizing_AT_for_Accom.pdf Specific Exceptionalities Cawthon, S.W. (2006). National survey of accommodations and alternate assessments for students who are deaf or hard of hearing in the United States. Journal of Deaf Studies and Deaf Education, 11(3), 337-359. Empirical and Research-oriented Studies Related to Assessment Accommodations Koretz, D.W. & Barton, K. (2003). Assessing students with disabilities: Issues and evidence. CSE Technical Report. CSE-TR-587. Office of Educational Research and Improvement, Washington, DC. Ysseldyke, J. & Nelson, R.R. (2004). What we know and need to know about the consequences of high-stakes testing for students with disabilities. Exceptional Children, 71(1), 75-94. Sireci, S.G., Scarpati, S.E., & Shuhong, L. (2005). Test accommodations for students with disabilities: An analysis of the interaction hypothesis. Review of Educational Research, 75(4), 457-490. Elliott, S.N., Kratochwill, T.R., Y McKevitt, B.C. (2001). Experimental analysis of the effects of testing accommodations on the scores of students with and without disabilities. Journal of School Psychology, 39(1), 3-24. Wagner, M., Friend, M., Bursuck, W.D., Kutash, K., Duchnowski, A.J., Sumi, W.C., & Epstein, M.H. (2006). Journal of Emotional and Behavioral Disorders, 14(1), 12-30. Weston, T.J. (2003). The validity of oral accommodation in testing: NAEP validity studies. Working Paper Series. National Center for Educational Statistics (ED), Washington, DC. NCESWP-2003-06. National Assessment of Educational Progress, Princeton, NJ. Considerations When Identifying and Implementing Accommodations for Students in ESOL Programs As with research on assessment accommodations for students in ESE programs, research on accommodations for students in ESOL programs is not plentiful either. We again must be guided by principles that ensure validity of results as we identify and implement accommodations. There are state and federal guidelines, measurement principles, and feasibility issues to consider. Make yourself familiar with the resources on Florida's Department of Education Office of Academic Achievement through Language Acquisition website at http://www.fldoe.org/aala/Default.asp. (Note: be sure to click on documents and publications for important information on assessment.) You will likely have use for these important resources in the future. Also, you may find the following articles reporting empirical research in the area useful in your professional work. Consider reading one of these (optional) articles related to accommodations for students with limited English proficiency. Abedi, J. & Hejri, F. (2004). Accommodations for students with limited English proficiency in the national assessment of educational progress. Applied Measurement in Education, 17(4), 371392. Duncan, T.G., Parent, L.R., Chen, L., Ferrara, S., Y Johnson, E. (2002). Study of a dual language test booklet in 8th grade mathematics. Paper presented at the Annual Meeting of the American Educational Research Association (New Orleans, LA). Albus, D., Bielinski, J., Thurlow, M., Liu, K. (2001). The effect of a simplified English language dictionary on a reading test. LEP Projects Report 1. National Center on Educational Outcomes. Special Education Programs, Washington, DC. Available from http://education.umn.edu/nceo/OnlinePubs/LEP1.html. Hafner, A.L. (2001). Evaluating the impact of test accommodations on test scores of LEP students & non-LEP students. Paper presented at the Annual Meeting of the American Educational Research Association (Seatle, WA). Here are a couple more websites with valuable resources. Visit the Center for Applied Linguistics website at URL http://www.cal.org/index.html to find available resources relevant to your professional role. Another resource you may find useful is found under Professional Development "PD Resources" at the World-Class Instructional Design and Assessment website found at http://www.wida.us. Select the first presentation Comprehensive School Reform for English Language Learners (ELL's). After listening to the first presentation you may want to select another one that would be of most use in your professional role.