The North Carolina Mathematics Tests Technical Report Grade 3 Pretest End-of-Grade Tests (Grades 3–8) High School Comprehensive Test Algebra I End-of-Course Test Geometry End-of-Course Test Algebra II End-of-Course Test March, 2006 Prepared by: Mildred Bazemore, Section Chief, Test Development Pam Van Dyk, Ph. D. Laura Kramer, Ph.D. North Carolina Department of Public Instruction Amber Yelton Robert Brown, Ph.D. Psychometric staff, Technical Outreach for Public Schools (TOPS) September 2004 In compliance with federal law, including the provisions of Title IX of the Education Amendments of 1972, the North Carolina Public Schools administers all state-operated educational programs, employment activities and admissions without discrimination because of race, national or ethnic origin, color, age, military service, disability, or gender, except where exemption is appropriate and allowed by law. Inquiries or complaints should be directed to: The Office of Curriculum and School Reform Services 6307 Mail Service Center Raleigh, NC 27699-6307 919-807-3761 (phone); 919-807-3767 (fax) 2 Table of Contents Chapter One: Introduction ……………………………………………………….. 12 1.1 Local Participation ……………………………………………………………… 12 1.2 The North Carolina Testing Program …………………………………………… 13 1.3 The North Carolina Mathematics Tests………………………………………….. 14 Chapter Two: Test Development Process ………………………………………. 16 2.1 Test Development Process for the North Carolina Testing Program …………….. 16 2.2 The Curriculum Connection …………………………………………………….. 18 2.3 Test Specifications ………………………………………………………………. 19 2.4 Item Development ……………………………………………………….……… 19 2.5 Item Format and Use of Manipulatives ………………………………………….. 20 2.6 Selection and Training of Item Writers ………………………………………….. 20 2.7 Reviewing Items for Field Testing ……………………………………………... 21 2.8 Assembling Field Test Forms …………………………………………………… 22 2.9 Sampling Procedures ……………………………………………………………. 23 2.10 Field Test Sample Characteristics ……………………………………………… 24 2.11 Item Analysis …………….……………………………………………….…… 25 2.12 Classical Measurement Analysis ………………………………………………. 25 2.13 Item Response Theory (IRT) Analysis ……………………………………….. 25 2.14 Differential Item Functioning Analysis ………………………………………. 27 2.15 Expert Review ………………………………………………………………… 28 2.16 Criteria for Inclusion in Item Pool ……………………………………………… 29 2.17 Item Pool Parameter Estimates ………………………………………………… 29 2.18 Operational Test Construction ………………………………………………… 29 3 2.19 Setting the Target p-value for Operational Tests ………………………………. 30 2.20 Comparison of Item Pool p-Values with Operational p-Values ……………….. 30 2.21 Review of Assembled Operational Tests ……………………………………… 31 2.22 Setting the Test Administration Time …………………………………………. 32 Chapter Three: Test Administration …………………………………………….. 33 3.1 Test Administration …………………………………………………………….. 33 3.2 Training for Test Administrators ……………………………………………….. 34 3.3 Preparation for Test Administration ……………………………………………. 34 3.4 Test Security and Handling Materials ………………………………………….. 34 3.5 Student Participation …………………………………………………………… 35 3.6 Alternate Assessments ………………………………………………………….. 36 3.7 Testing Accommodations ………………………………………………………. 36 3.8 Students with Limited English Proficiency ……………………………………… 36 3.9 Medical Exclusions …………………………………………………………….. 38 3.10 Reporting Student Scores ……………………………………………………… 38 3.11 Confidentiality of Student Test Scores ……….………………………………. 38 Chapter Four: Scaling and Standard-Setting for the North Carolina EOG and EOC Tests of Mathematics …………………………………………………….. 40 4.1 Conversion of Raw Test Scores …………………………………………………. 40 4.2 Constructing a Developmental Scale ……………………………………………. 40 4.3 Comparison with and Linkage to the First Edition Scale ………………………… 44 4.4 Equating the Scales for the First and Second Editions of the North Carolina EOG Tests of Mathematics ………………………………………………..……………… 46 4.5 Setting the Standards……………………………………………………………. 47 4.6 Score Reporting for the North Carolina Tests …………………………………. 48 4.7 Achievement Level Descriptors ………………………………………………… 49 4 4.8 Achievement Level Cut Scores …………………………………………………. 49 4.9 Achievement Level Trends ……………………………………………………… 50 4.10 Percentile Ranking …………………………………………………………….. 52 Chapter Five: Reports ……………………………………………………………. 53 5.1 Use of Test Score Reports Provided by the North Carolina Testing Program …… 53 5.2 Reporting by Student …………………………………………………………… 53 5.3 Reporting by School ……………………………………………………………. 53 5.4 Reporting by the State …………………………………………………………. 54 Chapter Six: Descriptive Statistics and Reliability ………………………………. 55 6.1 Descriptive Statistics for the First Operational Administration of the Tests …….. 55 6.2 Means and Standard Deviations for the First Operational Administration ……… 55 6.3 Population Demographics for the First Operational Administration …………… 56 6.4 Scale Score Frequency Distributions …………………………………………… 56 6.5 Reliability of the North Carolina Tests ………………………………………… 62 6.6 Internal Consistency of the North Carolina Math Tests ………………………… 62 6.7 Standard Error of Measurement for the North Carolina Math Tests ……………. 64 6.8 Equivalency of Test Forms……………………………………………………… 75 Chapter Seven: Evidence of Validity ……………………………………………. 87 7.1 Evidence of Validity ………………………………………………………......... 87 7.2 Content Validity ……………………………………………………………….. 87 7.3 Criterion-Related Validity ……………………………………………………… 88 Chapter Eight: Quality Control …………………………………………………. 94 8.1 Quality Control Prior to Test Administration …………………………………… 94 8.2 Quality Control in Data Preparation and Test Administration …………………… 94 5 8.3 Quality Control in Data Input …………………………………………………… 95 8.4 Quality Control of Test Scores ………………………………………………….. 95 8.5 Quality Control in Reporting …………………………………………………… 95 Glossary of Key Terms …………………………………………………………… 96 References …………………………………………………………………………. 99 Appendix A: Item Development Guidelines ……………………………………… 102 Appendix B: Test Specification Summaries……………………………………..... 104 Appendix C: Math Developmental Scale Report with Excel Plots for First and Second Editions’ Scale Scores …………………………………….……………… 114 Appendix D: Sample Items ………………………………………………………... 131 Appendix E: Sample Frequency Distribution Tables for Math Scale Scores .…. 151 Appendix F: Testing Code of Ethics ……………………………………………… 159 6 List of Tables Table 1: Number of Items Field Tested for North Carolina EOG and EOC Tests of Mathematics Table 2: Field test population (2000) for grade 3 pretest, grades 3-8 end-of-grade tests, and end-of-course tests. Field test population (1997) for Grade 10 High School Comprehensive Test Table 3: Field test population demographics (2001) Table 4: Field test population demographics (2002) Table 5: Average item pool parameter estimates for the EOG and EOC Tests of Mathematics by grade or subject (2000) Table 6: Comparison of p-value of item pool with p-values of assembled forms averaged across forms Table 7: Number of items per test and time allotted by grade and subject Table 8: Population means and standard deviations derived from the Spring 2000 item calibration for the North Carolina End-of-Grade Tests of Mathematics, second edition Table 9: Average difference between adjacent grade means in units of the standard deviation of the lower grade and ratios between adjacent-grade standard deviations derived from the Spring 2000 item calibrations for the North Carolina EOG Tests of Mathematics Table 10: Replications of the average difference between adjacent-grade means in units of the standard deviation of the lower grade and ratios between adjacent-grade standard deviations derived from the Spring 2000 item calibration for the North Carolina EOG Tests of Mathematics Table 11: Comparison of the population means and standard deviations for the second edition with averages and standard deviations obtained from the operational administration of the first edition in the Spring 2000 item calibration for the North Carolina EOG Tests of Mathematics Table 12: Percent of students assigned to each achievement level by teachers Table 13: Administrative Procedures Act 16 NCAC 6D .0501 (Definitions related to Student Accountability Standards) Table 14: EOG and EOC Tests of Mathematics achievement levels and corresponding scale scores Table 15: Achievement level trends for Grade 3 Pretest 7 Table 16: Achievement level trends for Grade 3 Table 17: Achievement level trends for Grade 4 Table 18: Achievement level trends for Grade 5 Table 19: Achievement level trends for Grade 6 Table 20: Achievement level trends for Grade 7 Table 21: Achievement level trends for Grade 8 Table 22: Achievement level trends for Grade 10 High School Comprehensive Test Table 23: Achievement level trends for Algebra I Table 24: Achievement level trends for Geometry Table 25: Achievement level trends for Algebra II Table 26: Descriptive statistics by grade for the 2001 administration of the North Carolina EOG Tests of Mathematics and the 1998 administration of the Grade 10 High School Comprehensive Test Table 27: Mean scale score for the 2001 administration of the North Carolina EOC Mathematics tests Table 28: Population demographics for the 2001 administration of the North Carolina EOG and EOC Tests of Mathematics and the 1998 administration of the Grade 10 High School Comprehensive Test Table 29: Reliability indices averaged across North Carolina EOG and EOC Tests of Mathematics forms Table 30: Reliability indices averaged across North Carolina EOG and EOC Tests of Mathematics forms (Gender) Table 31: Reliability indices averaged across North Carolina EOG and EOC Tests of Mathematics forms (Ethnicity) Table 32: Reliability indices averaged across North Carolina EOG and EOC Tests of Mathematics forms (Other Characteristics) Table 33: Ranges of standard error of measurement for scale scores by grade or subject Table 34: Instructional Validity of the content of the North Carolina EOG Tests of Mathematics 8 Table 35: Pearson correlation coefficient table for variables used to establish criterionrelated validity for the North Carolina EOG Tests of Mathematics Table 36: Pearson correlation coefficient table for variables used to establish criterionrelated validity for the North Carolina EOC Tests of Mathematics 9 List of Figures Figure 1: Flow chart of the test development process used in development of North Carolina Tests Figure 2: Thinking skills framework used to develop the North Carolina End-of-Grade Tests (adapted from Marzano, et al., 1988) Figure 3: Typical item characteristic curve (ICC) for a 4-option multiple-choice item Figure 4: Comparison of the growth curves for the first and second editions of the North Carolina EOG Tests of Mathematics in the Spring 2000 item calibration Figure 5: Equipercentile equating functions between the first and second editions of the North Carolina EOG Tests of Mathematics scales derived from the Spring 2001 equating study for Grades 3–8 Figure 6: Math Scale Score Frequency Distribution Grade 3 Figure 7: Math Scale Score Frequency Distribution Grade 4 Figure 8: Math Scale Score Frequency Distribution Grade 5 Figure 9: Math Scale Score Frequency Distribution Grade 6 Figure 10: Math Scale Score Frequency Distribution Grade 7 Figure 11: Math Scale Score Frequency Distribution Grade 8 Figure 12: Algebra I Scale Score Frequency Distribution Figure 13: Geometry Scale Score Frequency Distribution Figure 14: Algebra II Scale Score Frequency Distribution Figure 15: Standard Errors of Measurement on the Grade 3 Pretest of Mathematics Test forms Figure 16: Standard Errors of Measurement on the Grade 3 Mathematics Test forms Figure 17: Standard Errors of Measurement on the Grade 4 Mathematics Test forms Figure 18: Standard Errors of Measurement on the Grade 5 Mathematics Test forms Figure 19: Standard Errors of Measurement on the Grade 6 Mathematics Test forms Figure 20: Standard Errors of Measurement on the Grade 7 Mathematics Test forms 10 Figure 21: Standard Errors of Measurement on the Grade 8 Mathematics Test forms Figure 22: Standard Errors of Measurement on the Grade 10 Mathematics Test forms Figure 23: Standard Errors of Measurement on the Algebra I Test forms Figure 24: Standard Errors of Measurement on the Geometry Test forms Figure 25: Standard Errors of Measurement on the Algebra II Test forms Figure 26: Test Characteristic Curves for the Grade 3 Pretest of Mathematics Test forms Figure 27: Test Characteristic Curves for the Grade 3 Mathematics Test forms Figure 28: Test Characteristic Curves for the Grade 4 Mathematics Test forms Figure 29: Test Characteristic Curves for the Grade 5 Mathematics Test forms Figure 30: Test Characteristic Curves for the Grade 6 Mathematics Test forms Figure 31: Test Characteristic Curves for the Grade 7 Mathematics Test forms Figure 32: Test Characteristic Curves for the Grade 8 Mathematics Test forms Figure 33: Test Characteristic Curves for the Grade 10 Mathematics Test forms Figure 34: Test Characteristic Curves for the Algebra I Test forms Figure 35: Test Characteristic Curves for the Geometry Test forms Figure 36: Test Characteristic Curves for the Algebra II Test forms Figure 37: Comparison of NAEP “proficient” scores and North Carolina End-of-Grade Tests of Mathematics scores for Grade 4 Figure 38: Comparison of NAEP “basic” scores and North Carolina End-of-Grade Tests of Mathematics scores for Grade 4 Figure 39: Comparison of NAEP “proficient” scores and North Carolina End-of-Grade Tests of Mathematics scores for Grade 8 Figure 40: Comparison of NAEP “basic” scores and North Carolina End-of-Grade Tests of Mathematics scores for Grade 8 11 Chapter One: Introduction The General Assembly believes that all children can learn. It is the intent of the General Assembly that the mission of the public school community is to challenge with high expectations each child to learn, to achieve, and to fulfill his or her potential (G.S. 115C-105.20a). With that mission as its guide, the State Board of Education implemented the ABCs Accountability Program at grades K–8 effective with the 1996–1997 school year and grades 9–12 effective during the 1997–1998 school year to test students’ mastery of basic skills (reading, writing, and mathematics). The ABCs Accountability Program was developed under the Public School Laws mandating local participation in the program, the design of annual performance standards, and the development of student academic performance standards. 1.1 Local Participation The School-Based Management and Accountability Program shall be based upon an accountability, recognition, assistance, and intervention process in order to hold each school and the school’s personnel accountable for improved student performance in the school (G.S. 115C105.21c). Schools are held accountable for student learning by reporting student performance results on North Carolina tests. Student’s scores are compiled each year and released in a report card. Schools are then recognized for the performance of their students. Schools that consistently do not make adequate progress may receive intervention from the state. In April 1999, the State Board of Education unanimously approved Statewide Student Accountability Standards. These standards provide four Gateway Standards for student performance at grades 3, 5, 8, and 11. Students in the 3rd, 5th, and 8th grades are required to demonstrate grade-level performance in reading, writing (5th and 8th grades only), and mathematics in order to be promoted to the next grade. The law regarding student academic performance states: The State Board of Education shall develop a plan to create rigorous student academic performance standards for kindergarten through eighth grade and student academic standards for courses in grades 9-12. The performance standards shall align, whenever possible, with the student academic performance standards developed for the National Assessment of Educational Progress (NAEP). The plan also shall include clear and understandable methods of reporting individual student academic performance to parents (G.S. 115C-105.40). 12 1.2 The North Carolina Testing Program The North Carolina Testing Program was designed to measure the extent to which students satisfy academic performance requirements. Tests developed by the North Carolina Department of Public Instruction’s Test Development Section, when properly administered and interpreted, provide reliable and valid information that enables • • • • • students to know the extent to which they have mastered expected knowledge and skills and how they compare to others; parents to know if their children are acquiring the knowledge and skills needed to succeed in a highly competitive job market; teachers to know if their students have mastered grade-level knowledge and skills in the curriculum and, if not, what weaknesses need to be addressed; community leaders and lawmakers to know if students in North Carolina schools are improving their performance over time and how our students compare with students from other states; and citizens to assess the performance of the public schools (North Carolina Testing Code of Ethics, 1997, revised 2000). The North Carolina Testing Program was initiated in response to legislation passed by the North Carolina General Assembly. The following selection from Public School Laws (1994) describes the legislation. Public School Law 115C-174.10 states the following purposes of the North Carolina Testing Program: (1) to assure that all high school graduates possess the … skills and knowledge thought necessary to function as a member of society; (2) to provide a means of identifying strengths and weaknesses in the education process; and (3) to establish additional means for making the education system accountable to the public for results. Tests included in the North Carolina Testing Program are designed for use as federal, state, and local indicators of student performance. Interpretation of test scores in the North Carolina Testing Program provides information about a student’s performance on the test in percentiles, scale scores, and achievement levels. Percentiles provide an indicator of how a child performs relative to other children who took the test in the norming year, or the first year the test was administered. Percentiles range from 1 to 99. A percentile rank of 69 indicates that a child performed equal to or better than 69 percent of the children who took the test during the norming year. Scale scores are derived from a raw score or “number right” score for the test. Each test has a translation table that provides a scale score for each raw test score. Scale scores are reported alongside four achievement levels, which are predetermined academic achievement standards. The four achievement levels for the North Carolina Testing Program are shown below. 13 Level I: Students performing at this level do not have sufficient mastery of knowledge and skills in a particular subject area to be successful at the next grade level. Level II: Students performing at this level demonstrate inconsistent mastery of knowledge and skills in the subject area and are minimally prepared to be successful at the next grade level. Level III: Students performing at this level consistently demonstrate mastery of the grade level subject matter and skills and are well prepared for the next grade. Level IV: Students performing at this level consistently perform in a superior manner clearly beyond that required to be proficient at grade level. The North Carolina End-of-Grade (EOG) Tests include multiple-choice assessments of reading comprehension and mathematics in grades 3 through 8 and 10. The North Carolina End-of-Course (EOC) Tests include multiple-choice assessments of reading comprehension and mathematics in English I, Algebra I, Geometry, and Algebra II. In addition to the reading comprehension and mathematics tests, the North Carolina Testing Program includes science EOC tests (Biology, Chemistry, Physical Science, and Physics); social studies EOC tests which are currently under revision (Civics and Economics and U.S. History); writing assessments in grades 4, 7, and 10; the North Carolina Tests of Computer Skills; the North Carolina Competency Tests; and two alternate assessments (North Carolina Alternate Assessment Academic Inventory and the North Carolina Alternate Assessment Portfolio). The EOG reading comprehension and mathematics tests are used to monitor growth and student performance against absolute standards (performance composite) for student accountability. A student’s EOG scores from the prior grade are used to determine his or her entering level of knowledge and skills and to determine the amount of growth during one school year. Beginning in 1996, a student’s growth at grade 3 was determined by comparing the grade 3 EOG score with a grade 3 pretest administered during the first three weeks of the school year. The Grade Level Proficiency Guidelines, approved by the State Board of Education (February, 1995), established Level III (of those achievement levels listed above) as the standard for each grade level. The EOC tests measure a student’s mastery of course-level material. 1.3 The North Carolina Mathematics Tests The purpose of this document is to provide an overview and technical documentation for the North Carolina Mathematics Tests which include the Grade 3 Pretest, the End-ofGrade Mathematics Tests in grades 3-8, the High School Comprehensive Mathematics Test, and End-of-Course (EOC) Mathematics Tests in Algebra I, Geometry, and Algebra II. Chapter One provides an overview of the North Carolina Mathematics Tests. Chapter 14 Two describes the test development process. Chapter Three outlines the test administration. Chapter Four describes the construction of the developmental scale, the scoring of the tests, and the standard setting process. Chapter Five provides an outline of reporting of test results. Chapters Six and Seven provide the technical properties of the tests such as descriptive statistics from the first operational year, reliability indices, and evidence of validity. Chapter Eight is an overview of quality control procedures. 15 Chapter Two: Test Development 2.1 Test Development Process for the North Carolina Testing Program In June of 2003, the State Board of Education codified the process used in developing all multiple-choice tests in the North Carolina Testing Program. The development of tests for the North Carolina Testing Program follows a prescribed sequence of events. A flow chart of those events is found in figure 1. 16 Figure 1: Flow chart of the test development process used in development of North Carolina Tests Step 7 Review Item Tryout Statistics Step 14b Conduct Bias Reviews Step 1a Develop Test Specifications (Blueprint) Step 8b Develop New Items Step 15 Assemble Equivalent and Parallel Forms Step 2b Develop Test Items Step 9b Review Items for Field Test Step 16b Review Assembled Test Step 3b Review Items for Tryouts Step 10 Assemble Field Test Forms Step 17 Final Review of Test Step 4 Assemble Item Tryout Forms Step 11b Review Field Test Forms Step 18ab Administer Test as Pilot Step 5b Review Item Tryout Forms Step 12b Administer Field Test Step 19 Score Test Step 6b Administer Item Tryouts Step 13 Review Field Test Statistics Step 20ab Establish Standards Curriculum Adoption Step 21b Administer Test as Fully Operational Step 22 Report Test Results a Activities done only at implementation of new curriculum Activities involving NC teachers b Phase 1 (step 1) requires 4 months Phase 2 (steps 2-7) requires 12 months Phase 3 (steps 8-14) requires 20 months Phase 4 (steps 15-20) requires 4 months for EOC and 9 months for EOG Phase 5 (step 21) requires 4 months Phase 6 (step 22) requires 1 month TOTAL 44-49 months NOTES: Whenever possible, item tryouts should precede field testing items. Professional development opportunities are integral and ongoing to the curriculum and test development process. 17 2.2 The Curriculum Connection Using research conducted by the North Carolina Mathematics Framework Committee, the North Carolina Mathematics Standard Course of Study Committee constructed a curriculum focused on giving students the opportunity to acquire mathematical literacy. Mathematical literacy is necessary to function in an information age and has the primary roles of helping students • • • • cultivate the understanding and application of mathematical skills and concepts necessary to thrive in an ever-changing technological world; develop the essential elements of problem solving, communication, and reasoning; develop connections within their study of mathematics; and understand the major ideas of mathematics (Mathematics K–12 Standard Course of Study and Mathematics Competencies, NCDPI Publication, Instructional Services Division, www.ncpublicschools.org/curriculum/mathematics/). The North Carolina Mathematics Standard Course of Study clearly defines a curriculum focused on what students will need to know and be able to do to be successful and contributing citizens in our state and nation in the years ahead. As defined in the 1998 North Carolina Mathematics Standard Course of Study, the goals of mathematics education are for students to develop (1) strong mathematical problem solving and reasoning abilities; (2) a firm grounding in essential mathematical concepts and skills, including computation and estimation; (3) connections within mathematics and with other disciplines; (4) the ability to use appropriate tools, including technology, to solve mathematical problems; (5) the ability to communicate an understanding of mathematics effectively; and (6) positive attitudes and beliefs about mathematics. The elementary program of mathematics focuses on assisting students with a higher-level understanding of mathematics through the use of manipulative items, working independently and in groups, and conducting investigations and recording findings. Middle-grade students expand on these skills to compute with real numbers and to apply basic concepts in new and difficult situations. High school mathematics includes courses from Introductory Mathematics to Advanced Placement Calculus (North Carolina Standard Course of Study). The North Carolina State Board of Education adopted the revised mathematics component of the North Carolina Standard Course of Study (NCSCS) in 1998. Students in North Carolina schools are tested in mathematics in grades 3 through 8 and grade 10. In addition, students taking Algebra I, Algebra II, and Geometry in high school are tested at the end of these courses. Mathematics tests for these grades and courses are designed around the competency goals and objectives found in the NCSCS. 18 2.3 Test Specifications Delineating the purpose of a test must come before the test design. A clear statement of purpose provides the overall framework for test specifications, test blueprint, item development, tryout, and review. A clear statement of test purpose also contributes significantly to appropriate test use in practical contexts (Millman & Greene, 1993). The tests in the North Carolina Testing Program are designed in alignment with the NCSCS. The purpose of the North Carolina EOG and EOC Tests of Mathematics is legislated by General Statute 115C-174.10 and focuses on the measurement of individual student mathematical skills and knowledge as outlined in the NCSCS. Test specifications for the North Carolina mathematics tests are developed in accordance with the competency goals and objectives specified in the NCSCS. A summary of the test specifications is provided in Appendix B. These test specifications also are generally designed to include the following: (1) Percentage of questions from higher or lower thinking skills and classification of each test question into level of difficulty; (2) Percentage of test questions that measure a specific goal or objective; (3) Percentage of questions that require the use of a calculator and percentage that do not allow the use of a calculator. 2.4 Item Development Items on the North Carolina EOG and EOC Tests of Mathematics are developed using level of difficulty and thinking skill level. Item writers use these frameworks when developing items. The purpose of the categories is to ensure a balance of items across difficulty, as well as a balance of items across the different cognitive levels of learning in the North Carolina mathematics tests. For the purposes of guiding item writers to provide a variety of items, items were classified into three levels of difficulty: easy, medium, and hard. Easy items are those items that the item writer believes can be answered correctly by approximately 70% of the examinees. Medium items can be answered correctly by 50–60% of the examinees. Difficult items can be answered correctly by approximately 20–30% of the examinees. These targets are used for item pool development to ensure an adequate range of difficulty. A more recent consideration for item development is the classification of items by thinking skill level, the cognitive skills that an examinee must use to solve a problem or answer a test question. Thinking skill levels are based on Dimensions of Thinking by Marzano, et al. (1988). In addition to using thinking skill levels in framing achievement tests, they are also a practical framework for curriculum development, instruction, assessment, and staff development. Thinking skills begin with the basic skill of information-gathering and move to more complex thinking skills, such as integration and evaluation. Figure 2 below shows a visual representation of the framework. 19 Figure 2: Thinking skills framework used to develop the North Carolina End-of-Grade Tests (adapted from Marzano, et al., 1988) Dimensions of Thinking Content Area Knowledge Metacognition Critical and Creative Thinking Core Thinking Skills Categories: Thinking Processes: Concept Formation Principle Formation Comprehending Problem-solving Decision-making Research Composing Oral Discourse Focusing Information-gathering Remembering Organizing Analyzing Generating Integrating Evaluating 2.5 Item Format and Use of Manipulatives Items on the North Carolina mathematics tests are four-foil, multiple-choice items. On the end-of-grade mathematics tests, thirty percent of the items are calculator inactive items and seventy percent are calculator active items. A small percentage of items on the end-of-grade mathematics tests require the use of a ruler or protractor. Formula sheets are provided for grades 6 through 8 and 10. 2.6 Selection and Training of Item Writers Once the test blueprints were finalized from the test specifications for the revised editions of the North Carolina mathematics tests, North Carolina educators were recruited and trained to write new items for the state tests. The diversity among the item writers and their knowledge of the current NCSCS was addressed during recruitment. The use of 20 North Carolina educators to develop items ensured instructional validity of the items. Some items were developed through an external vendor; however, the vendor was encouraged to use North Carolina educators in addition to professional item writers to generate items that would align with the NCSCS for mathematics. Training for item writers occurred during a 3-day period. Item writers received a packet of materials designed in accordance with the mathematics curriculum, which included information on content and procedural guidelines as well as information on stem and foil development. The item-writing guidelines are included in Appendix A. The items developed during the training were evaluated by content specialists, who then provided feedback to the item writers on the quality of their items. 2.7 Reviewing Items for Field Testing To ensure that an item was developed to NCSCS standards, each item went through a detailed review process prior to being placed on a field test. A new group of North Carolina educators was recruited to review items. Once items had been through an educator review, test development staff members, with input from curriculum specialists, reviewed each item. Items were also reviewed by educators and/or staff familiar with the needs of students with disabilities and limited English proficiency. The criteria for evaluating each written item included the following: 1) Conceptual • objective match (curricular appropriateness) • thinking skill match • fair representation • lack of bias • clear statement • single problem • one best answer • common context in foils • credible foils • technical correctness 2) Language • appropriate for age • correct punctuation • spelling and grammar • lack of excess words • no stem or foil clues • no negative in foils 21 3) Format • logical order of foils • familiar presentation style, print size, and type • correct mechanics and appearance • equal length foils 4) Diagram/Graphics • necessary • clean • relevant • unbiased The detailed review of items helped prevent the loss of items during field testing due to quality issues. 2.8 Assembling Field Test Forms Prior to creating an operational test, items for each written subject/course area were assembled into field test forms. Field test forms were organized according to blueprints for the operational tests. Similar to the operational test review, North Carolina educators reviewed the assembled field test forms for clarity, correctness, potential bias, and curricular appropriateness. Field testing of mathematics Grade 3 Pretest, end-of-grade, and end-of-course test items occurred during the 1999–2000 school year. Rather than develop forms composed of field test items alone, field test items were instead embedded in operational forms from the previous curriculum. The three operational EOG and EOC base forms at each grade or course were embedded with 10–12 items each to create 45–51 separate test forms for each grade level or subject. In addition, there were 15–17 linking forms (administered at a grade below the nominal grade) and one research form which was used to examine context and location effects, resulting in 61 to 79 forms at each grade. The High School Comprehensive Mathematics items were field tested on whole forms in 1997. Table 1 below provides a breakdown of the number of grade-level forms, number of items per form, and number of total items per grade or subject. 22 Table 1: Number of Items Field Tested for North Carolina EOG and EOC Tests of Mathematics Grade / Subject Grade 3 Pretest Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 Grade 10 Algebra I Geometry Algebra II Number of Grade Level Forms 36 51 51 48 51 48 45 10 45 45 39 Number of Items per Form 11 12 12 12 12 12 12 80 12 12 10 Total Number of items 396 612 612 576 612 576 540 800 540 540 390 2.9 Sampling Procedures Sampling for field testing of the North Carolina Tests is typically accomplished using stratified random sampling with the goal being a selection of students that is representative of the entire student population in North Carolina. The development of the North Carolina Tests of Mathematics departed from random sampling during the first field test (2000) and instead used census sampling to embed field test items on an operational version of the mathematics tests. The sample for the High School Comprehensive Mathematics Test was selected through stratified random sampling to represent the general population characteristics. In 2001 and 2002, additional samples of students were selected at random to supplement the item pools. Field test sample characteristics for the three years are provided in the following section. 23 2.10 Field Test Sample Characteristics Table 2: Field test population (2000) for grade 3 pretest, grades 3-8 end-of-grade tests and end-of-course tests. Field test population (1997) for Grade 10 High School Comprehensive Test 3 Pretest 105,750 50.8 49.2 1.5 30.8 60.8 6.9 % LEP (Limited English Proficient) 1.7 Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 Grade 10 Algebra I Geometry Algebra II 105,900 104,531 103,109 100,746 98,424 95,229 10,691 90,322 65,060 52,701 48.7 48.6 49.3 49.0 49.6 49.1 52.0 50.2 53.6 53.1 1.4 1.4 1.4 1.3 1.4 1.3 1.2 1.3 1.1 1.1 30.9 30.6 30.2 31.0 33.7 30.8 26.1 28.6 22.8 25.0 60.0 61.0 61.7 61.4 58.9 61.7 65.7 64.1 70.7 68.4 7.7 7.0 6.6 6.3 6.1 6.1 7.0 6.1 5.5 5.5 2.8 2.3 2.2 1.8 1.5 1.8 0.6 0.7 0.4 0.4 Grade/ Subject N % % % % American % Black Male Female White Indian 51.3 51.4 50.7 51.0 50.4 50.9 48.0 49.8 46.3 46.9 % Other To supplement the item pools created from the embedded field testing, additional standalone field tests were administered in subsequent years in grades 3 through 8. The field test population characteristics from the stand-alone field tests are provided below in tables 3 and 4. Table 3: Field test population demographics (2001) Grade Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 N 10,397 10,251 8,019 18,319 16,885 15,395 % % % % American Male Female Black Indian 51.4 48.6 0.8 30.9 49.9 50.1 0.8 29.1 50.1 49.9 1.1 27.6 50.3 49.7 1.5 29.3 49.5 50.5 1.5 26.5 50.0 50.0 1.2 25.5 24 % White % Other % LEP 59.9 61.9 62.9 62.1 64.2 66.1 8.5 8.2 8.4 7.2 7.8 7.2 2.0 1.9 1.4 1.0 1.0 1.1 Table 4: Field test population demographics (2002) Grade Grade 6 Grade 8 % % % % American Male Female Black Indian 13.988 50.9 49.1 2.0 29.8 17,501 49.3 50.7 1.3 28.4 N % White % Other % LEP 60.7 62.5 8.9 7.8 1.1 1.1 2.11 Item Analysis Field testing provides important data for determining whether an item will be retained for use on an operational North Carolina EOG or EOC Test of Mathematics. The North Carolina Testing Program uses both classical measurement analysis and item response theory (IRT) analysis to determine if an item has sound psychometric properties. These analyses provide information that assists North Carolina Testing Program staff and consultants in determining the extent to which an item can accurately measure a student’s level of achievement. Field test data were analyzed by the North Carolina Department of Public Instruction (NCDPI) psychometric staff. Item statistics and descriptive information were then printed on labels and attached to the item record for each item. The item records contained the statistical, descriptive, and historical information for an item, a copy of the item as it was field tested, comments by reviewers, and curricular and psychometric notations. 2.12 Classical Measurement Analysis For each item, the p-value (proportion of examinees answering an item correctly), the standard deviation of the p-value, and the point-biserial correlation between the item score and the total test score were computed using SAS. In addition, frequency distributions of the response choices were tabulated. While the p-value is an important statistic and one component used in determining the selection of an item, the North Carolina Testing Program also uses IRT to provide additional item parameters to determine the psychometric properties of the North Carolina mathematics tests. 2.13 Item Response Theory (IRT) Analysis To provide additional information about item performance, the North Carolina Testing Program also uses IRT statistics to determine whether an item should be included on the test. IRT is, with increasing frequency, being used with large-scale achievement testing. “The reason for this may be the desire for item statistics to be independent of a particular group and for scores describing examinee proficiency to be independent of test difficulty, and for the need to assess reliability of tests without the tests being strictly parallel” (Hambleton, 1993, p. 148). IRT meets these needs and provides two additional advantages: the invariance of item parameters and the invariance of ability parameters. Regardless of the distribution of the sample, the parameter estimates will be linearly related to the parameters estimated with some other sample drawn from the same 25 population. IRT allows the comparison of two students’ ability estimates even though they may have taken different items. An important characteristic of IRT is item-level orientation. IRT makes a statement about the relationship between the probability of answering an item correctly and the student’s ability or the student’s level of achievement. The relationship between a student’s item performance and the set of traits underlying item performance can be described by a monotonically increasing function called an Item Characteristic Curve (ICC). This function specifies that as the level of the trait increases, the probability of a correct response to an item increases. The following figure shows the ICC for a typical 4-option multiple-choice item. Figure 3: Typical item characteristic curve (ICC) for a 4-option multiple-choice item Three Parameter Model 1.0 Probability of a Correct Response 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 Abi l i ty The three-parameter logistic model (3PL) of IRT, the model used in generating EOG item statistics, takes into account the difficulty of the item and the ability of the examinee. A student’s probability of answering a given item correctly depends on the student’s ability and the characteristics of the item. The 3PL model has three assumptions: (1) unidimensionality—only one ability is assessed by the set of items (for example, a spelling test only assesses a student’s ability to spell); (2) local independence—when abilities influencing test performance are held constant, an examinee’s responses to any pair of items are statistically independent (conditional independence, i.e., the only reason an examinee scores similarly on several items is because of his or her ability); and (3) the ICC specified reflects the true relationship among the unobservable variable (ability) and the observable variable (item response). 26 The formula for the 3PL model is 1 – ci Pi (θ) = ci + 1 + exp[-Dai (θ - bi)] where Pi(θ)—the probability that a randomly chosen examinee with ability (θ) answers item i correctly (this is an S-shaped curve with values between 0 and 1 over the ability scale) a—the slope or the discrimination power of the item (the slope of a typical item is 1.00) b—the threshold or the point on the ability scale where the probability of a correct response is 50% (the threshold of a typical item is 0.00) c—the asymptote or the proportion of the examinees who got the item correct but did poorly on the overall test (the asymptote of a typical 4-choice item is 0.25) D—a scaling factor, 1.7, to make the logistic function as close as possible to the normal ogive function (Hambleton, 1983, p.125). The IRT parameter estimates for each item are computed using the BILOG computer program (Muraki, Mislevy, & Bock, 1991) using the default Bayesian prior distributions for the item parameters [a~lognormal(0, 0.5), b~N(0,2), and c~Beta(6,16)]. 2.14 Differential Item Functioning Analysis It is important to know the extent to which an item on a test performs differently for different students. As a third component of the item analysis, differential item functioning (DIF) analyses examine the relationship between the score on an item and group membership while controlling for ability to determine if an item is biased towards a particular gender or ethnic group. In developing the North Carolina mathematics tests, the North Carolina Testing Program staff used the Mantel-Haenszel procedure to examine DIF by examining j 2 × 2 contingency tables, where j is the number of different levels of ability actually achieved by the examinees (actual total scores received on the test). The focal group is the focus of interest, and the reference group serves as a basis for comparison for the focal group (Dorans & Holland, 1993; Camilli & Shepherd, 1994). For example, females might serve as the focal group and males might serve as the reference group to determine if an item is biased towards or against females. The Mantel-Haenszel (MH) chi-square statistic (only used for 2 × 2 tables) tests the alternative hypothesis that a linear association exists between the row variable (score on the item) and the column variable (group membership). The Χ2 distribution has one degree of freedom (df ) and its significance is determined by the correlation between the row variable and the column variable (SAS Institute, 1985). 27 The MH Log Odds Ratio statistic is SAS was used to determine the direction of DIF. This measure was obtained by combining the odds ratios (aj) across levels with the formula for weighted averages (Camilli & Shepherd, 1994, p. 110). For this statistic, the null hypothesis of no relationship between score and group membership, the odds of getting the item correct are equal for the two groups, was not rejected when the odds ratio equals 1. For odds ratios greater than 1, the interpretation was that an individual at score level j of the Reference Group had a greater chance of answering the item correctly than an individual at score level j of the Focal Group. Conversely, for odds ratios less than 1, the interpretation was that an individual at score level j of the Focal Group had a greater chance of answering the item correctly than an individual at score level j of the Reference Group. The Breslow-Day Test was used to test whether the odds ratios from the j levels of the score were all equal. When the null hypothesis was true, the statistic was distributed approximately as a chi-square with j-1 degrees of freedom (SAS Institute, 1985). The ethnic and gender bias flags were determined by examining the significance levels of items from several forms and identifying a typical point on the continuum of odds ratios that was statistically significant at the α = 0.05 level. 2.15 Expert Review All items, statistics, and comments were reviewed by curriculum specialists and testing consultants. Items found to be inappropriate for curricular or psychometric reasons were deleted. In addition, items flagged for exhibiting ethnic or gender bias were then reviewed by a bias review committee. The bias review committee members, selected because of their knowledge of the curriculum area and their diversity, evaluated test items with a bias flag using the following questions: 1. Does the item contain any offensive gender, ethnic, religious, or regional content? 2. Does the item contain gender, ethnic, or cultural stereotyping? 3. Does the item contain activities that will be more familiar to one group than another? 4. Do the words in the item have a different meaning to one group than another? 5. Could there be group differences in performance that are unrelated to proficiency in the content areas? An answer of yes to any of these questions resulted in the unique 5-digit item number being recorded on an item bias sheet along with the nature of the bias or sensitivity. Items that were consistently identified as exhibiting bias or sensitivity were deleted from the item pool. Items that were flagged by the bias review committee were then reviewed by curriculum specialists. If curriculum found the items measured content that was expected to be 28 mastered by all students, the item was retained for test development. Items consistently identified as exhibiting bias by both review committees were deleted from the item pool. 2.16 Criteria for Inclusion in Item Pool All of the item parameter data generated from the above analyses were used to determine if an item displayed sound psychometric properties. Items could be potentially be flagged as exhibiting psychometric problems or bias due to ethnicity/race or gender according to the following criteria: y y y y weak prediction—the slope (a parameter) was less than 0.60; guessing—the asymptote (c parameter) was greater than 0.40; ethnic bias—the log odds ratio was greater than 1.5 (favored whites) or less than 0.67 (favored blacks); and gender bias—the log odds ratio was greater than 1.5 (favored females) or less than 0.67 (favored males). Because the tests were to be used to evaluate the implementation of the curriculum, items were not flagged on the basis of the difficulty of the item (threshold). The average item pool parameter estimates based on field test data are provided below. 2.17 Item Pool Parameter Estimates Table 5: Average item pool parameter estimates for the EOG and EOC Tests of Mathematics by grade or subject (2000) Grade / Subject 3 Pretest Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 Grade 10 Algebra I Geometry Algebra II IRT Parameters Threshold (b) 0.07 -0.30 0.03 0.28 0.44 0.51 0.58 1.26 0.62 0.58 0.91 Slope (a) 0.98 0.99 0.92 1.02 1.02 0.98 1.01 1.12 0.94 1.06 0.98 Asymptote (c) 0.18 0.11 0.11 0.13 0.14 0.15 0.14 0.22 0.18 0.19 0.20 pvalue 0.60 0.66 0.59 0.54 0.51 0.49 0.46 0.41 0.50 0.50 0.46 Bias (Odds Ratio Logit) Ethnic / Gender Race 1.06 1.01 1.11 1.03 1.09 1.01 1.09 1.00 1.07 1.03 1.11 1.02 1.10 1.03 1.01 1.01 1.10 1.02 1.14 1.03 1.07 1.02 2.18 Operational Test Construction The final item pool was based on approval by the (1) NCDPI Division of Instructional Services for curricular match and (2) NCDPI Division of Accountability Services/Test Development Section for psychometrically sound item performance. Once the final items 29 were identified for the item pool, operational tests were constructed according to the test blueprints. For a summary of the test specifications, see Appendix B. For EOG Tests of Mathematics, three forms were developed for operational administration for grades 3 through 6. For grades 7 and 8, two forms were developed. Three forms were developed for each of the EOC tests. 2.19 Setting the Target p-value for Operational Tests P-value is a measure of the difficulty of an item. P-values can range from 0 to 1. The letter “p” symbolizes the proportion of examinees that answer an item correctly. So an item with a pvalue of 0.75 was correctly answered by 75% of the students who answered the item during the field test, and one might expect that roughly 75 of 100 examinees will answer it correctly when the item is put on an operational test. An easy item has a p-value that is high, which means that a large proportion of the examinees got the item right during the field test. A difficult item has a low p-value, meaning that few examinees answered the item correctly during field testing. The NCDPI psychometric staff must choose a target p-value for each operational test prior to assembling the tests. Ideally, the average p-value of a test would be 0.625, which is the theoretical average of a student getting 100% correct on the test and a student scoring a chance performance (25% for a 4-foil multiple choice test). That is, (100 + 25)/2. The target was chosen by first looking at the distribution of the p-values for a particular item pool. While the goal is to set the target as close to 0.625 as possible, it is often the case that the target p-value is set between the ideal 0.625 and the average p-value of the item pool. The average p-value and the target p-value for operational forms are provided below for comparison. 2.20 Comparison of Item Pool p-Values with Operational p-Values Table 6: Comparison of p-value of item pool with p-values of assembled forms averaged across forms Grade/Subject 3 Pretest Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 Grade 10 Algebra I Geometry Algebra II p-Value of Item Pool p-Value of Forms 0.60 0.66 0.59 0.54 0.51 0.49 0.46 0.41 0.50 0.50 0.46 0.59 0.66 0.62 0.59 0.56 0.51 0.48 0.41 0.51 0.54 0.47 30 To develop equivalent forms, the test forms were balanced on P+, the sum of the p-values of the items. All calculator active sections within a grade were equated, and all calculator inactive sections within a grade were equated. Finally, to the extent possible, the sections were balanced on slope. 2.21 Review of Assembled Operational Tests Once forms were assembled to meet test specifications, target p-values, and item parameter targets, a group of North Carolina educators and curriculum supervisors then reviewed the assembled forms. Each group of subject area teachers and curriculum supervisors worked independently of the test developers. The criteria for evaluating each group of forms included the following: • • • • • the content of the test forms should reflect the goals and objectives of the North Carolina Standard Course of Study for the subject (curricular validity); the content of test forms should reflect the goals and objectives as taught in North Carolina Schools (instructional validity); items should be clearly and concisely written and the vocabulary appropriate to the target age level (item quality); content of the test forms should be balanced in relation to ethnicity, gender, socioeconomic status, and geographic district of the state (free from test/item bias); and an item should have one and only one best answer that is right; the distractors should appear plausible for someone who has not achieved mastery of the representative objective (one best answer). Reviewers were instructed to take the tests (circling the correct responses in the booklet) and to provide comments and feedback next to each item. After reviewing all the forms, each reviewer independently completed the survey asking for his or her opinion as to how well the tests met the five criteria listed above. During the last part of the session, the group discussed the tests and made comments as a group. The test review ratings along with the comments were aggregated for review by NCDPI curriculum specialists and testing consultants. As a final review, test development staff members, with input from curriculum staff, content experts, and editors, conducted a final content and grammar check for each test form. 2.22 Setting the Test Administration Time Additional important considerations in the construction of the North Carolina mathematics tests were the number of items to be included and the time necessary to complete the test. A timing study was conducted for the mathematics tests. The timing study was conducted in the spring of 2000. Twenty-four items were administered to a volunteer sample of students in Grades 3 and 6. Fifteen items were administered to a volunteer sample of high school students. The length of time necessary to complete the items was calculated. This provided a rough time-per-item estimate. In some cases it was necessary to reduce the number of items slightly so that test administration time was reasonable and comparable to previous test administrations. Adjustments to the length of the North Carolina mathematics tests were made 31 prior to administering the test as an operational test. The total number of items and approximate testing time (in minutes) for each mathematics tests is provided below. Table 7: Number of items per test and time allotted by grade and subject Grade / Subject 3 Pretest Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 Grade 10 Algebra I Geometry Algebra II Time Allotted Number of Items (includes short breaks and general instructions) 40 80 80 80 80 80 80 80 80 72 60 100 200 200 200 195 195 195 195 130 130 130 32 Chapter Three: Test Administration 3.1 Test Administration The North Carolina Grade 3 Mathematics Pretest, which measures grade 2 competencies in mathematics, is a multiple-choice test administered to all students in grade 3 within the first three weeks of the school year. The pretest allows schools to establish benchmarks to compare individual and group scale scores and achievement levels with the results from the regular end-of-grade test administered in the spring. In addition, a comparison of the results from the pretest and the results from the regular grade 3 end-of-grade test administration allows schools to measure growth in achievement in mathematics at the third-grade level for the ABCs accountability program. The grade 3 pretest measures the knowledge and skills specified for grade 2 from the mathematics goals and objectives of the 1998 North Carolina Standard Course of Study. The pretest is not designed to make student placement or diagnostic decisions. The End-of-Grade Mathematics Tests are administered to students in grades 3 through 8 as part of the statewide assessment program. The standard for grade-level proficiency is a test score at or above Achievement Level Three on both reading comprehension and mathematics tests. Effective with the 2000-2001 school year, the North Carolina End-ofGrade Mathematics Tests are multiple-choice tests that measure the goals and objectives of the mathematics curriculum adopted in 1998 by the North Carolina State Board of Education for each grade. The competency goals and objectives are organized into four strands: (1) number sense, numeration, and numerical operations; (2) spatial sense, measurement, and geometry; (3) patterns, relationships, and functions; and (4) data, probability, and statistics. The North Carolina High School Comprehensive Mathematics Test is administered to students in grade 10. It is a multiple-choice test that measures knowledge, skills, and competencies in mathematics that the typical student should have mastered by the end of the tenth grade. The mathematics framework consists of three competencies—problemsolving, reasoning, and communication—and four strands—(1) number sense, numeration, and numerical operations, (2) spatial sense, measurement, and geometry; (3) patterns, relationships, and functions; and (4) statistics, probability, and discrete mathematics. All end-of-course tests are administered within the final ten days of the course to students enrolled for credit in courses where end-of-course tests are required. The purpose of endof-course tests is to sample a student’s knowledge of subject-related concepts specified in the North Carolina Standard Course of Study and to provide a global estimate of the student’s mastery of the material in a particular content area. The mathematics end-ofcourse (Algebra I, Geometry, and Algebra II) tests were developed to provide accurate measurement of individual student knowledge and skills specified in the mathematics component of the North Carolina Standard Course of Study. 33 3.2 Training for Test Administrators The North Carolina Testing Program uses a train-the-trainer model to prepare test administrators to administer North Carolina tests. Regional Accountability Coordinators (RACs) receive training in test administration from NCDPI Testing Policy and Operations staff at regularly scheduled monthly training sessions. Subsequently, the RACs provide training on conducting a proper test administration to Local Education Agency (LEA) test coordinators. LEA test coordinators provide training to school test coordinators. The training includes information on the test administrators’ responsibilities, proctors’ responsibilities, preparing students for testing, eligibility for testing, policies for testing students with special needs (students with disabilities and students with limited English proficiency), test security (storing, inventorying, and, returning test materials), and the Testing Code of Ethics. 3.3 Preparation for Test Administration School test coordinators must be accessible to test administrators and proctors during the administration of secure state tests. The school test coordinator is responsible for monitoring test administrations within the building and responding to situations that may arise during test administrations. Only employees of the school system are permitted to administer secure state tests. Test administrators are school personnel who have professional training in education and the state testing program. Test administrators may not modify, change, alter, or tamper with student responses on the answer sheets or test books. Test administrators are to: thoroughly read the Test Administrator’s Manual prior to actual test administration; discuss with students the purpose of the test; and read and study the codified North Carolina Testing Code of Ethics. 3.4 Test Security and Handling Materials Compromised secure tests result in compromised test scores. To avoid contamination of test scores, the NCDPI maintains test security before, during, and after test administration at both the school system level and the individual school. School systems are also mandated to provide a secure area for storing tests. The Administrative Procedures Act 16 NCAC 6D .0302. states, in part, that school systems shall (1) account to the department (NCDPI) for all tests received; (2) provide a locked storage area for all tests received; (3) prohibit the reproduction of all or any part of the tests; and (4) prohibit their employees from disclosing the content of or discussing with students or others specific items contained in the tests. Secure test materials may only be stored at each individual school for a short period prior to and after the test administration. Every effort must be made to minimize school personnel access to secure state tests prior to and after each test administration. 34 At the individual school, the principal shall account for all test materials received. As established by APA 16 NCAC 6D .0306, the principal shall store test materials in a secure locked area except when in use. The principal shall establish a procedure to have test materials distributed immediately prior to each test administration. Before each test administration, the building level coordinator shall collect, count, and return all test materials to the secure, locked storage area. Any discrepancies are to be reported to the school system test coordinator immediately and a report must be filed with the regional accountability coordinator. 3.5 Student Participation The Administrative Procedures Act 16 NCAC 6D. 0301 requires that all public school students in enrolled grades for which the SBE adopts a test, including every child with disabilities, shall participate in the testing program unless excluded from testing as provided by 16 NCC 6G.0305(g). Grade 3 Pretest and End of Grade Mathematics Tests (Grades 3-8) All students in membership in grade 3, including students who have been retained at grade 3, are required to participate in the Grade 3 Mathematics Pretest. All students in membership in grades 3-8 are required to participate in the End-of-Grade Mathematics Tests. High School Comprehensive Mathematics Test (Grade 10) All students classified as tenth graders in the school system student information management system (SIMS, NCWise, etc.) must participate in the High School Comprehensive Mathematics Test. This also includes those students following the Occupational Course of Study (OCS) and those who are repeating grade 10. Algebra I, Geometry, and Algebra II End-of-Course Tests All students, including students with disabilities, enrolled in a course for credit must be administered the end-of-course test in the final ten days of the course. End-of-course tests are not required for graduation; however, students enrolled for credit in a course that has an end-of-course test must be administered the end-of-course test. Students who are repeating the course for credit must also be administered the EOC test. The student’s most recent test score will be used for the purpose of state accountability. In addition, starting with the 2001-2002 school year, LEAs shall use results from all multiple-choice EOC tests (English I; Algebra I; Biology; U.S. History; Economic, Legal, and Political Systems; Algebra II; Chemistry; Geometry; Physics; and Physical Science) as at least twenty-five percent of the student’s final grade for each respective course. LEAs shall adopt policies regarding the use of EOC test results in assigning final grades. 35 3.6 Alternate Assessments The North Carolina Testing Program currently offers the North Carolina Alternate Assessment Academic Inventory (NCAAAI) and the North Carolina Alternate Assessment Portfolio (NCAAP) as two alternate assessments for Grade 3 Pretest, the End-of-Grade Mathematics Tests (grades 3-8), the High School Comprehensive Mathematics Test, and End-of-Course Mathematics Tests. The NCAAAI is an assessment process in which teachers utilize a checklist to evaluate student performance on curriculum benchmarks in the areas of reading, mathematics and/or writing. Student performance data are collected at the beginning of the school year (baseline), in the middle of the school year (interim) and at the end of the school year (summative). The NCAAAI measures competencies on the North Carolina Standard Course of Study. The Individualized Education Program (IEP) team determines if a student is eligible to participate in the NCAAI. The NCAAP is a yearlong assessment process that involves a representative and deliberate collection of student work/information that allows the users to make judgments about what a student knows and is able to do, and the progress that has been made in relation to the goals specified in the student’s current IEP. The IEP team determines if the disability of a student is a significant cognitive disability. The determination of a significant cognitive disability is one criterion for student participation in the NCAAP. 3.7 Testing Accommodations On a case-by-case basis where appropriate documentation exists, students with disabilities and students with limited English proficiency may receive testing accommodations. The need for accommodations must be documented in a current Individualized Education Program (IEP), Section 504 Plan, or LEP Plan. The accommodations must be used routinely during the student’s instructional program or similar classroom assessments. For information regarding appropriate testing procedures, test administrators who provide accommodations for students with disabilities must refer to the most recent publication of Testing Students with Disabilities and any published supplements or updates. The publication is available through the local school system or at www.ncpublicshools.org/accountability/testing. Test administrators must be trained in the use of the specified accommodations by the school system test coordinator or designee prior to the test administration. 3.8 Students with Limited English Proficiency Per HSP-C-005, students identified as limited English proficient shall be included in the statewide testing program. Students identified as limited English proficient who have been assessed on the state-identified language proficiency test as below Intermediate High in reading may participate for up to 2 years (24 months) in U.S. schools in the NCAAAI as an alternate assessment in the areas of reading and mathematics at grades 3 through 8 and 10 and in high school courses in which an end-of-course test is 36 administered. Students identified as limited English proficient who have been assessed on the state-identified language proficiency test as below Superior, per HSP-A-011, in writing may participate in the NCAAAI in writing for grades 4, 7, and 10 for up to 2 years (24 months) in U.S. schools. All students identified as limited English proficient must be assessed using the state-identified language proficiency test at initial enrollment and annually thereafter during the window of February 1 to April 30. A student who enrolls after January 1 does not have to be retested during the same school year. Limited English proficient students who are administered the NCAAAI shall not be assessed offgrade level. In March 2004, the State Board of Education adopted a temporary rule to make the following changes with respect to limited English proficient students during their first year in U.S. schools.* *Note: First year of enrollment in U.S. schools refers to the first school year that a student has been enrolled in a U.S. school. It does not refer to a 12-month period. If a student has been enrolled in any U.S. school prior to this school year, the student, regardless of his/her enrollment period would be expected to be assessed in reading and mathematics. Schools shall: • • • • • continue to administer state reading and mathematics tests for LEP students who score at or above Intermediate High on the reading section of the language proficiency test during their first year in U.S. schools. Results from these assessments will be included in the ABCs and AYP. not require LEP students (who score below Intermediate High on the reading section of the language proficiency test) in their first year in U.S. schools to be assessed on the reading End-of-Grade tests, High School Comprehensive Test in Reading, or the NC Alternate Assessment Academic Inventory (NCAAAI) for reading. for purposes of determining the 95% tested rule in reading, use the language proficiency test from the spring administration for these students. not count mathematics results in determining AYP or ABCs performance composite scores for LEP students who score below Intermediate High on the reading section of the language proficiency test in their first year in U.S. schools. include students previously identified as LEP, who have exited LEP identification during the last two years, in the calculations for determining the status of the LEP subgroup for AYP only if that subgroup already met the minimum number of 40 students required for a subgroup. 37 3.9 Medical Exclusions In some rare cases students may be excused from the required state tests. The process for requesting special exceptions based on significant medical emergencies and/or conditions is as follows: For requests that involve significant medical emergencies and/or conditions, the LEA superintendent or charter school director is required to submit a justification statement that explains why the emergency and/or condition prevents participation in the respective test administration during the testing window and the subsequent makeup period. The request must include the name of the student, the name of the school, the LEA code, and the name of the test(s) for which the exception is being requested. Medical documents are not included in the request to NCDPI. The request is to be based on information housed at the central office. The student’s records must remain confidential. Requests must be submitted prior to the end of the makeup period for the respective test(s). Requests are to be submitted for consideration by the LEA superintendent or charter. 3.10 Reporting Student Scores According to APA 16 NCAC 6D .0302 schools systems shall, at the beginning of the school year, provide information to students and parents or guardians advising them of the district-wide and state mandated tests that students will be required to take during the school year. In addition, school systems shall provide information to students and parents or guardians to advise them of the dates the tests will be administered and how the results from the tests will be used. Also, information provided to parents about the tests shall include whether the State Board of Education or local board of education requires the test. School systems shall report scores resulting from the administration of the districtwide and state-mandated tests to students and parents or guardians along with available score interpretation information within 30 days from the generation of the score at the school system level or receipt of the score and interpretive documentation from the NCDPI. At the time the scores are reported for tests required for graduation such as competency tests and the computer skills tests, the school system shall provide information to students and parents or guardians to advise whether or not the student has met the standard for the test. If a student fails to meet the standard for the test, the students and parents or guardians shall be informed of the following at the time of reporting: (1) the date(s) when focused remedial instruction will be available and (2) the date of the next testing opportunity. 3.11 Confidentiality of Student Test Scores State Board of Education policy states that “any written material containing the identifiable scores of individual students on tests taken pursuant to these rules shall not be disseminated or otherwise made available to the public by any member of the State Board 38 of Education, any employee of the State Board of Education, the State Superintendent of Public Instruction, any employee of the North Carolina Department of Public Instruction, any member of a local board of education, any employee of a local board of education, or any other person, except as permitted under the provisions of the Family Educational Rights and Privacy Act of 1974, 20 U.S.C. § 1232g.” 39 Chapter Four: Scaling and Standard-Setting for the North Carolina EOG and EOC Tests of Mathematics The North Carolina EOG and EOC Tests of Mathematics scores are reported as scale scores, achievement levels, and percentiles. Scale scores are advantageous in reporting because: • • • • scale scores can be used to compare test results when there have been changes in the curriculum or changes in the method of testing; scale scores on pretests or released test forms can be related to scale scores used on secure test forms administered at the end of the course; scale scores can be used to compare the results of tests that measure the same content area but are composed of items presented in different formats; scale scores can be used to minimize differences among various forms of the tests. 4.1 Conversion of Raw Test Scores Each student’s score is determined by calculating the number of items he or she answered correctly and then converting the sum to a developmental scale score. Software developed at the L.L. Thurstone Psychometric Laboratory at the University of North Carolina at Chapel Hill converts raw scores (total number of items answered correctly) to scale scores using the three IRT parameters (threshold, slope, and asymptote) for each item. The software implements the algorithm described by Thissen and Orlando (2001, pp. 119-130). Because different items are placed on each form of a subject’s test, unique score conversion tables are produced for each form of a test for each grade or subject area. For example, grade 3 has three EOG Tests of Mathematics forms. Therefore, the scanning and reporting program developed and distributed by the NCDPI uses three scale-score conversion tables. In addition to scaled scores, there are also standard errors of measurement associated with each. Because the EOC Tests of Mathematics are not developmental in nature, the scales are calibrated in the norming year to have a mean of 50 and a standard deviation of 10 for each test; otherwise, the procedures for computing scale scores are the same as for the EOG tests. 4.2 Constructing a Developmental Scale The basis of a developmental scale is the specification of means and standard deviations for scores on that scale for each grade level. In the case of the North Carolina End-ofGrade Tests of Mathematics, the grade levels ranged from the Grade 3 Pretest (administered in the Fall to students in the third grade) through grade 8. The data from which the scale score means are derived make use of special test forms, called linking forms, that are administered to students in adjacent grades. The difference in performance among grades on these forms is used to estimate the difference in proficiency among grades. The second edition of the North Carolina End-of-Grade Tests of Mathematics used IRT to compute these estimates following procedures described by Williams, Pommerich, and Thissen (1998). Table 8 shows the population means and standard deviations derived 40 from the Spring 2000 item calibration for the North Carolina End-of-Grade Tests of Mathematics. Table 8: Population means and standard deviations derived from the Spring 2000 item calibration for the North Carolina End-of-Grade Tests of Mathematics, second edition Grade 3 Pretest Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 Mean 234.35 248.27 252.90 255.99 259.95 263.36 267.09 Population Standard Deviation 9.66 9.86 10.65 12.78 11.75 12.46 12.83 The values for the developmental scale shown in Table 8 are based on IRT estimates of differences between adjacent-grade means and ratios of adjacent-grade standard deviations computed using the computer program MULTILOG (Thissen, 1991); the estimates from MULTILOG were cross-checked against parallel estimates computed using the software IRTLRDIF (Thissen & Orlando, 2001). In the computation of estimates using either software system, the analysis of data from adjacent grades arbitrarily sets the means and standard deviation of the population distribution of the lower grade to values of zero (0) and one (1), respectively; the values of the mean (μ) ï€©ï€ and standard deviation (σï€©ï€ ) of the higher grade are estimated making use of the item response data and the three parameter logistic IRT model (Thissen & Orlando, 2001). Table 9 shows the average difference between adjacent-grade means (μ) in units of the standard deviation of the lower grade and ratios between adjacent-grade standard deviations (σ) derived from the Spring 2000 item calibration for the North Carolina End-of-Grade Tests of Mathematics. The values in Table 9 are converted into the final scale, shown in Table 8, by setting the average scale score in grade 4 to be 252.9 with a standard deviation of 10.65 and then computing the values for the other grades such that the differences between the means for adjacent grades, in units of the standard deviation of the lower grade, are the same as those shown in Table 8. 41 Table 9: Average difference between adjacent grade means in units of the standard deviation of the lower grade and ratios between adjacent-grade standard deviations derived from the Spring 2000 item calibrations for the North Carolina EOG Tests of Mathematics Grades 3P–3 3–4 4–5 5–6 6–7 7–8 Average Meanï€ ï€¨ (μ) ï€©ï€ Difference 1.44 0.47 0.29 0.31 0.29 0.30 Average (σï€©ï€ ) Ratio 1.02 1.08 1.20 0.92 1.06 1.03 (Useful) Replications 11 17 14 10 13 3 The estimates shown in Table 9 derive from 3 to 17 replications of the between-grade difference; the numbers of replications for each grade pair are also shown in Table 9. Each replication is based on a different short embedded linking form among the item tryout forms administered in Spring 2000. The sample size for each linking form varied from 398 to 4,313 students in each grade. (Most sample sizes were in the planned ranged of between 1,300 and 1,500 students.) The original field test design, as discussed earlier, was an embedded design and originally called for 12 to 17 twelve-item linking forms between each pair of grades, with sample sizes of approximately 1,500. However, due to logistical issues, some forms were administered to larger samples, and other forms (that were delivered later) were administered to smaller samples. In addition, the forms were not necessarily administered to the random samples that were planned within each grade. Corrections were made for these sampling problems in the computation of the estimates shown in Table 8. The mean difference between grades 5 and 6 was corrected using an estimate of the regression, across replications, of the mean difference on the new scale against the mean difference on the first edition scale, after data analysis suggested that the matched samples in grades 5 and 6 were atypical in their performance. The mean difference between grades 7 and 8 and the standard deviation ratio for grade 5 relative to grade 4 were adjusted to smooth the relation between those values and the corresponding values for adjacent grades. Table 10 shows for each adjacent-grade pair the values of the average difference between adjacent-grade means (μ)ï€©ï€ in units of the standard deviation of the lower grade and ratios of adjacent-grade standard deviations (σ) ï€©ï€ derived from the Spring 2000 item calibration for the North Carolina EOG Tests of Mathematics for each replication that provided useful data. In Table 10, the values for each grade-pair are in decreasing order of the estimate of the difference between the means. There is some variation among the estimates across replications due to the fact that some of the estimates are based on small samples and many of the estimates are based on non-random samples. However, as aggregated in Table 9, a useful developmental scale is constructed. 42 Table 10: Replications of the average difference between adjacent-grade means in units of the standard deviation of the lower grade and ratios between adjacent-grade standard deviations derived from the Spring 2000 item calibration for the North Carolina EOG Tests of Mathematics Grade 3P–3 Grades 3–4 Grades 4–5 Grades 5–6 Grades 6–7 Grades 7–8 Mean Mean SD Mean SD Mean SD Mean SD Mean SD SD 1.84 1.09 0.76 1.25 0.63 1.06 0.68 0.72 0.50 1.10 0.42 0.92 1.77 1.04 0.59 1.09 0.57 0.96 0.54 0.75 0.39 1.00 0.36 1.26 1.73 1.18 0.57 1.09 0.51 1.05 0.53 0.82 0.37 1.09 0.27 0.92 1.60 0.97 0.56 1.02 0.51 1.15 0.51 0.83 0.36 0.95 1.53 0.98 0.55 1.04 0.51 1.07 0.38 1.15 0.35 1.02 1.50 0.97 0.55 1.33 0.41 1.03 0.37 1.02 0.33 1.12 1.42 1.12 0.54 1.05 0.40 0.87 0.36 0.90 0.28 1.01 1.35 1.10 0.49 1.04 0.21 1.51 0.33 1.05 0.28 1.18 1.28 1.02 0.47 0.94 0.19 1.63 0.25 1.09 0.22 1.01 0.97 0.93 0.42 1.22 0.12 1.53 0.21 0.83 0.21 1.08 0.89 0.83 0.41 0.99 0.10 1.74 0.21 1.01 0.40 1.10 -0.02 1.68 0.14 1.05 0.40 0.97 -0.05 1.71 0.12 1.14 0.37 0.90 -0.09 1.57 0.36 1.28 0.35 1.01 0.28 1.04 Curriculum revisions in the mathematics Standard Course of Study, adopted by the State Board of Education in 1998, resulted in changes in test specifications and subsequent second edition of the North Carolina EOG and EOC Tests of Mathematics. To ensure a continuous measure of academic performance among North Carolina students, developmental scales from the first edition of the North Carolina EOG Tests of Mathematics were linked to developmental scales from the second edition of the test. 43 4.3 Comparison with and Linkage to the First Edition Scale The embedded nature of the Spring 2000 item calibration provided a basis for a preliminary linkage of the second edition developmental scale with that of the first edition. The results of that preliminary linkage were subsequently superseded by results obtained from a special study with the data collected in Spring 2001. Table 11 shows a comparison of the population means and standard deviations for the second edition with the averages and standard deviations for the scale scores obtained from the operational administration of the first edition. For ease of comparison of the two scales, Figure 4 shows the two sets of averages plotted together, with 100 subtracted from the 2nn values of the new scale so the same range could be approximated. The developmental scales for the first and second editions of the EOG mathematics tests are somewhat dissimilar. The smaller rates of change observed in the calibration data for the second edition are likely due to incomplete implementation in the 1999–2000 academic year of the new curriculum, which was the basis for the academic content in the second edition. Table 11: Comparison of the population means and standard deviations for the second edition with averages and standard deviations obtained from the operational administration of the first edition in the Spring 2000 item calibration for the North Carolina EOG Tests of Mathematics Grade 3 Pretest Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 First Edition Standard Mean Deviation 131.6 7.8 143.5 11.1 152.9 10.1 159.5 10.1 165.1 11.2 171.0 11.5 175.3 11.9 Second Edition Standard Mean Deviation 234.4 9.7 248.3 9.9 252.9 10.7 256.0 12.8 260.0 11.8 263.4 12.5 267.1 12.8 Scoring tables for the second edition forms that were to be administered in the Spring of 2001 were constructed after those forms were assembled in February–March 2001. Using the item parameters from the embedded item calibration and the population means and standard deviations in Table 11, scoring tables were constructed using the procedures described by Thissen, Pommerich, Billeaud, and Williams (1995), and Thissen and Orlando (2001). These procedures yield tables that translate summed scores into corresponding IRT scale scores on the developmental scale. A side-effect of the construction of those scoring tables was that the algorithm provided IRT model-based estimates of the proportions of the (Spring 2000) item calibration samples that would had obtained each summed score (and hence, each scale score) had students been administered the forms assembled for Spring 2001. Those scoreproportions were matched with the observed score distributions on the first-edition forms that were included in the item tryout, yielding equipercentile equating tables (Angoff, 1982) that match scores on the second edition with the scores at the same percentiles on the first edition. 44 This equipercentile matching also provided part of the basis for a preliminary translation of the cut scores between achievement levels from the first edition to the second edition. Additional information was also used to select the preliminary cut scores, in the form of the consistency in the patterns of the matched cut scores between the EOG Levels I, II, III, and IV across grades. At the time of the preliminary linkage, values were computed based on IRT modeled estimates of the score distributions that would have been obtained if the new secondedition forms had been administered operationally in the Spring of 2000 (which they were not). Those values were treated as though they reflected performance that would have happened had the new curriculum been completely implemented in 1999–2000. As a result, preliminary estimates were put in place to accommodate the testing schedule and associated decision-making that needed to occur prior to the Spring of 2001. The graphic below shows the linking of the mathematics forms across seven grades, the scale of the latent proficiency for mathematics, and the operational scale for the first edition of the North Carolina EOG Tests of Mathematics (Williams, Pommerich, and Thissen, 1998). Figure 4: Comparison of the growth curves for the first and second editions of the North Carolina EOG Tests of Mathematics in the Spring 2000 item calibration 310 NewAverage 300 Average2 Average3 New Math Scale Scores (& old + 100) 290 OldScores+100 280 270 260 250 240 230 220 210 200 1 2 3 4 5 6 7 Grade (Vertical lines indicate 1, 2, and 3 standard deviations on the second edition) 45 8 9 4.4 Equating the Scales for the First and Second Editions of the North Carolina EOG Tests of Mathematics To ensure that the first and second edition scales were comparable, the scales for the first and second editions of the test were linked using statistical moderation and the statistical technology of equipercentile equating. Because of the uncertainty surrounding the preliminary linkage between the scales for the first and second editions of the North Carolina EOG Tests of Mathematics, an equating study was performed in Spring 2001. In this study, the newly constructed second edition forms of the mathematics tests and selected forms from the first edition were administered to spiraled samples in the context of the item tryout of the new items to create additional second-edition forms. The purpose of this study was to provide data for linkage of the scales on the first and second editions using the newly constructed operational forms of the second edition of the test (which were not available until early Spring 2001). Figure 5 shows the equipercentile equating functions for grades 3–8 obtained using data from the equating study. Figure 5: Equipercentile equating functions between the first and second editions of the North Carolina EOG Tests of Mathematics scales derived from the Spring 2001 equating study for Grades 3–8 220 First Edition Scale 200 180 160 Grade 3 Grade 4 140 Grade 5 Grade 6 120 Grade 7 Grade 8 100 200 220 240 260 280 Second Edition Scale 46 300 320 The 5th through the 95th percentiles (5th, 10th, 15th, … ,95th) were plotted to determine whether the fit was linear. Because they were found to be somewhat convex, an algorithm was applied to improve the aspect of the fit. First, a line segment was placed between the 25th and 75th percentile pairs and extrapolated. When the straight line deviated from the data, the fit was doglegged by either: (1) placing a straight line between the 75th and 95th percentiles or the 5th and 25th percentiles and extrapolating; or (2) when the data point went off-scale at either end, another straight line was placed between the 95th percentile and the data point representing the maximum scale score on the new test and maximum scale score on the old test. This was also done at the minimum ends of the scales between the 5th percentile and the minimum score on both old and new tests. All fitted points on the line were rounded to give integer score translation tables. This procedure resulted in score translation tables that matched or closely matched percentiles for the middle 90% of the data and matched the minimum and maximum scale scores on the two tests. The equated mean and standard deviation were compared to see how closely they matched the observed first edition test mean and standard deviation in the equating sample. In all cases except 5th grade, means and standard deviations were similar. For 5th grade slight adjustments were made to improve the aspect of the fit. 4.5 Setting the Standards For tests developed under the North Carolina Testing Program, academic achievement standard setting, the process of determining cut scores for the different achievement levels, is typically accomplished through the use of contrasting groups. Contrasting groups is an examinee-based method of standard setting, which involves categorizing students into the four achievement levels by expert judges who are knowledgeable of students’ achievement in various domains outside of the testing situation and then comparing these judgments to students’ actual scores. For the North Carolina mathematics tests, North Carolina teachers were considered as expert judges under the rationale that teachers were able to make informed judgments about students’ academic achievement because they had observed the breadth and depth of the students’ work during the school year. For the North Carolina EOG academic achievement standard setting, originally conducted for the first edition (1992), approximately 160,000 students were placed into categories by approximately 5,000 teachers. Teachers categorized students who participated in field testing into one of the four achievement levels with the remainder categorized as not a clear example of any of the achievement levels. The resulting proportions of students expected to score in each of the four achievement levels were then applied to the first operational year to arrive at the cut scores for the first edition North Carolina EOG Tests of Mathematics. Table 12 shows the percentage of students classified into each achievement level by grade or course. 47 Table 12: Percent of students assigned to each achievement level by teachers Grade/Subject Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 Algebra I (May, 1993) Geometry (May, 1995) Level I 12.0% 10.3% 13.0% 12.1% 12.4% 11.2% 14.5% 17.2% Level II 28.1% 27.2% 27.8% 28.1% 27.9% 28.8% 32.5% 30.7% Level III 40.6% 42.8% 40.8% 40.4% 39.8% 40.4% 40.4% 34.3% Level IV 19.2% 19.6% 18.3% 19.4% 19.9% 19.6% 12.6% 17.8% (May, 1992, unless otherwise specified) When applying the contrasting groups approach to standard setting for the second edition of the North Carolina mathematics tests, scale scores from the field test year were distributed from lowest to highest. If the classifications for grade 3 were used as an example, 12% of 160,000 would be 19,200 scores. Counting up to 19,200 on the cumulative frequency distribution gives the scale score below which 19,200 students scored. This scale score became the cut-off between Level I and Level II. The process continued for each of the levels until all cut scores had been derived. It should be noted that to avoid an inflation of children categorized as Level IV, the percentage categorized as No Clear Category was removed from the cut score calculations. Since the administration of the first edition (1992) and the re-norming year (1998), the proportions of students in Level I have continued to decrease and the proportions of students in Levels III and IV have continued to increase. For example, from 1999 to 2000, 2% fewer children were in Level I than the year before. From 2000 to 2001 there were 1.8% fewer children in Level I than from 1999 to 2000. To continue this trend, it was anticipated that a similar percentage of fewer children would be in Level I from 2001 to 2002. Rather than develop new standards for the second edition of the North Carolina EOG Tests of Mathematics which would disrupt the continuous measure and reporting of academic performance for students, the standards for the second edition were established by maintaining the historical trends mentioned above while making using of the equated scales. Interim academic achievement standards were set using the field test data. The final standards were set on the operational data. 4.6 Score Reporting for the North Carolina Tests Scores from the North Carolina mathematics tests are reported as scale scores, achievement levels, and percentile ranks. The scale scores are computed through the use of raw-to-scale score conversion tables. The scale score determines the achievement level in which a student falls. Score reports are generated at the local level to depict performance for individual students, classrooms, schools, and local education agencies. The data can be disaggregated by subgroups of gender and race/ethnicity, as well as other demographic variables collected during the test administration. Demographic data are reported on 48 variables such as free/reduced lunch status, limited English proficient status, migrant status, Title I status, disability status, and parents’ levels of education. The results are reported in aggregate at the state level usually at the end of June of each year. The NCDPI uses the data for school accountability, student accountability (grades 3, 5, and 8), and to satisfy other federal requirements under the No Child Left Behind Act of 2001. 4.7 Achievement Level Descriptors The four achievement levels in the North Carolina Testing Program are defined below. Table 13: Administrative Procedures Act 16 NCAC 6D .0501 (Definitions related to Student Accountability Standards) Achievement Levels for the North Carolina Testing Program Level I Students performing at this level do not have sufficient mastery of knowledge and skills in this subject area to be successful at the next grade level. Level II Students performing at this level demonstrate inconsistent mastery of knowledge and skills that are fundamental in this subject area and that are minimally sufficient to be successful at the next grade level. Level III Students performing at this level consistently demonstrate mastery of grade level subject matter and skills and are well prepared for the next grade level. Level IV Students performing at this level consistently perform in a superior manner clearly beyond that required to be proficient at grade level work. 4.8 Achievement Level Cut Scores The achievement level cut scores for the North Carolina mathematics tests are shown in the table below. Table 14: EOG and EOC Tests of Mathematics achievement levels and corresponding scale scores Grade/Subject 3 Pretest Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 Grade 10 Algebra I Geometry Algebra II Level I 211–219 218–237 221–239 221–242 228–246 231–249 235–253 141–159 23–44 23–45 23–45 Level II 220–229 238–245 240–246 243–249 247–253 250–257 254–260 160–171 45–54 46–56 46–57 49 Level III 230–239 246–254 247–257 250–259 254–264 258–266 261–271 172–188 55–65 57–66 58–68 Level IV 240–260 255–276 258–285 260–295 265–296 267–307 272–310 189–226 66–87 67–87 69–88 4.9 Achievement Level Trends The percentage of students in each of the achievement levels is provided below by grade. Table 15: Achievement level trends for Grade 3 Pretest Grade 3 Pretest Level I Level II Level III Level IV 1995 * * * * 1996 * * * * 1997 1998 1999 2000 2001 2002 2003 6.2 23.5 40.6 29.7 5.4 23.1 41.3 30.2 4.6 20.6 41.8 32.9 3.3 19.7 41.7 35.3 2.0 18.9 43.4 35.8 1.4 15.9 42.8 40.0 1.1 14.0 43.6 41.3 1998 7.0 24.8 39.8 28.4 1999 6.3 23.7 40.2 29.8 2000 5.6 22.6 40.0 31.8 2001 4.2 22.2 43.3 30.3 2002 3.2 19.5 43.1 34.2 2003 1.1 10.0 45.9 42.9 1998 4.0 16.8 41.7 37.6 1999 2.9 14.4 43.0 39.6 2000 2.1 13.4 43.7 40.8 2001 1.2 12.0 46.7 40.0 2002 0.9 10.2 45.9 43.0 2003 0.7 4.5 35.6 59.1 1998 5.8 16.1 37.8 40.2 1999 3.8 13.7 35.5 46.9 2000 3.8 13.3 34.3 48.6 2001 2.2 11.2 36.6 50.1 2002 1.7 9.8 35.3 53.2 2003 1.1 6.4 30.7 61.8 1998 5.0 16.7 40.7 37.7 1999 4.3 14.6 39.8 41.3 2000 4.1 14.9 38.1 42.9 2001 3.3 13.8 40.5 42.4 2002 2.2 11.4 39.2 47.2 2003 1.7 8.2 34.5 55.6 *Test not administered Table 16: Achievement level trends for Grade 3 Grade 3 Level I Level II Level III Level IV 1995 9.3 25.6 39.7 25.4 1996 7.9 24.7 39.7 27.7 1997 6.8 23.0 39.6 30.7 Table 17: Achievement level trends for Grade 4 Grade 4 Level I Level II Level III Level IV 1995 8.6 22.9 41.3 27.2 1996 7.2 21.3 43.6 28.0 1997 6.4 19.1 41.9 32.7 Table 18: Achievement level trends for Grade 5 Grade 5 Level I Level II Level III Level IV 1995 9.4 24.1 37.3 29.2 1996 8.5 21.5 38.0 32.0 1997 7.1 19.8 36.2 36.8 Table 19: Achievement level trends for Grade 6 Grade 6 Level I Level II Level III Level IV 1995 8.2 24.1 42.5 25.1 1996 7.0 20.5 43.0 29.6 1997 6.6 20.7 40.5 32.2 50 Table 20: Achievement level trends for Grade 7 Grade 7 Level I Level II Level III Level IV 1995 8.4 24.5 38.6 28.5 1996 9.0 22.5 38.8 29.7 1997 8.6 20.6 36.9 34.0 1998 5.4 17.7 38.3 38.6 1999 3.9 13.6 37.4 45.0 2000 4.5 14.8 35.1 45.6 2001 3.2 15.5 33.3 48.0 2002 2.7 14.0 32.4 50.9 2003 2.9 13.3 31.1 52.7 2001 5.3 15.2 36.8 42.7 2002 4.2 13.5 35.7 46.6 2003 4.5 11.3 34.1 50.1 Table 21: Achievement level trends for Grade 8 Grade 8 Level I Level II Level III Level IV 1995 8.2 24.2 40.1 27.5 1996 8.8 23.5 38.7 29.1 1997 9.0 22.1 38.4 30.5 1998 5.4 18.3 37.6 38.7 1999 5.4 17.0 37.9 39.7 2000 4.8 14.6 36.5 44.1 Table 22: Achievement level trends for Grade 10 High School Comprehensive Test Grade 10 Level I Level II Level III Level IV 1995 * * * * 1996 * * * * 1997 * * * * 1998 11.9 32.5 41.0 14.6 1999 8.8 30.2 45.2 15.9 2000 8.8 29.4 45.4 16.4 2001 9.5 28.9 44.9 16.7 2002 * * * * 2003 8.3 27.0 47.5 17.2 1999 9.1 25.5 43.4 22.0 2000 9.0 22.1 38.8 30.1 2001 3.2 20.8 44.6 31.5 2002 2.7 18.4 41.2 37.7 2003 2.7 18.7 40.9 37.7 1999 10.0 30.6 37.5 20.9 2000 9.6 30.3 36.4 23.6 2001 4.7 31.4 42.1 21.9 2002 4.3 29.3 41.6 24.8 2003 3.8 26.7 41.6 27.8 *Test not administered Table 23: Achievement level trends for Algebra I Algebra I Level I Level II Level III Level IV 1995 13.9 32.1 40.0 14.1 1996 15.1 31.8 38.7 14.4 1997 14.0 30.6 39.7 15.8 1998 10.8 27.7 41.9 19.6 Table 24: Achievement level trends for Geometry Geometry Level I Level II Level III Level IV 1995 * * * * 1996 * * * * 1997 * * * * 1998 * * * * *Test not administered 51 Table 25: Achievement level trends for Algebra II Algebra II Level I Level II Level III Level IV 1995 * * * * 1996 * * * * 1997 * * * * 1998 * * * * 1999 10.0 31.0 36.0 23.0 2000 9.0 28.3 35.9 26.7 2001 2.5 24.5 40.3 32.6 2002 2.5 21.1 39.0 37.8 2003 1.6 19.6 39.1 39.6 *Test not administered 4.10 Percentile Ranking The percentile rank for each scale score is the percentage of scores less than or equal to that score. A percentile is a score or a point on the original measurement scale. If the percentile formula is applied to the frequency distribution of scores for grade 3 (see Appendix E for samples of frequency distribution tables) a score of 260 would have a percentile rank of 89. The percentile rank provides relative information about a student’s score on a test relative to other students in the norming year. The percentile ranks for the scores on the North Carolina mathematics tests are calculated based on the first operational administration of the tests. The use of percentile rank reporting allows a meaningful comparison to be made among mathematics scores at the total test score level. 52 Chapter Five: Reports 5.1 Use of Test Score Reports Provided by the North Carolina Testing Program The North Carolina Testing Program provides reports at the student level, school level, and state level. The North Carolina Testing Code of Ethics dictates that educators use test scores and reports appropriately. This means that educators recognize that a test score is only one piece of information and must be interpreted together with other scores and indicators. Test data help educators understand educational patterns and practices. Data analysis of test scores for decision-making purposes should be based upon disaggregation of data by student demographics and other student variables as well as an examination of grading practices in relation to test scores, growth trends, and goal summaries for state mandated tests. 5.2 Reporting by Student The state provides scoring equipment in each school system so that administrators can score all state-required multiple-choice tests. This scoring generally takes place within two weeks after testing so the individual score report can be given to the student and parent before the end of the school year. Each student in grades 3-8 who takes the end-of-grade tests is given a “Parent/Teacher Report.” This single sheet provides information on that student’s performance on the reading and mathematics tests. A flyer titled, “Understanding Your Child’s EOG Score,” is provided with each “Parent/Teacher Report.” This publication offers information for understanding student scores as well as suggestions on what parents and teachers can do to help students in the areas of reading and mathematics. The student report also shows how that student’s performance compared to the average scores for the school, the school system, and the state. A four-level achievement scale is used for the tests. Achievement Level I represents insufficient mastery of the subject. Achievement Level II is inconsistent mastery of the subject. Achievement Level III is consistent mastery and the minimum goal for students. Achievement Level IV is superior mastery of the subject. Students achieving at Level III or Level IV are considered to be at or above grade level. Achievement Level III is the level students must score to be considered proficient and to pass to the next grade under state Student Accountability Standards for grades 3, 5, and 8. 5.3 Reporting by School Since 1997, the student performance on end-of-grade tests for each elementary and middle school has been released by the state through the ABCs of School Accountability. High school student performance began to be reported in 1998 in the ABCs of School 53 Accountability. For each school, parents and others can see the actual performance for groups of students at the school in reading, mathematics, and writing; the percentage of students tested; whether the school met or exceeded goals that were set for it; and the status designated by the state. Some schools that do not meet their goals and that have low numbers of students performing at grade level receive help from the state. Other schools, where goals have been reached or exceeded, receive bonuses for the certified staff and teacher assistants in that school. Local school systems received their first results under No Child Left Behind (NCLB) in July 2003 as part of the state’s ABCs accountability program. Under NCLB, each school is evaluated according to whether or not it met Adequate Yearly Progress (AYP). AYP is not only a goal for the school overall, but also for each subgroup of students in the school. Every subgroup must meet its goal for the school to meet AYP. AYP is only one part of the state’s ABCs accountability model. Complete ABCs results are released in September and show how much growth students in every school made as well as the overall percentage of students who are proficient. The ABCs report is available on the Department of Public Instruction web site at http://abcs.ncpublicschools.org/abcs/. School principals also can provide information about the ABC report to parents. 5.4 Reporting by the State The state reports information on student performance in various ways. The North Carolina Report Cards provide information about K-12 public schools (including charters and alternative schools) for schools, school systems, and the state. Each report card includes a school or district profile and information about student performance, safe schools, access to technology, and teacher quality. As a participating state in the National Assessment of Educational Progress (NAEP), North Carolina student performance is included in annual reports released nationally on selected subjects. The state also releases state and local SAT scores each summer. 54 Chapter Six: Descriptive Statistics and Reliability 6.1 Descriptive Statistics for the First Operational Administration of the Tests The second editions of the EOG and EOC Tests of Mathematics were administered for the first time in the spring of 2001. Descriptive statistics for the North Carolina Tests of Mathematics’ first operational year and operational administration population demographics are provided below. 6.2 Means and Standard Deviations for the First Operational Administration of the Tests Table 26: Descriptive statistics by grade for the 2001 administration of the North Carolina EOG Tests of Mathematics and the 1998 administration of the Grade 10 High School Comprehensive Test Grade 3 Pretest Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 Grade 10 N Mean 102,484 102,172 100,418 100,252 100,409 97,205 93,603 73,635 237.4 250.6 255.8 260.0 263.2 267.1 270.0 174.3 Standard Deviation 7.7 7.7 8.3 9.6 9.9 10.6 11.0 13.5 Table 27: Mean scale score for the 2001 administration of the North Carolina EOC Mathematics tests Subject Algebra I Geometry Algebra II N 93,116 65,515 54,909 55 Mean 61 57 65 6.3 Population Demographics for the first Operational Administration Table 28: Population demographics for the 2001 administration of the North Carolina EOG and EOC Tests of Mathematics and the 1998 administration of the Grade 10 High School Comprehensive Test Grade / Subject 3 Pretest Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 Grade 10 Algebra I Geometry Algebra II N % Male 102,484 102,172 100,418 100,252 100,409 97,205 93,603 73,635 93,116 65,515 54,909 51.0 51.3 51.4 50.7 51.0 50.4 50.9 48.9 49.8 46.4 47.0 % % American Female Indian 49.0 1.5 48.7 1.4 48.6 1.4 49.3 1.4 49.0 1.3 49.6 1.4 49.1 1.3 51.1 1.3 50.2 1.3 53.6 1.1 53.1 1.1 % % % % Black White Other LEP 29.5 30.9 30.6 30.2 31.0 33.7 30.8 27.5 28.6 22.8 25.0 57.7 60.0 61.0 61.7 61.4 58.9 61.7 66.7 64.1 70.7 68.4 11.3 7.7 7.0 6.6 6.3 6.1 6.1 4.5 7.0 5.5 5.5 2.1 2.8 2.3 2.2 1.8 1.5 1.8 0.6 0.7 0.4 0.4 6.4 Scale Score Frequency Distributions The following figures present the frequency distributions of the developmental scale scores from the first statewide administration of the North Carolina EOG and EOC Tests of Mathematics. The frequency distributions are not smooth because of the conversion from raw scores to scale scores. Due to rounding in the conversion process, sometimes two raw scores in the middle of the distribution convert to the same scale score resulting in the appearance of a spike in that particular scale score. 56 Figure 6: Math Scale Score Frequency Distribution Grade 3 6500 6000 5500 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 218 221 223 225 227 229 231 233 235 238 240 242 244 246 248 250 252 254 256 258 260 262 264 266 268 271 274 276 Number of Students 2001 Math Scale Score Distribution Grade 3 (n = 102,172) Scale Score Figure 7: Math Scale Score Frequency Distribution Grade 4 2001 Math Scale Score Distribution Grade 4 (n = 100,418) 5500 5000 4000 3500 3000 2500 2000 1500 1000 500 0 221 223 225 227 229 231 233 235 237 239 241 243 245 247 249 251 253 255 257 259 261 263 264 267 269 271 273 275 277 279 281 283 285 Number of Students 4500 Scale Score 57 Figure 8: Math Scale Score Frequency Distribution Grade 5 2001 Math Scale Score Distribution Grade 5 (n = 100,252) 5000 4500 Number of Students 4000 3500 3000 2500 2000 1500 1000 500 221 223 225 227 229 231 233 235 237 239 241 243 245 247 249 251 253 255 257 259 261 263 265 267 269 271 273 275 277 279 281 283 285 287 289 291 293 295 0 Sca le Score Figure 9: Math Scale Score Frequency Distribution Grade 6 2001 Math Scale Score Distribution Grade 6 (n = 100,409) 5500 5000 4000 3500 3000 2500 2000 1500 1000 500 0 228 230 232 234 236 238 240 242 244 246 248 250 252 254 256 258 260 262 264 266 268 270 272 274 276 278 280 282 284 286 288 290 292 294 296 Number of Students 4500 Scale Score 58 Figure 10: Math Scale Score Frequency Distribution Grade 7 2001 Math Scale Score Distribution Grade 7 (n = 97,205) 5000 4500 Number of Students 4000 3500 3000 2500 2000 1500 1000 500 231 233 235 237 239 241 243 245 247 249 251 253 255 257 259 261 263 265 267 269 271 273 275 277 279 281 283 285 287 289 291 293 295 297 299 301 303 305 0 Scale Score Figure 11: Math Scale Score Frequency Distribution Grade 8 2001 Math Scale Score Distribution Grade 8 (n = 93,603) 4500 4000 3000 2500 2000 1500 1000 500 0 235 237 239 241 243 245 247 249 251 253 255 257 259 261 263 265 267 269 271 273 275 277 279 281 283 285 287 289 291 293 295 297 299 301 303 305 307 209 Number of Students 3500 Scale Score 59 Figure 12: Algebra I Scale Score Frequency Distribution 2001 Algebra I Scale Score Distribution (n=93,116) 5000 Number of Students 4500 4000 3500 3000 2500 2000 1500 1000 500 95 91 87 83 79 75 71 67 63 59 55 51 47 43 39 35 31 0 Sca le Score s Figure 13: Geometry Scale Score Frequency Distribution 2001 Geometry Scale Score Distribution (n=65,515) 4000 3000 2500 2000 1500 1000 500 Sca le Score s 60 93 89 85 81 77 73 69 65 61 57 53 49 45 41 37 0 32 Number of Students 3500 Figure 14: Algebra II Scale Score Frequency Distribution 2001 Algebra II Scale Score Distribution (n=54,909) 2000 1500 1000 500 Sca le Score s 61 97 90 86 82 78 74 70 66 62 58 54 50 46 42 38 0 33 Number of Students 2500 6.5 Reliability of the North Carolina Mathematics Tests Reliability refers to the consistency of a measure when the testing procedure is repeated on a population of individuals or groups. In testing, if use is to be made of some piece of information, then the information should be stable, consistent, and dependable. If any use is to be made of the information from a test, then the test results must be reliable. If decisions about individuals are to be made on the basis of test data, then it is desirable that the test results be reliable and tests exhibit a reliability coefficient of at least 0.85. There are three broad categories of reliability coefficients recognized as appropriate indices for establishing reliability in tests: (a) coefficients derived from the administration of parallel forms in independent testing sessions (alternate-form coefficients); (b) coefficients obtained by administration of the same instrument on separate occasions (test-retest or stability coefficients); and (c) coefficients based on the relationships among scores derived from individual items or subsets of the items within a test, all data accruing from a single administration of the test. The last coefficient is known as an internal consistency coefficient (Standards for Educational and Psychological Testing, AERA, APA, NCME, 1985, p.27). An internal consistency coefficient, coefficient alpha, is the metric used to establish reliability for the North Carolina EOG and EOC Tests of Mathematics. 6.6 Internal Consistency of the North Carolina Mathematics Tests The following table presents the coefficient alpha indices averaged across forms. Table 29: Reliability indices averaged across North Carolina EOG and EOC Tests of Mathematics forms Average Coefficient Alpha 0.82 0.96 0.96 0.95 0.96 0.95 0.94 0.94 0.94 0.94 0.88 Grade 3 Pretest* Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 Grade 10 Algebra I Geometry Algebra II *The grade 3 pretest is 40 items (half of the total number of items on the grade 3 test). As noted above, the North Carolina EOG and EOC Tests of Mathematics are highly reliable as a whole. In addition, it is important to note that this high degree of reliability extends across gender, ethnicity, LEP status, and disability. Looking at coefficients alpha for the different groups reveals that in all test forms for mathematics tests, including EOG 62 tests, the math section of the high school comprehensive test, and Algebra I, 87% of the values were at or above 0.94 and all were above 0.91. Table 30: Reliability indices averaged across North Carolina EOG and EOC Test of Mathematics forms (Gender) Grade / Subject Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 Grade 10 Algebra I Females 0.96 0.95 0.95 0.95 0.94 0.95 0.94 0.94 Males 0.95 0.96 0.95 0.96 0.95 0.95 0.95 0.95 Table 31: Reliability indices averaged across North Carolina EOG and EOC Test of Mathematics forms (Ethnicity) Grade / Subject Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 Grade 10 Algebra I Asian Black Hispanic 0.97 0.97 0.97 0.97 0.95 0.96 0.96 0.95 0.95 0.94 0.94 0.94 0.94 0.96 0.93 0.91 0.97 0.97 0.97 0.97 0.92 0.96 0.94 0.94 Native American 0.95 0.95 0.94 0.95 0.93 0.94 0.93 0.93 MultiRacial 0.95 0.94 0.95 0.95 0.93 0.95 0.94 0.94 White 0.95 0.95 0.95 0.95 0.94 0.95 0.94 0.94 Table 32: Reliability indices averaged across North Carolina EOG and EOC Test of Mathematics forms (Other Characteristics) Grade / Subject Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 Grade 10 Algebra I No Disability 0.94 0.94 0.94 0.95 0.94 0.94 0.94 0.94 Disability Not LEP LEP 0.97 0.97 0.97 0.96 0.94 0.95 0.94 0.92 0.96 0.95 0.95 0.96 0.95 0.95 0.94 0.94 0.98 0.98 0.97 0.97 0.96 0.96 0.94 0.94 Although the North Carolina Testing Program administers alternate forms of the test, it is not possible to calculate alternate-forms reliabilities on the tests within the context of a natural test setting. Students take the test one time, and only those students in grades 3, 5, 63 and 8 who do not achieve Level III are required to retake the test. Thus, the natural population of re-testers has a sharp restriction in range, which would lower the observed correlation. Additionally, North Carolina students are extremely test-wise. Attempting to do a special study on test-retest reliability, where one of the administrations does not have stakes for the student, with this population would give questionable results. 6.7 Standard Error of Measurement The information provided by the standard error of measurement (SEM) for a given score is important because it assists in determining the accuracy of an examinee’s obtained score. It allows a probabilistic statement to be made about an individual’s test score. For example, if a score of 100 has an SEM of plus or minus two, then one can conclude that a student obtained a score of 100, which is accurate within plus or minus 2 points with a 68% confidence. In other words, a 68% confidence interval for a score of 100 is 98–102. If that student were to be retested, his or her score would be expected to be in the range of 98–102 about 68% of the time. The standard error of measurement ranges for scores on the North Carolina EOC and EOG Tests of Mathematics is provided in table 33 below. For students with scores within 2 standard deviations of the mean (95% of the students), standard errors are typically 2 to 3 points. For most of the EOG Tests of Mathematics scale scores, the standard error of measurement in the middle range of scores, particularly at the cutpoint between Level II and Level III, is 2 to 3 points. Scores at the lower and higher ends of the scale (above the 97.5 percentile and below the 2.5 percentile) have standard errors of measurement of approximately 4 to 6 points. This is typical as scores become more extreme due to less measurement precision associated with those extreme scores. Table 33: Ranges of standard error of measurement for scale scores by grade or subject Standard Error of Measurement (Range) 3–6 2–5 2–6 2–6 2–6 2–6 2–6 3–8 2–6 2–5 3–7 Grade/Subject 3 Pretest Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 Grade 10 Algebra I Geometry Algebra II 64 Additionally, standard error curves are presented in the following figures. These are presented on a (0,1) scale on the x-axis representing the θ estimate (the estimate of the test-taker’s true ability) for examinees. Figure 15: Standard Errors of Measurement on the Grade 3 Pretest of Mathematics Test forms 65 Figure 16: Standard Errors of Measurement on the Grade 3 Mathematics Test forms 66 Figure 17: Standard Errors of Measurement on the Grade 4 Mathematics Test forms 67 Figure 18: Standard Errors of Measurement on the Grade 5 Mathematics Test forms 68 Figure 19: Standard Errors of Measurement on the Grade 6 Mathematics Test forms 69 Figure 20: Standard Errors of Measurement on the Grade 7 Mathematics Test forms 70 Figure 21: Standard Errors of Measurement on the Grade 8 Mathematics Test forms 71 Figure 22: Standard Errors of Measurement on the Grade 10 Mathematics Test forms 72 Figure 23: Standard Errors of Measurement on the Algebra I Test forms 73 Figure 24: Standard Errors of Measurement on the Geometry Test forms 74 Figure 25: Standard Errors of Measurement on the Algebra II Test forms 6.8 Equivalency of Test Forms North Carolina administers multiple forms of each test during each testing cycle. This serves several purposes. First, it allows North Carolina to fully test the breadth and depth of each curriculum. The curricula are extremely rich, and administering a single form that fully addressed each competency would be prohibitively long. Additionally, the use of multiple forms reduces the incidence of one student copying from the test of another student. The tests are parallel in terms of content coverage at the goal level. That is, each form has the same number of items from the Number Sense, Numeration, and Numerical Operations strand (Goal 1) as every other form administered in that grade. The specific questions asked on each form are a random domain sample of the topics in that grade’s goals, although care is taken to not overemphasize a particular topic on a single test form. The tests are statistically equivalent at the total test score level. Additionally, the two parts of the mathematics tests, Calculator Active and Calculator Inactive, are also equivalent at the whole-score level. That is, all the Calculator Active portions of the tests 75 for a given grade are equally difficult. However, due to the purposively random selection of items tested in each goal, the tests are not statistically equated at the goal level. The use of multiple equivalent and parallel forms has given rise to several “urban legends,” foremost among which is that “The red form is harder” (referring to the color of the front cover of one of the three test booklets). However, as the following figures show, the tests are indeed equivalent. Figure 26: Test Characteristic Curves for the Grade 3 Pretest of Mathematics Test forms 76 Figure 27: Test Characteristic Curves for the Grade 3 Mathematics Test forms 77 Figure 28: Test Characteristic Curves for the Grade 4 Mathematics Test forms 78 Figure 29: Test Characteristic Curves for the Grade 5 Mathematics Test forms 79 Figure 30: Test Characteristic Curves for the Grade 6 Mathematics Test forms 80 Figure 31: Test Characteristic Curves for the Grade 7 Mathematics Test forms 81 Figure 32: Test Characteristic Curves for the Grade 8 Mathematics Test forms 82 Figure 33: Test Characteristic Curves for the Grade 10 Mathematics Test forms 83 Figure 34: Test Characteristic Curves for the Algebra I Test forms 84 Figure 35: Test Characteristic Curves for the Geometry Test forms 85 Figure 36: Test Characteristic Curves for the Algebra II Test forms For each grade’s set of test forms, the test characteristic curves are very nearly coincident for much of the range of θ. Slight variations appear in the test curves at the extremes, as the tests were designed to have maximum sensitivity in the middle of the range of examinee ability. 86 Chapter Seven: Evidence of Validity 7.1 Evidence of Validity The validity of a test is the degree to which evidence and theory support the interpretation of test scores. Validity provides a check on how well a test fulfills its function. For all forms of test development, the validity of the test is an issue to be addressed from the first stage of development through analysis and reporting of scores. The process of validation involves accumulating evidence to provide a sound scientific basis for the proposed test score interpretations. Those interpretations of test scores are evaluated rather than the test itself. Validation, when possible, should include several types of evidence and the quality of the evidence is of primary importance (AERA, APA, NCME, 1985). For the North Carolina EOG and EOC Tests of Mathematics, evidence of validity is provided through content relevance and relationship of test scores to other external variables. 7.2 Content Validity Evidence of content validity begins with an explicit statement of the constructs or concepts being measured by the proposed test. The constructs or concepts measured by the North Carolina EOG Tests of Mathematics are categorized by four basic strands: Number Sense, Numeration, and Numerical Operations; Spatial Sense, Measurement, and Geometry; Patterns, Relationships, and Functions; and Data, Probability, and Statistics. All items developed for the North Carolina EOG Tests of Mathematics are written to measure those four constructs. Algebra I, Algebra II, and Geometry comprise the EOC Tests of Mathematics. These tests measure the different levels of mathematics knowledge specific to the three areas with particular focus on assessing students’ ability to process information and engage in higher order thinking. For test specification summaries, see Appendix B. Almost all of the items are written by North Carolina teachers and other educators. Many of the first round of the second edition math items were written under a contract with a major testing company to handle the logistics, but the contract specified that at least half of the items be written by teachers from North Carolina. During the additional field tests, the vast majority of the items were written by North Carolina educators. Additionally, all items written are reviewed by at least two content-area teachers from North Carolina, and the state’s teachers are involved in other aspects of item development and test review. Because North Carolina educators not only deliver the Standard Course of Study every day in their classrooms, they are also the most familiar with the way in which students learn and understand the material. Thus, North Carolina teachers are best able to recognize questions that not only match the Standard Course of Study for their 87 particular course or grade, but also are relevant and comprehensible to the students at that level. Instructional Validity DPI routinely administers questionnaires to teachers in an effort to evaluate the validity and appropriateness of the North Carolina End-of-Grade and End-of-Course Tests of Mathematics. Teachers are asked to evaluate the following statements using a five-point scale, with the highest score being “to a superior degree,” and the lowest score being “not at all.” 1. The test content reflects the goals and objectives of the Grade X Mathematics curriculum as outlined on the enclosed list of Grade X Mathematics objectives. 2. The test content reflects the goals and objectives of the Grade X Mathematics curriculum as Grade X is taught in my school or school system. 3. The items are clearly and concisely written, and the vocabulary is appropriate to the target age level. 4. The content is balanced in relation to ethnicity, race, sex, socioeconomic status, and geographic districts of the state. 5. Each of the items has one and only one answer that is best; however, the distractors appear plausible for someone who has not achieved mastery of the represented objective. In the most recent administrations, responses to statements reflect that the tests generally met these criteria to a “superior” or “high” degree. All tests and grades showed similar patterns in their responses; the results shown below are in aggregate. Table 34: Instructional Validity of the content of the North Carolina EOG Tests of Mathematics Statement 1 2 3 4 5 % indicating to a superior or high degree 85% 58% 55% 85% 48% 7.3 Criterion-Related Validity Analysis of the relationship of test scores to variables external to the test provides another important source of validity evidence. External variables may include measures of some criteria that the test is expected to predict, as well as relationships to other tests hypothesized to measure the same constructs. Criterion-related validity of a test indicates the effectiveness of a test in predicting an individual’s behavior in a specific situation. The criterion for evaluating the performance of a test can be measured at the same time (concurrent validity) or at some later time (predictive validity). 88 For the North Carolina EOG and EOC Tests of Mathematics, teachers’ judgments of student achievement, expected grade, and assigned achievement levels all serve as sources of evidence of concurrent validity. The Pearson correlation coefficient is used to provide a measure of association between the scale score and those variables listed above. The correlation coefficients for the North Carolina EOG and EOC Tests of Mathematics range from 0.49 to 0.89 indicating a moderate to strong correlation between EOG scale scores and its associated variables.* The tables below provide the Pearson correlation coefficients for variables used to establish criterion-related validity for the North Carolina EOG and EOC Tests of Mathematics. *Note: By comparison, the uncorrected correlation coefficient between SAT score and freshman year grades in college is variously reported as 0.35 to 0.55 (Camera & Echternacht, 2000). Table 35: Pearson correlation coefficient table for variables used to establish criterion-related validity for the North Carolina EOG Tests of Mathematics Grade 3 4 5 6 7 8 Teacher Judgment of Achievement Level by Assigned Achievement Level 0.59 0.55 0.54 0.58 0.55 0.58 Teacher Judgment of Achievement by Expected Grade 0.70 0.70 0.67 0.63 0.61 0.60 Teacher Judgment of Achievement by Math Scale Score 0.65 0.61 0.63 0.64 0.62 0.62 Assigned Achievement Level by Expected Grade 0.65 0.61 0.57 0.54 0.49 0.49 Expected Grade by Math Scale Score 0.69 0.68 0.67 0.61 0.58 0.56 89 Table 36: Pearson correlation coefficient table for variables used to establish criterion-related validity for the North Carolina EOC Tests of Mathematics Subject Algebra I Geometry Algebra II Assigned Achievement Level by Expected Grade 0.57 0.60 0.54 Teacher Judgment of Achievement by Assigned Achievement Level 0.54 0.55 0.48 Expected Grade by Math Scale Score 0.62 0.64 0.58 Teacher Judgment of Achievement by Math Scale Score 0.58 0.59 0.53 The variables used in the tables above are as follows. • • • • Teacher Judgment of Achievement: Teachers were asked, for each student participating in the test, to evaluate the student’s absolute ability, external to the test, based on their knowledge of their students’ achievement. The categories that teachers could use correspond to the achievement level descriptors mentioned previously on page 49. Assigned Achievement Level: The achievement level assigned to a student based on his or her test score, based on the cut scores previously described on page 49. Expected Grade: Teachers were also asked to provide for each student the letter grade that they anticipated each student would receive at the end of the grade or course. Math Scale Score: The converted raw-score-to-scale-score value obtained by each examinee. DPI found moderate to strong correlations between scale scores in mathematics and variables such as teachers’ judgment of student achievement, expected grade, and assigned achievement levels (all measures of concurrent validity). The department also found generally low correlations among these scale scores and variables external to the test such as gender, limited English proficiency, and disability for grades 3 through 8, the High School Comprehensive Test of Mathematics (grade 10), and Algebra I. The vast majority of the correlations between scale scores and gender or limited English proficient were less extreme than ± 0.10, and most of the correlations between scale scores and disability status were less extreme than ± 0.30. None of these relationships approached the levels recorded for the selected measures of concurrent validity. These generalizations held across the full range of forms administered by DPI for all the grades and subject areas. 90 An additional source of concurrent validity is the trend between students’ progress on the National Assessment of Education Progress (NAEP) and their progress on end-of-grade scores. Although the scores themselves cannot and should not be compared directly, nor is it valid to compare the percent “proficient” on each test, the trends show corresponding increases in both NAEP math scores and scores on the North Carolina EOG tests in mathematics. Figures 37 through 40 show the trends for students who score “basic” or “proficient” on NAEP assessments in grades 4 and 8 compared to students who scored at Level III or above on the North Carolina End-of-Grade Tests of Mathematics in grade 4 and 8. Figure 37: Comparison of NAEP “proficient” scores and North Carolina End-of-Grade Tests of Mathematics scores for Grade 4 91 Figure 38: Comparison of NAEP “basic” scores and North Carolina End-of-Grade Tests of Mathematics scores for Grade 4 Figure 39: Comparison of NAEP “proficient” scores and North Carolina End-of-Grade Tests of Mathematics scores for Grade 8 92 Figure 40: Comparison of NAEP “basic” scores and North Caroline End-of-Grade Tests of Mathematics scores for Grade 8 93 Chapter Eight: Quality Control Procedures Quality control procedures for the North Carolina testing program are implemented throughout all stages of testing. This includes quality control for test development, test administration, score analysis, and reporting. 8.1 Quality Control Prior to Test Administration Once test forms have been assembled, they are reviewed by a panel of subject experts. Once the review panel has approved a test form, test forms are then configured to go through the printing process. Printers send a blue-lined form back to NCDPI Test Development staff to review and adjust if necessary. Once all test answer sheets and booklets are printed, the test project manager conducts a spot check of test booklets to ensure that all test pages are included and test items are in order. 8.2 Quality Control in Data Preparation and Test Administration Student background information must be coded before testing begins. The school system may elect to either: (1) pre-code the answer sheets, (2) direct the test administrator to code the Student Background Information, or (3) direct the students to code the Student Background Information. For the North Carolina multiple-choice tests, the school system may elect to pre-code some or all of the Student Background Information on SIDE 1 of the printed multiple-choice answer sheet. The pre-coded responses come from the schools’ SIMS/NCWISE database. Pre-coded answer sheets provide schools with the opportunity to correct or update information in the SIMS/NCWISE database. In such cases, the test administrator ensures that the pre-coded information is accurate. The test administrator must know what information will be pre-coded on the student answer sheets to prepare for the test administration. Directions for instructing students to check the accuracy of these responses are located in test administrator manuals. All corrections for pre-coded responses are provided to a person designated by the school system test coordinator to make such corrections. The students and the test administrator must not change, alter, or erase pre-coding on students’ answer sheets. To ensure that all students participate in the required tests and to eliminate duplications, students, regardless of whether they take the multiple-choice test or an alternate assessment, are required to complete the student background information on the answer sheets. When tests and answer sheets are received by the local schools, they are kept in a locked, secure location. Class rosters are reviewed for accuracy by the test administrator to ensure that students receive their answer sheets. During test administration at the school level, proctors and test administrators circulate throughout the test facility (typically a classroom) to ensure that students are using the bubble sheets correctly. Once students have completed their tests, answer sheets are reviewed and where appropriate cleaned by local test coordinators (removal of stray marks, etc.). 94 8.3 Quality Control in Data Input All answer sheets are then sent from individual schools to the Local Test Coordinator, where they are scanned in a secure facility. The use of a scanner provides the opportunity to program in a number of quality control mechanisms to ensure that errors overlooked in the manual check of data are identified and resolved. For example, if the answer sheet is unreadable by the scanner, the scanner stops the scan process until the error is resolved. In addition, if a student bubbles in two answers for the same question, the scan records the student’s answer as a (*) indicating that the student has answered twice. 8.4 Quality Control of Test Scores Once all tests are scanned, they are then sent through a secure system to the Regional Accountability Coordinators who check to ensure that all schools in all LEAs have completed and returned student test scores. The Regional Accountability Coordinators also conduct a spot check of data and then send the data through a secure server to the North Carolina Department of Public Instruction Division of Accountability Services. Data are then imported into a file and cleaned. When a portion of the data are in, NCDPI runs a CHECK KEYS program to flag areas where answer keys may need a second check. In addition, as data come into the NCDPI Division of Accountability Services, Reporting Section staff import and clean data to ensure that individual student files are complete. 8.5 Quality Control in Reporting Scores can only be reported at the school level after NCDPI issues a certification statement. This is to ensure that school, district, and state-level quality control procedures have been employed. The certification statement is issued by the NCDPI Division of Accountability. The following certification statement is an example: “The department hereby certifies the accuracy of the data from the North Carolina endof-course tests for Fall 2004 provided that all NCDPI-directed test administration guidelines, rules, procedures, and policies have been followed at the district and schools in conducting proper test administrations and in the generation of the data. The LEAs may generate the required reports for the end-of-course tests as this completes the certification process for the EOC tests for the Fall 2004 semester.” 95 Glossary of Key Terms The terms below are defined by their application in this document and their common uses in the North Carolina Testing Program. Some of the terms refer to complex statistical procedures used in the process of test development. In an effort to avoid the use of excessive technical jargon, definitions have been simplified; however, they should not be considered exhaustive. Accommodations Changes made in the format or administration of the test to provide options to test takers who are unable to take the original test under standard test conditions. Achievement levels Descriptions of a test taker’s competency in a particular area of knowledge or skill, usually defined as ordered categories on a continuum classified by broad ranges of performance. Asymptote An item statistic that describes the proportion of examinees that endorsed a question correctly but did poorly on the overall test. Asymptote for a theoretical four-choice item is 0.25 but can vary somewhat by test. (For math it is generally 0.15 and for social studies it is generally 0.22). Biserial correlation The relationship between an item score (right or wrong) and a total test score. Common curriculum Objectives that are unchanged between the old and new curricula Cut scores A specific point on a score scale, such that scores at or above that point are interpreted or acted upon differently from scores below that point. Dimensionality The extent to which a test item measures more than one ability. Embedded test model Using an operational test to field test new items or sections. The new items or sections are “embedded” into the new test and appear to examinees as being indistinguishable from the operational test. Equivalent forms Statistically insignificant differences between forms (i.e., the red form is not harder). 96 Field test A collection of items to approximate how a test form will work. Statistics produced will be used in interpreting item behavior/performance and allow for the calibration of item parameters used in equating tests. Foil counts Number of examinees that endorse each foil (e.g. number who answer “A,” number who answer “B,” etc.) Item response theory A method of test item analysis that takes into account the ability of the examinee, and determines characteristics of the item relative to other items in the test. The NCDPI uses the 3-parameter model, which provides slope, threshold, and asymptote. Item tryout A collection of a limited number of items of a new type, a new format, or a new curriculum. Only a few forms are assembled to determine the performance of new items and not all objectives are tested. Mantel-Haenszel A statistical procedure that examines the differential item functioning (DIF) or the relationship between a score on an item and the different groups answering the item (e.g. gender, race). This procedure is used to identify individual items for further bias review. Operational test Test is administered statewide with uniform procedures and full reporting of scores , and stakes for examinees and schools. p-value Difficulty of an item defined by using the proportion of examinees who answered an item correctly. Parallel forms Covers the same curricular material as other forms Percentile The score on a test below which a given percentage of scores fall. Pilot test Test is administered as if it were “the real thing” but has limited associated reporting or stakes for examinees or schools. 97 Quasi-equated Item statistics are available for items that have been through item tryouts (although they could change after revisions); and field test forms are developed using this information to maintain similar difficulty levels to the extent possible. Raw score The unadjusted score on a test determined by counting the number of correct answers. Scale score A score to which raw scores are converted by numerical transformation. Scale scores allow for comparison of different forms of the test using the same scale. Slope The ability of a test item to distinguish between examinees of high and low ability. Standard error of measurement The standard deviation of an individual’s observed scores, usually estimated from group data. Test blueprint The testing plan, which includes numbers of items from each objective to appear on test and arrangement of objectives. Threshold The point on the ability scale where the probability of a correct response is fifty percent. Threshold for an item of average difficulty is 0.00. WINSCAN Program Proprietary computer program that contains the test answer keys and files necessary to scan and score state multiple-choice tests. Student scores and local reports can be generated immediately using the program. 98 References Camera, W. J. & Echternacht, G. (2000). The SAT I and High School Grades: Utility in Predicting Success in College. Research Notes RN-10, July 2000 (p.6). The College Board Office of Research and Development. Gregory, Robert J. (2000). Psychological Testing: History, Principles, and Applications. Needham Heights: Allyn & Bacon. Hambleton, Ronald K. (1983). Applications of Item Response Theory. British Columbia: Educational Research Institute of British Columbia. Hinkle. D.E., Wiersma, W., & Jurs, S. G. (1998). Applied Statistics for the Behavioral Sciences (pp. 69-70) Muraki, E., Mislevy, R.J., & Bock, R.D. (1991). PC-BiMain: Analayis of item parameter drift, differential item functioning, and variant item performance [Computer software]. Mooresville, IN: Scientific Software, Inc. Marzano, R.J., Brandt, R.S., Hughes, C.S., Jones, B.F., Presseisen, B.Z., Stuart, C., & Suhor, C. (1988). Dimensions of Thinking. Alexandria, VA: Association for Supervision and Curriculum Development. Millman, J., and Greene, J. (1993). The Specification and Development of Tests of Achievement and Ability. In Robert Linn (ed.), Educational Measurement (pp. 335-366). Phoenix: American Council on Education and Oryx Press. Thissen, D., & Orlando, M. (2001). Item response theory for items scored in two categories. In D. Thissen & H. Wainer (Eds), Test Scoring (pp. 73-140). Mahwah, NJ: Lawrence Erlbaum Associates. Williams, V.S.L., Pommerich, M., & Thissen, D. (1998). A comparison of developmental scales based on Thurstone methods and item response theory. Journal of Educational Measurement, 35, 93-107. Additional Resources Anastasi, A. (1982). Psychological Testing. New York: Macmillan Publishing Company, Inc. Averett, C.P. (1994). North Carolina End-of-Grade Tests: Setting standards for the achievement levels. Unpublished manuscript. Berk, R.A. (1984). A Guide to Criterion-Referenced Test Construction. Baltimore: The Johns Hopkins University Press. Berk, R.A. (1982). Handbook of Methods for Detecting Test Bias. Baltimore: The Johns Hopkins University Press. 99 Bock, R.D., Gibbons, R., & Muraki, E. (1988). Full information factor analysis. Applied Psychological Measurement, 12, 261-280. Camilli, G. & Shepard, L.A. (1994). Methods for Identifying Biased Test Items. Thousand Oaks, CA: Sage Publications, Inc. Campbell, D.T. & Fiske, D.W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105. Cattell, R.B. (1956). Validation and intensification of the Sixteen Personality Factor Questionnaire. Journal of Clinical Psychology, 12, 105-214. Dorans, N.J. & Holland, P.W. (1993). DIF Detection and description: Mantel-Haenszel and standardization. In P.W. Holland and H. Wainer (Eds.), Differential Item Functioning (pp. 35-66). Hillsdale, NJ: Lawrence Erlbaum. Haladyna, T.M. (1994). Developing and Validating Multiple-Choice Test Items. Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers. Hambleton, R.K. & Swaminathan, H. (1985). Item Response Theory: Principles and Applications. Kluwer-Nijhoff Publishing. Hambleton, R.K., Swaminathan, H., & Rogers, H.J. (1991). Fundamentals of Item Response Theory. Newbury Park, CA: Sage Publications, Inc. Holland, P.W. & Wainer, H. (1993). Differential Item Functioning. Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. Joreskog, K.J. & Sorbom, D. (1986). PRELIS: A program for multivariate data screening and data summarization. Chicago, IL: Scientific Software, Inc. Joreskog, K.J. & Sorbom, D. (1988). LISREL 7: A guide to the program and applications. Chicago, IL: SPSS, Inc. Kubiszyn, T. & Borich, G. (1990). Educational Testing and Measurement. New York: HarperCollins Publishers. Muraki, E., Mislevy, R.J., & Bock, R.D. PC-Bimain Manual. (1991). Chicago, IL: Scientific Software, Inc. National Council of Teachers of Mathematics. Curriculum and Evaluation Standards for School Mathematics. (1989). Reston, VA: Author. North Carolina Department of Public Instruction. (1992). Teacher Handbook for Mathematics. Raleigh, NC: Author. 100 North Carolina Department of Public Instruction. (1993). North Carolina End-of-Grade Testing Program: Background Information. Raleigh, NC: Author. North Carolina Department of Public Instruction. (1996). North Carolina Testing Code of Ethics. Raleigh, NC: Author. North Carolina State Board of Education. (1993). Public School Laws of North Carolina 1994. Raleigh, NC: The Michie Company. Nunnally, J. (1978). Psychometric Theory. New York: McGraw-Hill Book Company. Rosenthal, R. & Rosnow, R.L. (1984). Essentials of behavioral research: Methods and data analysis. New York: McGraw-Hill Book Company. SAS Institute, Inc. (1985). The FREQ Procedure. In SAS User's Guide: Statistics, Version 5 Edition. Cary, NC: Author. Traub, R.E. (1994). Reliability for the social sciences: Theory and applications. Thousand Oaks, CA: Sage Publications, Inc. 101 Appendix A: Item Development Guidelines Content Guidelines 1. Items must be based on the goals and objectives outlined in the North Carolina Standard Course of Study in Mathematics and written for the appropriate grade level. 2. To the extent possible, each item written should measure a single concept, principle, procedure, or competency. 3. Write items that measure important or significant material instead of trivial material. 4. Keep the testing vocabulary consistent with the expected grade level of students tested. 5. Avoid writing stems based on opinions. 6. Emphasize higher level thinking skills using the taxonomy provided by the NCDPI. Procedural Guidelines 7. Use the best answer format. 8. Avoid writing complex multiple-choice items. 9. Format the items vertically, not horizontally. 10. Avoid errors of grammar, abbreviations, punctuation, and spelling. 11. Minimize student reading time. 12. Avoid tricky or misleading items. 13. Avoid the use of contractions. 14. Avoid the use of first or second person. Stem Construction Guidelines 15. Items are to be written in the question format. 16. Ensure that the directions written in the stems are clear and that the wording lets the students know exactly what is being tested. 17. Avoid excessive verbiage when writing the stems. 18. Word the stems positively, avoiding any negative phrasing. The use of negatives such as NOT and EXCEPT is to be avoided. 19. Write the items so that the central idea and the phrasing are included in the stem instead of the foils. 20. Place the interrogative as close to the item foils as possible. General Foil Development 21. Each item must contain four foils (A, B, C, D). 22. Order the answer choices in a logical order. Numbers should be listed in ascending or descending order. 102 23. Each item written should contain foils that are independent and not overlapping. 24. All foils in an item should be homogeneous in content and length. 25. Do not use the following as foils: all of the above, none of the above, I don’t know. 26. Word the foils positively, avoiding any negative phrasing. The use of negatives such as NOT and EXCEPT is to be avoided. 27. Avoid providing clues to the correct response. Avoid writing items where phrases in the stem (clang associations) are repeated in the foils. 28. Avoid including ridiculous options. 29. Avoid grammatical clues to the correct answer. 30. Avoid specific determiners because they are so extreme that they are seldom the correct response. To the extent possible, specific determiners such as ALWAYS, NEVER, TOTALLY, and ABSOLUTELY should not be used when writing items. Qualifiers such as best, most likely, approximately, etc. should be bold and italic. 31. The correct response for items written should be evenly balanced among the response options. For a 4-option multiple-choice item, each correct response should be located at each option position about 25% of the time. 32. The items written should contain one and only one best (correct) answer. Distractor Development 33. Use plausible distractors. The best (correct) answer must clearly be the best (correct) answer and the incorrect responses must clearly be inferior to the best (correct) answer. No distractor should be obviously wrong. 34. To the extent possible, use the common errors made by students as distractors. Give your reasoning for incorrect choices on the back of the item spec sheet. 35. Technically written phrases may be used, where appropriate, as plausible distractors. 36. True phrases that do not correctly respond to the stem may be used as plausible distractors where appropriate. 37. The use of humor should be avoided. 103 Appendix B: Test Blueprint Summaries 104 Mathematics Grade 3: Test Blueprint Summary Number Sense, Numeration, and Numerical Operations Competency Goal One: The learner will read, write, model, and compute with rational numbers. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal One Summary 32 96 0.498–0.758 Spatial Sense, Measurement, and Geometry Competency Goal Two: The learner will recognize, understand, and use basic geometric properties and standard units of metric and customary measurement. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Two Summary 24 72 0.513–0.808 Patterns, Relationships, and Functions Competency Goal Three: The learner will demonstrate an understanding of classification, patterning, and seriation. Average # of items Average Number of Difficulty of Pool per form Items per Class (Range) Goal Three Summary 12 36 0.483–0.693 Data, Probability, and Statistics Competency Goal Four: The learner will demonstrate an understanding of data collection, display, and interpretation. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Four Summary 12 36 0.577–0.880 105 Mathematics Grade 4: Test Blueprint Summary Number Sense, Numeration, and Numerical Operations Competency Goal One: The learner will read, write, model, and compute with rational numbers. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal One Summary 30 90 0.500–0.800 Spatial Sense, Measurement, and Geometry Competency Goal Two: The learner will demonstrate an understanding and use of the properties and relationships in geometry, and standard units of metric and customary measurement. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Two Summary 23 69 0.449–0.684 Patterns, Relationships, and Functions Competency Goal Three: The learner will demonstrate an understanding of patterns and relationships. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Three Summary 11 33 0.600–0.643 Data, Probability, and Statistics Competency Goal Four: The learner will demonstrate an understanding and use of graphing, probability, and data analysis. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Four Summary 16 49 0.468–0.735 106 Mathematics Grade 5: Test Blueprint Summary Number Sense, Numeration, and Numerical Operations Competency Goal One: The learner will understand and compute with rational numbers. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal One Summary 32 96 0.458–0.757 Spatial Sense, Measurement, and Geometry Competency Goal Two: The learner will understand and compute with rational numbers. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Two Summary 23 69 0.417–0.602 Patterns, Relationships, and Functions Competency Goal Three: The learner will demonstrate an understanding of patterns, relationships, and elementary algebraic representation. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Three Summary 13 39 0.401–0.576 Data, Probability, and Statistics Competency Goal Four: The learner will demonstrate an understanding and use of graphing, probability, and data analysis. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Four Summary 15 45 0.319–0.634 107 Mathematics Grade 6: Test Blueprint Summary Number Sense, Numeration, and Numerical Operations Competency Goal One: The learner will understand and compute with rational numbers. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal One Summary 29 87 0.353–0.700 Spatial Sense, Measurement, and Geometry Competency Goal Two: The learner will demonstrate an understanding and use of the properties and relationships in geometry and standard units of metric and customary measurement. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Two Summary 23 69 0.381–0.697 Patterns, Relationships, and Functions Competency Goal Three: The learner will demonstrate an understanding of patterns, relationships, and algebraic representations. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Three Summary 14 42 0.449–0.570 Data, Probability, and Statistics Competency Goal Four: The learner will demonstrate an understanding and use of graphing, probability, and data analysis. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Four Summary 14 42 0.421–0.647 108 Mathematics Grade 7: Test Blueprint Summary Number Sense, Numeration, and Numerical Operations Competency Goal One: The learner will understand and compute with real numbers. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal One Summary 24 48 0.378–0.665 Spatial Sense, Measurement, and Geometry Competency Goal Two: The learner will demonstrate an understanding and use of the properties and relationships in geometry and standard units of metric and customary measurement. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Two Summary 14 28 0.250–0.653 Patterns, Relationships, and Functions Competency Goal Three: The learner will demonstrate an understanding of patterns, relationships, and fundamental algebraic concepts. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Three Summary 20 40 0.412–0.583 Data, Probability, and Statistics Competency Goal Four: The learner will demonstrate an understanding and use of graphing, probability, and data analysis. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Four Summary 22 44 0.238–0.590 109 Mathematics Grade 8: Test Blueprint Summary Number Sense, Numeration, and Numerical Operations Competency Goal One: The learner will understand and compute with real numbers. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal One Summary 32 74 0.318–0.595 Spatial Sense, Measurement, and Geometry Competency Goal Two: The learner will demonstrate an understanding and use of the properties and relationships in geometry and standard units of metric and customary measurement. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Two Summary 20 40 0.309–0.571 Patterns, Relationships, and Functions Competency Goal Three: The learner will demonstrate an understanding of patterns, relationships, and fundamental algebraic concepts. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Three Summary 11 22 0.400–0.644 Data, Probability, and Statistics Competency Goal Four: The learner will demonstrate an understanding and use of graphing, probability, and data analysis. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Four Summary 14 24 0.334–0.572 110 Algebra I: Test Blueprint Summary Number Sense, Numeration, and Numerical Operations Competency Goal One: The learner will perform operations with real numbers and polynomials to solve problems. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal One Summary 12 36 0.567–0.620 Spatial Sense, Measurement, and Geometry Competency Goal Two: The learner will solve problems in a geometric context. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Two Summary 4 12 0.392–0.516 Patterns, Relationships, and Functions Competency Goal Three: The learner will graph and use relations and functions to solve problems. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Three Summary 56 156 0.423–0.581 Data, Probability, and Statistics Competency Goal Four: The learner will collect and interpret data to solve problems. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Four Summary 8 24 0.486–0.614 111 Geometry: Test Blueprint Summary Number Sense, Numeration, and Numerical Operations Competency Goal One: The learner will perform operations with real numbers to solve problems in a geometric context. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal One Summary None None **** Spatial Sense, Measurement, and Geometry Competency Goal Two: The learner will use properties of geometric figures to solve problems and write proofs. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Two Summary 61 189 0.271–0.702 Patterns, Relationships, and Functions Competency Goal Three: The learner will graph and use relations and functions to solve problems. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Three Summary 7 21 0.392–0.453 Data, Probability, and Statistics Competency Goal Four: The learner will collect and interpret data to solve problems. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Four Summary 3 9 0.472 112 Algebra II: Test Blueprint Summary Number Sense, Numeration, and Numerical Operations Competency Goal One: The learner will perform operations with real numbers and polynomials to solve problems. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal One Summary 8 24 0.427–0.447 Spatial Sense, Measurement, and Geometry Competency Goal Two: The learner will solve problems in a geometric context. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Two Summary 3 9 0.477–0.537 Patterns, Relationships, and Functions Competency Goal Three: The learner will graph and use relations and functions to solve problems. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Three Summary 40 120 0.317–0.690 Data, Probability, and Statistics Competency Goal Four: The learner will collect and interpret data to solve problems. Average Number of Average Number of Difficulty of Pool items per form Items per Class (Range) Goal Four Summary 9 27 0.357–0.566 113 Appendix C: Math Developmental Scale Report with Excel Plots for First and Second Editions’ Scale Scores The Developmental Scale for the North Carolina End-of-Grade Mathematics Tests, Second Edition David Thissen, Viji Sathy, Michael C. Edwards, & David Flora L.L. Thurstone Psychometric Laboratory The University of North Carolina at Chapel Hill Following changes in the North Carolina curricular specifications for mathematics, a second edition of the North Carolina End-of-Grade tests in mathematics was designed, and an item tryout was administered using small sets of 12 items each, embedded in the operational End-of-Grade tests in the Spring of 2000. This report describes the use of data from that item tryout to construct a developmental scale for the second edition of the North Carolina End-of-Grade tests in mathematics. The basis of a developmental scale is the specification of the means and standard deviations for scores on that scale for each grade level. In the case of the North Carolina End-of-Grade tests the grade levels range from the grade 3 pretest (administered in the Fall to students in the 3rd grade) through grade 8. The data from which the scale-score means and standard deviations are derived make use of special test forms (called linking forms) that are administered to students in adjacent grades. The difference in performance among grades on these forms is used to estimate the difference in proficiency among grades. The second edition of the North Carolina EOG Tests of Mathematics used item response theory (IRT) to compute these estimates following procedures described by Williams, Pommerich, and Thissen (1998). The population means and standard deviations derived from the Spring 2000 item calibration for the North Carolina EOG Mathematics tests are shown in Table 1. Table 1. Population means and standard deviations derived from the Spring 2000 item calibration for the North Carolina EOG Tests of Mathematics, second edition Population Grade Mean Standard Deviation 3 Pretest 234.35 9.66 3 248.27 9.86 4 252.90 10.65 5 255.99 12.78 6 259.95 11.75 7 263.36 12.46 8 267.09 12.83 The values for the developmental scale shown in Table 1 are based on IRT estimates of differences between adjacent-grade means and ratios of adjacent-grade standard deviations computed using the computer program MULTILOG (Thissen, 1991); the estimates from MULTILOG were cross-checked against parallel estimates computed using 114 the software IRTLRDIF (Thissen, 2001). In the computation of estimates using either software system, the analysis of data from adjacent grades arbitrarily sets the mean and standard deviation of the population distribution of the lower grade to values of zero (0) and one (1), respectively; the values of the mean (µ) and standard deviation (σ) of the higher grade are estimated making use of the item response data and the three-parameter logistic IRT model (Thissen and Orlando, 2001). Table 2 shows the average difference between adjacent-grade means (µ) in units of the standard deviation of the lower grade, and ratios between adjacent-grade standard deviations (σ), derived from the Spring 2000 item calibration for the North Carolina EOG Tests of Mathematics. The values in Table 2 were converted into the final scale, shown in Table 1, by (arbitrarily) setting the average scale score in grade 4 to be 252.9, with a standard deviation of 10.65, and then computing the values for the other grades such that the differences between the means for adjacent grades in units of the standard deviation of the lower grade were the same as those shown in Table 2. Table 2. Average difference between adjacent-grade means (µ) in units of the standard deviation of the lower grade and ratios between adjacent-grade standard deviations (σ), derived from the Spring 2000 item calibration for the North Carolina EOG Tests of Mathematics, second edition Grades 3P-3 3-4 4-5 5-6 6-7 7-8 Average µ Difference 1.44 0.47 0.29 0.31 0.29 0.30 Average σ Ratio 1.02 1.08 1.20 0.92 1.06 1.03 (Useful) Replications 11 17 14 10 13 3 The estimates shown in Table 2 were derived from 3 to 17 replications of the betweengrade difference; the numbers of replications for each grade pair are also shown in Table 2. Each replication was based on a (different) short embedded linking form from among the item tryout forms administered in the Spring 2000. The sample size for each linking form varied from 398 to 4,313 students in each grade. (Most sample sizes were in the planned range of 1,300–1,500.) The original design of the embedded item calibration for the second edition called for 12 to 17 (12-item) linking forms between each pair of grades, with sample sizes around 1,500. However, some planned forms were not printed and distributed before the testing window began. As a result, some forms were administered to larger samples, and other forms (that were delivered late) were administered to smaller samples. In addition, the forms were not necessarily administered to the random samples that were planned within each grade. Corrections were made for these problems in the computation of the estimates shown in Table 2. The mean difference between grades 5 and 6 was corrected using an estimate of the regression across replications of the mean difference on the new scale against the mean difference on the old (operational) scale after data analysis suggested that the matched samples in grades 5 and 6 were atypical in their performance. 115 The mean difference between grades 7 and 8 and the standard deviation ratio for grade 5 relative to grade 4 were adjusted to smooth the relation between those values and the corresponding values for adjacent grades. Table 3 shows, for each adjacent-grade pair, the values of the average difference between adjacent-grade means (µ) in units of the standard deviation of the lower grade and ratios of adjacent-grade standard deviations (σ) derived from the Spring 2000 item calibration for the North Carolina EOG Tests of Mathematics for each replication that provided useful data. In Table 3 the values of each grade-pair are in decreasing order of the estimate of the difference between the means. There is some variation among the estimates across replications due to the fact that some of the estimates are based on small samples and many of the estimates are based on non-random samples. However, as aggregated in Table 2, a useful developmental scale was constructed. 116 Table 3: (Useful) replications of the average difference between adjacent-grade means (µ) in units of the standard deviation of the lower grade and ratios between adjacent-grade standard deviations (σ) derived from the Spring 2000 item calibration for the North Carolina EOG Tests of Mathematics, second edition Grade 3P–3 Grades 3–4 Grades 4–5 Grades 5–6 Grades 6–7 Grades 7–8 Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD 1.84 1.09 0.76 1.25 0.63 1.06 0.68 0.72 0.50 1.10 0.42 0.92 1.77 1.04 0.59 1.09 0.57 0.96 0.54 0.75 0.39 1.00 0.36 1.26 1.73 1.18 0.57 1.09 0.51 1.05 0.53 0.82 0.37 1.09 0.27 0.92 1.60 0.97 0.56 1.02 0.51 1.15 0.51 0.83 0.36 0.95 1.53 0.98 0.55 1.04 0.51 1.07 0.38 1.15 0.35 1.02 1.50 0.97 0.55 1.33 0.41 1.03 0.37 1.02 0.33 1.12 1.42 1.12 0.54 1.05 0.40 0.87 0.36 0.9 0.28 1.01 1.35 1.10 0.49 1.04 0.21 1.51 0.33 1.05 0.28 1.18 1.28 1.02 0.47 0.94 0.19 1.63 0.25 1.09 0.22 1.01 0.97 0.93 0.42 1.22 0.12 1.53 0.21 0.83 0.21 1.08 0.89 0.83 0.41 0.99 0.1 1.74 0.21 1.01 0.40 1.1 -0.02 1.68 0.14 1.05 0.40 0.97 -0.05 1.71 0.12 1.14 0.37 0.90 -0.09 1.57 0.36 1.28 0.35 1.01 0.28 1.04 117 Comparison with and Linkage to the First Edition Scale The embedded nature of the Spring 2000 item calibration provided a basis for a preliminary linkage of the second edition developmental scale with that of the first edition. (The results of that preliminary linkage were subsequently superseded by results obtained from a special study with data collected in Spring 2001.) Table 4 shows a comparison of the population means and standard deviations for the second edition with the averages and standard deviations for the scale scores obtained from the operational administration of the first edition. For ease of comparison of the two scales, Figure 1 shows the two sets of averages plotted together, with 100 subtracted from the 2nn values of the new scale so they use approximately the same range. The developmental scales for the first and second editions of the mathematics test are somewhat dissimilar. The smaller rates of change observed in the calibration data for the second edition are likely due to incomplete implementation in the 1999–2000 academic year of the new curriculum upon which the second edition was based. Table 4: Comparison of the population means and standard deviations for the second edition with the averages and standard deviations obtained from the operational administration of the first edition in the Spring 2000 item calibration for the North Carolina EOG Tests of Mathematics First Edition Second Edition Mean Standard Deviation Mean Standard Deviation 3 Pretest 131.6 7.8 234.35 9.66 3 143.5 11.1 248.27 9.86 4 152.9 10.1 252.90 10.65 5 159.5 10.1 255.99 12.78 6 165.1 11.2 259.95 11.75 7 171.0 11.5 263.36 12.46 8 175.3 11.9 267.09 12.83 Grade [The careful reader will note that, in Table 4, the second edition standard deviations are somewhat larger than those for the first edition. This is due to the fact that the standard deviations for the second edition are the values for the population distribution and those for the first edition are standard deviations of the scale scores themselves; the latter must be somewhat smaller than the former for IRT scale scores.] 118 310 NewAverage 300 Average2 Average3 New Math Scale Scores (& old + 100) 290 OldScores+100 280 270 260 250 240 230 220 210 200 1 2 3 4 5 Grade 6 7 8 Figure 1. Comparison of the growth curves for the first and second editions of the North Carolina EOG Tests of Mathematics in the Spring 2000 item calibration (Vertical lines indicate 1, 2, and 3 standard deviations on the second edition) 119 9 Scoring tables for the second edition forms that were to be administered in the Spring of 2001 were constructed after those forms were assembled in February–March 2001. Using the item parameters from the embedded item calibration and the population means and standard deviations in Table 1, we constructed scoring tables using the procedures described by Thissen, Pommerich, Billeaud, and Williams (1995) and Thissen and Orlando (2001). These procedures yield tables that translate summed scores into corresponding IRT scale scores on the developmental scale. A side-effect of the construction of those scoring tables was that the algorithm provides IRT model-based estimates of the proportions of the (Spring 2000) item calibration samples that would have obtained each summed score (and hence, each scale score) had they been administered the forms assembled for Spring 2001. Those score-proportions were matched with the observed score distributions on the first-edition forms that were included in the item tryout, yielding equipercentile equating tables (Angoff, 1982) that match scores on the second edition with the scores at the same percentiles on the first edition. This equipercentile matching also provided part of the basis for a preliminary translation of the cut scores between achievement levels from the first edition to the second edition. Additional information was also used to select the preliminary cut scores, in the form of the consistency in the patterns of the matched cut scores between the EOG Levels I, II, III, and IV across grades. At the time of the preliminary linkage between the first and second edition score scales, it was known that that linkage was based largely on statistical models and hypothetical computations. We computed values based on IRT modeled estimates of the score distributions that would have been obtained if the new second-edition forms had been administered operationally in the Spring of 2000 (which they were not), and treated those values as though they reflected performance that would have happened had the new curriculum been completely implemented in 1999–2000 (which subsequent evidence indicated was unlikely). As a result, those preliminary estimates were in place only because the testing schedule and associated decision-making required some (albeit preliminary) cut scores prior to the inaugural administration of the second edition tests in the Spring of 2001. The Equating Study Because of the uncertainty surrounding the preliminary linkage between the scales for the first and second editions of the North Carolina End-of-Grade Mathematics tests, a special study commonly known as the equating study was performed in the Spring of 2001. In this study, the newly-constructed second edition forms of the mathematics tests and selected forms from the first edition were administered to spiraled samples in the context of the item tryout of new items to create additional second-edition forms. The purpose of this study was to provide data for linkage of the scales on the first and second editions using the newly-constructed operational forms of the second edition of the test, which were not available until early Spring 2001. 120 Figure 2 shows the equipercentile equating functions for grades 3–8 obtained using data from the equating study. [Strictly speaking, this is not equating because the first and second editions of the test measure different things, i.e., achievement on different curricula. It is more technically referred to as statistical moderation (Linn, 1993; Mislevy, 1992). However, the statistical procedures of equipercentile equating are used, so it is commonly referred to as equating.] The equating functions in Figure 2 are not coincident because they cannot be given the fact that the developmental scales follow different trajectories across grades (as shown in Figure 1). 220 First Edition Scale 200 180 160 Grade 3 Grade 4 140 Grade 5 Grade 6 120 Grade 7 Grade 8 100 200 220 240 260 280 Second Edition Scale 300 320 Figure 2: Equipercentile equating functions between the first and second edition NC End of Grade Mathematics scales derived from the Spring 2001 “equating study” for Grades 3-8. Nevertheless, within grades these curves yield translation tables that can be used to convert scores on the second edition of the test into equated scores on the first edition. Such converted scores may be used in the computation of year-to-year change for the ABCs accountability system for the transitional year when the scores for the previous year are on the first-edition scale and the scores for the current year are on the secondedition scale. In addition, because these equipercentile relations translate between scores on the first-edition scale and the second-edition scale, they are used to translate the cut 121 scores (between Achievement Levels I, II, III, and IV) from the old scale to the new scale. The Effects of Significant Curricular Change In the Spring of 2001, scores on the inaugural administration of the second edition of the North Carolina EOG Tests of Mathematics were substantially higher than had been expected given student performance on the test items in the Spring 2000 item tryout and calibration. Table 5 shows the average scores for each grade; across grades the statewide performance on the test was 2–4 scale-score points higher in Spring 2001 than was expected from the item tryout the previous year. This was an unprecedented level of change given that annual increases of test scores on this scale had almost always been less than one point throughout the 1990s. Table 5: Comparison of the population means and standard deviations for the second edition with the averages and standard deviations obtained from the operational administration of the second edition in the of Spring 2001. Grade 3 Pretest 3 4 5 6 7 8 Item Tryout, 2000 Standard Mean Deviation 234.35 9.66 248.27 9.86 252.90 10.65 255.99 12.78 259.95 11.75 263.36 12.46 267.09 12.83 Statewide Data, Spring 2001 Standard Mean Deviation 236.1 8.1 250.6 7.7 255.8 8.3 260.0 9.6 263.2 9.9 267.1 10.6 270.0 11.0 This result, along with a good deal of unpublished statistical and anecdotal evidence, suggested that performance of the students on the Spring 2000 item tryout was limited by incomplete implementation of the new mathematics curriculum in the 1999–2000 academic year. In the 2000–2001 academic year, when it was known that the second edition scores would be accepted, instruction in the new curriculum may have been much more thorough, so scores were higher. A consequence of these facts was that the preliminary cut scores for the Achievement Levels, which had been set using only data from the Spring 2000 item tryout, were substantially lower than the final cut scores set using data from the equating study in Spring 2001. Because this was not known during the period of testing, and because scores were reported locally before data from the equating study could be analyzed, there were misleading reports of very high percentages of students passing the test with scores in Achievement Levels III or IV. These results were corrected after the data from the equating study were used to re-compute the cut scores before the test score information was used for the ABCs accountability system. 122 However, it is likely that this kind of experience may follow any drastic change in the curriculum in any subject-matter area. If the test must be calibrated based on data obtained before students actually experience the new curriculum, the test items will appear at that time to be more difficult than they will be when used operationally after the new curriculum has been implemented. However, the amount of that change cannot be known until after the first operational administration of any new test. Current requirements that scores be reported and used immediately after each administration of the End-of-Grade tests, including the first administration for a new edition, may lead to unexpected results, as was the case for mathematics scores in Spring 2001. 123 Percentile Grade 3 Second-Edition Data Points 238 241 243 244 246 247 247 248 249 250 252 253 253 255 255 257 259 261 265 5th 10th 15th 20th 25th 30th 35th 40th 45th 50th 55th 60th 65th 70th 75th 80th 85th 90th 95th Grade 3 First-Edition Data Points 123 127 131 134 136 138 139 141 142 144 146 147 148 150 151 154 155 157 161 Grade 3 165 160 155 Form O (old form) 150 145 140 135 130 125 120 235 240 245 250 255 Form A (new form) 124 260 265 270 Percentile Grade 4 Second-Edition Scale 242 245 247 248 249 251 252 253 254 256 257 258 259 260 261 264 266 268 271 5th 10th 15th 20th 25th 30th 35th 40th 45th 50th 55th 60th 65th 70th 75th 80th 85th 90th 95th Grade 4 First-Edition Scale 135 139 142 143 146 147 149 150 152 153 155 156 157 159 160 162 164 166 168 Grade 4 170 165 Form O (old form) 160 155 150 145 140 135 130 240 245 250 255 260 Form A (new form) 125 265 270 275 Percentile Grade 5 Second-Edition Scale 244 248 249 251 252 254 255 256 257 259 260 261 263 264 266 268 270 273 277 5th 10th 15th 20th 25th 30th 35th 40th 45th 50th 55th 60th 65th 70th 75th 80th 85th 90th 95th Grade 5 First-Edition Scale 142 146 149 152 153 154 156 157 158 159 161 162 163 164 165 167 169 171 174 Grade 5 175 170 Form P (old form) 165 160 155 150 145 140 240 245 250 255 260 Form A (new form) 126 265 270 275 280 Percentile Grade 6 Second-Edition Scale 247 249 252 253 254 256 257 258 260 261 262 263 265 267 268 270 272 275 278 5th 10th 15th 20th 25th 30th 35th 40th 45th 50th 55th 60th 65th 70th 75th 80th 85th 90th 95th Grade 6 First-Edition Scale 146 149 152 153.5 156 157 159 160 163 164 166 167 169 171 172 174 176 178 183 Grade 6 185 180 Form O (old form) 175 170 165 160 155 150 145 245 250 255 260 265 Form A (new form) 127 270 275 280 Percentile Grade 7 Second-Edition Scale 250 253 255 257 258 260 261 263 264 265 266 268 270 272 273 275 277 282 286 5th 10th 15th 20th 25th 30th 35th 40th 45th 50th 55th 60th 65th 70th 75th 80th 85th 90th 95th Grade 7 First-Edition Scale 154 158 161 163 165 166 168 169 170 171 173 174 176 177 179 180 182 185 188 Grade 7 190 185 Form C (old form) 180 175 170 165 160 155 150 245 250 255 260 265 270 Form A (new form) 128 275 280 285 290 Percentile Grade 8 Second-Edition Scale 252 254 256 258 259 261 263 264 265 266 269 271 272 274 276 278 281 284 289 5th 10th 15th 20th 25th 30th 35th 40th 45th 50th 55th 60th 65th 70th 75th 80th 85th 90th 95th Grade 8 First-Edition Scale 153 155 157 161 163 165 167 168 170 172 174 176 178 179 181 183 186 189 193 Grade 8 195 190 185 Form P (old form) 180 175 170 165 160 155 150 250 255 260 265 270 275 Form A (new form) 129 280 285 290 295 References Angoff, W.H. (1982). Summary and derivation of equating methods used at ETS (Pp. 5569). In P.W. Holland & D.B. Rubin, Test equating. New York: Academic Press. Linn, R.L. (1993). Linking results of distinct assessments. Applied Measurement in Education. 6, 83-102. Mislevy, R.J. (1992). Linking educational assessments: Concepts, issues, methods, and prospects. Princeton, NJ: Educational Testing Service. Thissen, D. (1991). MULTILOG user’s guide—Version 6. Chicago, IL: Scientific Software, Inc. Thissen, D. (2001). IRTLRDIF v.2.0b: Software for the computation of the statistics involved in item response theory likelihood-ratio tests for differential item functioning. Unpublished ms. Thissen, D., & Orlando, M. (2001). Item response theory for items scored in two categories. In D. Thissen & H. Wainer (Eds), Test Scoring (Pp. 73-140). Mahwah, NJ: Lawrence Erlbaum Associates. Thissen, D., Pommerich, M., Billeaud, K., & Williams, V.S.L. (1995). Item response theory for scores on tests including polytomous items with ordered responses. Applied Psychological Measurement, 19, 39-49. Williams, V.S.L., Pommerich, M., & Thissen, D. (1998). A comparison of developmental scales based on Thurstone methods and item response theory. Journal of Educational Measurement, 35, 93-107. 130 Appendix D: Sample Items 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 Appendix E: Sample Frequency Distribution Tables for Math Scale Scores (selected grades and subjects) Grade 3 EOG (2001) MathScaleScore 218 219 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 Frequency Count 4 1 4 1 3 1 6 7 7 19 63 61 102 277 239 602 818 507 1590 1772 1688 1988 2966 3080 2845 4174 4177 4019 5833 4696 5818 5282 3734 4709 5469 4658 3771 3668 3762 3609 3332 2424 2216 2034 528 Percent 0 0 0 0 0 0 0.01 0.01 0.01 0.02 0.06 0.06 0.1 0.27 0.23 0.59 0.8 0.5 1.56 1.73 1.65 1.95 2.9 3.01 2.78 4.09 4.09 3.93 5.71 4.6 5.69 5.17 3.65 4.61 5.35 4.56 3.69 3.59 3.68 3.53 3.26 2.37 2.17 1.99 0.52 151 Cumulative Frequency 4 5 9 10 13 14 20 27 34 53 116 177 279 556 795 1397 2215 2722 4312 6084 7772 9760 12726 15806 18651 22825 27002 31021 36854 41550 47368 52650 56384 61093 66562 71220 74991 78659 82421 86030 89362 91786 94002 96036 96564 Cumulative Percent 0 0 0.01 0.01 0.01 0.01 0.02 0.03 0.03 0.05 0.11 0.17 0.27 0.54 0.78 1.37 2.17 2.66 4.22 5.95 7.61 9.55 12.46 15.47 18.25 22.34 26.43 30.36 36.07 40.67 46.36 51.53 55.19 59.79 65.15 69.71 73.4 76.99 80.67 84.2 87.46 89.83 92 93.99 94.51 264 265 266 267 268 269 271 273 274 275 276 1241 1550 382 756 621 206 514 210 80 41 7 1.21 1.52 0.37 0.74 0.61 0.2 0.5 0.21 0.08 0.04 0.01 97805 99355 99737 100493 101114 101320 101834 102044 102124 102165 102172 152 95.73 97.24 97.62 98.36 98.96 99.17 99.67 99.87 99.95 99.99 100 Grade 5 EOG (2001) Math Scale Score Frequency Count 221 1 229 2 230 1 231 7 232 16 233 23 234 27 235 26 236 80 237 147 238 181 239 235 240 340 241 483 242 626 243 738 244 826 245 1727 246 1598 247 1790 248 2431 249 2076 250 3354 251 3094 252 3194 253 3252 254 4100 255 4206 256 4424 257 4478 258 3732 259 2852 260 4447 261 4402 262 3569 263 2930 264 4254 265 2110 266 4106 267 1947 268 3880 269 1821 270 1864 271 1739 272 1725 273 2135 Percent 0 0 0 0.01 0.02 0.02 0.03 0.03 0.08 0.15 0.18 0.23 0.34 0.48 0.62 0.74 0.82 1.72 1.59 1.79 2.42 2.07 3.35 3.09 3.19 3.24 4.09 4.2 4.41 4.47 3.72 2.84 4.44 4.39 3.56 2.92 4.24 2.1 4.1 1.94 3.87 1.82 1.86 1.73 1.72 2.13 153 Cumulative Frequency 1 3 4 11 27 50 77 103 183 330 511 746 1086 1569 2195 2933 3759 5486 7084 8874 11305 13381 16735 19829 23023 26275 30375 34581 39005 43483 47215 50067 54514 58916 62485 65415 69669 71779 75885 77832 81712 83533 85397 87136 88861 90996 Cumulative Percent 0 0 0 0.01 0.03 0.05 0.08 0.1 0.18 0.33 0.51 0.74 1.08 1.57 2.19 2.93 3.75 5.47 7.07 8.85 11.28 13.35 16.69 19.78 22.97 26.21 30.3 34.49 38.91 43.37 47.1 49.94 54.38 58.77 62.33 65.25 69.49 71.6 75.69 77.64 81.51 83.32 85.18 86.92 88.64 90.77 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 290 291 293 295 1519 989 1384 1240 718 718 583 275 492 386 202 126 227 100 147 50 65 13 22 1.52 0.99 1.38 1.24 0.72 0.72 0.58 0.27 0.49 0.39 0.2 0.13 0.23 0.1 0.15 0.05 0.06 0.01 0.02 92515 93504 94888 96128 96846 97564 98147 98422 98914 99300 99502 99628 99855 99955 100102 100152 100217 100230 100252 154 92.28 93.27 94.65 95.89 96.6 97.32 97.9 98.17 98.67 99.05 99.25 99.38 99.6 99.7 99.85 99.9 99.97 99.98 100 Algebra I EOC (2001) Scale Score Frequency Count Percent Cumulative Frequency Cumulative Percent 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 13 3 7 6 10 12 69 72 155 211 346 522 670 892 1089 1241 1490 1621 1671 1872 2510 2935 2766 2223 4456 2373 4678 3131 3881 4609 3847 2959 4510 4316 2701 3174 3635 2397 2739 2973 1886 2065 1524 1361 0.01 0 0.01 0.01 0.01 0.01 0.07 0.08 0.17 0.23 0.37 0.56 0.72 0.96 1.17 1.33 1.6 1.74 1.79 2.01 2.7 3.15 2.97 2.39 4.79 2.55 5.02 3.36 4.17 4.95 4.13 3.18 4.84 4.64 2.9 3.41 3.9 2.57 2.94 3.19 2.03 2.22 1.64 1.46 13 16 23 29 39 51 120 192 347 558 904 1426 2096 2988 4077 5318 6808 8429 10100 11972 14482 17417 20183 22406 26862 29235 33913 37044 40925 45534 49381 52340 56850 61166 63867 67041 70676 73073 75812 78785 80671 82736 84260 85621 0.01 0.02 0.02 0.03 0.04 0.05 0.13 0.21 0.37 0.6 0.97 1.53 2.25 3.21 4.38 5.71 7.31 9.05 10.85 12.86 15.55 18.7 21.68 24.06 28.85 31.4 36.42 39.78 43.95 48.9 53.03 56.21 61.05 65.69 68.59 72 75.9 78.48 81.42 84.61 86.63 88.85 90.49 91.95 155 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 1287 874 1324 718 610 514 488 406 253 218 192 203 50 108 84 36 45 22 19 23 10 11 1.38 0.94 1.42 0.77 0.66 0.55 0.52 0.44 0.27 0.23 0.21 0.22 0.05 0.12 0.09 0.04 0.05 0.02 0.02 0.02 0.01 0.01 86908 87782 89106 89824 90434 90948 91436 91842 92095 92313 92505 92708 92758 92866 92950 92986 93031 93053 93072 93095 93105 93116 156 93.33 94.27 95.69 96.46 97.12 97.67 98.2 98.63 98.9 99.14 99.34 99.56 99.62 99.73 99.82 99.86 99.91 99.93 99.95 99.98 99.99 100 Geometry EOC (2001) Scale Score 32 33 34 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 Frequency Count 4 2 2 14 15 103 127 185 321 404 527 617 763 1262 1056 1601 1745 1447 1446 2082 2643 2213 2767 2289 2847 2918 3013 2326 3588 2346 2869 2756 2056 2844 1354 2159 1586 1493 1597 873 1033 721 628 539 509 460 Percent 0.01 0 0 0.02 0.02 0.16 0.19 0.28 0.49 0.62 0.8 0.94 1.16 1.93 1.61 2.44 2.66 2.21 2.21 3.18 4.03 3.38 4.22 3.49 4.35 4.45 4.6 3.55 5.48 3.58 4.38 4.21 3.14 4.34 2.07 3.3 2.42 2.28 2.44 1.33 1.58 1.1 0.96 0.82 0.78 0.7 157 Cumulative Frequency 4 6 8 22 37 140 267 452 773 1177 1704 2321 3084 4346 5402 7003 8748 10195 11641 13723 16366 18579 21346 23635 26482 29400 32413 34739 38327 40673 43542 46298 48354 51198 52552 54711 56297 57790 59387 60260 61293 62014 62642 63181 63690 64150 Cumulative Percent 0.01 0.01 0.01 0.03 0.06 0.21 0.41 0.69 1.18 1.8 2.6 3.54 4.71 6.63 8.25 10.69 13.35 15.56 17.77 20.95 24.98 28.36 32.58 36.08 40.42 44.88 49.47 53.02 58.5 62.08 66.46 70.67 73.81 78.15 80.21 83.51 85.93 88.21 90.65 91.98 93.56 94.66 95.61 96.44 97.21 97.92 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 362 228 170 140 149 52 97 62 35 15 38 4 7 4 2 0.55 0.35 0.26 0.21 0.23 0.08 0.15 0.09 0.05 0.02 0.06 0.01 0.01 0.01 0 64512 64740 64910 65050 65199 65251 65348 65410 65445 65460 65498 65502 65509 65513 65515 158 98.47 98.82 99.08 99.29 99.52 99.6 99.75 99.84 99.89 99.92 99.97 99.98 99.99 100 100 Appendix F: Testing Code of Ethics Testing Code of Ethics (16 NCAC 6D .0306) Testing Code of Ethics Introduction In North Carolina, standardized testing is an integral part of the educational experience of all students. When properly administered and interpreted, test results provide an independent, uniform source of reliable and valid information, which enables: • students to know the extent to which they have mastered expected knowledge and skills and how they compare to others; • parents to know if their children are acquiring the knowledge and skills needed to succeed in a highly competitive job market; • teachers to know if their students have mastered grade-level knowledge and skills in the curriculum and, if not, what weaknesses need to be addressed; • community leaders and lawmakers to know if students in North Carolina schools are improving their performance over time and how the students compare with students from other states or the nation; and • citizens to assess the performance of the public schools. Testing should be conducted in a fair and ethical manner, which includes: Security • assuring adequate security of the testing materials before, during, and after testing and during scoring • assuring student confidentiality Preparation • teaching the tested curriculum and test-preparation skills • training staff in appropriate testing practices and procedures • providing an appropriate atmosphere Administration • developing a local policy for the implementation of fair and ethical testing practices and for resolving questions concerning those practices • assuring that all students who should be tested are tested • utilizing tests which are developmentally appropriate • utilizing tests only for the purposes for which they were designed Scoring, Analysis and Reporting • interpreting test results to the appropriate audience • providing adequate data analyses to guide curriculum implementation and improvement Because standardized tests provide only one valuable piece of information, such information should be used in conjunction with all other available information known about a student to assist 159 in improving student learning. The administration of tests required by applicable statutes and the use of student data for personnel/program decisions shall comply with the Testing Code of Ethics (16 NCAC 6D .0306), which is printed on the next three pages. Testing Code of Ethics Testing Code of Ethics (16 NCAC 6D .0306) .0306 TESTING CODE OF ETHICS (a) This Rule shall apply to all public school employees who are involved in the state testing program. (b) The superintendent or superintendent’s designee shall develop local policies and procedures to ensure maximum test security in coordination with the policies and procedures developed by the test publisher. The principal shall ensure test security within the school building. (1) The principal shall store test materials in a secure, locked area. The principal shall allow test materials to be distributed immediately prior to the test administration. Before each test administration, the building level test coordinator shall accurately count and distribute test materials. Immediately after each test administration, the building level test coordinator shall collect, count, and return all test materials to the secure, locked storage area. (2) “Access” to test materials by school personnel means handling the materials but does not include reviewing tests or analyzing test items. The superintendent or superintendent’s designee shall designate the personnel who are authorized to have access to test materials. (3) Persons who have access to secure test materials shall not use those materials for personal gain. (4) No person may copy, reproduce, or paraphrase in any manner or for any reason the test materials without the express written consent of the test publisher. (5) The superintendent or superintendent’s designee shall instruct personnel who are responsible for the testing program in testing administration procedures. This instruction shall include test administrations that require procedural modifications and shall emphasize the need to follow the directions outlined by the test publisher. (6) Any person who learns of any breach of security, loss of materials, failure to account for materials, or any other deviation from required security procedures shall immediately report that information to the principal, building level test coordinator, school system test coordinator, and state level test coordinator. (c) Preparation for testing. (1) The superintendent shall ensure that school system test coordinators: (A) secure necessary materials; (B) plan and implement training for building level test coordinators, test administrators, and proctors; (C) ensure that each building level test coordinator and test administrator is trained in the implementation of procedural modifications used during test administrations; and (D) in conjunction with program administrators, ensure that the need for test modifications is documented and that modifications are limited to the specific need. (2) The principal shall ensure that the building level test coordinators: 160 (A) maintain test security and accountability of test materials; (B) identify and train personnel, proctors, and backup personnel for test administrations; and (C) encourage a positive atmosphere for testing. (3) Test administrators shall be school personnel who have professional training in education and the state testing program. (4) Teachers shall provide instruction that meets or exceeds the standard course of study to meet the needs of the specific students in the class. Teachers may help students improve test-taking skills by: (A) helping students become familiar with test formats using curricular content; (B) teaching students test-taking strategies and providing practice sessions; (C) helping students learn ways of preparing to take tests; and (D) using resource materials such as test questions from test item banks, testlets and linking documentsin instruction and test preparation. (d) Test administration. (1) The superintendent or superintendent’s designee shall: (A) assure that each school establishes procedures to ensure that all test administrators comply with test publisher guidelines; (B) inform the local board of education of any breach of this code of ethics; and (C) inform building level administrators of their responsibilities. (2) The principal shall: (A) assure that school personnel know the content of state and local testing policies; (B) implement the school system’s testing policies and procedures and establish any needed school policies and procedures to assure that all eligible students are tested fairly; (C) assign trained proctors to test administrations; and (D) report all testing irregularities to the school system test coordinator. (3) Test administrators shall: (A) administer tests according to the directions in the administration manual and any subsequent updates developed by the test publisher; (B) administer tests to all eligible students; (C) report all testing irregularities to the school system test coordinator; and (D) provide a positive test-taking climate. (4) Proctors shall serve as additional monitors to help the test administrator assure that testing occurs fairly. (e) Scoring. The school system test coordinator shall: (1) ensure that each test is scored according to the procedures and guidelines defined for the test by the test publisher; (2) maintain quality control during the entire scoring process, which consists of handling and editing documents, scanning answer documents, and producing electronic files and reports. Quality control shall address at a minimum accuracy and scoring consistency. 161 (3) maintain security of tests and data files at all times, including: (A) protecting the confidentiality of students at all times when publicizing test results; and (B) maintaining test security of answer keys and item-specific scoring rubrics. ( f ) Analysis and reporting. Educators shall use test scores appropriately. This means that the educator recognizes that a test score is only one piece of information and must be interpreted together with other scores and indicators. Test data help educators understand educational patterns and practices. The superintendent shall ensure that school personnel analyze and report test data ethically and within the limitations described in this paragraph. (1) Educators shall release test scores to students, parents, legal guardians, teachers, and the media with interpretive materials as needed. (2) Staff development relating to testing must enable personnel to respond knowledgeably to questions related to testing, including the tests, scores, scoring procedures, and other interpretive materials. (3) Items and associated materials on a secure test shall not be in the public domain. Only items that are within the public domain may be used for item analysis. (4) Educators shall maintain the confidentiality of individual students. Publicizing test scores that contain the names of individual students is unethical. (5) Data analysis of test scores for decision-making purposes shall be based upon: (A) dissagregation of data based upon student demographics and other collected variables; (B) examination of grading practices in relation to test scores; and (C) examination of growth trends and goal summary reports for statemandated tests. (g) Unethical testing practices include, but are not limited to, the following practices: (1) encouraging students to be absent the day of testing; (2) encouraging students not to do their best because of the purposes of the test; (3) using secure test items or modified secure test items for instruction; (4) changing student responses at any time; (5) interpreting, explaining, or paraphrasing the test directions or the test items; (6) reclassifying students solely for the purpose of avoiding state testing; (7) not testing all eligible students; (8) failing to provide needed modifications during testing, if available; (9) modifying scoring programs including answer keys, equating files, and lookup tables; (10)modifying student records solely for the purpose of raising test scores; (11) using a single test score to make individual decisions; and (12)misleading the public concerning the results and interpretations of test data. (h) In the event of a violation of this Rule, the SBE may, in accordance with the contested case provisions of Chapter 150B of the General Statutes, impose any one or more of the following sanctions: (1) withhold ABCs incentive awards from individuals or from all eligible staff in a school; (2) file a civil action against the person or persons responsible for the violation for copyright infringement or for any other available cause of action; 162 (3) seek criminal prosecution of the person or persons responsible for the violation; and (4) in accordance with the provisions of 16 NCAC 6C .0312, suspend or revoke the professional license of the person or persons responsible for the violation. History Note: Authority G.S. 115C-12(9)c.; 115C-81(b)(4); Eff. November 1, 1997; Amended Eff. August 1, 2000. 163