Technical Report The North Carolina Mathematics Tests

advertisement
The North Carolina Mathematics Tests
Technical Report
Grade 3 Pretest
End-of-Grade Tests (Grades 3–8)
High School Comprehensive Test
Algebra I End-of-Course Test
Geometry End-of-Course Test
Algebra II End-of-Course Test
March, 2006
Prepared by:
Mildred Bazemore, Section Chief, Test Development
Pam Van Dyk, Ph. D.
Laura Kramer, Ph.D.
North Carolina Department of Public Instruction
Amber Yelton
Robert Brown, Ph.D.
Psychometric staff, Technical Outreach for Public Schools (TOPS)
September 2004
In compliance with federal law, including the provisions of Title IX of the Education
Amendments of 1972, the North Carolina Public Schools administers all state-operated
educational programs, employment activities and admissions without discrimination because
of race, national or ethnic origin, color, age, military service, disability, or gender, except
where exemption is appropriate and allowed by law.
Inquiries or complaints should be directed to:
The Office of Curriculum and School Reform Services
6307 Mail Service Center
Raleigh, NC 27699-6307
919-807-3761 (phone); 919-807-3767 (fax)
2
Table of Contents
Chapter One: Introduction ……………………………………………………….. 12
1.1 Local Participation ……………………………………………………………… 12
1.2 The North Carolina Testing Program …………………………………………… 13
1.3 The North Carolina Mathematics Tests………………………………………….. 14
Chapter Two: Test Development Process ……………………………………….
16
2.1 Test Development Process for the North Carolina Testing Program …………….. 16
2.2 The Curriculum Connection …………………………………………………….. 18
2.3 Test Specifications ………………………………………………………………. 19
2.4 Item Development ……………………………………………………….……… 19
2.5 Item Format and Use of Manipulatives ………………………………………….. 20
2.6 Selection and Training of Item Writers ………………………………………….. 20
2.7 Reviewing Items for Field Testing ……………………………………………... 21
2.8 Assembling Field Test Forms …………………………………………………… 22
2.9 Sampling Procedures ……………………………………………………………. 23
2.10 Field Test Sample Characteristics ……………………………………………… 24
2.11 Item Analysis …………….……………………………………………….…… 25
2.12 Classical Measurement Analysis ………………………………………………. 25
2.13 Item Response Theory (IRT) Analysis ………………………………………..
25
2.14 Differential Item Functioning Analysis ……………………………………….
27
2.15 Expert Review ………………………………………………………………… 28
2.16 Criteria for Inclusion in Item Pool ……………………………………………… 29
2.17 Item Pool Parameter Estimates ………………………………………………… 29
2.18 Operational Test Construction ………………………………………………… 29
3
2.19 Setting the Target p-value for Operational Tests ………………………………. 30
2.20 Comparison of Item Pool p-Values with Operational p-Values ……………….. 30
2.21 Review of Assembled Operational Tests ……………………………………… 31
2.22 Setting the Test Administration Time …………………………………………. 32
Chapter Three: Test Administration …………………………………………….. 33
3.1 Test Administration …………………………………………………………….. 33
3.2 Training for Test Administrators ……………………………………………….. 34
3.3 Preparation for Test Administration ……………………………………………. 34
3.4 Test Security and Handling Materials ………………………………………….. 34
3.5 Student Participation …………………………………………………………… 35
3.6 Alternate Assessments ………………………………………………………….. 36
3.7 Testing Accommodations ………………………………………………………. 36
3.8 Students with Limited English Proficiency ……………………………………… 36
3.9 Medical Exclusions …………………………………………………………….. 38
3.10 Reporting Student Scores ……………………………………………………… 38
3.11 Confidentiality of Student Test Scores ……….……………………………….
38
Chapter Four: Scaling and Standard-Setting for the North Carolina EOG and EOC
Tests of Mathematics …………………………………………………….. 40
4.1 Conversion of Raw Test Scores …………………………………………………. 40
4.2 Constructing a Developmental Scale ……………………………………………. 40
4.3 Comparison with and Linkage to the First Edition Scale ………………………… 44
4.4 Equating the Scales for the First and Second Editions of the North Carolina EOG Tests of
Mathematics ………………………………………………..……………… 46
4.5 Setting the Standards……………………………………………………………. 47
4.6 Score Reporting for the North Carolina Tests ………………………………….
48
4.7 Achievement Level Descriptors ………………………………………………… 49
4
4.8 Achievement Level Cut Scores …………………………………………………. 49
4.9 Achievement Level Trends ……………………………………………………… 50
4.10 Percentile Ranking …………………………………………………………….. 52
Chapter Five: Reports ……………………………………………………………. 53
5.1 Use of Test Score Reports Provided by the North Carolina Testing Program …… 53
5.2 Reporting by Student …………………………………………………………… 53
5.3 Reporting by School ……………………………………………………………. 53
5.4 Reporting by the State ………………………………………………………….
54
Chapter Six: Descriptive Statistics and Reliability ………………………………. 55
6.1 Descriptive Statistics for the First Operational Administration of the Tests …….. 55
6.2 Means and Standard Deviations for the First Operational Administration ……… 55
6.3 Population Demographics for the First Operational Administration ……………
56
6.4 Scale Score Frequency Distributions …………………………………………… 56
6.5 Reliability of the North Carolina Tests …………………………………………
62
6.6 Internal Consistency of the North Carolina Math Tests ………………………… 62
6.7 Standard Error of Measurement for the North Carolina Math Tests ……………. 64
6.8 Equivalency of Test Forms……………………………………………………… 75
Chapter Seven: Evidence of Validity …………………………………………….
87
7.1 Evidence of Validity ………………………………………………………......... 87
7.2 Content Validity ………………………………………………………………..
87
7.3 Criterion-Related Validity ……………………………………………………… 88
Chapter Eight: Quality Control ………………………………………………….
94
8.1 Quality Control Prior to Test Administration …………………………………… 94
8.2 Quality Control in Data Preparation and Test Administration …………………… 94
5
8.3 Quality Control in Data Input …………………………………………………… 95
8.4 Quality Control of Test Scores ………………………………………………….. 95
8.5 Quality Control in Reporting …………………………………………………… 95
Glossary of Key Terms …………………………………………………………… 96
References …………………………………………………………………………. 99
Appendix A: Item Development Guidelines ……………………………………… 102
Appendix B: Test Specification Summaries……………………………………..... 104
Appendix C: Math Developmental Scale Report with Excel Plots for First and Second
Editions’ Scale Scores …………………………………….……………… 114
Appendix D: Sample Items ………………………………………………………... 131
Appendix E: Sample Frequency Distribution Tables for Math Scale Scores .…. 151
Appendix F: Testing Code of Ethics ……………………………………………… 159
6
List of Tables
Table 1:
Number of Items Field Tested for North Carolina EOG and EOC Tests of
Mathematics
Table 2:
Field test population (2000) for grade 3 pretest, grades 3-8 end-of-grade tests,
and end-of-course tests. Field test population (1997) for Grade 10 High School
Comprehensive Test
Table 3:
Field test population demographics (2001)
Table 4:
Field test population demographics (2002)
Table 5:
Average item pool parameter estimates for the EOG and EOC Tests of
Mathematics by grade or subject (2000)
Table 6:
Comparison of p-value of item pool with p-values of assembled forms
averaged across forms
Table 7:
Number of items per test and time allotted by grade and subject
Table 8:
Population means and standard deviations derived from the Spring 2000 item
calibration for the North Carolina End-of-Grade Tests of Mathematics, second
edition
Table 9:
Average difference between adjacent grade means in units of the standard
deviation of the lower grade and ratios between adjacent-grade standard
deviations derived from the Spring 2000 item calibrations for the North
Carolina EOG Tests of Mathematics
Table 10:
Replications of the average difference between adjacent-grade means in
units of the standard deviation of the lower grade and ratios between
adjacent-grade standard deviations derived from the Spring 2000 item
calibration for the North Carolina EOG Tests of Mathematics
Table 11:
Comparison of the population means and standard deviations for the
second edition with averages and standard deviations obtained from the
operational administration of the first edition in the Spring 2000 item
calibration for the North Carolina EOG Tests of Mathematics
Table 12:
Percent of students assigned to each achievement level by teachers
Table 13:
Administrative Procedures Act 16 NCAC 6D .0501 (Definitions related to
Student Accountability Standards)
Table 14:
EOG and EOC Tests of Mathematics achievement levels and corresponding
scale scores
Table 15:
Achievement level trends for Grade 3 Pretest
7
Table 16:
Achievement level trends for Grade 3
Table 17:
Achievement level trends for Grade 4
Table 18:
Achievement level trends for Grade 5
Table 19:
Achievement level trends for Grade 6
Table 20:
Achievement level trends for Grade 7
Table 21:
Achievement level trends for Grade 8
Table 22:
Achievement level trends for Grade 10 High School Comprehensive Test
Table 23:
Achievement level trends for Algebra I
Table 24:
Achievement level trends for Geometry
Table 25:
Achievement level trends for Algebra II
Table 26:
Descriptive statistics by grade for the 2001 administration of the North
Carolina EOG Tests of Mathematics and the 1998 administration of the Grade
10 High School Comprehensive Test
Table 27:
Mean scale score for the 2001 administration of the North Carolina EOC
Mathematics tests
Table 28:
Population demographics for the 2001 administration of the North Carolina
EOG and EOC Tests of Mathematics and the 1998 administration of the
Grade 10 High School Comprehensive Test
Table 29:
Reliability indices averaged across North Carolina EOG and EOC Tests of
Mathematics forms
Table 30:
Reliability indices averaged across North Carolina EOG and EOC Tests of
Mathematics forms (Gender)
Table 31:
Reliability indices averaged across North Carolina EOG and EOC Tests of
Mathematics forms (Ethnicity)
Table 32:
Reliability indices averaged across North Carolina EOG and EOC Tests of
Mathematics forms (Other Characteristics)
Table 33:
Ranges of standard error of measurement for scale scores by grade or subject
Table 34:
Instructional Validity of the content of the North Carolina EOG Tests of
Mathematics
8
Table 35:
Pearson correlation coefficient table for variables used to establish criterionrelated validity for the North Carolina EOG Tests of Mathematics
Table 36:
Pearson correlation coefficient table for variables used to establish criterionrelated validity for the North Carolina EOC Tests of Mathematics
9
List of Figures
Figure 1:
Flow chart of the test development process used in development of North
Carolina Tests
Figure 2:
Thinking skills framework used to develop the North Carolina End-of-Grade
Tests (adapted from Marzano, et al., 1988)
Figure 3:
Typical item characteristic curve (ICC) for a 4-option multiple-choice item
Figure 4:
Comparison of the growth curves for the first and second editions of the
North Carolina EOG Tests of Mathematics in the Spring 2000 item
calibration
Figure 5:
Equipercentile equating functions between the first and second editions of
the North Carolina EOG Tests of Mathematics scales derived from the
Spring 2001 equating study for Grades 3–8
Figure 6:
Math Scale Score Frequency Distribution Grade 3
Figure 7:
Math Scale Score Frequency Distribution Grade 4
Figure 8:
Math Scale Score Frequency Distribution Grade 5
Figure 9:
Math Scale Score Frequency Distribution Grade 6
Figure 10:
Math Scale Score Frequency Distribution Grade 7
Figure 11:
Math Scale Score Frequency Distribution Grade 8
Figure 12:
Algebra I Scale Score Frequency Distribution
Figure 13:
Geometry Scale Score Frequency Distribution
Figure 14:
Algebra II Scale Score Frequency Distribution
Figure 15:
Standard Errors of Measurement on the Grade 3 Pretest of Mathematics
Test forms
Figure 16:
Standard Errors of Measurement on the Grade 3 Mathematics Test forms
Figure 17:
Standard Errors of Measurement on the Grade 4 Mathematics Test forms
Figure 18:
Standard Errors of Measurement on the Grade 5 Mathematics Test forms
Figure 19:
Standard Errors of Measurement on the Grade 6 Mathematics Test forms
Figure 20:
Standard Errors of Measurement on the Grade 7 Mathematics Test forms
10
Figure 21:
Standard Errors of Measurement on the Grade 8 Mathematics Test forms
Figure 22:
Standard Errors of Measurement on the Grade 10 Mathematics Test forms
Figure 23:
Standard Errors of Measurement on the Algebra I Test forms
Figure 24:
Standard Errors of Measurement on the Geometry Test forms
Figure 25:
Standard Errors of Measurement on the Algebra II Test forms
Figure 26:
Test Characteristic Curves for the Grade 3 Pretest of Mathematics Test
forms
Figure 27:
Test Characteristic Curves for the Grade 3 Mathematics Test forms
Figure 28:
Test Characteristic Curves for the Grade 4 Mathematics Test forms
Figure 29:
Test Characteristic Curves for the Grade 5 Mathematics Test forms
Figure 30:
Test Characteristic Curves for the Grade 6 Mathematics Test forms
Figure 31:
Test Characteristic Curves for the Grade 7 Mathematics Test forms
Figure 32:
Test Characteristic Curves for the Grade 8 Mathematics Test forms
Figure 33:
Test Characteristic Curves for the Grade 10 Mathematics Test forms
Figure 34:
Test Characteristic Curves for the Algebra I Test forms
Figure 35:
Test Characteristic Curves for the Geometry Test forms
Figure 36:
Test Characteristic Curves for the Algebra II Test forms
Figure 37:
Comparison of NAEP “proficient” scores and North Carolina End-of-Grade
Tests of Mathematics scores for Grade 4
Figure 38:
Comparison of NAEP “basic” scores and North Carolina End-of-Grade Tests
of Mathematics scores for Grade 4
Figure 39:
Comparison of NAEP “proficient” scores and North Carolina End-of-Grade
Tests of Mathematics scores for Grade 8
Figure 40:
Comparison of NAEP “basic” scores and North Carolina End-of-Grade Tests
of Mathematics scores for Grade 8
11
Chapter One: Introduction
The General Assembly believes that all children can learn. It is the intent
of the General Assembly that the mission of the public school community
is to challenge with high expectations each child to learn, to achieve, and
to fulfill his or her potential (G.S. 115C-105.20a).
With that mission as its guide, the State Board of Education implemented the ABCs
Accountability Program at grades K–8 effective with the 1996–1997 school year and
grades 9–12 effective during the 1997–1998 school year to test students’ mastery of basic
skills (reading, writing, and mathematics). The ABCs Accountability Program was
developed under the Public School Laws mandating local participation in the program,
the design of annual performance standards, and the development of student academic
performance standards.
1.1 Local Participation
The School-Based Management and Accountability Program shall be
based upon an accountability, recognition, assistance, and intervention
process in order to hold each school and the school’s personnel
accountable for improved student performance in the school (G.S. 115C105.21c).
Schools are held accountable for student learning by reporting student performance
results on North Carolina tests. Student’s scores are compiled each year and released in a
report card. Schools are then recognized for the performance of their students. Schools
that consistently do not make adequate progress may receive intervention from the state.
In April 1999, the State Board of Education unanimously approved Statewide Student
Accountability Standards. These standards provide four Gateway Standards for student
performance at grades 3, 5, 8, and 11. Students in the 3rd, 5th, and 8th grades are required
to demonstrate grade-level performance in reading, writing (5th and 8th grades only), and
mathematics in order to be promoted to the next grade. The law regarding student
academic performance states:
The State Board of Education shall develop a plan to create rigorous
student academic performance standards for kindergarten through eighth
grade and student academic standards for courses in grades 9-12. The
performance standards shall align, whenever possible, with the student
academic performance standards developed for the National Assessment
of Educational Progress (NAEP). The plan also shall include clear and
understandable methods of reporting individual student academic
performance to parents (G.S. 115C-105.40).
12
1.2 The North Carolina Testing Program
The North Carolina Testing Program was designed to measure the extent to which
students satisfy academic performance requirements. Tests developed by the North
Carolina Department of Public Instruction’s Test Development Section, when properly
administered and interpreted, provide reliable and valid information that enables
•
•
•
•
•
students to know the extent to which they have mastered expected knowledge
and skills and how they compare to others;
parents to know if their children are acquiring the knowledge and skills needed to
succeed in a highly competitive job market;
teachers to know if their students have mastered grade-level knowledge and skills
in the curriculum and, if not, what weaknesses need to be addressed;
community leaders and lawmakers to know if students in North Carolina schools
are improving their performance over time and how our students compare with
students from other states; and
citizens to assess the performance of the public schools (North Carolina Testing
Code of Ethics, 1997, revised 2000).
The North Carolina Testing Program was initiated in response to legislation passed by the
North Carolina General Assembly. The following selection from Public School Laws
(1994) describes the legislation. Public School Law 115C-174.10 states the following
purposes of the North Carolina Testing Program:
(1) to assure that all high school graduates possess the … skills and
knowledge thought necessary to function as a member of society; (2) to
provide a means of identifying strengths and weaknesses in the education
process; and (3) to establish additional means for making the education
system accountable to the public for results.
Tests included in the North Carolina Testing Program are designed for use as federal,
state, and local indicators of student performance. Interpretation of test scores in the
North Carolina Testing Program provides information about a student’s performance on
the test in percentiles, scale scores, and achievement levels. Percentiles provide an
indicator of how a child performs relative to other children who took the test in the
norming year, or the first year the test was administered. Percentiles range from 1 to 99.
A percentile rank of 69 indicates that a child performed equal to or better than 69 percent
of the children who took the test during the norming year.
Scale scores are derived from a raw score or “number right” score for the test. Each test
has a translation table that provides a scale score for each raw test score. Scale scores are
reported alongside four achievement levels, which are predetermined academic
achievement standards. The four achievement levels for the North Carolina Testing
Program are shown below.
13
Level I: Students performing at this level do not have sufficient mastery
of knowledge and skills in a particular subject area to be successful at the
next grade level.
Level II: Students performing at this level demonstrate inconsistent
mastery of knowledge and skills in the subject area and are minimally
prepared to be successful at the next grade level.
Level III: Students performing at this level consistently demonstrate
mastery of the grade level subject matter and skills and are well prepared
for the next grade.
Level IV: Students performing at this level consistently perform in a
superior manner clearly beyond that required to be proficient at grade
level.
The North Carolina End-of-Grade (EOG) Tests include multiple-choice assessments of
reading comprehension and mathematics in grades 3 through 8 and 10. The North
Carolina End-of-Course (EOC) Tests include multiple-choice assessments of reading
comprehension and mathematics in English I, Algebra I, Geometry, and Algebra II. In
addition to the reading comprehension and mathematics tests, the North Carolina Testing
Program includes science EOC tests (Biology, Chemistry, Physical Science, and
Physics); social studies EOC tests which are currently under revision (Civics and
Economics and U.S. History); writing assessments in grades 4, 7, and 10; the North
Carolina Tests of Computer Skills; the North Carolina Competency Tests; and two
alternate assessments (North Carolina Alternate Assessment Academic Inventory and the
North Carolina Alternate Assessment Portfolio).
The EOG reading comprehension and mathematics tests are used to monitor growth and
student performance against absolute standards (performance composite) for student
accountability. A student’s EOG scores from the prior grade are used to determine his or
her entering level of knowledge and skills and to determine the amount of growth during
one school year. Beginning in 1996, a student’s growth at grade 3 was determined by
comparing the grade 3 EOG score with a grade 3 pretest administered during the first
three weeks of the school year. The Grade Level Proficiency Guidelines, approved by the
State Board of Education (February, 1995), established Level III (of those achievement
levels listed above) as the standard for each grade level. The EOC tests measure a
student’s mastery of course-level material.
1.3 The North Carolina Mathematics Tests
The purpose of this document is to provide an overview and technical documentation for
the North Carolina Mathematics Tests which include the Grade 3 Pretest, the End-ofGrade Mathematics Tests in grades 3-8, the High School Comprehensive Mathematics
Test, and End-of-Course (EOC) Mathematics Tests in Algebra I, Geometry, and Algebra
II. Chapter One provides an overview of the North Carolina Mathematics Tests. Chapter
14
Two describes the test development process. Chapter Three outlines the test
administration. Chapter Four describes the construction of the developmental scale, the
scoring of the tests, and the standard setting process. Chapter Five provides an outline of
reporting of test results. Chapters Six and Seven provide the technical properties of the
tests such as descriptive statistics from the first operational year, reliability indices, and
evidence of validity. Chapter Eight is an overview of quality control procedures.
15
Chapter Two: Test Development
2.1 Test Development Process for the North Carolina Testing Program
In June of 2003, the State Board of Education codified the process used in developing all
multiple-choice tests in the North Carolina Testing Program. The development of tests
for the North Carolina Testing Program follows a prescribed sequence of events. A flow
chart of those events is found in figure 1.
16
Figure 1: Flow chart of the test development process used in development of North Carolina Tests
Step 7
Review Item Tryout Statistics
Step 14b
Conduct Bias Reviews
Step 1a
Develop Test Specifications
(Blueprint)
Step 8b
Develop New Items
Step 15
Assemble Equivalent and
Parallel Forms
Step 2b
Develop Test Items
Step 9b
Review Items for Field Test
Step 16b
Review Assembled Test
Step 3b
Review Items for Tryouts
Step 10
Assemble Field Test Forms
Step 17
Final Review of Test
Step 4
Assemble Item Tryout Forms
Step 11b
Review Field Test Forms
Step 18ab
Administer Test as Pilot
Step 5b
Review Item Tryout Forms
Step 12b
Administer Field Test
Step 19
Score Test
Step 6b
Administer Item Tryouts
Step 13
Review Field Test Statistics
Step 20ab
Establish Standards
Curriculum Adoption
Step 21b
Administer Test as Fully
Operational
Step 22
Report Test Results
a
Activities done only at implementation of new curriculum
Activities involving NC teachers
b
Phase 1 (step 1) requires 4 months
Phase 2 (steps 2-7) requires 12 months
Phase 3 (steps 8-14) requires 20 months
Phase 4 (steps 15-20) requires 4 months for EOC and 9 months for EOG
Phase 5 (step 21) requires 4 months
Phase 6 (step 22) requires 1 month
TOTAL 44-49 months
NOTES: Whenever possible, item tryouts should precede field testing items. Professional development opportunities are integral and
ongoing to the curriculum and test development process.
17
2.2 The Curriculum Connection
Using research conducted by the North Carolina Mathematics Framework Committee,
the North Carolina Mathematics Standard Course of Study Committee constructed a
curriculum focused on giving students the opportunity to acquire mathematical literacy.
Mathematical literacy is necessary to function in an information age and has the primary
roles of helping students
•
•
•
•
cultivate the understanding and application of mathematical skills and concepts
necessary to thrive in an ever-changing technological world;
develop the essential elements of problem solving, communication, and
reasoning;
develop connections within their study of mathematics; and
understand the major ideas of mathematics (Mathematics K–12 Standard Course
of Study and Mathematics Competencies, NCDPI Publication, Instructional
Services Division, www.ncpublicschools.org/curriculum/mathematics/).
The North Carolina Mathematics Standard Course of Study clearly defines a curriculum
focused on what students will need to know and be able to do to be successful and
contributing citizens in our state and nation in the years ahead. As defined in the 1998
North Carolina Mathematics Standard Course of Study, the goals of mathematics
education are for students to develop
(1) strong mathematical problem solving and reasoning abilities;
(2) a firm grounding in essential mathematical concepts and skills, including
computation and estimation;
(3) connections within mathematics and with other disciplines;
(4) the ability to use appropriate tools, including technology, to solve mathematical
problems;
(5) the ability to communicate an understanding of mathematics effectively; and
(6) positive attitudes and beliefs about mathematics.
The elementary program of mathematics focuses on assisting students with a higher-level
understanding of mathematics through the use of manipulative items, working
independently and in groups, and conducting investigations and recording findings.
Middle-grade students expand on these skills to compute with real numbers and to apply
basic concepts in new and difficult situations. High school mathematics includes courses
from Introductory Mathematics to Advanced Placement Calculus (North Carolina
Standard Course of Study). The North Carolina State Board of Education adopted the
revised mathematics component of the North Carolina Standard Course of Study
(NCSCS) in 1998. Students in North Carolina schools are tested in mathematics in grades
3 through 8 and grade 10. In addition, students taking Algebra I, Algebra II, and
Geometry in high school are tested at the end of these courses. Mathematics tests for
these grades and courses are designed around the competency goals and objectives found
in the NCSCS.
18
2.3 Test Specifications
Delineating the purpose of a test must come before the test design. A clear statement of
purpose provides the overall framework for test specifications, test blueprint, item
development, tryout, and review. A clear statement of test purpose also contributes
significantly to appropriate test use in practical contexts (Millman & Greene, 1993). The
tests in the North Carolina Testing Program are designed in alignment with the NCSCS.
The purpose of the North Carolina EOG and EOC Tests of Mathematics is legislated by
General Statute 115C-174.10 and focuses on the measurement of individual student
mathematical skills and knowledge as outlined in the NCSCS.
Test specifications for the North Carolina mathematics tests are developed in accordance
with the competency goals and objectives specified in the NCSCS. A summary of the test
specifications is provided in Appendix B. These test specifications also are generally
designed to include the following:
(1) Percentage of questions from higher or lower thinking skills and classification of
each test question into level of difficulty;
(2) Percentage of test questions that measure a specific goal or objective;
(3) Percentage of questions that require the use of a calculator and percentage that
do not allow the use of a calculator.
2.4 Item Development
Items on the North Carolina EOG and EOC Tests of Mathematics are developed using
level of difficulty and thinking skill level. Item writers use these frameworks when
developing items. The purpose of the categories is to ensure a balance of items across
difficulty, as well as a balance of items across the different cognitive levels of learning in
the North Carolina mathematics tests.
For the purposes of guiding item writers to provide a variety of items, items were
classified into three levels of difficulty: easy, medium, and hard. Easy items are those
items that the item writer believes can be answered correctly by approximately 70% of
the examinees. Medium items can be answered correctly by 50–60% of the examinees.
Difficult items can be answered correctly by approximately 20–30% of the examinees.
These targets are used for item pool development to ensure an adequate range of
difficulty.
A more recent consideration for item development is the classification of items by
thinking skill level, the cognitive skills that an examinee must use to solve a problem or
answer a test question. Thinking skill levels are based on Dimensions of Thinking by
Marzano, et al. (1988). In addition to using thinking skill levels in framing achievement
tests, they are also a practical framework for curriculum development, instruction,
assessment, and staff development. Thinking skills begin with the basic skill of
information-gathering and move to more complex thinking skills, such as integration and
evaluation. Figure 2 below shows a visual representation of the framework.
19
Figure 2: Thinking skills framework used to develop the North Carolina End-of-Grade Tests (adapted from
Marzano, et al., 1988)
Dimensions of Thinking
Content Area
Knowledge
Metacognition
Critical and Creative
Thinking
Core Thinking Skills
Categories:
Thinking Processes:
Concept Formation
Principle Formation
Comprehending
Problem-solving
Decision-making
Research
Composing
Oral Discourse
Focusing
Information-gathering
Remembering
Organizing
Analyzing
Generating
Integrating
Evaluating
2.5 Item Format and Use of Manipulatives
Items on the North Carolina mathematics tests are four-foil, multiple-choice items. On
the end-of-grade mathematics tests, thirty percent of the items are calculator inactive
items and seventy percent are calculator active items. A small percentage of items on the
end-of-grade mathematics tests require the use of a ruler or protractor. Formula sheets are
provided for grades 6 through 8 and 10.
2.6 Selection and Training of Item Writers
Once the test blueprints were finalized from the test specifications for the revised editions
of the North Carolina mathematics tests, North Carolina educators were recruited and
trained to write new items for the state tests. The diversity among the item writers and
their knowledge of the current NCSCS was addressed during recruitment. The use of
20
North Carolina educators to develop items ensured instructional validity of the items.
Some items were developed through an external vendor; however, the vendor was
encouraged to use North Carolina educators in addition to professional item writers to
generate items that would align with the NCSCS for mathematics.
Training for item writers occurred during a 3-day period. Item writers received a packet
of materials designed in accordance with the mathematics curriculum, which included
information on content and procedural guidelines as well as information on stem and foil
development. The item-writing guidelines are included in Appendix A. The items
developed during the training were evaluated by content specialists, who then provided
feedback to the item writers on the quality of their items.
2.7 Reviewing Items for Field Testing
To ensure that an item was developed to NCSCS standards, each item went through a
detailed review process prior to being placed on a field test. A new group of North
Carolina educators was recruited to review items. Once items had been through an
educator review, test development staff members, with input from curriculum specialists,
reviewed each item. Items were also reviewed by educators and/or staff familiar with the
needs of students with disabilities and limited English proficiency.
The criteria for evaluating each written item included the following:
1) Conceptual
• objective match (curricular appropriateness)
• thinking skill match
• fair representation
• lack of bias
• clear statement
• single problem
• one best answer
• common context in foils
• credible foils
• technical correctness
2) Language
• appropriate for age
• correct punctuation
• spelling and grammar
• lack of excess words
• no stem or foil clues
• no negative in foils
21
3) Format
• logical order of foils
• familiar presentation style, print size, and type
• correct mechanics and appearance
• equal length foils
4) Diagram/Graphics
• necessary
• clean
• relevant
• unbiased
The detailed review of items helped prevent the loss of items during field testing due to
quality issues.
2.8 Assembling Field Test Forms
Prior to creating an operational test, items for each written subject/course area were
assembled into field test forms. Field test forms were organized according to blueprints
for the operational tests. Similar to the operational test review, North Carolina educators
reviewed the assembled field test forms for clarity, correctness, potential bias, and
curricular appropriateness.
Field testing of mathematics Grade 3 Pretest, end-of-grade, and end-of-course test items
occurred during the 1999–2000 school year. Rather than develop forms composed of
field test items alone, field test items were instead embedded in operational forms from
the previous curriculum. The three operational EOG and EOC base forms at each grade
or course were embedded with 10–12 items each to create 45–51 separate test forms for
each grade level or subject. In addition, there were 15–17 linking forms (administered at
a grade below the nominal grade) and one research form which was used to examine
context and location effects, resulting in 61 to 79 forms at each grade. The High School
Comprehensive Mathematics items were field tested on whole forms in 1997. Table 1
below provides a breakdown of the number of grade-level forms, number of items per
form, and number of total items per grade or subject.
22
Table 1: Number of Items Field Tested for North Carolina EOG and EOC Tests of Mathematics
Grade /
Subject
Grade 3 Pretest
Grade 3
Grade 4
Grade 5
Grade 6
Grade 7
Grade 8
Grade 10
Algebra I
Geometry
Algebra II
Number of
Grade Level
Forms
36
51
51
48
51
48
45
10
45
45
39
Number of
Items per
Form
11
12
12
12
12
12
12
80
12
12
10
Total
Number of
items
396
612
612
576
612
576
540
800
540
540
390
2.9 Sampling Procedures
Sampling for field testing of the North Carolina Tests is typically accomplished using
stratified random sampling with the goal being a selection of students that is
representative of the entire student population in North Carolina. The development of the
North Carolina Tests of Mathematics departed from random sampling during the first
field test (2000) and instead used census sampling to embed field test items on an
operational version of the mathematics tests. The sample for the High School
Comprehensive Mathematics Test was selected through stratified random sampling to
represent the general population characteristics. In 2001 and 2002, additional samples of
students were selected at random to supplement the item pools. Field test sample
characteristics for the three years are provided in the following section.
23
2.10 Field Test Sample Characteristics
Table 2: Field test population (2000) for grade 3 pretest, grades 3-8 end-of-grade tests and end-of-course
tests. Field test population (1997) for Grade 10 High School Comprehensive Test
3 Pretest
105,750 50.8
49.2
1.5
30.8
60.8
6.9
% LEP
(Limited
English
Proficient)
1.7
Grade 3
Grade 4
Grade 5
Grade 6
Grade 7
Grade 8
Grade 10
Algebra I
Geometry
Algebra II
105,900
104,531
103,109
100,746
98,424
95,229
10,691
90,322
65,060
52,701
48.7
48.6
49.3
49.0
49.6
49.1
52.0
50.2
53.6
53.1
1.4
1.4
1.4
1.3
1.4
1.3
1.2
1.3
1.1
1.1
30.9
30.6
30.2
31.0
33.7
30.8
26.1
28.6
22.8
25.0
60.0
61.0
61.7
61.4
58.9
61.7
65.7
64.1
70.7
68.4
7.7
7.0
6.6
6.3
6.1
6.1
7.0
6.1
5.5
5.5
2.8
2.3
2.2
1.8
1.5
1.8
0.6
0.7
0.4
0.4
Grade/
Subject
N
%
%
%
%
American % Black
Male Female
White
Indian
51.3
51.4
50.7
51.0
50.4
50.9
48.0
49.8
46.3
46.9
%
Other
To supplement the item pools created from the embedded field testing, additional standalone field tests were administered in subsequent years in grades 3 through 8. The field
test population characteristics from the stand-alone field tests are provided below in
tables 3 and 4.
Table 3: Field test population demographics (2001)
Grade
Grade 3
Grade 4
Grade 5
Grade 6
Grade 7
Grade 8
N
10,397
10,251
8,019
18,319
16,885
15,395
%
%
%
%
American
Male Female
Black
Indian
51.4
48.6
0.8
30.9
49.9
50.1
0.8
29.1
50.1
49.9
1.1
27.6
50.3
49.7
1.5
29.3
49.5
50.5
1.5
26.5
50.0
50.0
1.2
25.5
24
%
White
%
Other
% LEP
59.9
61.9
62.9
62.1
64.2
66.1
8.5
8.2
8.4
7.2
7.8
7.2
2.0
1.9
1.4
1.0
1.0
1.1
Table 4: Field test population demographics (2002)
Grade
Grade 6
Grade 8
%
%
%
%
American
Male Female
Black
Indian
13.988 50.9
49.1
2.0
29.8
17,501 49.3
50.7
1.3
28.4
N
%
White
%
Other
% LEP
60.7
62.5
8.9
7.8
1.1
1.1
2.11 Item Analysis
Field testing provides important data for determining whether an item will be retained for
use on an operational North Carolina EOG or EOC Test of Mathematics. The North
Carolina Testing Program uses both classical measurement analysis and item response
theory (IRT) analysis to determine if an item has sound psychometric properties. These
analyses provide information that assists North Carolina Testing Program staff and
consultants in determining the extent to which an item can accurately measure a student’s
level of achievement.
Field test data were analyzed by the North Carolina Department of Public Instruction
(NCDPI) psychometric staff. Item statistics and descriptive information were then printed
on labels and attached to the item record for each item. The item records contained the
statistical, descriptive, and historical information for an item, a copy of the item as it was
field tested, comments by reviewers, and curricular and psychometric notations.
2.12 Classical Measurement Analysis
For each item, the p-value (proportion of examinees answering an item correctly), the
standard deviation of the p-value, and the point-biserial correlation between the item
score and the total test score were computed using SAS. In addition, frequency
distributions of the response choices were tabulated. While the p-value is an important
statistic and one component used in determining the selection of an item, the North
Carolina Testing Program also uses IRT to provide additional item parameters to
determine the psychometric properties of the North Carolina mathematics tests.
2.13 Item Response Theory (IRT) Analysis
To provide additional information about item performance, the North Carolina Testing
Program also uses IRT statistics to determine whether an item should be included on the
test. IRT is, with increasing frequency, being used with large-scale achievement testing.
“The reason for this may be the desire for item statistics to be independent of a particular
group and for scores describing examinee proficiency to be independent of test difficulty,
and for the need to assess reliability of tests without the tests being strictly parallel”
(Hambleton, 1993, p. 148). IRT meets these needs and provides two additional
advantages: the invariance of item parameters and the invariance of ability parameters.
Regardless of the distribution of the sample, the parameter estimates will be linearly
related to the parameters estimated with some other sample drawn from the same
25
population. IRT allows the comparison of two students’ ability estimates even though
they may have taken different items. An important characteristic of IRT is item-level
orientation. IRT makes a statement about the relationship between the probability of
answering an item correctly and the student’s ability or the student’s level of
achievement. The relationship between a student’s item performance and the set of traits
underlying item performance can be described by a monotonically increasing function
called an Item Characteristic Curve (ICC). This function specifies that as the level of the
trait increases, the probability of a correct response to an item increases. The following
figure shows the ICC for a typical 4-option multiple-choice item.
Figure 3: Typical item characteristic curve (ICC) for a 4-option multiple-choice item
Three Parameter Model
1.0
Probability of a Correct Response
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
-3.0
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
Abi l i ty
The three-parameter logistic model (3PL) of IRT, the model used in generating EOG item
statistics, takes into account the difficulty of the item and the ability of the examinee. A
student’s probability of answering a given item correctly depends on the student’s ability
and the characteristics of the item. The 3PL model has three assumptions:
(1) unidimensionality—only one ability is assessed by the set of items (for example, a
spelling test only assesses a student’s ability to spell);
(2) local independence—when abilities influencing test performance are held
constant, an examinee’s responses to any pair of items are statistically
independent (conditional independence, i.e., the only reason an examinee scores
similarly on several items is because of his or her ability); and
(3) the ICC specified reflects the true relationship among the unobservable variable
(ability) and the observable variable (item response).
26
The formula for the 3PL model is
1 – ci
Pi (θ) = ci +
1 + exp[-Dai (θ - bi)]
where
Pi(θ)—the probability that a randomly chosen examinee with ability (θ)
answers item i correctly (this is an S-shaped curve with values
between 0 and 1 over the ability scale)
a—the slope or the discrimination power of the item (the slope of a typical
item is 1.00)
b—the threshold or the point on the ability scale where the probability of a
correct response is 50% (the threshold of a typical item is 0.00)
c—the asymptote or the proportion of the examinees who got the item
correct but did poorly on the overall test (the asymptote of a typical
4-choice item is 0.25)
D—a scaling factor, 1.7, to make the logistic function as close as possible
to the normal ogive function (Hambleton, 1983, p.125).
The IRT parameter estimates for each item are computed using the BILOG computer
program (Muraki, Mislevy, & Bock, 1991) using the default Bayesian prior distributions
for the item parameters [a~lognormal(0, 0.5), b~N(0,2), and c~Beta(6,16)].
2.14 Differential Item Functioning Analysis
It is important to know the extent to which an item on a test performs differently for
different students. As a third component of the item analysis, differential item functioning
(DIF) analyses examine the relationship between the score on an item and group
membership while controlling for ability to determine if an item is biased towards a
particular gender or ethnic group. In developing the North Carolina mathematics tests, the
North Carolina Testing Program staff used the Mantel-Haenszel procedure to examine
DIF by examining j 2 × 2 contingency tables, where j is the number of different levels of
ability actually achieved by the examinees (actual total scores received on the test). The
focal group is the focus of interest, and the reference group serves as a basis for
comparison for the focal group (Dorans & Holland, 1993; Camilli & Shepherd, 1994).
For example, females might serve as the focal group and males might serve as the
reference group to determine if an item is biased towards or against females.
The Mantel-Haenszel (MH) chi-square statistic (only used for 2 × 2 tables) tests the
alternative hypothesis that a linear association exists between the row variable (score on
the item) and the column variable (group membership). The Χ2 distribution has one
degree of freedom (df ) and its significance is determined by the correlation between the
row variable and the column variable (SAS Institute, 1985).
27
The MH Log Odds Ratio statistic is SAS was used to determine the direction of DIF.
This measure was obtained by combining the odds ratios (aj) across levels with the
formula for weighted averages (Camilli & Shepherd, 1994, p. 110).
For this statistic, the null hypothesis of no relationship between score and group
membership, the odds of getting the item correct are equal for the two groups, was not
rejected when the odds ratio equals 1. For odds ratios greater than 1, the interpretation
was that an individual at score level j of the Reference Group had a greater chance of
answering the item correctly than an individual at score level j of the Focal Group.
Conversely, for odds ratios less than 1, the interpretation was that an individual at score
level j of the Focal Group had a greater chance of answering the item correctly than an
individual at score level j of the Reference Group. The Breslow-Day Test was used to test
whether the odds ratios from the j levels of the score were all equal. When the null
hypothesis was true, the statistic was distributed approximately as a chi-square with j-1
degrees of freedom (SAS Institute, 1985).
The ethnic and gender bias flags were determined by examining the significance levels of
items from several forms and identifying a typical point on the continuum of odds ratios
that was statistically significant at the α = 0.05 level.
2.15 Expert Review
All items, statistics, and comments were reviewed by curriculum specialists and testing
consultants. Items found to be inappropriate for curricular or psychometric reasons were
deleted. In addition, items flagged for exhibiting ethnic or gender bias were then
reviewed by a bias review committee.
The bias review committee members, selected because of their knowledge of the
curriculum area and their diversity, evaluated test items with a bias flag using the
following questions:
1. Does the item contain any offensive gender, ethnic, religious, or regional content?
2. Does the item contain gender, ethnic, or cultural stereotyping?
3. Does the item contain activities that will be more familiar to one group than
another?
4. Do the words in the item have a different meaning to one group than another?
5. Could there be group differences in performance that are unrelated to proficiency
in the content areas?
An answer of yes to any of these questions resulted in the unique 5-digit item number
being recorded on an item bias sheet along with the nature of the bias or sensitivity. Items
that were consistently identified as exhibiting bias or sensitivity were deleted from the
item pool.
Items that were flagged by the bias review committee were then reviewed by curriculum
specialists. If curriculum found the items measured content that was expected to be
28
mastered by all students, the item was retained for test development. Items consistently
identified as exhibiting bias by both review committees were deleted from the item pool.
2.16 Criteria for Inclusion in Item Pool
All of the item parameter data generated from the above analyses were used to determine
if an item displayed sound psychometric properties. Items could be potentially be flagged
as exhibiting psychometric problems or bias due to ethnicity/race or gender according to
the following criteria:
y
y
y
y
weak prediction—the slope (a parameter) was less than 0.60;
guessing—the asymptote (c parameter) was greater than 0.40;
ethnic bias—the log odds ratio was greater than 1.5 (favored whites)
or less than 0.67 (favored blacks); and
gender bias—the log odds ratio was greater than 1.5 (favored
females) or less than 0.67 (favored males).
Because the tests were to be used to evaluate the implementation of the curriculum, items
were not flagged on the basis of the difficulty of the item (threshold). The average item
pool parameter estimates based on field test data are provided below.
2.17 Item Pool Parameter Estimates
Table 5: Average item pool parameter estimates for the EOG and EOC Tests of Mathematics by grade or
subject (2000)
Grade /
Subject
3 Pretest
Grade 3
Grade 4
Grade 5
Grade 6
Grade 7
Grade 8
Grade 10
Algebra I
Geometry
Algebra II
IRT Parameters
Threshold
(b)
0.07
-0.30
0.03
0.28
0.44
0.51
0.58
1.26
0.62
0.58
0.91
Slope
(a)
0.98
0.99
0.92
1.02
1.02
0.98
1.01
1.12
0.94
1.06
0.98
Asymptote
(c)
0.18
0.11
0.11
0.13
0.14
0.15
0.14
0.22
0.18
0.19
0.20
pvalue
0.60
0.66
0.59
0.54
0.51
0.49
0.46
0.41
0.50
0.50
0.46
Bias
(Odds Ratio Logit)
Ethnic /
Gender
Race
1.06
1.01
1.11
1.03
1.09
1.01
1.09
1.00
1.07
1.03
1.11
1.02
1.10
1.03
1.01
1.01
1.10
1.02
1.14
1.03
1.07
1.02
2.18 Operational Test Construction
The final item pool was based on approval by the (1) NCDPI Division of Instructional
Services for curricular match and (2) NCDPI Division of Accountability Services/Test
Development Section for psychometrically sound item performance. Once the final items
29
were identified for the item pool, operational tests were constructed according to the test
blueprints. For a summary of the test specifications, see Appendix B. For EOG Tests of
Mathematics, three forms were developed for operational administration for grades 3 through
6. For grades 7 and 8, two forms were developed. Three forms were developed for each of the
EOC tests.
2.19 Setting the Target p-value for Operational Tests
P-value is a measure of the difficulty of an item. P-values can range from 0 to 1. The letter “p”
symbolizes the proportion of examinees that answer an item correctly. So an item with a pvalue of 0.75 was correctly answered by 75% of the students who answered the item during
the field test, and one might expect that roughly 75 of 100 examinees will answer it correctly
when the item is put on an operational test. An easy item has a p-value that is high, which
means that a large proportion of the examinees got the item right during the field test. A
difficult item has a low p-value, meaning that few examinees answered the item correctly
during field testing.
The NCDPI psychometric staff must choose a target p-value for each operational test prior to
assembling the tests. Ideally, the average p-value of a test would be 0.625, which is the
theoretical average of a student getting 100% correct on the test and a student scoring a chance
performance (25% for a 4-foil multiple choice test). That is, (100 + 25)/2.
The target was chosen by first looking at the distribution of the p-values for a particular item
pool. While the goal is to set the target as close to 0.625 as possible, it is often the case that the
target p-value is set between the ideal 0.625 and the average p-value of the item pool. The
average p-value and the target p-value for operational forms are provided below for
comparison.
2.20 Comparison of Item Pool p-Values with Operational p-Values
Table 6: Comparison of p-value of item pool with p-values of assembled forms averaged across forms
Grade/Subject
3 Pretest
Grade 3
Grade 4
Grade 5
Grade 6
Grade 7
Grade 8
Grade 10
Algebra I
Geometry
Algebra II
p-Value of
Item Pool
p-Value of
Forms
0.60
0.66
0.59
0.54
0.51
0.49
0.46
0.41
0.50
0.50
0.46
0.59
0.66
0.62
0.59
0.56
0.51
0.48
0.41
0.51
0.54
0.47
30
To develop equivalent forms, the test forms were balanced on P+, the sum of the p-values of
the items. All calculator active sections within a grade were equated, and all calculator inactive
sections within a grade were equated. Finally, to the extent possible, the sections were
balanced on slope.
2.21 Review of Assembled Operational Tests
Once forms were assembled to meet test specifications, target p-values, and item
parameter targets, a group of North Carolina educators and curriculum supervisors then
reviewed the assembled forms. Each group of subject area teachers and curriculum
supervisors worked independently of the test developers. The criteria for evaluating each
group of forms included the following:
•
•
•
•
•
the content of the test forms should reflect the goals and objectives of the North
Carolina Standard Course of Study for the subject (curricular validity);
the content of test forms should reflect the goals and objectives as taught in North
Carolina Schools (instructional validity);
items should be clearly and concisely written and the vocabulary appropriate to the
target age level (item quality);
content of the test forms should be balanced in relation to ethnicity, gender,
socioeconomic status, and geographic district of the state (free from test/item
bias); and
an item should have one and only one best answer that is right; the distractors
should appear plausible for someone who has not achieved mastery of the
representative objective (one best answer).
Reviewers were instructed to take the tests (circling the correct responses in the booklet) and
to provide comments and feedback next to each item. After reviewing all the forms, each
reviewer independently completed the survey asking for his or her opinion as to how well the
tests met the five criteria listed above. During the last part of the session, the group discussed
the tests and made comments as a group. The test review ratings along with the comments
were aggregated for review by NCDPI curriculum specialists and testing consultants. As a
final review, test development staff members, with input from curriculum staff, content
experts, and editors, conducted a final content and grammar check for each test form.
2.22 Setting the Test Administration Time
Additional important considerations in the construction of the North Carolina mathematics
tests were the number of items to be included and the time necessary to complete the test. A
timing study was conducted for the mathematics tests. The timing study was conducted in the
spring of 2000. Twenty-four items were administered to a volunteer sample of students in
Grades 3 and 6. Fifteen items were administered to a volunteer sample of high school
students. The length of time necessary to complete the items was calculated. This provided a
rough time-per-item estimate. In some cases it was necessary to reduce the number of items
slightly so that test administration time was reasonable and comparable to previous test
administrations. Adjustments to the length of the North Carolina mathematics tests were made
31
prior to administering the test as an operational test. The total number of items and
approximate testing time (in minutes) for each mathematics tests is provided below.
Table 7: Number of items per test and time allotted by grade and subject
Grade / Subject
3 Pretest
Grade 3
Grade 4
Grade 5
Grade 6
Grade 7
Grade 8
Grade 10
Algebra I
Geometry
Algebra II
Time Allotted
Number of Items
(includes short breaks
and general instructions)
40
80
80
80
80
80
80
80
80
72
60
100
200
200
200
195
195
195
195
130
130
130
32
Chapter Three: Test Administration
3.1 Test Administration
The North Carolina Grade 3 Mathematics Pretest, which measures grade 2 competencies
in mathematics, is a multiple-choice test administered to all students in grade 3 within the
first three weeks of the school year. The pretest allows schools to establish benchmarks to
compare individual and group scale scores and achievement levels with the results from
the regular end-of-grade test administered in the spring. In addition, a comparison of the
results from the pretest and the results from the regular grade 3 end-of-grade test
administration allows schools to measure growth in achievement in mathematics at the
third-grade level for the ABCs accountability program. The grade 3 pretest measures the
knowledge and skills specified for grade 2 from the mathematics goals and objectives of
the 1998 North Carolina Standard Course of Study. The pretest is not designed to make
student placement or diagnostic decisions.
The End-of-Grade Mathematics Tests are administered to students in grades 3 through 8
as part of the statewide assessment program. The standard for grade-level proficiency is a
test score at or above Achievement Level Three on both reading comprehension and
mathematics tests. Effective with the 2000-2001 school year, the North Carolina End-ofGrade Mathematics Tests are multiple-choice tests that measure the goals and objectives
of the mathematics curriculum adopted in 1998 by the North Carolina State Board of
Education for each grade. The competency goals and objectives are organized into four
strands: (1) number sense, numeration, and numerical operations; (2) spatial sense,
measurement, and geometry; (3) patterns, relationships, and functions; and (4) data,
probability, and statistics.
The North Carolina High School Comprehensive Mathematics Test is administered to
students in grade 10. It is a multiple-choice test that measures knowledge, skills, and
competencies in mathematics that the typical student should have mastered by the end of
the tenth grade. The mathematics framework consists of three competencies—problemsolving, reasoning, and communication—and four strands—(1) number sense,
numeration, and numerical operations, (2) spatial sense, measurement, and geometry; (3)
patterns, relationships, and functions; and (4) statistics, probability, and discrete
mathematics.
All end-of-course tests are administered within the final ten days of the course to students
enrolled for credit in courses where end-of-course tests are required. The purpose of endof-course tests is to sample a student’s knowledge of subject-related concepts specified in
the North Carolina Standard Course of Study and to provide a global estimate of the
student’s mastery of the material in a particular content area. The mathematics end-ofcourse (Algebra I, Geometry, and Algebra II) tests were developed to provide accurate
measurement of individual student knowledge and skills specified in the mathematics
component of the North Carolina Standard Course of Study.
33
3.2 Training for Test Administrators
The North Carolina Testing Program uses a train-the-trainer model to prepare test
administrators to administer North Carolina tests. Regional Accountability Coordinators
(RACs) receive training in test administration from NCDPI Testing Policy and
Operations staff at regularly scheduled monthly training sessions. Subsequently, the
RACs provide training on conducting a proper test administration to Local Education
Agency (LEA) test coordinators. LEA test coordinators provide training to school test
coordinators. The training includes information on the test administrators’
responsibilities, proctors’ responsibilities, preparing students for testing, eligibility for
testing, policies for testing students with special needs (students with disabilities and
students with limited English proficiency), test security (storing, inventorying, and,
returning test materials), and the Testing Code of Ethics.
3.3 Preparation for Test Administration
School test coordinators must be accessible to test administrators and proctors during the
administration of secure state tests. The school test coordinator is responsible for
monitoring test administrations within the building and responding to situations that may
arise during test administrations. Only employees of the school system are permitted to
administer secure state tests. Test administrators are school personnel who have
professional training in education and the state testing program. Test administrators may
not modify, change, alter, or tamper with student responses on the answer sheets or test
books. Test administrators are to: thoroughly read the Test Administrator’s Manual prior
to actual test administration; discuss with students the purpose of the test; and read and
study the codified North Carolina Testing Code of Ethics.
3.4 Test Security and Handling Materials
Compromised secure tests result in compromised test scores. To avoid contamination of
test scores, the NCDPI maintains test security before, during, and after test administration
at both the school system level and the individual school. School systems are also
mandated to provide a secure area for storing tests. The Administrative Procedures Act
16 NCAC 6D .0302. states, in part, that
school systems shall (1) account to the department (NCDPI) for all tests
received; (2) provide a locked storage area for all tests received; (3)
prohibit the reproduction of all or any part of the tests; and (4) prohibit
their employees from disclosing the content of or discussing with students
or others specific items contained in the tests. Secure test materials may
only be stored at each individual school for a short period prior to and
after the test administration. Every effort must be made to minimize
school personnel access to secure state tests prior to and after each test
administration.
34
At the individual school, the principal shall account for all test materials received. As
established by APA 16 NCAC 6D .0306, the principal shall store test materials in a
secure locked area except when in use. The principal shall establish a procedure to have
test materials distributed immediately prior to each test administration. Before each test
administration, the building level coordinator shall collect, count, and return all test
materials to the secure, locked storage area. Any discrepancies are to be reported to the
school system test coordinator immediately and a report must be filed with the regional
accountability coordinator.
3.5 Student Participation
The Administrative Procedures Act 16 NCAC 6D. 0301 requires that all public school
students in enrolled grades for which the SBE adopts a test, including every child with
disabilities, shall participate in the testing program unless excluded from testing as
provided by 16 NCC 6G.0305(g).
Grade 3 Pretest and End of Grade Mathematics Tests (Grades 3-8)
All students in membership in grade 3, including students who have been retained at
grade 3, are required to participate in the Grade 3 Mathematics Pretest. All students in
membership in grades 3-8 are required to participate in the End-of-Grade Mathematics
Tests.
High School Comprehensive Mathematics Test (Grade 10)
All students classified as tenth graders in the school system student information
management system (SIMS, NCWise, etc.) must participate in the High School
Comprehensive Mathematics Test. This also includes those students following the
Occupational Course of Study (OCS) and those who are repeating grade 10.
Algebra I, Geometry, and Algebra II End-of-Course Tests
All students, including students with disabilities, enrolled in a course for credit must be
administered the end-of-course test in the final ten days of the course. End-of-course tests
are not required for graduation; however, students enrolled for credit in a course that has
an end-of-course test must be administered the end-of-course test. Students who are
repeating the course for credit must also be administered the EOC test. The student’s
most recent test score will be used for the purpose of state accountability. In addition,
starting with the 2001-2002 school year, LEAs shall use results from all multiple-choice
EOC tests (English I; Algebra I; Biology; U.S. History; Economic, Legal, and Political
Systems; Algebra II; Chemistry; Geometry; Physics; and Physical Science) as at least
twenty-five percent of the student’s final grade for each respective course. LEAs shall
adopt policies regarding the use of EOC test results in assigning final grades.
35
3.6 Alternate Assessments
The North Carolina Testing Program currently offers the North Carolina Alternate
Assessment Academic Inventory (NCAAAI) and the North Carolina Alternate
Assessment Portfolio (NCAAP) as two alternate assessments for Grade 3 Pretest, the
End-of-Grade Mathematics Tests (grades 3-8), the High School Comprehensive
Mathematics Test, and End-of-Course Mathematics Tests.
The NCAAAI is an assessment process in which teachers utilize a checklist to evaluate
student performance on curriculum benchmarks in the areas of reading, mathematics
and/or writing. Student performance data are collected at the beginning of the school year
(baseline), in the middle of the school year (interim) and at the end of the school year
(summative). The NCAAAI measures competencies on the North Carolina Standard
Course of Study. The Individualized Education Program (IEP) team determines if a
student is eligible to participate in the NCAAI.
The NCAAP is a yearlong assessment process that involves a representative and
deliberate collection of student work/information that allows the users to make judgments
about what a student knows and is able to do, and the progress that has been made in
relation to the goals specified in the student’s current IEP. The IEP team determines if the
disability of a student is a significant cognitive disability. The determination of a
significant cognitive disability is one criterion for student participation in the NCAAP.
3.7 Testing Accommodations
On a case-by-case basis where appropriate documentation exists, students with
disabilities and students with limited English proficiency may receive testing
accommodations. The need for accommodations must be documented in a current
Individualized Education Program (IEP), Section 504 Plan, or LEP Plan. The
accommodations must be used routinely during the student’s instructional program or
similar classroom assessments. For information regarding appropriate testing procedures,
test administrators who provide accommodations for students with disabilities must refer
to the most recent publication of Testing Students with Disabilities and any published
supplements or updates. The publication is available through the local school system or at
www.ncpublicshools.org/accountability/testing. Test administrators must be trained in the
use of the specified accommodations by the school system test coordinator or designee
prior to the test administration.
3.8 Students with Limited English Proficiency
Per HSP-C-005, students identified as limited English proficient shall be included in the
statewide testing program. Students identified as limited English proficient who have
been assessed on the state-identified language proficiency test as below Intermediate
High in reading may participate for up to 2 years (24 months) in U.S. schools in the
NCAAAI as an alternate assessment in the areas of reading and mathematics at grades 3
through 8 and 10 and in high school courses in which an end-of-course test is
36
administered. Students identified as limited English proficient who have been assessed on
the state-identified language proficiency test as below Superior, per HSP-A-011, in
writing may participate in the NCAAAI in writing for grades 4, 7, and 10 for up to 2
years (24 months) in U.S. schools. All students identified as limited English proficient
must be assessed using the state-identified language proficiency test at initial enrollment
and annually thereafter during the window of February 1 to April 30. A student who
enrolls after January 1 does not have to be retested during the same school year. Limited
English proficient students who are administered the NCAAAI shall not be assessed offgrade level. In March 2004, the State Board of Education adopted a temporary rule to
make the following changes with respect to limited English proficient students during
their first year in U.S. schools.*
*Note: First year of enrollment in U.S. schools refers to the first school year that a student has been enrolled in a
U.S. school. It does not refer to a 12-month period. If a student has been enrolled in any U.S. school prior to this
school year, the student, regardless of his/her enrollment period would be expected to be assessed in reading and
mathematics.
Schools shall:
•
•
•
•
•
continue to administer state reading and mathematics tests for LEP students
who score at or above Intermediate High on the reading section of the
language proficiency test during their first year in U.S. schools. Results from
these assessments will be included in the ABCs and AYP.
not require LEP students (who score below Intermediate High on the reading
section of the language proficiency test) in their first year in U.S. schools to be
assessed on the reading End-of-Grade tests, High School Comprehensive Test
in Reading, or the NC Alternate Assessment Academic Inventory (NCAAAI)
for reading.
for purposes of determining the 95% tested rule in reading, use the language
proficiency test from the spring administration for these students.
not count mathematics results in determining AYP or ABCs performance
composite scores for LEP students who score below Intermediate High on the
reading section of the language proficiency test in their first year in U.S.
schools.
include students previously identified as LEP, who have exited LEP
identification during the last two years, in the calculations for determining the
status of the LEP subgroup for AYP only if that subgroup already met the
minimum number of 40 students required for a subgroup.
37
3.9 Medical Exclusions
In some rare cases students may be excused from the required state tests. The process for
requesting special exceptions based on significant medical emergencies and/or conditions
is as follows:
For requests that involve significant medical emergencies and/or conditions, the LEA
superintendent or charter school director is required to submit a justification statement
that explains why the emergency and/or condition prevents participation in the respective
test administration during the testing window and the subsequent makeup
period. The request must include the name of the student, the name of the school, the
LEA code, and the name of the test(s) for which the exception is being requested.
Medical documents are not included in the request to NCDPI. The request is to be based
on information housed at the central office. The student’s records must remain
confidential. Requests must be submitted prior to the end of the makeup period for the
respective test(s). Requests are to be submitted for consideration by the LEA
superintendent or charter.
3.10 Reporting Student Scores
According to APA 16 NCAC 6D .0302 schools systems shall, at the beginning of the
school year, provide information to students and parents or guardians advising them of
the district-wide and state mandated tests that students will be required to take during the
school year. In addition, school systems shall provide information to students and parents
or guardians to advise them of the dates the tests will be administered and how the results
from the tests will be used. Also, information provided to parents about the tests shall
include whether the State Board of Education or local board of education requires the
test. School systems shall report scores resulting from the administration of the districtwide and state-mandated tests to students and parents or guardians along with available
score interpretation information within 30 days from the generation of the score at the
school system level or receipt of the score and interpretive documentation from the
NCDPI.
At the time the scores are reported for tests required for graduation such as competency
tests and the computer skills tests, the school system shall provide information to students
and parents or guardians to advise whether or not the student has met the standard for the
test. If a student fails to meet the standard for the test, the students and parents or
guardians shall be informed of the following at the time of reporting: (1) the date(s) when
focused remedial instruction will be available and (2) the date of the next testing
opportunity.
3.11 Confidentiality of Student Test Scores
State Board of Education policy states that “any written material containing the
identifiable scores of individual students on tests taken pursuant to these rules shall not be
disseminated or otherwise made available to the public by any member of the State Board
38
of Education, any employee of the State Board of Education, the State Superintendent of
Public Instruction, any employee of the North Carolina Department of Public Instruction,
any member of a local board of education, any employee of a local board of education, or
any other person, except as permitted under the provisions of the Family Educational
Rights and Privacy Act of 1974, 20 U.S.C. § 1232g.”
39
Chapter Four: Scaling and Standard-Setting for the
North Carolina EOG and EOC Tests of Mathematics
The North Carolina EOG and EOC Tests of Mathematics scores are reported as scale scores,
achievement levels, and percentiles. Scale scores are advantageous in reporting because:
•
•
•
•
scale scores can be used to compare test results when there have been changes in the
curriculum or changes in the method of testing;
scale scores on pretests or released test forms can be related to scale scores used on
secure test forms administered at the end of the course;
scale scores can be used to compare the results of tests that measure the same content
area but are composed of items presented in different formats;
scale scores can be used to minimize differences among various forms of the tests.
4.1 Conversion of Raw Test Scores
Each student’s score is determined by calculating the number of items he or she answered
correctly and then converting the sum to a developmental scale score. Software developed at
the L.L. Thurstone Psychometric Laboratory at the University of North Carolina at Chapel
Hill converts raw scores (total number of items answered correctly) to scale scores using the
three IRT parameters (threshold, slope, and asymptote) for each item. The software
implements the algorithm described by Thissen and Orlando (2001, pp. 119-130). Because
different items are placed on each form of a subject’s test, unique score conversion tables are
produced for each form of a test for each grade or subject area. For example, grade 3 has three
EOG Tests of Mathematics forms. Therefore, the scanning and reporting program developed
and distributed by the NCDPI uses three scale-score conversion tables. In addition to scaled
scores, there are also standard errors of measurement associated with each. Because the EOC
Tests of Mathematics are not developmental in nature, the scales are calibrated in the norming
year to have a mean of 50 and a standard deviation of 10 for each test; otherwise, the
procedures for computing scale scores are the same as for the EOG tests.
4.2 Constructing a Developmental Scale
The basis of a developmental scale is the specification of means and standard deviations
for scores on that scale for each grade level. In the case of the North Carolina End-ofGrade Tests of Mathematics, the grade levels ranged from the Grade 3 Pretest
(administered in the Fall to students in the third grade) through grade 8. The data from
which the scale score means are derived make use of special test forms, called linking
forms, that are administered to students in adjacent grades. The difference in performance
among grades on these forms is used to estimate the difference in proficiency among
grades.
The second edition of the North Carolina End-of-Grade Tests of Mathematics used IRT
to compute these estimates following procedures described by Williams, Pommerich, and
Thissen (1998). Table 8 shows the population means and standard deviations derived
40
from the Spring 2000 item calibration for the North Carolina End-of-Grade Tests of
Mathematics.
Table 8: Population means and standard deviations derived from the Spring 2000 item calibration for the
North Carolina End-of-Grade Tests of Mathematics, second edition
Grade
3 Pretest
Grade 3
Grade 4
Grade 5
Grade 6
Grade 7
Grade 8
Mean
234.35
248.27
252.90
255.99
259.95
263.36
267.09
Population
Standard Deviation
9.66
9.86
10.65
12.78
11.75
12.46
12.83
The values for the developmental scale shown in Table 8 are based on IRT estimates of
differences between adjacent-grade means and ratios of adjacent-grade standard
deviations computed using the computer program MULTILOG (Thissen, 1991); the
estimates from MULTILOG were cross-checked against parallel estimates computed
using the software IRTLRDIF (Thissen & Orlando, 2001). In the computation of
estimates using either software system, the analysis of data from adjacent grades
arbitrarily sets the means and standard deviation of the population distribution of the
lower grade to values of zero (0) and one (1), respectively; the values of the mean (μ) and
standard deviation (σ) of the higher grade are estimated making use of the item response
data and the three parameter logistic IRT model (Thissen & Orlando, 2001). Table 9
shows the average difference between adjacent-grade means (μ) in units of the standard
deviation of the lower grade and ratios between adjacent-grade standard deviations
(σ) derived from the Spring 2000 item calibration for the North Carolina End-of-Grade
Tests of Mathematics. The values in Table 9 are converted into the final scale, shown in
Table 8, by setting the average scale score in grade 4 to be 252.9 with a standard
deviation of 10.65 and then computing the values for the other grades such that the
differences between the means for adjacent grades, in units of the standard deviation of
the lower grade, are the same as those shown in Table 8.
41
Table 9: Average difference between adjacent grade means in units of the standard deviation of the lower
grade and ratios between adjacent-grade standard deviations derived from the Spring 2000 item calibrations
for the North Carolina EOG Tests of Mathematics
Grades
3P–3
3–4
4–5
5–6
6–7
7–8
Average Mean
(μ) Difference
1.44
0.47
0.29
0.31
0.29
0.30
Average (σ)
Ratio
1.02
1.08
1.20
0.92
1.06
1.03
(Useful)
Replications
11
17
14
10
13
3
The estimates shown in Table 9 derive from 3 to 17 replications of the between-grade
difference; the numbers of replications for each grade pair are also shown in Table 9.
Each replication is based on a different short embedded linking form among the item
tryout forms administered in Spring 2000. The sample size for each linking form varied
from 398 to 4,313 students in each grade. (Most sample sizes were in the planned ranged
of between 1,300 and 1,500 students.)
The original field test design, as discussed earlier, was an embedded design and
originally called for 12 to 17 twelve-item linking forms between each pair of grades, with
sample sizes of approximately 1,500. However, due to logistical issues, some forms were
administered to larger samples, and other forms (that were delivered later) were
administered to smaller samples. In addition, the forms were not necessarily administered
to the random samples that were planned within each grade. Corrections were made for
these sampling problems in the computation of the estimates shown in Table 8. The mean
difference between grades 5 and 6 was corrected using an estimate of the regression,
across replications, of the mean difference on the new scale against the mean difference
on the first edition scale, after data analysis suggested that the matched samples in grades
5 and 6 were atypical in their performance. The mean difference between grades 7 and 8
and the standard deviation ratio for grade 5 relative to grade 4 were adjusted to smooth
the relation between those values and the corresponding values for adjacent grades.
Table 10 shows for each adjacent-grade pair the values of the average difference between
adjacent-grade means (μ) in units of the standard deviation of the lower grade and ratios
of adjacent-grade standard deviations (σ) derived from the Spring 2000 item calibration
for the North Carolina EOG Tests of Mathematics for each replication that provided
useful data. In Table 10, the values for each grade-pair are in decreasing order of the
estimate of the difference between the means. There is some variation among the
estimates across replications due to the fact that some of the estimates are based on small
samples and many of the estimates are based on non-random samples. However, as
aggregated in Table 9, a useful developmental scale is constructed.
42
Table 10: Replications of the average difference between adjacent-grade means in units of the standard
deviation of the lower grade and ratios between adjacent-grade standard deviations derived from the Spring
2000 item calibration for the North Carolina EOG Tests of Mathematics
Grade 3P–3
Grades 3–4
Grades 4–5
Grades 5–6
Grades 6–7
Grades 7–8
Mean
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
SD
1.84
1.09
0.76
1.25
0.63
1.06
0.68
0.72
0.50
1.10
0.42 0.92
1.77
1.04
0.59
1.09
0.57
0.96
0.54
0.75
0.39
1.00
0.36 1.26
1.73
1.18
0.57
1.09
0.51
1.05
0.53
0.82
0.37
1.09
0.27 0.92
1.60
0.97
0.56
1.02
0.51
1.15
0.51
0.83
0.36
0.95
1.53
0.98
0.55
1.04
0.51
1.07
0.38
1.15
0.35
1.02
1.50
0.97
0.55
1.33
0.41
1.03
0.37
1.02
0.33
1.12
1.42
1.12
0.54
1.05
0.40
0.87
0.36
0.90
0.28
1.01
1.35
1.10
0.49
1.04
0.21
1.51
0.33
1.05
0.28
1.18
1.28
1.02
0.47
0.94
0.19
1.63
0.25
1.09
0.22
1.01
0.97
0.93
0.42
1.22
0.12
1.53
0.21
0.83
0.21
1.08
0.89
0.83
0.41
0.99
0.10
1.74
0.21
1.01
0.40
1.10
-0.02
1.68
0.14
1.05
0.40
0.97
-0.05
1.71
0.12
1.14
0.37
0.90
-0.09
1.57
0.36
1.28
0.35
1.01
0.28
1.04
Curriculum revisions in the mathematics Standard Course of Study, adopted by the State
Board of Education in 1998, resulted in changes in test specifications and subsequent
second edition of the North Carolina EOG and EOC Tests of Mathematics. To ensure a
continuous measure of academic performance among North Carolina students,
developmental scales from the first edition of the North Carolina EOG Tests of
Mathematics were linked to developmental scales from the second edition of the test.
43
4.3 Comparison with and Linkage to the First Edition Scale
The embedded nature of the Spring 2000 item calibration provided a basis for a
preliminary linkage of the second edition developmental scale with that of the first
edition. The results of that preliminary linkage were subsequently superseded by results
obtained from a special study with the data collected in Spring 2001. Table 11 shows a
comparison of the population means and standard deviations for the second edition with
the averages and standard deviations for the scale scores obtained from the operational
administration of the first edition. For ease of comparison of the two scales, Figure 4
shows the two sets of averages plotted together, with 100 subtracted from the 2nn values
of the new scale so the same range could be approximated. The developmental scales for
the first and second editions of the EOG mathematics tests are somewhat dissimilar. The
smaller rates of change observed in the calibration data for the second edition are likely
due to incomplete implementation in the 1999–2000 academic year of the new
curriculum, which was the basis for the academic content in the second edition.
Table 11: Comparison of the population means and standard deviations for the second edition with
averages and standard deviations obtained from the operational administration of the first edition in the
Spring 2000 item calibration for the North Carolina EOG Tests of Mathematics
Grade
3 Pretest
Grade 3
Grade 4
Grade 5
Grade 6
Grade 7
Grade 8
First Edition
Standard
Mean
Deviation
131.6
7.8
143.5
11.1
152.9
10.1
159.5
10.1
165.1
11.2
171.0
11.5
175.3
11.9
Second Edition
Standard
Mean
Deviation
234.4
9.7
248.3
9.9
252.9
10.7
256.0
12.8
260.0
11.8
263.4
12.5
267.1
12.8
Scoring tables for the second edition forms that were to be administered in the Spring of
2001 were constructed after those forms were assembled in February–March 2001. Using
the item parameters from the embedded item calibration and the population means and
standard deviations in Table 11, scoring tables were constructed using the procedures
described by Thissen, Pommerich, Billeaud, and Williams (1995), and Thissen and
Orlando (2001). These procedures yield tables that translate summed scores into
corresponding IRT scale scores on the developmental scale.
A side-effect of the construction of those scoring tables was that the algorithm provided
IRT model-based estimates of the proportions of the (Spring 2000) item calibration
samples that would had obtained each summed score (and hence, each scale score) had
students been administered the forms assembled for Spring 2001. Those scoreproportions were matched with the observed score distributions on the first-edition forms
that were included in the item tryout, yielding equipercentile equating tables (Angoff,
1982) that match scores on the second edition with the scores at the same percentiles on
the first edition.
44
This equipercentile matching also provided part of the basis for a preliminary translation
of the cut scores between achievement levels from the first edition to the second edition.
Additional information was also used to select the preliminary cut scores, in the form of
the consistency in the patterns of the matched cut scores between the EOG Levels I, II,
III, and IV across grades.
At the time of the preliminary linkage, values were computed based on IRT modeled
estimates of the score distributions that would have been obtained if the new secondedition forms had been administered operationally in the Spring of 2000 (which they
were not). Those values were treated as though they reflected performance that would
have happened had the new curriculum been completely implemented in 1999–2000. As
a result, preliminary estimates were put in place to accommodate the testing schedule and
associated decision-making that needed to occur prior to the Spring of 2001. The graphic
below shows the linking of the mathematics forms across seven grades, the scale of the
latent proficiency for mathematics, and the operational scale for the first edition of the
North Carolina EOG Tests of Mathematics (Williams, Pommerich, and Thissen, 1998).
Figure 4: Comparison of the growth curves for the first and second editions of the North Carolina EOG
Tests of Mathematics in the Spring 2000 item calibration
310
NewAverage
300
Average2
Average3
New Math Scale Scores (& old + 100)
290
OldScores+100
280
270
260
250
240
230
220
210
200
1
2
3
4
5
6
7
Grade
(Vertical lines indicate 1, 2, and 3 standard deviations on the second edition)
45
8
9
4.4 Equating the Scales for the First and Second Editions of the North Carolina
EOG Tests of Mathematics
To ensure that the first and second edition scales were comparable, the scales for the first
and second editions of the test were linked using statistical moderation and the statistical
technology of equipercentile equating. Because of the uncertainty surrounding the
preliminary linkage between the scales for the first and second editions of the North
Carolina EOG Tests of Mathematics, an equating study was performed in Spring 2001. In
this study, the newly constructed second edition forms of the mathematics tests and
selected forms from the first edition were administered to spiraled samples in the context
of the item tryout of the new items to create additional second-edition forms. The purpose
of this study was to provide data for linkage of the scales on the first and second editions
using the newly constructed operational forms of the second edition of the test (which
were not available until early Spring 2001). Figure 5 shows the equipercentile equating
functions for grades 3–8 obtained using data from the equating study.
Figure 5: Equipercentile equating functions between the first and second editions of the North Carolina
EOG Tests of Mathematics scales derived from the Spring 2001 equating study for Grades 3–8
220
First Edition Scale
200
180
160
Grade 3
Grade 4
140
Grade 5
Grade 6
120
Grade 7
Grade 8
100
200
220
240
260
280
Second Edition Scale
46
300
320
The 5th through the 95th percentiles (5th, 10th, 15th, … ,95th) were plotted to determine
whether the fit was linear. Because they were found to be somewhat convex, an algorithm
was applied to improve the aspect of the fit. First, a line segment was placed between the
25th and 75th percentile pairs and extrapolated. When the straight line deviated from the
data, the fit was doglegged by either:
(1) placing a straight line between the 75th and 95th percentiles or the 5th and 25th
percentiles and extrapolating; or
(2) when the data point went off-scale at either end, another straight line was placed
between the 95th percentile and the data point representing the maximum scale score on
the new test and maximum scale score on the old test.
This was also done at the minimum ends of the scales between the 5th percentile and the
minimum score on both old and new tests. All fitted points on the line were rounded to
give integer score translation tables. This procedure resulted in score translation tables
that matched or closely matched percentiles for the middle 90% of the data and matched
the minimum and maximum scale scores on the two tests. The equated mean and
standard deviation were compared to see how closely they matched the observed first
edition test mean and standard deviation in the equating sample. In all cases except 5th
grade, means and standard deviations were similar. For 5th grade slight adjustments were
made to improve the aspect of the fit.
4.5 Setting the Standards
For tests developed under the North Carolina Testing Program, academic achievement
standard setting, the process of determining cut scores for the different achievement
levels, is typically accomplished through the use of contrasting groups. Contrasting
groups is an examinee-based method of standard setting, which involves categorizing
students into the four achievement levels by expert judges who are knowledgeable of
students’ achievement in various domains outside of the testing situation and then
comparing these judgments to students’ actual scores. For the North Carolina
mathematics tests, North Carolina teachers were considered as expert judges under the
rationale that teachers were able to make informed judgments about students’ academic
achievement because they had observed the breadth and depth of the students’ work
during the school year.
For the North Carolina EOG academic achievement standard setting, originally
conducted for the first edition (1992), approximately 160,000 students were placed into
categories by approximately 5,000 teachers. Teachers categorized students who
participated in field testing into one of the four achievement levels with the remainder
categorized as not a clear example of any of the achievement levels. The resulting
proportions of students expected to score in each of the four achievement levels were
then applied to the first operational year to arrive at the cut scores for the first edition
North Carolina EOG Tests of Mathematics. Table 12 shows the percentage of students
classified into each achievement level by grade or course.
47
Table 12: Percent of students assigned to each achievement level by teachers
Grade/Subject
Grade 3
Grade 4
Grade 5
Grade 6
Grade 7
Grade 8
Algebra I (May, 1993)
Geometry (May, 1995)
Level I
12.0%
10.3%
13.0%
12.1%
12.4%
11.2%
14.5%
17.2%
Level II
28.1%
27.2%
27.8%
28.1%
27.9%
28.8%
32.5%
30.7%
Level III
40.6%
42.8%
40.8%
40.4%
39.8%
40.4%
40.4%
34.3%
Level IV
19.2%
19.6%
18.3%
19.4%
19.9%
19.6%
12.6%
17.8%
(May, 1992, unless otherwise specified)
When applying the contrasting groups approach to standard setting for the second edition
of the North Carolina mathematics tests, scale scores from the field test year were
distributed from lowest to highest. If the classifications for grade 3 were used as an
example, 12% of 160,000 would be 19,200 scores. Counting up to 19,200 on the
cumulative frequency distribution gives the scale score below which 19,200 students
scored. This scale score became the cut-off between Level I and Level II. The process
continued for each of the levels until all cut scores had been derived. It should be noted
that to avoid an inflation of children categorized as Level IV, the percentage categorized
as No Clear Category was removed from the cut score calculations.
Since the administration of the first edition (1992) and the re-norming year (1998), the
proportions of students in Level I have continued to decrease and the proportions of
students in Levels III and IV have continued to increase. For example, from 1999 to
2000, 2% fewer children were in Level I than the year before. From 2000 to 2001 there
were 1.8% fewer children in Level I than from 1999 to 2000. To continue this trend, it
was anticipated that a similar percentage of fewer children would be in Level I from 2001
to 2002. Rather than develop new standards for the second edition of the North Carolina
EOG Tests of Mathematics which would disrupt the continuous measure and reporting of
academic performance for students, the standards for the second edition were established
by maintaining the historical trends mentioned above while making using of the equated
scales. Interim academic achievement standards were set using the field test data. The
final standards were set on the operational data.
4.6 Score Reporting for the North Carolina Tests
Scores from the North Carolina mathematics tests are reported as scale scores,
achievement levels, and percentile ranks. The scale scores are computed through the use
of raw-to-scale score conversion tables. The scale score determines the achievement level
in which a student falls.
Score reports are generated at the local level to depict performance for individual
students, classrooms, schools, and local education agencies. The data can be
disaggregated by subgroups of gender and race/ethnicity, as well as other demographic
variables collected during the test administration. Demographic data are reported on
48
variables such as free/reduced lunch status, limited English proficient status, migrant
status, Title I status, disability status, and parents’ levels of education. The results are
reported in aggregate at the state level usually at the end of June of each year. The
NCDPI uses the data for school accountability, student accountability (grades 3, 5, and
8), and to satisfy other federal requirements under the No Child Left Behind Act of 2001.
4.7 Achievement Level Descriptors
The four achievement levels in the North Carolina Testing Program are defined below.
Table 13: Administrative Procedures Act 16 NCAC 6D .0501 (Definitions related to Student Accountability
Standards)
Achievement Levels for the North Carolina Testing Program
Level I
Students performing at this level do not have sufficient mastery of
knowledge and skills in this subject area to be successful at the next
grade level.
Level II
Students performing at this level demonstrate inconsistent mastery of
knowledge and skills that are fundamental in this subject area and that
are minimally sufficient to be successful at the next grade level.
Level III
Students performing at this level consistently demonstrate mastery of
grade level subject matter and skills and are well prepared for the next
grade level.
Level IV
Students performing at this level consistently perform in a superior
manner clearly beyond that required to be proficient at grade level
work.
4.8 Achievement Level Cut Scores
The achievement level cut scores for the North Carolina mathematics tests are shown in the
table below.
Table 14: EOG and EOC Tests of Mathematics achievement levels and corresponding scale scores
Grade/Subject
3 Pretest
Grade 3
Grade 4
Grade 5
Grade 6
Grade 7
Grade 8
Grade 10
Algebra I
Geometry
Algebra II
Level I
211–219
218–237
221–239
221–242
228–246
231–249
235–253
141–159
23–44
23–45
23–45
Level II
220–229
238–245
240–246
243–249
247–253
250–257
254–260
160–171
45–54
46–56
46–57
49
Level III
230–239
246–254
247–257
250–259
254–264
258–266
261–271
172–188
55–65
57–66
58–68
Level IV
240–260
255–276
258–285
260–295
265–296
267–307
272–310
189–226
66–87
67–87
69–88
4.9 Achievement Level Trends
The percentage of students in each of the achievement levels is provided below by grade.
Table 15: Achievement level trends for Grade 3 Pretest
Grade 3
Pretest
Level I
Level II
Level III
Level IV
1995
*
*
*
*
1996
*
*
*
*
1997
1998
1999
2000
2001
2002
2003
6.2
23.5
40.6
29.7
5.4
23.1
41.3
30.2
4.6
20.6
41.8
32.9
3.3
19.7
41.7
35.3
2.0
18.9
43.4
35.8
1.4
15.9
42.8
40.0
1.1
14.0
43.6
41.3
1998
7.0
24.8
39.8
28.4
1999
6.3
23.7
40.2
29.8
2000
5.6
22.6
40.0
31.8
2001
4.2
22.2
43.3
30.3
2002
3.2
19.5
43.1
34.2
2003
1.1
10.0
45.9
42.9
1998
4.0
16.8
41.7
37.6
1999
2.9
14.4
43.0
39.6
2000
2.1
13.4
43.7
40.8
2001
1.2
12.0
46.7
40.0
2002
0.9
10.2
45.9
43.0
2003
0.7
4.5
35.6
59.1
1998
5.8
16.1
37.8
40.2
1999
3.8
13.7
35.5
46.9
2000
3.8
13.3
34.3
48.6
2001
2.2
11.2
36.6
50.1
2002
1.7
9.8
35.3
53.2
2003
1.1
6.4
30.7
61.8
1998
5.0
16.7
40.7
37.7
1999
4.3
14.6
39.8
41.3
2000
4.1
14.9
38.1
42.9
2001
3.3
13.8
40.5
42.4
2002
2.2
11.4
39.2
47.2
2003
1.7
8.2
34.5
55.6
*Test not administered
Table 16: Achievement level trends for Grade 3
Grade 3
Level I
Level II
Level III
Level IV
1995
9.3
25.6
39.7
25.4
1996
7.9
24.7
39.7
27.7
1997
6.8
23.0
39.6
30.7
Table 17: Achievement level trends for Grade 4
Grade 4
Level I
Level II
Level III
Level IV
1995
8.6
22.9
41.3
27.2
1996
7.2
21.3
43.6
28.0
1997
6.4
19.1
41.9
32.7
Table 18: Achievement level trends for Grade 5
Grade 5
Level I
Level II
Level III
Level IV
1995
9.4
24.1
37.3
29.2
1996
8.5
21.5
38.0
32.0
1997
7.1
19.8
36.2
36.8
Table 19: Achievement level trends for Grade 6
Grade 6
Level I
Level II
Level III
Level IV
1995
8.2
24.1
42.5
25.1
1996
7.0
20.5
43.0
29.6
1997
6.6
20.7
40.5
32.2
50
Table 20: Achievement level trends for Grade 7
Grade 7
Level I
Level II
Level III
Level IV
1995
8.4
24.5
38.6
28.5
1996
9.0
22.5
38.8
29.7
1997
8.6
20.6
36.9
34.0
1998
5.4
17.7
38.3
38.6
1999
3.9
13.6
37.4
45.0
2000
4.5
14.8
35.1
45.6
2001
3.2
15.5
33.3
48.0
2002
2.7
14.0
32.4
50.9
2003
2.9
13.3
31.1
52.7
2001
5.3
15.2
36.8
42.7
2002
4.2
13.5
35.7
46.6
2003
4.5
11.3
34.1
50.1
Table 21: Achievement level trends for Grade 8
Grade 8
Level I
Level II
Level III
Level IV
1995
8.2
24.2
40.1
27.5
1996
8.8
23.5
38.7
29.1
1997
9.0
22.1
38.4
30.5
1998
5.4
18.3
37.6
38.7
1999
5.4
17.0
37.9
39.7
2000
4.8
14.6
36.5
44.1
Table 22: Achievement level trends for Grade 10 High School Comprehensive Test
Grade 10
Level I
Level II
Level III
Level IV
1995
*
*
*
*
1996
*
*
*
*
1997
*
*
*
*
1998
11.9
32.5
41.0
14.6
1999
8.8
30.2
45.2
15.9
2000
8.8
29.4
45.4
16.4
2001
9.5
28.9
44.9
16.7
2002
*
*
*
*
2003
8.3
27.0
47.5
17.2
1999
9.1
25.5
43.4
22.0
2000
9.0
22.1
38.8
30.1
2001
3.2
20.8
44.6
31.5
2002
2.7
18.4
41.2
37.7
2003
2.7
18.7
40.9
37.7
1999
10.0
30.6
37.5
20.9
2000
9.6
30.3
36.4
23.6
2001
4.7
31.4
42.1
21.9
2002
4.3
29.3
41.6
24.8
2003
3.8
26.7
41.6
27.8
*Test not administered
Table 23: Achievement level trends for Algebra I
Algebra I
Level I
Level II
Level III
Level IV
1995
13.9
32.1
40.0
14.1
1996
15.1
31.8
38.7
14.4
1997
14.0
30.6
39.7
15.8
1998
10.8
27.7
41.9
19.6
Table 24: Achievement level trends for Geometry
Geometry
Level I
Level II
Level III
Level IV
1995
*
*
*
*
1996
*
*
*
*
1997
*
*
*
*
1998
*
*
*
*
*Test not administered
51
Table 25: Achievement level trends for Algebra II
Algebra II
Level I
Level II
Level III
Level IV
1995
*
*
*
*
1996
*
*
*
*
1997
*
*
*
*
1998
*
*
*
*
1999
10.0
31.0
36.0
23.0
2000
9.0
28.3
35.9
26.7
2001
2.5
24.5
40.3
32.6
2002
2.5
21.1
39.0
37.8
2003
1.6
19.6
39.1
39.6
*Test not administered
4.10 Percentile Ranking
The percentile rank for each scale score is the percentage of scores less than or equal to that
score. A percentile is a score or a point on the original measurement scale. If the percentile
formula is applied to the frequency distribution of scores for grade 3 (see Appendix E for
samples of frequency distribution tables) a score of 260 would have a percentile rank of 89.
The percentile rank provides relative information about a student’s score on a test relative to
other students in the norming year. The percentile ranks for the scores on the North Carolina
mathematics tests are calculated based on the first operational administration of the tests. The
use of percentile rank reporting allows a meaningful comparison to be made among
mathematics scores at the total test score level.
52
Chapter Five: Reports
5.1 Use of Test Score Reports Provided by the North Carolina Testing Program
The North Carolina Testing Program provides reports at the student level, school level,
and state level. The North Carolina Testing Code of Ethics dictates that educators use test
scores and reports appropriately. This means that educators recognize that a test score is
only one piece of information and must be interpreted together with other scores and
indicators. Test data help educators understand educational patterns and practices. Data
analysis of test scores for decision-making purposes should be based upon disaggregation
of data by student demographics and other student variables as well as an examination of
grading practices in relation to test scores, growth trends, and goal summaries for state
mandated tests.
5.2 Reporting by Student
The state provides scoring equipment in each school system so that administrators can
score all state-required multiple-choice tests. This scoring generally takes place within
two weeks after testing so the individual score report can be given to the student and
parent before the end of the school year.
Each student in grades 3-8 who takes the end-of-grade tests is given a “Parent/Teacher
Report.” This single sheet provides information on that student’s performance on the
reading and mathematics tests. A flyer titled, “Understanding Your Child’s EOG Score,”
is provided with each “Parent/Teacher Report.” This publication offers information for
understanding student scores as well as suggestions on what parents and teachers can do
to help students in the areas of reading and mathematics.
The student report also shows how that student’s performance compared to the average
scores for the school, the school system, and the state. A four-level achievement scale is
used for the tests.
Achievement Level I represents insufficient mastery of the subject.
Achievement Level II is inconsistent mastery of the subject.
Achievement Level III is consistent mastery and the minimum goal for students.
Achievement Level IV is superior mastery of the subject.
Students achieving at Level III or Level IV are considered to be at or above grade level.
Achievement Level III is the level students must score to be considered proficient and to
pass to the next grade under state Student Accountability Standards for grades 3, 5, and 8.
5.3 Reporting by School
Since 1997, the student performance on end-of-grade tests for each elementary and
middle school has been released by the state through the ABCs of School Accountability.
High school student performance began to be reported in 1998 in the ABCs of School
53
Accountability. For each school, parents and others can see the actual performance for
groups of students at the school in reading, mathematics, and writing; the percentage of
students tested; whether the school met or exceeded goals that were set for it; and the
status designated by the state.
Some schools that do not meet their goals and that have low numbers of students
performing at grade level receive help from the state. Other schools, where goals have
been reached or exceeded, receive bonuses for the certified staff and teacher assistants in
that school. Local school systems received their first results under No Child Left Behind
(NCLB) in July 2003 as part of the state’s ABCs accountability program. Under NCLB,
each school is evaluated according to whether or not it met Adequate Yearly Progress
(AYP). AYP is not only a goal for the school overall, but also for each subgroup of
students in the school. Every subgroup must meet its goal for the school to meet AYP.
AYP is only one part of the state’s ABCs accountability model. Complete ABCs results
are released in September and show how much growth students in every school made as
well as the overall percentage of students who are proficient. The ABCs report is
available on the Department of Public Instruction web site at
http://abcs.ncpublicschools.org/abcs/. School principals also can provide information
about the ABC report to parents.
5.4 Reporting by the State
The state reports information on student performance in various ways. The North
Carolina Report Cards provide information about K-12 public schools (including charters
and alternative schools) for schools, school systems, and the state. Each report card
includes a school or district profile and information about student performance, safe
schools, access to technology, and teacher quality.
As a participating state in the National Assessment of Educational Progress (NAEP),
North Carolina student performance is included in annual reports released nationally on
selected subjects. The state also releases state and local SAT scores each summer.
54
Chapter Six: Descriptive Statistics and Reliability
6.1 Descriptive Statistics for the First Operational Administration of the Tests
The second editions of the EOG and EOC Tests of Mathematics were administered for
the first time in the spring of 2001. Descriptive statistics for the North Carolina Tests of
Mathematics’ first operational year and operational administration population
demographics are provided below.
6.2 Means and Standard Deviations for the First Operational Administration of the
Tests
Table 26: Descriptive statistics by grade for the 2001 administration of the North Carolina EOG Tests of
Mathematics and the 1998 administration of the Grade 10 High School Comprehensive Test
Grade
3 Pretest
Grade 3
Grade 4
Grade 5
Grade 6
Grade 7
Grade 8
Grade 10
N
Mean
102,484
102,172
100,418
100,252
100,409
97,205
93,603
73,635
237.4
250.6
255.8
260.0
263.2
267.1
270.0
174.3
Standard
Deviation
7.7
7.7
8.3
9.6
9.9
10.6
11.0
13.5
Table 27: Mean scale score for the 2001 administration of the North Carolina EOC Mathematics tests
Subject
Algebra I
Geometry
Algebra II
N
93,116
65,515
54,909
55
Mean
61
57
65
6.3 Population Demographics for the first Operational Administration
Table 28: Population demographics for the 2001 administration of the North Carolina EOG and EOC Tests
of Mathematics and the 1998 administration of the Grade 10 High School Comprehensive Test
Grade /
Subject
3 Pretest
Grade 3
Grade 4
Grade 5
Grade 6
Grade 7
Grade 8
Grade 10
Algebra I
Geometry
Algebra II
N
%
Male
102,484
102,172
100,418
100,252
100,409
97,205
93,603
73,635
93,116
65,515
54,909
51.0
51.3
51.4
50.7
51.0
50.4
50.9
48.9
49.8
46.4
47.0
%
%
American
Female
Indian
49.0
1.5
48.7
1.4
48.6
1.4
49.3
1.4
49.0
1.3
49.6
1.4
49.1
1.3
51.1
1.3
50.2
1.3
53.6
1.1
53.1
1.1
%
%
%
%
Black White Other LEP
29.5
30.9
30.6
30.2
31.0
33.7
30.8
27.5
28.6
22.8
25.0
57.7
60.0
61.0
61.7
61.4
58.9
61.7
66.7
64.1
70.7
68.4
11.3
7.7
7.0
6.6
6.3
6.1
6.1
4.5
7.0
5.5
5.5
2.1
2.8
2.3
2.2
1.8
1.5
1.8
0.6
0.7
0.4
0.4
6.4 Scale Score Frequency Distributions
The following figures present the frequency distributions of the developmental scale
scores from the first statewide administration of the North Carolina EOG and EOC Tests
of Mathematics. The frequency distributions are not smooth because of the conversion
from raw scores to scale scores. Due to rounding in the conversion process, sometimes
two raw scores in the middle of the distribution convert to the same scale score resulting
in the appearance of a spike in that particular scale score.
56
Figure 6: Math Scale Score Frequency Distribution Grade 3
6500
6000
5500
5000
4500
4000
3500
3000
2500
2000
1500
1000
500
0
218
221
223
225
227
229
231
233
235
238
240
242
244
246
248
250
252
254
256
258
260
262
264
266
268
271
274
276
Number of Students
2001 Math Scale Score Distribution
Grade 3 (n = 102,172)
Scale Score
Figure 7: Math Scale Score Frequency Distribution Grade 4
2001 Math Scale Score Distribution
Grade 4 (n = 100,418)
5500
5000
4000
3500
3000
2500
2000
1500
1000
500
0
221
223
225
227
229
231
233
235
237
239
241
243
245
247
249
251
253
255
257
259
261
263
264
267
269
271
273
275
277
279
281
283
285
Number of Students
4500
Scale Score
57
Figure 8: Math Scale Score Frequency Distribution Grade 5
2001 Math Scale Score Distribution
Grade 5 (n = 100,252)
5000
4500
Number of Students
4000
3500
3000
2500
2000
1500
1000
500
221
223
225
227
229
231
233
235
237
239
241
243
245
247
249
251
253
255
257
259
261
263
265
267
269
271
273
275
277
279
281
283
285
287
289
291
293
295
0
Sca le Score
Figure 9: Math Scale Score Frequency Distribution Grade 6
2001 Math Scale Score Distribution
Grade 6 (n = 100,409)
5500
5000
4000
3500
3000
2500
2000
1500
1000
500
0
228
230
232
234
236
238
240
242
244
246
248
250
252
254
256
258
260
262
264
266
268
270
272
274
276
278
280
282
284
286
288
290
292
294
296
Number of Students
4500
Scale Score
58
Figure 10: Math Scale Score Frequency Distribution Grade 7
2001 Math Scale Score Distribution
Grade 7 (n = 97,205)
5000
4500
Number of Students
4000
3500
3000
2500
2000
1500
1000
500
231
233
235
237
239
241
243
245
247
249
251
253
255
257
259
261
263
265
267
269
271
273
275
277
279
281
283
285
287
289
291
293
295
297
299
301
303
305
0
Scale Score
Figure 11: Math Scale Score Frequency Distribution Grade 8
2001 Math Scale Score Distribution
Grade 8 (n = 93,603)
4500
4000
3000
2500
2000
1500
1000
500
0
235
237
239
241
243
245
247
249
251
253
255
257
259
261
263
265
267
269
271
273
275
277
279
281
283
285
287
289
291
293
295
297
299
301
303
305
307
209
Number of Students
3500
Scale Score
59
Figure 12: Algebra I Scale Score Frequency Distribution
2001 Algebra I Scale Score Distribution
(n=93,116)
5000
Number of Students
4500
4000
3500
3000
2500
2000
1500
1000
500
95
91
87
83
79
75
71
67
63
59
55
51
47
43
39
35
31
0
Sca le Score s
Figure 13: Geometry Scale Score Frequency Distribution
2001 Geometry Scale Score Distribution
(n=65,515)
4000
3000
2500
2000
1500
1000
500
Sca le Score s
60
93
89
85
81
77
73
69
65
61
57
53
49
45
41
37
0
32
Number of Students
3500
Figure 14: Algebra II Scale Score Frequency Distribution
2001 Algebra II Scale Score Distribution
(n=54,909)
2000
1500
1000
500
Sca le Score s
61
97
90
86
82
78
74
70
66
62
58
54
50
46
42
38
0
33
Number of Students
2500
6.5 Reliability of the North Carolina Mathematics Tests
Reliability refers to the consistency of a measure when the testing procedure is repeated
on a population of individuals or groups. In testing, if use is to be made of some piece of
information, then the information should be stable, consistent, and dependable. If any use
is to be made of the information from a test, then the test results must be reliable. If
decisions about individuals are to be made on the basis of test data, then it is desirable
that the test results be reliable and tests exhibit a reliability coefficient of at least 0.85.
There are three broad categories of reliability coefficients recognized as appropriate
indices for establishing reliability in tests: (a) coefficients derived from the administration
of parallel forms in independent testing sessions (alternate-form coefficients); (b)
coefficients obtained by administration of the same instrument on separate occasions
(test-retest or stability coefficients); and (c) coefficients based on the relationships among
scores derived from individual items or subsets of the items within a test, all data
accruing from a single administration of the test. The last coefficient is known as an
internal consistency coefficient (Standards for Educational and Psychological Testing,
AERA, APA, NCME, 1985, p.27). An internal consistency coefficient, coefficient alpha,
is the metric used to establish reliability for the North Carolina EOG and EOC Tests of
Mathematics.
6.6 Internal Consistency of the North Carolina Mathematics Tests
The following table presents the coefficient alpha indices averaged across forms.
Table 29: Reliability indices averaged across North Carolina EOG and EOC Tests of Mathematics forms
Average
Coefficient Alpha
0.82
0.96
0.96
0.95
0.96
0.95
0.94
0.94
0.94
0.94
0.88
Grade
3 Pretest*
Grade 3
Grade 4
Grade 5
Grade 6
Grade 7
Grade 8
Grade 10
Algebra I
Geometry
Algebra II
*The grade 3 pretest is 40 items (half of the total number of items on the grade 3 test).
As noted above, the North Carolina EOG and EOC Tests of Mathematics are highly
reliable as a whole. In addition, it is important to note that this high degree of reliability
extends across gender, ethnicity, LEP status, and disability. Looking at coefficients alpha
for the different groups reveals that in all test forms for mathematics tests, including EOG
62
tests, the math section of the high school comprehensive test, and Algebra I, 87% of the
values were at or above 0.94 and all were above 0.91.
Table 30: Reliability indices averaged across North Carolina EOG and EOC Test of Mathematics forms
(Gender)
Grade / Subject
Grade 3
Grade 4
Grade 5
Grade 6
Grade 7
Grade 8
Grade 10
Algebra I
Females
0.96
0.95
0.95
0.95
0.94
0.95
0.94
0.94
Males
0.95
0.96
0.95
0.96
0.95
0.95
0.95
0.95
Table 31: Reliability indices averaged across North Carolina EOG and EOC Test of Mathematics forms
(Ethnicity)
Grade /
Subject
Grade 3
Grade 4
Grade 5
Grade 6
Grade 7
Grade 8
Grade 10
Algebra I
Asian
Black
Hispanic
0.97
0.97
0.97
0.97
0.95
0.96
0.96
0.95
0.95
0.94
0.94
0.94
0.94
0.96
0.93
0.91
0.97
0.97
0.97
0.97
0.92
0.96
0.94
0.94
Native
American
0.95
0.95
0.94
0.95
0.93
0.94
0.93
0.93
MultiRacial
0.95
0.94
0.95
0.95
0.93
0.95
0.94
0.94
White
0.95
0.95
0.95
0.95
0.94
0.95
0.94
0.94
Table 32: Reliability indices averaged across North Carolina EOG and EOC Test of Mathematics forms
(Other Characteristics)
Grade / Subject
Grade 3
Grade 4
Grade 5
Grade 6
Grade 7
Grade 8
Grade 10
Algebra I
No
Disability
0.94
0.94
0.94
0.95
0.94
0.94
0.94
0.94
Disability
Not LEP
LEP
0.97
0.97
0.97
0.96
0.94
0.95
0.94
0.92
0.96
0.95
0.95
0.96
0.95
0.95
0.94
0.94
0.98
0.98
0.97
0.97
0.96
0.96
0.94
0.94
Although the North Carolina Testing Program administers alternate forms of the test, it is
not possible to calculate alternate-forms reliabilities on the tests within the context of a
natural test setting. Students take the test one time, and only those students in grades 3, 5,
63
and 8 who do not achieve Level III are required to retake the test. Thus, the natural
population of re-testers has a sharp restriction in range, which would lower the observed
correlation. Additionally, North Carolina students are extremely test-wise. Attempting to
do a special study on test-retest reliability, where one of the administrations does not
have stakes for the student, with this population would give questionable results.
6.7 Standard Error of Measurement
The information provided by the standard error of measurement (SEM) for a given score
is important because it assists in determining the accuracy of an examinee’s obtained
score. It allows a probabilistic statement to be made about an individual’s test score. For
example, if a score of 100 has an SEM of plus or minus two, then one can conclude that a
student obtained a score of 100, which is accurate within plus or minus 2 points with a
68% confidence. In other words, a 68% confidence interval for a score of 100 is 98–102.
If that student were to be retested, his or her score would be expected to be in the range of
98–102 about 68% of the time.
The standard error of measurement ranges for scores on the North Carolina EOC and
EOG Tests of Mathematics is provided in table 33 below. For students with scores within
2 standard deviations of the mean (95% of the students), standard errors are typically 2 to
3 points. For most of the EOG Tests of Mathematics scale scores, the standard error of
measurement in the middle range of scores, particularly at the cutpoint between Level II
and Level III, is 2 to 3 points. Scores at the lower and higher ends of the scale (above the
97.5 percentile and below the 2.5 percentile) have standard errors of measurement of
approximately 4 to 6 points. This is typical as scores become more extreme due to less
measurement precision associated with those extreme scores.
Table 33: Ranges of standard error of measurement for scale scores by grade or subject
Standard Error of
Measurement
(Range)
3–6
2–5
2–6
2–6
2–6
2–6
2–6
3–8
2–6
2–5
3–7
Grade/Subject
3 Pretest
Grade 3
Grade 4
Grade 5
Grade 6
Grade 7
Grade 8
Grade 10
Algebra I
Geometry
Algebra II
64
Additionally, standard error curves are presented in the following figures. These are
presented on a (0,1) scale on the x-axis representing the θ estimate (the estimate of the
test-taker’s true ability) for examinees.
Figure 15: Standard Errors of Measurement on the Grade 3 Pretest of Mathematics Test forms
65
Figure 16: Standard Errors of Measurement on the Grade 3 Mathematics Test forms
66
Figure 17: Standard Errors of Measurement on the Grade 4 Mathematics Test forms
67
Figure 18: Standard Errors of Measurement on the Grade 5 Mathematics Test forms
68
Figure 19: Standard Errors of Measurement on the Grade 6 Mathematics Test forms
69
Figure 20: Standard Errors of Measurement on the Grade 7 Mathematics Test forms
70
Figure 21: Standard Errors of Measurement on the Grade 8 Mathematics Test forms
71
Figure 22: Standard Errors of Measurement on the Grade 10 Mathematics Test forms
72
Figure 23: Standard Errors of Measurement on the Algebra I Test forms
73
Figure 24: Standard Errors of Measurement on the Geometry Test forms
74
Figure 25: Standard Errors of Measurement on the Algebra II Test forms
6.8 Equivalency of Test Forms
North Carolina administers multiple forms of each test during each testing cycle. This
serves several purposes. First, it allows North Carolina to fully test the breadth and depth
of each curriculum. The curricula are extremely rich, and administering a single form that
fully addressed each competency would be prohibitively long. Additionally, the use of
multiple forms reduces the incidence of one student copying from the test of another
student.
The tests are parallel in terms of content coverage at the goal level. That is, each form has
the same number of items from the Number Sense, Numeration, and Numerical
Operations strand (Goal 1) as every other form administered in that grade. The specific
questions asked on each form are a random domain sample of the topics in that grade’s
goals, although care is taken to not overemphasize a particular topic on a single test form.
The tests are statistically equivalent at the total test score level. Additionally, the two
parts of the mathematics tests, Calculator Active and Calculator Inactive, are also
equivalent at the whole-score level. That is, all the Calculator Active portions of the tests
75
for a given grade are equally difficult. However, due to the purposively random selection
of items tested in each goal, the tests are not statistically equated at the goal level.
The use of multiple equivalent and parallel forms has given rise to several “urban
legends,” foremost among which is that “The red form is harder” (referring to the color of
the front cover of one of the three test booklets). However, as the following figures show,
the tests are indeed equivalent.
Figure 26: Test Characteristic Curves for the Grade 3 Pretest of Mathematics Test forms
76
Figure 27: Test Characteristic Curves for the Grade 3 Mathematics Test forms
77
Figure 28: Test Characteristic Curves for the Grade 4 Mathematics Test forms
78
Figure 29: Test Characteristic Curves for the Grade 5 Mathematics Test forms
79
Figure 30: Test Characteristic Curves for the Grade 6 Mathematics Test forms
80
Figure 31: Test Characteristic Curves for the Grade 7 Mathematics Test forms
81
Figure 32: Test Characteristic Curves for the Grade 8 Mathematics Test forms
82
Figure 33: Test Characteristic Curves for the Grade 10 Mathematics Test forms
83
Figure 34: Test Characteristic Curves for the Algebra I Test forms
84
Figure 35: Test Characteristic Curves for the Geometry Test forms
85
Figure 36: Test Characteristic Curves for the Algebra II Test forms
For each grade’s set of test forms, the test characteristic curves are very nearly coincident
for much of the range of θ. Slight variations appear in the test curves at the extremes, as
the tests were designed to have maximum sensitivity in the middle of the range of
examinee ability.
86
Chapter Seven: Evidence of Validity
7.1 Evidence of Validity
The validity of a test is the degree to which evidence and theory support the interpretation of
test scores. Validity provides a check on how well a test fulfills its function. For all forms of
test development, the validity of the test is an issue to be addressed from the first stage of
development through analysis and reporting of scores. The process of validation involves
accumulating evidence to provide a sound scientific basis for the proposed test score
interpretations. Those interpretations of test scores are evaluated rather than the test itself.
Validation, when possible, should include several types of evidence and the quality of the
evidence is of primary importance (AERA, APA, NCME, 1985). For the North Carolina EOG
and EOC Tests of Mathematics, evidence of validity is provided through content relevance
and relationship of test scores to other external variables.
7.2 Content Validity
Evidence of content validity begins with an explicit statement of the constructs or
concepts being measured by the proposed test. The constructs or concepts measured by
the North Carolina EOG Tests of Mathematics are categorized by four basic strands:
Number Sense, Numeration, and Numerical Operations; Spatial Sense, Measurement, and
Geometry; Patterns, Relationships, and Functions; and Data, Probability, and Statistics.
All items developed for the North Carolina EOG Tests of Mathematics are written to
measure those four constructs.
Algebra I, Algebra II, and Geometry comprise the EOC Tests of Mathematics. These
tests measure the different levels of mathematics knowledge specific to the three areas
with particular focus on assessing students’ ability to process information and engage in
higher order thinking.
For test specification summaries, see Appendix B.
Almost all of the items are written by North Carolina teachers and other educators. Many
of the first round of the second edition math items were written under a contract with a
major testing company to handle the logistics, but the contract specified that at least half
of the items be written by teachers from North Carolina. During the additional field tests,
the vast majority of the items were written by North Carolina educators.
Additionally, all items written are reviewed by at least two content-area teachers from
North Carolina, and the state’s teachers are involved in other aspects of item development
and test review. Because North Carolina educators not only deliver the Standard Course
of Study every day in their classrooms, they are also the most familiar with the way in
which students learn and understand the material. Thus, North Carolina teachers are best
able to recognize questions that not only match the Standard Course of Study for their
87
particular course or grade, but also are relevant and comprehensible to the students at that
level.
Instructional Validity
DPI routinely administers questionnaires to teachers in an effort to evaluate the validity and
appropriateness of the North Carolina End-of-Grade and End-of-Course Tests of
Mathematics. Teachers are asked to evaluate the following statements using a five-point scale,
with the highest score being “to a superior degree,” and the lowest score being “not at all.”
1. The test content reflects the goals and objectives of the Grade X Mathematics
curriculum as outlined on the enclosed list of Grade X Mathematics objectives.
2. The test content reflects the goals and objectives of the Grade X Mathematics
curriculum as Grade X is taught in my school or school system.
3. The items are clearly and concisely written, and the vocabulary is appropriate to the
target age level.
4. The content is balanced in relation to ethnicity, race, sex, socioeconomic status, and
geographic districts of the state.
5. Each of the items has one and only one answer that is best; however, the distractors
appear plausible for someone who has not achieved mastery of the represented
objective.
In the most recent administrations, responses to statements reflect that the tests generally met
these criteria to a “superior” or “high” degree. All tests and grades showed similar patterns in
their responses; the results shown below are in aggregate.
Table 34: Instructional Validity of the content of the North Carolina EOG Tests of Mathematics
Statement
1
2
3
4
5
% indicating to a
superior or high degree
85%
58%
55%
85%
48%
7.3 Criterion-Related Validity
Analysis of the relationship of test scores to variables external to the test provides another
important source of validity evidence. External variables may include measures of some
criteria that the test is expected to predict, as well as relationships to other tests
hypothesized to measure the same constructs.
Criterion-related validity of a test indicates the effectiveness of a test in predicting an
individual’s behavior in a specific situation. The criterion for evaluating the performance
of a test can be measured at the same time (concurrent validity) or at some later time
(predictive validity).
88
For the North Carolina EOG and EOC Tests of Mathematics, teachers’ judgments of
student achievement, expected grade, and assigned achievement levels all serve as
sources of evidence of concurrent validity. The Pearson correlation coefficient is used to
provide a measure of association between the scale score and those variables listed above.
The correlation coefficients for the North Carolina EOG and EOC Tests of Mathematics
range from 0.49 to 0.89 indicating a moderate to strong correlation between EOG scale
scores and its associated variables.* The tables below provide the Pearson correlation
coefficients for variables used to establish criterion-related validity for the North Carolina
EOG and EOC Tests of Mathematics.
*Note: By comparison, the uncorrected correlation coefficient between SAT score and freshman year
grades in college is variously reported as 0.35 to 0.55 (Camera & Echternacht, 2000).
Table 35: Pearson correlation coefficient table for variables used to establish criterion-related validity for
the North Carolina EOG Tests of Mathematics
Grade
3
4
5
6
7
8
Teacher Judgment of
Achievement Level by
Assigned Achievement
Level
0.59
0.55
0.54
0.58
0.55
0.58
Teacher Judgment of
Achievement by Expected
Grade
0.70
0.70
0.67
0.63
0.61
0.60
Teacher Judgment of
Achievement by Math
Scale Score
0.65
0.61
0.63
0.64
0.62
0.62
Assigned Achievement
Level by Expected Grade
0.65
0.61
0.57
0.54
0.49
0.49
Expected Grade by Math
Scale Score
0.69
0.68
0.67
0.61
0.58
0.56
89
Table 36: Pearson correlation coefficient table for variables used to establish criterion-related validity for
the North Carolina EOC Tests of Mathematics
Subject
Algebra I
Geometry
Algebra II
Assigned Achievement Level
by Expected Grade
0.57
0.60
0.54
Teacher Judgment of
Achievement by Assigned
Achievement Level
0.54
0.55
0.48
Expected Grade by Math
Scale Score
0.62
0.64
0.58
Teacher Judgment of
Achievement by Math Scale
Score
0.58
0.59
0.53
The variables used in the tables above are as follows.
•
•
•
•
Teacher Judgment of Achievement: Teachers were asked, for each student
participating in the test, to evaluate the student’s absolute ability, external to the
test, based on their knowledge of their students’ achievement. The categories that
teachers could use correspond to the achievement level descriptors mentioned
previously on page 49.
Assigned Achievement Level: The achievement level assigned to a student based
on his or her test score, based on the cut scores previously described on page 49.
Expected Grade: Teachers were also asked to provide for each student the letter
grade that they anticipated each student would receive at the end of the grade or
course.
Math Scale Score: The converted raw-score-to-scale-score value obtained by
each examinee.
DPI found moderate to strong correlations between scale scores in mathematics and
variables such as teachers’ judgment of student achievement, expected grade, and
assigned achievement levels (all measures of concurrent validity). The department also
found generally low correlations among these scale scores and variables external to the
test such as gender, limited English proficiency, and disability for grades 3 through 8, the
High School Comprehensive Test of Mathematics (grade 10), and Algebra I. The vast
majority of the correlations between scale scores and gender or limited English proficient
were less extreme than ± 0.10, and most of the correlations between scale scores and
disability status were less extreme than ± 0.30. None of these relationships approached
the levels recorded for the selected measures of concurrent validity. These generalizations
held across the full range of forms administered by DPI for all the grades and subject
areas.
90
An additional source of concurrent validity is the trend between students’ progress on the
National Assessment of Education Progress (NAEP) and their progress on end-of-grade
scores. Although the scores themselves cannot and should not be compared directly, nor
is it valid to compare the percent “proficient” on each test, the trends show corresponding
increases in both NAEP math scores and scores on the North Carolina EOG tests in
mathematics.
Figures 37 through 40 show the trends for students who score “basic” or “proficient” on
NAEP assessments in grades 4 and 8 compared to students who scored at Level III or
above on the North Carolina End-of-Grade Tests of Mathematics in grade 4 and 8.
Figure 37: Comparison of NAEP “proficient” scores and North Carolina End-of-Grade Tests of
Mathematics scores for Grade 4
91
Figure 38: Comparison of NAEP “basic” scores and North Carolina End-of-Grade Tests of Mathematics
scores for Grade 4
Figure 39: Comparison of NAEP “proficient” scores and North Carolina End-of-Grade Tests of
Mathematics scores for Grade 8
92
Figure 40: Comparison of NAEP “basic” scores and North Caroline End-of-Grade Tests of Mathematics
scores for Grade 8
93
Chapter Eight: Quality Control Procedures
Quality control procedures for the North Carolina testing program are implemented
throughout all stages of testing. This includes quality control for test development, test
administration, score analysis, and reporting.
8.1 Quality Control Prior to Test Administration
Once test forms have been assembled, they are reviewed by a panel of subject experts.
Once the review panel has approved a test form, test forms are then configured to go
through the printing process. Printers send a blue-lined form back to NCDPI Test
Development staff to review and adjust if necessary. Once all test answer sheets and
booklets are printed, the test project manager conducts a spot check of test booklets to
ensure that all test pages are included and test items are in order.
8.2 Quality Control in Data Preparation and Test Administration
Student background information must be coded before testing begins. The school system
may elect to either: (1) pre-code the answer sheets, (2) direct the test administrator to
code the Student Background Information, or (3) direct the students to code the Student
Background Information. For the North Carolina multiple-choice tests, the school system
may elect to pre-code some or all of the Student Background Information on SIDE 1 of
the printed multiple-choice answer sheet. The pre-coded responses come from the
schools’ SIMS/NCWISE database. Pre-coded answer sheets provide schools with the
opportunity to correct or update information in the SIMS/NCWISE database. In such
cases, the test administrator ensures that the pre-coded information is accurate. The test
administrator must know what information will be pre-coded on the student answer
sheets to prepare for the test administration. Directions for instructing students to check
the accuracy of these responses are located in test administrator manuals. All corrections
for pre-coded responses are provided to a person designated by the school system test
coordinator to make such corrections. The students and the test administrator must not
change, alter, or erase pre-coding on students’ answer sheets. To ensure that all students
participate in the required tests and to eliminate duplications, students, regardless of
whether they take the multiple-choice test or an alternate assessment, are required to
complete the student background information on the answer sheets.
When tests and answer sheets are received by the local schools, they are kept in a locked,
secure location. Class rosters are reviewed for accuracy by the test administrator to
ensure that students receive their answer sheets. During test administration at the school
level, proctors and test administrators circulate throughout the test facility (typically a
classroom) to ensure that students are using the bubble sheets correctly. Once students
have completed their tests, answer sheets are reviewed and where appropriate cleaned by
local test coordinators (removal of stray marks, etc.).
94
8.3 Quality Control in Data Input
All answer sheets are then sent from individual schools to the Local Test Coordinator,
where they are scanned in a secure facility. The use of a scanner provides the opportunity
to program in a number of quality control mechanisms to ensure that errors overlooked in
the manual check of data are identified and resolved. For example, if the answer sheet is
unreadable by the scanner, the scanner stops the scan process until the error is resolved.
In addition, if a student bubbles in two answers for the same question, the scan records
the student’s answer as a (*) indicating that the student has answered twice.
8.4 Quality Control of Test Scores
Once all tests are scanned, they are then sent through a secure system to the Regional
Accountability Coordinators who check to ensure that all schools in all LEAs have
completed and returned student test scores. The Regional Accountability Coordinators
also conduct a spot check of data and then send the data through a secure server to the
North Carolina Department of Public Instruction Division of Accountability Services.
Data are then imported into a file and cleaned. When a portion of the data are in, NCDPI
runs a CHECK KEYS program to flag areas where answer keys may need a second
check. In addition, as data come into the NCDPI Division of Accountability Services,
Reporting Section staff import and clean data to ensure that individual student files are
complete.
8.5 Quality Control in Reporting
Scores can only be reported at the school level after NCDPI issues a certification
statement. This is to ensure that school, district, and state-level quality control procedures
have been employed. The certification statement is issued by the NCDPI Division of
Accountability. The following certification statement is an example:
“The department hereby certifies the accuracy of the data from the North Carolina endof-course tests for Fall 2004 provided that all NCDPI-directed test administration
guidelines, rules, procedures, and policies have been followed at the district and schools
in conducting proper test administrations and in the generation of the data. The LEAs
may generate the required reports for the end-of-course tests as this completes the
certification process for the EOC tests for the Fall 2004 semester.”
95
Glossary of Key Terms
The terms below are defined by their application in this document and their common uses
in the North Carolina Testing Program. Some of the terms refer to complex statistical
procedures used in the process of test development. In an effort to avoid the use of
excessive technical jargon, definitions have been simplified; however, they should not be
considered exhaustive.
Accommodations
Changes made in the format or administration of the
test to provide options to test takers who are unable to
take the original test under standard test conditions.
Achievement levels
Descriptions of a test taker’s competency in a
particular area of knowledge or skill, usually defined
as ordered categories on a continuum classified by
broad ranges of performance.
Asymptote
An item statistic that describes the proportion of
examinees that endorsed a question correctly but did
poorly on the overall test. Asymptote for a theoretical
four-choice item is 0.25 but can vary somewhat by test.
(For math it is generally 0.15 and for social studies it is
generally 0.22).
Biserial correlation
The relationship between an item score (right or
wrong) and a total test score.
Common curriculum
Objectives that are unchanged between the old and
new curricula
Cut scores
A specific point on a score scale, such that scores at or
above that point are interpreted or acted upon
differently from scores below that point.
Dimensionality
The extent to which a test item measures more than
one ability.
Embedded test model
Using an operational test to field test new items or
sections. The new items or sections are “embedded”
into the new test and appear to examinees as being
indistinguishable from the operational test.
Equivalent forms
Statistically insignificant differences between forms
(i.e., the red form is not harder).
96
Field test
A collection of items to approximate how a test form
will work. Statistics produced will be used in
interpreting item behavior/performance and allow for
the calibration of item parameters used in equating
tests.
Foil counts
Number of examinees that endorse each foil (e.g.
number who answer “A,” number who answer “B,”
etc.)
Item response theory
A method of test item analysis that takes into account
the ability of the examinee, and determines
characteristics of the item relative to other items in the
test. The NCDPI uses the 3-parameter model, which
provides slope, threshold, and asymptote.
Item tryout
A collection of a limited number of items of a new
type, a new format, or a new curriculum. Only a few
forms are assembled to determine the performance of
new items and not all objectives are tested.
Mantel-Haenszel
A statistical procedure that examines the differential
item functioning (DIF) or the relationship between a
score on an item and the different groups answering
the item (e.g. gender, race). This procedure is used to
identify individual items for further bias review.
Operational test
Test is administered statewide with uniform
procedures and full reporting of scores , and stakes for
examinees and schools.
p-value
Difficulty of an item defined by using the proportion of
examinees who answered an item correctly.
Parallel forms
Covers the same curricular material as other forms
Percentile
The score on a test below which a given percentage of
scores fall.
Pilot test
Test is administered as if it were “the real thing” but
has limited associated reporting or stakes for
examinees or schools.
97
Quasi-equated
Item statistics are available for items that have been
through item tryouts (although they could change after
revisions); and field test forms are developed using this
information to maintain similar difficulty levels to the
extent possible.
Raw score
The unadjusted score on a test determined by counting
the number of correct answers.
Scale score
A score to which raw scores are converted by
numerical transformation. Scale scores allow for
comparison of different forms of the test using the
same scale.
Slope
The ability of a test item to distinguish between
examinees of high and low ability.
Standard error of
measurement
The standard deviation of an individual’s observed
scores, usually estimated from group data.
Test blueprint
The testing plan, which includes numbers of items
from each objective to appear on test and arrangement
of objectives.
Threshold
The point on the ability scale where the probability of
a correct response is fifty percent. Threshold for an
item of average difficulty is 0.00.
WINSCAN Program
Proprietary computer program that contains the test
answer keys and files necessary to scan and score state
multiple-choice tests. Student scores and local reports
can be generated immediately using the program.
98
References
Camera, W. J. & Echternacht, G. (2000). The SAT I and High School Grades: Utility in
Predicting Success in College. Research Notes RN-10, July 2000 (p.6). The
College Board Office of Research and Development.
Gregory, Robert J. (2000). Psychological Testing: History, Principles, and Applications.
Needham Heights: Allyn & Bacon.
Hambleton, Ronald K. (1983). Applications of Item Response Theory. British Columbia:
Educational Research Institute of British Columbia.
Hinkle. D.E., Wiersma, W., & Jurs, S. G. (1998). Applied Statistics for the Behavioral
Sciences (pp. 69-70)
Muraki, E., Mislevy, R.J., & Bock, R.D. (1991). PC-BiMain: Analayis of item parameter
drift, differential item functioning, and variant item performance [Computer
software]. Mooresville, IN: Scientific Software, Inc.
Marzano, R.J., Brandt, R.S., Hughes, C.S., Jones, B.F., Presseisen, B.Z., Stuart, C., &
Suhor, C. (1988). Dimensions of Thinking. Alexandria, VA: Association for
Supervision and Curriculum Development.
Millman, J., and Greene, J. (1993). The Specification and Development of Tests of
Achievement and Ability. In Robert Linn (ed.), Educational Measurement (pp.
335-366). Phoenix: American Council on Education and Oryx Press.
Thissen, D., & Orlando, M. (2001). Item response theory for items scored in two
categories. In D. Thissen & H. Wainer (Eds), Test Scoring (pp. 73-140). Mahwah,
NJ: Lawrence Erlbaum Associates.
Williams, V.S.L., Pommerich, M., & Thissen, D. (1998). A comparison of developmental
scales based on Thurstone methods and item response theory. Journal of
Educational Measurement, 35, 93-107.
Additional Resources
Anastasi, A. (1982). Psychological Testing. New York: Macmillan Publishing Company,
Inc.
Averett, C.P. (1994). North Carolina End-of-Grade Tests: Setting standards for the
achievement levels. Unpublished manuscript.
Berk, R.A. (1984). A Guide to Criterion-Referenced Test Construction. Baltimore: The
Johns Hopkins University Press.
Berk, R.A. (1982). Handbook of Methods for Detecting Test Bias. Baltimore: The Johns
Hopkins University Press.
99
Bock, R.D., Gibbons, R., & Muraki, E. (1988). Full information factor analysis. Applied
Psychological Measurement, 12, 261-280.
Camilli, G. & Shepard, L.A. (1994). Methods for Identifying Biased Test Items. Thousand
Oaks, CA: Sage Publications, Inc.
Campbell, D.T. & Fiske, D.W. (1959). Convergent and discriminant validation by the
multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105.
Cattell, R.B. (1956). Validation and intensification of the Sixteen Personality Factor
Questionnaire. Journal of Clinical Psychology, 12, 105-214.
Dorans, N.J. & Holland, P.W. (1993). DIF Detection and description: Mantel-Haenszel
and standardization. In P.W. Holland and H. Wainer (Eds.), Differential Item
Functioning (pp. 35-66). Hillsdale, NJ: Lawrence Erlbaum.
Haladyna, T.M. (1994). Developing and Validating Multiple-Choice Test Items.
Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.
Hambleton, R.K. & Swaminathan, H. (1985). Item Response Theory: Principles and
Applications. Kluwer-Nijhoff Publishing.
Hambleton, R.K., Swaminathan, H., & Rogers, H.J. (1991). Fundamentals of Item
Response Theory. Newbury Park, CA: Sage Publications, Inc.
Holland, P.W. & Wainer, H. (1993). Differential Item Functioning. Hillsdale, NJ:
Lawrence Erlbaum Associates, Inc.
Joreskog, K.J. & Sorbom, D. (1986). PRELIS: A program for multivariate data screening
and data summarization. Chicago, IL: Scientific Software, Inc.
Joreskog, K.J. & Sorbom, D. (1988). LISREL 7: A guide to the program and applications.
Chicago, IL: SPSS, Inc.
Kubiszyn, T. & Borich, G. (1990). Educational Testing and Measurement. New York:
HarperCollins Publishers.
Muraki, E., Mislevy, R.J., & Bock, R.D. PC-Bimain Manual. (1991). Chicago, IL:
Scientific Software, Inc.
National Council of Teachers of Mathematics. Curriculum and Evaluation Standards for
School Mathematics. (1989). Reston, VA: Author.
North Carolina Department of Public Instruction. (1992). Teacher Handbook for
Mathematics. Raleigh, NC: Author.
100
North Carolina Department of Public Instruction. (1993). North Carolina End-of-Grade
Testing Program: Background Information. Raleigh, NC: Author.
North Carolina Department of Public Instruction. (1996). North Carolina Testing Code of
Ethics. Raleigh, NC: Author.
North Carolina State Board of Education. (1993). Public School Laws of North Carolina
1994. Raleigh, NC: The Michie Company.
Nunnally, J. (1978). Psychometric Theory. New York: McGraw-Hill Book Company.
Rosenthal, R. & Rosnow, R.L. (1984). Essentials of behavioral research: Methods and
data analysis. New York: McGraw-Hill Book Company.
SAS Institute, Inc. (1985). The FREQ Procedure. In SAS User's Guide: Statistics,
Version 5 Edition. Cary, NC: Author.
Traub, R.E. (1994). Reliability for the social sciences: Theory and applications.
Thousand Oaks, CA: Sage Publications, Inc.
101
Appendix A: Item Development Guidelines
Content Guidelines
1. Items must be based on the goals and objectives outlined in the North
Carolina Standard Course of Study in Mathematics and written for the
appropriate grade level.
2. To the extent possible, each item written should measure a single concept,
principle, procedure, or competency.
3. Write items that measure important or significant material instead of trivial
material.
4. Keep the testing vocabulary consistent with the expected grade level of
students tested.
5. Avoid writing stems based on opinions.
6. Emphasize higher level thinking skills using the taxonomy provided by the
NCDPI.
Procedural Guidelines
7. Use the best answer format.
8. Avoid writing complex multiple-choice items.
9. Format the items vertically, not horizontally.
10. Avoid errors of grammar, abbreviations, punctuation, and spelling.
11. Minimize student reading time.
12. Avoid tricky or misleading items.
13. Avoid the use of contractions.
14. Avoid the use of first or second person.
Stem Construction Guidelines
15. Items are to be written in the question format.
16. Ensure that the directions written in the stems are clear and that the wording
lets the students know exactly what is being tested.
17. Avoid excessive verbiage when writing the stems.
18. Word the stems positively, avoiding any negative phrasing. The use of
negatives such as NOT and EXCEPT is to be avoided.
19. Write the items so that the central idea and the phrasing are included in the
stem instead of the foils.
20. Place the interrogative as close to the item foils as possible.
General Foil Development
21. Each item must contain four foils (A, B, C, D).
22. Order the answer choices in a logical order. Numbers should be listed in
ascending or descending order.
102
23. Each item written should contain foils that are independent and not
overlapping.
24. All foils in an item should be homogeneous in content and length.
25. Do not use the following as foils: all of the above, none of the above, I don’t
know.
26. Word the foils positively, avoiding any negative phrasing. The use of
negatives such as NOT and EXCEPT is to be avoided.
27. Avoid providing clues to the correct response. Avoid writing items where
phrases in the stem (clang associations) are repeated in the foils.
28. Avoid including ridiculous options.
29. Avoid grammatical clues to the correct answer.
30. Avoid specific determiners because they are so extreme that they are seldom
the correct response. To the extent possible, specific determiners such as
ALWAYS, NEVER, TOTALLY, and ABSOLUTELY should not be used
when writing items. Qualifiers such as best, most likely, approximately, etc.
should be bold and italic.
31. The correct response for items written should be evenly balanced among the
response options. For a 4-option multiple-choice item, each correct response
should be located at each option position about 25% of the time.
32. The items written should contain one and only one best (correct) answer.
Distractor Development
33. Use plausible distractors. The best (correct) answer must clearly be the best
(correct) answer and the incorrect responses must clearly be inferior to the
best (correct) answer. No distractor should be obviously wrong.
34. To the extent possible, use the common errors made by students as distractors.
Give your reasoning for incorrect choices on the back of the item spec sheet.
35. Technically written phrases may be used, where appropriate, as plausible
distractors.
36. True phrases that do not correctly respond to the stem may be used as
plausible distractors where appropriate.
37. The use of humor should be avoided.
103
Appendix B: Test Blueprint Summaries
104
Mathematics Grade 3: Test Blueprint Summary
Number Sense, Numeration, and Numerical Operations
Competency Goal One: The learner will read, write, model, and compute with rational
numbers.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal One Summary
32
96
0.498–0.758
Spatial Sense, Measurement, and Geometry
Competency Goal Two: The learner will recognize, understand, and use basic geometric
properties and standard units of metric and customary measurement.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Two Summary
24
72
0.513–0.808
Patterns, Relationships, and Functions
Competency Goal Three: The learner will demonstrate an understanding of classification,
patterning, and seriation.
Average # of items
Average Number of Difficulty of Pool
per form
Items per Class
(Range)
Goal Three
Summary
12
36
0.483–0.693
Data, Probability, and Statistics
Competency Goal Four: The learner will demonstrate an understanding of data collection,
display, and interpretation.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Four
Summary
12
36
0.577–0.880
105
Mathematics Grade 4: Test Blueprint Summary
Number Sense, Numeration, and Numerical Operations
Competency Goal One: The learner will read, write, model, and compute with rational
numbers.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal One Summary
30
90
0.500–0.800
Spatial Sense, Measurement, and Geometry
Competency Goal Two: The learner will demonstrate an understanding and use of the
properties and relationships in geometry, and standard units of metric and customary
measurement.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Two Summary
23
69
0.449–0.684
Patterns, Relationships, and Functions
Competency Goal Three: The learner will demonstrate an understanding of patterns and
relationships.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Three
Summary
11
33
0.600–0.643
Data, Probability, and Statistics
Competency Goal Four: The learner will demonstrate an understanding and use of graphing,
probability, and data analysis.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Four
Summary
16
49
0.468–0.735
106
Mathematics Grade 5: Test Blueprint Summary
Number Sense, Numeration, and Numerical Operations
Competency Goal One: The learner will understand and compute with rational numbers.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal One Summary
32
96
0.458–0.757
Spatial Sense, Measurement, and Geometry
Competency Goal Two: The learner will understand and compute with rational numbers.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Two Summary
23
69
0.417–0.602
Patterns, Relationships, and Functions
Competency Goal Three: The learner will demonstrate an understanding of patterns,
relationships, and elementary algebraic representation.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Three
Summary
13
39
0.401–0.576
Data, Probability, and Statistics
Competency Goal Four: The learner will demonstrate an understanding and use of graphing,
probability, and data analysis.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Four
Summary
15
45
0.319–0.634
107
Mathematics Grade 6: Test Blueprint Summary
Number Sense, Numeration, and Numerical Operations
Competency Goal One: The learner will understand and compute with rational numbers.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal One Summary
29
87
0.353–0.700
Spatial Sense, Measurement, and Geometry
Competency Goal Two: The learner will demonstrate an understanding and use of the
properties and relationships in geometry and standard units of metric and customary
measurement.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Two Summary
23
69
0.381–0.697
Patterns, Relationships, and Functions
Competency Goal Three: The learner will demonstrate an understanding of patterns,
relationships, and algebraic representations.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Three
Summary
14
42
0.449–0.570
Data, Probability, and Statistics
Competency Goal Four: The learner will demonstrate an understanding and use of graphing,
probability, and data analysis.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Four
Summary
14
42
0.421–0.647
108
Mathematics Grade 7: Test Blueprint Summary
Number Sense, Numeration, and Numerical Operations
Competency Goal One: The learner will understand and compute with real numbers.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal One Summary
24
48
0.378–0.665
Spatial Sense, Measurement, and Geometry
Competency Goal Two: The learner will demonstrate an understanding and use of the
properties and relationships in geometry and standard units of metric and customary
measurement.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Two Summary
14
28
0.250–0.653
Patterns, Relationships, and Functions
Competency Goal Three: The learner will demonstrate an understanding of patterns,
relationships, and fundamental algebraic concepts.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Three
Summary
20
40
0.412–0.583
Data, Probability, and Statistics
Competency Goal Four: The learner will demonstrate an understanding and use of graphing,
probability, and data analysis.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Four
Summary
22
44
0.238–0.590
109
Mathematics Grade 8: Test Blueprint Summary
Number Sense, Numeration, and Numerical Operations
Competency Goal One: The learner will understand and compute with real numbers.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal One Summary
32
74
0.318–0.595
Spatial Sense, Measurement, and Geometry
Competency Goal Two: The learner will demonstrate an understanding and use of the
properties and relationships in geometry and standard units of metric and customary
measurement.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Two Summary
20
40
0.309–0.571
Patterns, Relationships, and Functions
Competency Goal Three: The learner will demonstrate an understanding of patterns,
relationships, and fundamental algebraic concepts.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Three
Summary
11
22
0.400–0.644
Data, Probability, and Statistics
Competency Goal Four: The learner will demonstrate an understanding and use of graphing,
probability, and data analysis.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Four
Summary
14
24
0.334–0.572
110
Algebra I: Test Blueprint Summary
Number Sense, Numeration, and Numerical Operations
Competency Goal One: The learner will perform operations with real numbers and
polynomials to solve problems.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal One Summary
12
36
0.567–0.620
Spatial Sense, Measurement, and Geometry
Competency Goal Two: The learner will solve problems in a geometric context.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Two Summary
4
12
0.392–0.516
Patterns, Relationships, and Functions
Competency Goal Three: The learner will graph and use relations and functions to solve
problems.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Three
Summary
56
156
0.423–0.581
Data, Probability, and Statistics
Competency Goal Four: The learner will collect and interpret data to solve problems.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Four
Summary
8
24
0.486–0.614
111
Geometry: Test Blueprint Summary
Number Sense, Numeration, and Numerical Operations
Competency Goal One: The learner will perform operations with real numbers to solve
problems in a geometric context.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal One Summary
None
None
****
Spatial Sense, Measurement, and Geometry
Competency Goal Two: The learner will use properties of geometric figures to solve
problems and write proofs.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Two Summary
61
189
0.271–0.702
Patterns, Relationships, and Functions
Competency Goal Three: The learner will graph and use relations and functions to solve
problems.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Three
Summary
7
21
0.392–0.453
Data, Probability, and Statistics
Competency Goal Four: The learner will collect and interpret data to solve problems.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Four
Summary
3
9
0.472
112
Algebra II: Test Blueprint Summary
Number Sense, Numeration, and Numerical Operations
Competency Goal One: The learner will perform operations with real numbers and
polynomials to solve problems.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal One Summary
8
24
0.427–0.447
Spatial Sense, Measurement, and Geometry
Competency Goal Two: The learner will solve problems in a geometric context.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Two Summary
3
9
0.477–0.537
Patterns, Relationships, and Functions
Competency Goal Three: The learner will graph and use relations and functions to solve
problems.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Three
Summary
40
120
0.317–0.690
Data, Probability, and Statistics
Competency Goal Four: The learner will collect and interpret data to solve problems.
Average Number of Average Number of Difficulty of Pool
items per form
Items per Class
(Range)
Goal Four
Summary
9
27
0.357–0.566
113
Appendix C: Math Developmental Scale Report with Excel Plots for First and Second
Editions’ Scale Scores
The Developmental Scale for the North Carolina End-of-Grade Mathematics Tests,
Second Edition
David Thissen, Viji Sathy, Michael C. Edwards, & David Flora
L.L. Thurstone Psychometric Laboratory
The University of North Carolina at Chapel Hill
Following changes in the North Carolina curricular specifications for mathematics, a
second edition of the North Carolina End-of-Grade tests in mathematics was designed,
and an item tryout was administered using small sets of 12 items each, embedded in the
operational End-of-Grade tests in the Spring of 2000. This report describes the use of
data from that item tryout to construct a developmental scale for the second edition of the
North Carolina End-of-Grade tests in mathematics.
The basis of a developmental scale is the specification of the means and standard
deviations for scores on that scale for each grade level. In the case of the North Carolina
End-of-Grade tests the grade levels range from the grade 3 pretest (administered in the
Fall to students in the 3rd grade) through grade 8. The data from which the scale-score
means and standard deviations are derived make use of special test forms (called linking
forms) that are administered to students in adjacent grades. The difference in
performance among grades on these forms is used to estimate the difference in
proficiency among grades. The second edition of the North Carolina EOG Tests of
Mathematics used item response theory (IRT) to compute these estimates following
procedures described by Williams, Pommerich, and Thissen (1998). The population
means and standard deviations derived from the Spring 2000 item calibration for the
North Carolina EOG Mathematics tests are shown in Table 1.
Table 1. Population means and standard deviations derived from the Spring 2000
item calibration for the North Carolina EOG Tests of Mathematics, second edition
Population
Grade
Mean Standard Deviation
3 Pretest 234.35 9.66
3
248.27 9.86
4
252.90 10.65
5
255.99 12.78
6
259.95 11.75
7
263.36 12.46
8
267.09 12.83
The values for the developmental scale shown in Table 1 are based on IRT estimates of
differences between adjacent-grade means and ratios of adjacent-grade standard
deviations computed using the computer program MULTILOG (Thissen, 1991); the
estimates from MULTILOG were cross-checked against parallel estimates computed using
114
the software IRTLRDIF (Thissen, 2001). In the computation of estimates using either
software system, the analysis of data from adjacent grades arbitrarily sets the mean and
standard deviation of the population distribution of the lower grade to values of zero (0)
and one (1), respectively; the values of the mean (µ) and standard deviation (σ) of the
higher grade are estimated making use of the item response data and the three-parameter
logistic IRT model (Thissen and Orlando, 2001). Table 2 shows the average difference
between adjacent-grade means (µ) in units of the standard deviation of the lower grade,
and ratios between adjacent-grade standard deviations (σ), derived from the Spring 2000
item calibration for the North Carolina EOG Tests of Mathematics. The values in Table 2
were converted into the final scale, shown in Table 1, by (arbitrarily) setting the average
scale score in grade 4 to be 252.9, with a standard deviation of 10.65, and then computing
the values for the other grades such that the differences between the means for adjacent
grades in units of the standard deviation of the lower grade were the same as those shown
in Table 2.
Table 2. Average difference between adjacent-grade means (µ) in units of the
standard deviation of the lower grade and ratios between adjacent-grade standard
deviations (σ), derived from the Spring 2000 item calibration for the North Carolina
EOG Tests of Mathematics, second edition
Grades
3P-3
3-4
4-5
5-6
6-7
7-8
Average µ
Difference
1.44
0.47
0.29
0.31
0.29
0.30
Average σ
Ratio
1.02
1.08
1.20
0.92
1.06
1.03
(Useful)
Replications
11
17
14
10
13
3
The estimates shown in Table 2 were derived from 3 to 17 replications of the betweengrade difference; the numbers of replications for each grade pair are also shown in Table
2. Each replication was based on a (different) short embedded linking form from among
the item tryout forms administered in the Spring 2000. The sample size for each linking
form varied from 398 to 4,313 students in each grade. (Most sample sizes were in the
planned range of 1,300–1,500.)
The original design of the embedded item calibration for the second edition called for 12
to 17 (12-item) linking forms between each pair of grades, with sample sizes around
1,500. However, some planned forms were not printed and distributed before the testing
window began. As a result, some forms were administered to larger samples, and other
forms (that were delivered late) were administered to smaller samples. In addition, the
forms were not necessarily administered to the random samples that were planned within
each grade. Corrections were made for these problems in the computation of the
estimates shown in Table 2. The mean difference between grades 5 and 6 was corrected
using an estimate of the regression across replications of the mean difference on the new
scale against the mean difference on the old (operational) scale after data analysis
suggested that the matched samples in grades 5 and 6 were atypical in their performance.
115
The mean difference between grades 7 and 8 and the standard deviation ratio for grade 5
relative to grade 4 were adjusted to smooth the relation between those values and the
corresponding values for adjacent grades.
Table 3 shows, for each adjacent-grade pair, the values of the average difference between
adjacent-grade means (µ) in units of the standard deviation of the lower grade and ratios
of adjacent-grade standard deviations (σ) derived from the Spring 2000 item calibration
for the North Carolina EOG Tests of Mathematics for each replication that provided
useful data. In Table 3 the values of each grade-pair are in decreasing order of the
estimate of the difference between the means. There is some variation among the
estimates across replications due to the fact that some of the estimates are based on small
samples and many of the estimates are based on non-random samples. However, as
aggregated in Table 2, a useful developmental scale was constructed.
116
Table 3: (Useful) replications of the average difference between adjacent-grade means (µ) in units of the
standard deviation of the lower grade and ratios between adjacent-grade standard deviations (σ) derived
from the Spring 2000 item calibration for the North Carolina EOG Tests of Mathematics, second edition
Grade 3P–3
Grades 3–4
Grades 4–5
Grades 5–6
Grades 6–7
Grades 7–8
Mean SD
Mean SD
Mean SD
Mean SD
Mean SD
Mean SD
1.84
1.09
0.76
1.25
0.63
1.06
0.68
0.72
0.50
1.10
0.42
0.92
1.77
1.04
0.59
1.09
0.57
0.96
0.54
0.75
0.39
1.00
0.36
1.26
1.73
1.18
0.57
1.09
0.51
1.05
0.53
0.82
0.37
1.09
0.27
0.92
1.60
0.97
0.56
1.02
0.51
1.15
0.51
0.83
0.36
0.95
1.53
0.98
0.55
1.04
0.51
1.07
0.38
1.15
0.35
1.02
1.50
0.97
0.55
1.33
0.41
1.03
0.37
1.02
0.33
1.12
1.42
1.12
0.54
1.05
0.40
0.87
0.36
0.9
0.28
1.01
1.35
1.10
0.49
1.04
0.21
1.51
0.33
1.05
0.28
1.18
1.28
1.02
0.47
0.94
0.19
1.63
0.25
1.09
0.22
1.01
0.97
0.93
0.42
1.22
0.12
1.53
0.21
0.83
0.21
1.08
0.89
0.83
0.41
0.99
0.1
1.74
0.21
1.01
0.40
1.1
-0.02
1.68
0.14
1.05
0.40
0.97
-0.05
1.71
0.12
1.14
0.37
0.90
-0.09
1.57
0.36
1.28
0.35
1.01
0.28
1.04
117
Comparison with and Linkage to the First Edition Scale
The embedded nature of the Spring 2000 item calibration provided a basis for a
preliminary linkage of the second edition developmental scale with that of the first
edition. (The results of that preliminary linkage were subsequently superseded by results
obtained from a special study with data collected in Spring 2001.) Table 4 shows a
comparison of the population means and standard deviations for the second edition with
the averages and standard deviations for the scale scores obtained from the operational
administration of the first edition. For ease of comparison of the two scales, Figure 1
shows the two sets of averages plotted together, with 100 subtracted from the 2nn values
of the new scale so they use approximately the same range. The developmental scales for
the first and second editions of the mathematics test are somewhat dissimilar. The smaller
rates of change observed in the calibration data for the second edition are likely due to
incomplete implementation in the 1999–2000 academic year of the new curriculum upon
which the second edition was based.
Table 4: Comparison of the population means and standard deviations for the second
edition with the averages and standard deviations obtained from the operational
administration of the first edition in the Spring 2000 item calibration for the North
Carolina EOG Tests of Mathematics
First Edition
Second Edition
Mean
Standard Deviation
Mean
Standard Deviation
3 Pretest 131.6
7.8
234.35
9.66
3
143.5
11.1
248.27
9.86
4
152.9
10.1
252.90
10.65
5
159.5
10.1
255.99
12.78
6
165.1
11.2
259.95
11.75
7
171.0
11.5
263.36
12.46
8
175.3
11.9
267.09
12.83
Grade
[The careful reader will note that, in Table 4, the second edition standard deviations are
somewhat larger than those for the first edition. This is due to the fact that the standard
deviations for the second edition are the values for the population distribution and those
for the first edition are standard deviations of the scale scores themselves; the latter must
be somewhat smaller than the former for IRT scale scores.]
118
310
NewAverage
300
Average2
Average3
New Math Scale Scores (& old + 100)
290
OldScores+100
280
270
260
250
240
230
220
210
200
1
2
3
4
5
Grade
6
7
8
Figure 1. Comparison of the growth curves for the first and second editions of the North Carolina EOG
Tests of Mathematics in the Spring 2000 item calibration (Vertical lines indicate 1, 2, and 3 standard
deviations on the second edition)
119
9
Scoring tables for the second edition forms that were to be administered in the Spring of
2001 were constructed after those forms were assembled in February–March 2001. Using
the item parameters from the embedded item calibration and the population means and
standard deviations in Table 1, we constructed scoring tables using the procedures
described by Thissen, Pommerich, Billeaud, and Williams (1995) and Thissen and
Orlando (2001). These procedures yield tables that translate summed scores into
corresponding IRT scale scores on the developmental scale.
A side-effect of the construction of those scoring tables was that the algorithm provides
IRT model-based estimates of the proportions of the (Spring 2000) item calibration
samples that would have obtained each summed score (and hence, each scale score) had
they been administered the forms assembled for Spring 2001. Those score-proportions
were matched with the observed score distributions on the first-edition forms that were
included in the item tryout, yielding equipercentile equating tables (Angoff, 1982) that
match scores on the second edition with the scores at the same percentiles on the first
edition.
This equipercentile matching also provided part of the basis for a preliminary translation
of the cut scores between achievement levels from the first edition to the second edition.
Additional information was also used to select the preliminary cut scores, in the form of
the consistency in the patterns of the matched cut scores between the EOG Levels I, II,
III, and IV across grades.
At the time of the preliminary linkage between the first and second edition score scales, it
was known that that linkage was based largely on statistical models and hypothetical
computations. We computed values based on IRT modeled estimates of the score
distributions that would have been obtained if the new second-edition forms had been
administered operationally in the Spring of 2000 (which they were not), and treated those
values as though they reflected performance that would have happened had the new
curriculum been completely implemented in 1999–2000 (which subsequent evidence
indicated was unlikely). As a result, those preliminary estimates were in place only
because the testing schedule and associated decision-making required some (albeit
preliminary) cut scores prior to the inaugural administration of the second edition tests in
the Spring of 2001.
The Equating Study
Because of the uncertainty surrounding the preliminary linkage between the scales for the
first and second editions of the North Carolina End-of-Grade Mathematics tests, a special
study commonly known as the equating study was performed in the Spring of 2001. In
this study, the newly-constructed second edition forms of the mathematics tests and
selected forms from the first edition were administered to spiraled samples in the context
of the item tryout of new items to create additional second-edition forms. The purpose of
this study was to provide data for linkage of the scales on the first and second editions
using the newly-constructed operational forms of the second edition of the test, which
were not available until early Spring 2001.
120
Figure 2 shows the equipercentile equating functions for grades 3–8 obtained using data
from the equating study. [Strictly speaking, this is not equating because the first and
second editions of the test measure different things, i.e., achievement on different
curricula. It is more technically referred to as statistical moderation (Linn, 1993; Mislevy,
1992). However, the statistical procedures of equipercentile equating are used, so it is
commonly referred to as equating.] The equating functions in Figure 2 are not coincident
because they cannot be given the fact that the developmental scales follow different
trajectories across grades (as shown in Figure 1).
220
First Edition Scale
200
180
160
Grade 3
Grade 4
140
Grade 5
Grade 6
120
Grade 7
Grade 8
100
200
220
240
260
280
Second Edition Scale
300
320
Figure 2: Equipercentile equating functions between the first and second edition NC End of Grade
Mathematics scales derived from the Spring 2001 “equating study” for Grades 3-8.
Nevertheless, within grades these curves yield translation tables that can be used to
convert scores on the second edition of the test into equated scores on the first edition.
Such converted scores may be used in the computation of year-to-year change for the
ABCs accountability system for the transitional year when the scores for the previous
year are on the first-edition scale and the scores for the current year are on the secondedition scale. In addition, because these equipercentile relations translate between scores
on the first-edition scale and the second-edition scale, they are used to translate the cut
121
scores (between Achievement Levels I, II, III, and IV) from the old scale to the new
scale.
The Effects of Significant Curricular Change
In the Spring of 2001, scores on the inaugural administration of the second edition of the
North Carolina EOG Tests of Mathematics were substantially higher than had been
expected given student performance on the test items in the Spring 2000 item tryout and
calibration. Table 5 shows the average scores for each grade; across grades the statewide
performance on the test was 2–4 scale-score points higher in Spring 2001 than was
expected from the item tryout the previous year. This was an unprecedented level of
change given that annual increases of test scores on this scale had almost always been
less than one point throughout the 1990s.
Table 5: Comparison of the population means and standard deviations for the second edition
with the averages and standard deviations obtained from the operational administration of the
second edition in the of Spring 2001.
Grade
3 Pretest
3
4
5
6
7
8
Item Tryout, 2000
Standard
Mean
Deviation
234.35
9.66
248.27
9.86
252.90
10.65
255.99
12.78
259.95
11.75
263.36
12.46
267.09
12.83
Statewide Data, Spring 2001
Standard
Mean
Deviation
236.1
8.1
250.6
7.7
255.8
8.3
260.0
9.6
263.2
9.9
267.1
10.6
270.0
11.0
This result, along with a good deal of unpublished statistical and anecdotal evidence,
suggested that performance of the students on the Spring 2000 item tryout was limited by
incomplete implementation of the new mathematics curriculum in the 1999–2000
academic year. In the 2000–2001 academic year, when it was known that the second
edition scores would be accepted, instruction in the new curriculum may have been much
more thorough, so scores were higher.
A consequence of these facts was that the preliminary cut scores for the Achievement
Levels, which had been set using only data from the Spring 2000 item tryout, were
substantially lower than the final cut scores set using data from the equating study in
Spring 2001. Because this was not known during the period of testing, and because scores
were reported locally before data from the equating study could be analyzed, there were
misleading reports of very high percentages of students passing the test with scores in
Achievement Levels III or IV. These results were corrected after the data from the
equating study were used to re-compute the cut scores before the test score information
was used for the ABCs accountability system.
122
However, it is likely that this kind of experience may follow any drastic change in the
curriculum in any subject-matter area. If the test must be calibrated based on data
obtained before students actually experience the new curriculum, the test items will
appear at that time to be more difficult than they will be when used operationally after the
new curriculum has been implemented. However, the amount of that change cannot be
known until after the first operational administration of any new test. Current
requirements that scores be reported and used immediately after each administration of
the End-of-Grade tests, including the first administration for a new edition, may lead to
unexpected results, as was the case for mathematics scores in Spring 2001.
123
Percentile
Grade 3 Second-Edition Data
Points
238
241
243
244
246
247
247
248
249
250
252
253
253
255
255
257
259
261
265
5th
10th
15th
20th
25th
30th
35th
40th
45th
50th
55th
60th
65th
70th
75th
80th
85th
90th
95th
Grade 3 First-Edition
Data Points
123
127
131
134
136
138
139
141
142
144
146
147
148
150
151
154
155
157
161
Grade 3
165
160
155
Form O (old form)
150
145
140
135
130
125
120
235
240
245
250
255
Form A (new form)
124
260
265
270
Percentile
Grade 4 Second-Edition
Scale
242
245
247
248
249
251
252
253
254
256
257
258
259
260
261
264
266
268
271
5th
10th
15th
20th
25th
30th
35th
40th
45th
50th
55th
60th
65th
70th
75th
80th
85th
90th
95th
Grade 4 First-Edition Scale
135
139
142
143
146
147
149
150
152
153
155
156
157
159
160
162
164
166
168
Grade 4
170
165
Form O (old form)
160
155
150
145
140
135
130
240
245
250
255
260
Form A (new form)
125
265
270
275
Percentile
Grade 5 Second-Edition
Scale
244
248
249
251
252
254
255
256
257
259
260
261
263
264
266
268
270
273
277
5th
10th
15th
20th
25th
30th
35th
40th
45th
50th
55th
60th
65th
70th
75th
80th
85th
90th
95th
Grade 5 First-Edition Scale
142
146
149
152
153
154
156
157
158
159
161
162
163
164
165
167
169
171
174
Grade 5
175
170
Form P (old form)
165
160
155
150
145
140
240
245
250
255
260
Form A (new form)
126
265
270
275
280
Percentile
Grade 6 Second-Edition
Scale
247
249
252
253
254
256
257
258
260
261
262
263
265
267
268
270
272
275
278
5th
10th
15th
20th
25th
30th
35th
40th
45th
50th
55th
60th
65th
70th
75th
80th
85th
90th
95th
Grade 6 First-Edition Scale
146
149
152
153.5
156
157
159
160
163
164
166
167
169
171
172
174
176
178
183
Grade 6
185
180
Form O (old form)
175
170
165
160
155
150
145
245
250
255
260
265
Form A (new form)
127
270
275
280
Percentile
Grade 7 Second-Edition
Scale
250
253
255
257
258
260
261
263
264
265
266
268
270
272
273
275
277
282
286
5th
10th
15th
20th
25th
30th
35th
40th
45th
50th
55th
60th
65th
70th
75th
80th
85th
90th
95th
Grade 7 First-Edition Scale
154
158
161
163
165
166
168
169
170
171
173
174
176
177
179
180
182
185
188
Grade 7
190
185
Form C (old form)
180
175
170
165
160
155
150
245
250
255
260
265
270
Form A (new form)
128
275
280
285
290
Percentile
Grade 8 Second-Edition
Scale
252
254
256
258
259
261
263
264
265
266
269
271
272
274
276
278
281
284
289
5th
10th
15th
20th
25th
30th
35th
40th
45th
50th
55th
60th
65th
70th
75th
80th
85th
90th
95th
Grade 8 First-Edition Scale
153
155
157
161
163
165
167
168
170
172
174
176
178
179
181
183
186
189
193
Grade 8
195
190
185
Form P (old form)
180
175
170
165
160
155
150
250
255
260
265
270
275
Form A (new form)
129
280
285
290
295
References
Angoff, W.H. (1982). Summary and derivation of equating methods used at ETS (Pp. 5569). In P.W. Holland & D.B. Rubin, Test equating. New York: Academic Press.
Linn, R.L. (1993). Linking results of distinct assessments. Applied Measurement in
Education. 6, 83-102.
Mislevy, R.J. (1992). Linking educational assessments: Concepts, issues, methods, and
prospects. Princeton, NJ: Educational Testing Service.
Thissen, D. (1991). MULTILOG user’s guide—Version 6. Chicago, IL: Scientific Software,
Inc.
Thissen, D. (2001). IRTLRDIF v.2.0b: Software for the computation of the statistics
involved in item response theory likelihood-ratio tests for differential item
functioning. Unpublished ms.
Thissen, D., & Orlando, M. (2001). Item response theory for items scored in two
categories. In D. Thissen & H. Wainer (Eds), Test Scoring (Pp. 73-140). Mahwah,
NJ: Lawrence Erlbaum Associates.
Thissen, D., Pommerich, M., Billeaud, K., & Williams, V.S.L. (1995). Item response
theory for scores on tests including polytomous items with ordered responses. Applied
Psychological Measurement, 19, 39-49.
Williams, V.S.L., Pommerich, M., & Thissen, D. (1998). A comparison of developmental
scales based on Thurstone methods and item response theory. Journal of Educational
Measurement, 35, 93-107.
130
Appendix D: Sample Items
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
Appendix E: Sample Frequency Distribution Tables for Math Scale Scores (selected grades
and subjects)
Grade 3 EOG (2001)
MathScaleScore
218
219
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
Frequency Count
4
1
4
1
3
1
6
7
7
19
63
61
102
277
239
602
818
507
1590
1772
1688
1988
2966
3080
2845
4174
4177
4019
5833
4696
5818
5282
3734
4709
5469
4658
3771
3668
3762
3609
3332
2424
2216
2034
528
Percent
0
0
0
0
0
0
0.01
0.01
0.01
0.02
0.06
0.06
0.1
0.27
0.23
0.59
0.8
0.5
1.56
1.73
1.65
1.95
2.9
3.01
2.78
4.09
4.09
3.93
5.71
4.6
5.69
5.17
3.65
4.61
5.35
4.56
3.69
3.59
3.68
3.53
3.26
2.37
2.17
1.99
0.52
151
Cumulative
Frequency
4
5
9
10
13
14
20
27
34
53
116
177
279
556
795
1397
2215
2722
4312
6084
7772
9760
12726
15806
18651
22825
27002
31021
36854
41550
47368
52650
56384
61093
66562
71220
74991
78659
82421
86030
89362
91786
94002
96036
96564
Cumulative
Percent
0
0
0.01
0.01
0.01
0.01
0.02
0.03
0.03
0.05
0.11
0.17
0.27
0.54
0.78
1.37
2.17
2.66
4.22
5.95
7.61
9.55
12.46
15.47
18.25
22.34
26.43
30.36
36.07
40.67
46.36
51.53
55.19
59.79
65.15
69.71
73.4
76.99
80.67
84.2
87.46
89.83
92
93.99
94.51
264
265
266
267
268
269
271
273
274
275
276
1241
1550
382
756
621
206
514
210
80
41
7
1.21
1.52
0.37
0.74
0.61
0.2
0.5
0.21
0.08
0.04
0.01
97805
99355
99737
100493
101114
101320
101834
102044
102124
102165
102172
152
95.73
97.24
97.62
98.36
98.96
99.17
99.67
99.87
99.95
99.99
100
Grade 5 EOG (2001)
Math Scale
Score
Frequency Count
221
1
229
2
230
1
231
7
232
16
233
23
234
27
235
26
236
80
237
147
238
181
239
235
240
340
241
483
242
626
243
738
244
826
245
1727
246
1598
247
1790
248
2431
249
2076
250
3354
251
3094
252
3194
253
3252
254
4100
255
4206
256
4424
257
4478
258
3732
259
2852
260
4447
261
4402
262
3569
263
2930
264
4254
265
2110
266
4106
267
1947
268
3880
269
1821
270
1864
271
1739
272
1725
273
2135
Percent
0
0
0
0.01
0.02
0.02
0.03
0.03
0.08
0.15
0.18
0.23
0.34
0.48
0.62
0.74
0.82
1.72
1.59
1.79
2.42
2.07
3.35
3.09
3.19
3.24
4.09
4.2
4.41
4.47
3.72
2.84
4.44
4.39
3.56
2.92
4.24
2.1
4.1
1.94
3.87
1.82
1.86
1.73
1.72
2.13
153
Cumulative
Frequency
1
3
4
11
27
50
77
103
183
330
511
746
1086
1569
2195
2933
3759
5486
7084
8874
11305
13381
16735
19829
23023
26275
30375
34581
39005
43483
47215
50067
54514
58916
62485
65415
69669
71779
75885
77832
81712
83533
85397
87136
88861
90996
Cumulative
Percent
0
0
0
0.01
0.03
0.05
0.08
0.1
0.18
0.33
0.51
0.74
1.08
1.57
2.19
2.93
3.75
5.47
7.07
8.85
11.28
13.35
16.69
19.78
22.97
26.21
30.3
34.49
38.91
43.37
47.1
49.94
54.38
58.77
62.33
65.25
69.49
71.6
75.69
77.64
81.51
83.32
85.18
86.92
88.64
90.77
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
290
291
293
295
1519
989
1384
1240
718
718
583
275
492
386
202
126
227
100
147
50
65
13
22
1.52
0.99
1.38
1.24
0.72
0.72
0.58
0.27
0.49
0.39
0.2
0.13
0.23
0.1
0.15
0.05
0.06
0.01
0.02
92515
93504
94888
96128
96846
97564
98147
98422
98914
99300
99502
99628
99855
99955
100102
100152
100217
100230
100252
154
92.28
93.27
94.65
95.89
96.6
97.32
97.9
98.17
98.67
99.05
99.25
99.38
99.6
99.7
99.85
99.9
99.97
99.98
100
Algebra I EOC (2001)
Scale Score
Frequency Count
Percent
Cumulative
Frequency
Cumulative
Percent
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
13
3
7
6
10
12
69
72
155
211
346
522
670
892
1089
1241
1490
1621
1671
1872
2510
2935
2766
2223
4456
2373
4678
3131
3881
4609
3847
2959
4510
4316
2701
3174
3635
2397
2739
2973
1886
2065
1524
1361
0.01
0
0.01
0.01
0.01
0.01
0.07
0.08
0.17
0.23
0.37
0.56
0.72
0.96
1.17
1.33
1.6
1.74
1.79
2.01
2.7
3.15
2.97
2.39
4.79
2.55
5.02
3.36
4.17
4.95
4.13
3.18
4.84
4.64
2.9
3.41
3.9
2.57
2.94
3.19
2.03
2.22
1.64
1.46
13
16
23
29
39
51
120
192
347
558
904
1426
2096
2988
4077
5318
6808
8429
10100
11972
14482
17417
20183
22406
26862
29235
33913
37044
40925
45534
49381
52340
56850
61166
63867
67041
70676
73073
75812
78785
80671
82736
84260
85621
0.01
0.02
0.02
0.03
0.04
0.05
0.13
0.21
0.37
0.6
0.97
1.53
2.25
3.21
4.38
5.71
7.31
9.05
10.85
12.86
15.55
18.7
21.68
24.06
28.85
31.4
36.42
39.78
43.95
48.9
53.03
56.21
61.05
65.69
68.59
72
75.9
78.48
81.42
84.61
86.63
88.85
90.49
91.95
155
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
1287
874
1324
718
610
514
488
406
253
218
192
203
50
108
84
36
45
22
19
23
10
11
1.38
0.94
1.42
0.77
0.66
0.55
0.52
0.44
0.27
0.23
0.21
0.22
0.05
0.12
0.09
0.04
0.05
0.02
0.02
0.02
0.01
0.01
86908
87782
89106
89824
90434
90948
91436
91842
92095
92313
92505
92708
92758
92866
92950
92986
93031
93053
93072
93095
93105
93116
156
93.33
94.27
95.69
96.46
97.12
97.67
98.2
98.63
98.9
99.14
99.34
99.56
99.62
99.73
99.82
99.86
99.91
99.93
99.95
99.98
99.99
100
Geometry EOC (2001)
Scale Score
32
33
34
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
Frequency Count
4
2
2
14
15
103
127
185
321
404
527
617
763
1262
1056
1601
1745
1447
1446
2082
2643
2213
2767
2289
2847
2918
3013
2326
3588
2346
2869
2756
2056
2844
1354
2159
1586
1493
1597
873
1033
721
628
539
509
460
Percent
0.01
0
0
0.02
0.02
0.16
0.19
0.28
0.49
0.62
0.8
0.94
1.16
1.93
1.61
2.44
2.66
2.21
2.21
3.18
4.03
3.38
4.22
3.49
4.35
4.45
4.6
3.55
5.48
3.58
4.38
4.21
3.14
4.34
2.07
3.3
2.42
2.28
2.44
1.33
1.58
1.1
0.96
0.82
0.78
0.7
157
Cumulative
Frequency
4
6
8
22
37
140
267
452
773
1177
1704
2321
3084
4346
5402
7003
8748
10195
11641
13723
16366
18579
21346
23635
26482
29400
32413
34739
38327
40673
43542
46298
48354
51198
52552
54711
56297
57790
59387
60260
61293
62014
62642
63181
63690
64150
Cumulative
Percent
0.01
0.01
0.01
0.03
0.06
0.21
0.41
0.69
1.18
1.8
2.6
3.54
4.71
6.63
8.25
10.69
13.35
15.56
17.77
20.95
24.98
28.36
32.58
36.08
40.42
44.88
49.47
53.02
58.5
62.08
66.46
70.67
73.81
78.15
80.21
83.51
85.93
88.21
90.65
91.98
93.56
94.66
95.61
96.44
97.21
97.92
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
362
228
170
140
149
52
97
62
35
15
38
4
7
4
2
0.55
0.35
0.26
0.21
0.23
0.08
0.15
0.09
0.05
0.02
0.06
0.01
0.01
0.01
0
64512
64740
64910
65050
65199
65251
65348
65410
65445
65460
65498
65502
65509
65513
65515
158
98.47
98.82
99.08
99.29
99.52
99.6
99.75
99.84
99.89
99.92
99.97
99.98
99.99
100
100
Appendix F: Testing Code of Ethics
Testing Code of Ethics (16 NCAC 6D .0306)
Testing Code of Ethics
Introduction
In North Carolina, standardized testing is an integral part of the educational experience of all
students. When properly administered and interpreted, test results provide an independent,
uniform source of reliable and valid information, which enables:
• students to know the extent to which they have mastered expected knowledge and skills and
how they compare to others;
• parents to know if their children are acquiring the knowledge and skills needed to succeed
in a highly competitive job market;
• teachers to know if their students have mastered grade-level knowledge and skills in the
curriculum and, if not, what weaknesses need to be addressed;
• community leaders and lawmakers to know if students in North Carolina schools are
improving their performance over time and how the students compare with students from
other states or the nation; and
• citizens to assess the performance of the public schools.
Testing should be conducted in a fair and ethical manner, which includes:
Security
• assuring adequate security of the testing materials before, during, and after
testing and during scoring
• assuring student confidentiality
Preparation
• teaching the tested curriculum and test-preparation skills
• training staff in appropriate testing practices and procedures
• providing an appropriate atmosphere
Administration
• developing a local policy for the implementation of fair and ethical testing practices and
for resolving questions concerning those practices
• assuring that all students who should be tested are tested
• utilizing tests which are developmentally appropriate
• utilizing tests only for the purposes for which they were designed
Scoring, Analysis and Reporting
• interpreting test results to the appropriate audience
• providing adequate data analyses to guide curriculum implementation and improvement
Because standardized tests provide only one valuable piece of information, such information
should be used in conjunction with all other available information known about a student to assist
159
in improving student learning. The administration of tests required by applicable statutes and the
use of student data for personnel/program decisions shall comply with the Testing Code of Ethics
(16 NCAC 6D .0306), which is printed on the next three pages.
Testing Code of Ethics
Testing Code of Ethics (16 NCAC 6D .0306)
.0306 TESTING CODE OF ETHICS
(a) This Rule shall apply to all public school employees who are involved in the state
testing program.
(b) The superintendent or superintendent’s designee shall develop local policies and
procedures to ensure maximum test security in coordination with the policies and
procedures developed by the test publisher. The principal shall ensure test security
within the school building.
(1) The principal shall store test materials in a secure, locked area. The principal
shall allow test materials to be distributed immediately prior to the test administration.
Before each test administration, the building level test coordinator shall accurately count
and distribute test materials. Immediately after each test administration, the building
level test coordinator shall collect, count, and return all test materials to the secure,
locked storage area.
(2) “Access” to test materials by school personnel means handling the materials
but does not include reviewing tests or analyzing test items. The superintendent or
superintendent’s designee shall designate the personnel who are authorized to have
access to test materials.
(3) Persons who have access to secure test materials shall not use those
materials for personal gain.
(4) No person may copy, reproduce, or paraphrase in any manner or for any
reason the test materials without the express written consent of the test publisher.
(5) The superintendent or superintendent’s designee shall instruct personnel who
are responsible for the testing program in testing administration procedures. This
instruction shall include test administrations that require procedural modifications and
shall emphasize the need to follow the directions outlined by the test publisher.
(6) Any person who learns of any breach of security, loss of materials, failure to
account for materials, or any other deviation from required security procedures shall
immediately report that information to the principal, building level test coordinator, school
system test coordinator, and state level test coordinator.
(c) Preparation for testing.
(1) The superintendent shall ensure that school system test coordinators:
(A) secure necessary materials;
(B) plan and implement training for building level test coordinators, test
administrators, and proctors;
(C) ensure that each building level test coordinator and test administrator
is trained in the implementation of procedural modifications used during test
administrations; and
(D) in conjunction with program administrators, ensure that the need for
test modifications is documented and that modifications are limited to the specific
need.
(2) The principal shall ensure that the building level test coordinators:
160
(A) maintain test security and accountability of test materials;
(B) identify and train personnel, proctors, and backup personnel for test
administrations; and
(C) encourage a positive atmosphere for testing.
(3) Test administrators shall be school personnel who have professional training
in education and the state testing program.
(4) Teachers shall provide instruction that meets or exceeds the standard course
of study to meet the needs of the specific students in the class. Teachers may help
students improve test-taking skills by:
(A) helping students become familiar with test formats using curricular
content;
(B) teaching students test-taking strategies and providing practice
sessions;
(C) helping students learn ways of preparing to take tests; and
(D) using resource materials such as test questions from test item banks,
testlets and linking documentsin instruction and test preparation.
(d) Test administration.
(1) The superintendent or superintendent’s designee shall:
(A) assure that each school establishes procedures to ensure that all test
administrators comply with test publisher guidelines;
(B) inform the local board of education of any breach of this code of
ethics; and
(C) inform building level administrators of their responsibilities.
(2) The principal shall:
(A) assure that school personnel know the content of state and local
testing policies;
(B) implement the school system’s testing policies and procedures and
establish any needed school policies and procedures to assure that all eligible
students are tested fairly;
(C) assign trained proctors to test administrations; and
(D) report all testing irregularities to the school system test coordinator.
(3) Test administrators shall:
(A) administer tests according to the directions in the administration
manual and any subsequent updates developed by the test publisher;
(B) administer tests to all eligible students;
(C) report all testing irregularities to the school system test coordinator;
and
(D) provide a positive test-taking climate.
(4) Proctors shall serve as additional monitors to help the test administrator
assure that testing occurs fairly.
(e) Scoring. The school system test coordinator shall:
(1) ensure that each test is scored according to the procedures and guidelines
defined for the test by the test publisher;
(2) maintain quality control during the entire scoring process, which consists of
handling and editing documents, scanning answer documents, and producing electronic
files and reports. Quality control shall address at a minimum accuracy and scoring
consistency.
161
(3) maintain security of tests and data files at all times, including:
(A) protecting the confidentiality of students at all times when publicizing
test results; and
(B) maintaining test security of answer keys and item-specific scoring
rubrics.
( f ) Analysis and reporting. Educators shall use test scores appropriately. This means
that the educator recognizes that a test score is only one piece of information and must
be interpreted together with other scores and indicators. Test data help educators
understand educational patterns and practices. The superintendent shall
ensure that school personnel analyze and report test data ethically and within the
limitations described in this paragraph.
(1) Educators shall release test scores to students, parents, legal guardians,
teachers, and the media with interpretive materials as needed.
(2) Staff development relating to testing must enable personnel to respond
knowledgeably to questions related to testing, including the tests, scores, scoring
procedures, and other interpretive materials.
(3) Items and associated materials on a secure test shall not be in the public
domain. Only items that are within the public domain may be used for item analysis.
(4) Educators shall maintain the confidentiality of individual students. Publicizing
test scores that contain the names of individual students is unethical.
(5) Data analysis of test scores for decision-making purposes shall be based
upon:
(A) dissagregation of data based upon student demographics and other
collected variables;
(B) examination of grading practices in relation to test scores; and
(C) examination of growth trends and goal summary reports for statemandated tests.
(g) Unethical testing practices include, but are not limited to, the following practices:
(1) encouraging students to be absent the day of testing;
(2) encouraging students not to do their best because of the purposes of the test;
(3) using secure test items or modified secure test items for instruction;
(4) changing student responses at any time;
(5) interpreting, explaining, or paraphrasing the test directions or the test items;
(6) reclassifying students solely for the purpose of avoiding state testing;
(7) not testing all eligible students;
(8) failing to provide needed modifications during testing, if available;
(9) modifying scoring programs including answer keys, equating files, and lookup
tables;
(10)modifying student records solely for the purpose of raising test scores;
(11) using a single test score to make individual decisions; and
(12)misleading the public concerning the results and interpretations of test data.
(h) In the event of a violation of this Rule, the SBE may, in accordance with the
contested case provisions of Chapter 150B of the General Statutes, impose any one or
more of the following sanctions:
(1) withhold ABCs incentive awards from individuals or from all eligible staff in a
school;
(2) file a civil action against the person or persons responsible for the violation for
copyright infringement or for any other available cause of action;
162
(3) seek criminal prosecution of the person or persons responsible for the
violation; and
(4) in accordance with the provisions of 16 NCAC 6C .0312, suspend or revoke
the professional license of the person or persons responsible for the violation.
History Note: Authority G.S. 115C-12(9)c.; 115C-81(b)(4);
Eff. November 1, 1997;
Amended Eff. August 1, 2000.
163
Download