Collaborative Conference for Student Achievement
March 4, 2014
NCDPI/Division of Accountability Services
Test Development Section
1
Hope Lung – Section Chief
Dan Auman – English Language Arts
Test Measurement Specialist
Michael Gallagher, Ph.D. – Mathematics
Test Measurement Specialist
Garron Gianopulos, Ph.D. – Psychometrician
Jami-Jon Pearson, Ph.D. – Psychometrician
2
• Overview of End-of-Grade (EOG) &
End-of-Course (EOC) Assessments
• Measurement Principles
• How EOG/EOC Tests Are Built
• Reporting
• Frequently Asked Questions (FAQ)
3
The Mission of the Accountability
Services Division is to…
(a) promote the academic achievement of all North Carolina public school students and (b) assist stakeholders in understanding and gauging this achievement against state and national standards
4
• reliable and valid assessment instruments
• suitable assessment instruments for all students
• accurate and statistically appropriate reports
5
Frequently Asked Question
Question:
Did the test determine the content standards?
Answer:
NO. Content standards drive item and test development.
6
Statewide Summative
Types of Assessment
Type Frequency Evaluate Purpose/Reporting
Classroom
/Formative
(local)
Interim/
Benchmark
(local)
Daily
Multiple per year
Minute-tominute
Mastery of set of standards
Descriptive feedback to adjust instruction & support learning
Inform stakeholders
(e.g., teachers, students, parents)
Summative
EOG/EOC
(statewide)
End of year/ course
Overall mastery
High stakes accountability
8
Why is statewide assessment important?
Environment conducive for learning?
Moving all students forward?
Are we preparing our
STUDENTS to be successful?
STATE
Sound basic education to every student?
9
Frequently Asked Question
Question:
Are there close to 200 state-mandated tests in
NC?
Answer:
NO. Reference the following for the testing calendar and assessment options: http://www.ncpublicschools.org/accountability/p olicies/
10
Guiding Philosophy
• Scores serve as one piece of evidence
(snapshot) regarding students’ knowledge
& skills.
• Instruction and placement decisions should be made based on a culmination of evidence.
• Scores are most reliable at the scale score level.
12
13
EOG/EOC Measurement
Two important goals are to…
1) Achieve the most reliable and accurate snapshot of student achievement with minimal impact on instructional time within legislative guidelines.
• Fewer items per assessment
• Shorter test administration time
14
EOG/EOC Measurement
Two important goals are to…
2) Remove bias using valid and reliable psychometric methods throughout the test development process.
15
EOG/EOC Measurement
• Validity depends on evidence that the test is measuring what it is designed to measure
• Reliability depends on scores being consistent
• Must have reliability in order to have validity
16
• Content Validity
– Items are carefully aligned to the content standards
– The NCDPI also contracts to have independent alignment studies of its assessments
• Concurrent Validity
– Correlation of student performance with other measures
17
Reliability refers to the consistency of the scores on a test ( if a test were repeated, a student would obtain close to the same score )
• State uses coefficient alpha to estimate reliability
• The industry standard for state assessments used for accountability purposes is a coefficient alpha of
.85 or higher;
– the EOGs & EOCs exceed that value
18
SUBJECT
ELA
Math
Science
* Minimum of .85 is standard
Cronbach’s Alpha
.88 - .92
.90 - .93
.90 - .92
19
When interpreting score averages…
• Classroom-level data are more reliable than student-level data
• School-level data are more reliable than classroom-level data
• LEA-level data are more reliable than school-level data
• State-level data are more reliable than LEA-level data
More observations = more reliable inferences
20
• Recall our goals
– Minimizing impact on instructional time
– Valid and reliable overall scale score
• Giving reliable scores at the sub-topic* level would require MORE
– Items and testing time
• double or triple item amounts on tests
21
*Topic and *Sub-Topic
Umbrella terms used for curriculum clarification
• 1 st and 2 nd level of curriculum structure
– CCSS ELA Strand/Standard
• RI.4.3 - Strand RI (grade 4), Standard 3
– CCSS Math is Domain/Cluster/Standard
• 3.NF.A.2 – Grade 3, Domain NF, Cluster A, Standard 3
– ES Science is Standard/Clarifying Objective
• 5.E.1 – Standard 5 (Earth Systems), Clarifying
Objective 1
22
Common Questions
State Standards
Items/Forms
Technical Quality
Standard Setting
The answer will dictate what types of data are available and how those data can be used.
23
EOG/EOC OVERVIEW: Common
Questions
1.
Who writes the items?
• Teachers who successfully complete an online training system for item writing are contracted to write items for specific content standards.
2.
How are the test specifications determined?
• The NCDPI convenes teachers and content experts ( Test
Specification Panel ) to provide input on what percent of the test should cover groups of the content standards
(Blueprint).
24
EOG/EOC OVERVIEW: Common
Questions
• Blueprint
–Priority of Topics and Sub-topics determined by the Test Specification Panel
• NC Teachers
• DPI-Curriculum
• DPI-Exceptional Children
• DPI-Test Development
• Outside Content (e.g., professors)
25
EOG/EOC OVERVIEW: Common
Questions
3
.
Why are items field tested?
• Before being placed on a test form, the psychometricians need item statistics to control the overall difficulty and reliability of a form.
4
.
Is one test form harder than another form?
• No, all of the forms ( online and paper ) of a given test for a grade/subject are equivalent.
26
State Standards: What we measure…
• Common Core State Standards ( ELA & Math ) &
Essential Standards ( Science )
– More rigorous standards aligned to College and
Career Readiness
• Global competitiveness
– Same standards across states
• Could compare achievement
• Continuity
– Adopted by State Board of Education (SBE) in 2010
27
28
Frequently Asked Question
Question:
Does anyone review the item statistics for each item each year?
Answer:
YES. Item statistics are reviewed for every item on every form after semester and yearlong test cycles as soon as a representative data sample is received by Accountability Services.
29
The Life of a EOG/EOC item…
State Board of Education (SBE) Policy GCS-A-013 specified the process used to develop tests for the
North Carolina Testing Program:
• Content Standards are adopted by the SBE
• Test Specifications are determined with input from teachers and content experts
• Items are developed, aligned to the content standards, and then piloted/field tested (item tryout)
– 3 to 1 ratio for item needs
• Items deemed to meet technical criteria based on the field test statistics are then placed on a test form
30
Item Characteristics
• Level of difficulty
– Easy, medium, hard
• Level of cognitive complexity
– Example: Depth of Knowledge (DOK)
• Different types of items
31
• Multiple-Choice (MC)
• Technology Enhanced (TE)
– Drag and Drop (DD)
– Text Identify(TI)
– String Replacement (SR)
• Constructed Response (CR)
– Gridded Response(GR)/Numeric Entry (NE)
– Short Answer (SA)
32
Item Review
17 Steps
• Teachers
• Content Lead
• Content Specialist
• DPI-Curriculum
• EC/ESL/VI
• Production
• Editing
• Subject-specific
Test Measurement
Specialist (TMS)
33
The Life of a EOG/EOC test form…
• Test forms are reviewed for technical quality and content alignment
– Teachers, content experts, editing staff, and NCDPI curriculum staff
– Psychometricians affirm forms of a given test are equivalent with respect to content and difficulty ( forms not harder or easier than each other )
34
The Life of a EOG/EOC test form…
• Tests are administered to students
• In the first year of administration, the standard setting is conducted and achievement levels are set
• Psychometricians continue to monitor the forms to ensure equivalency
35
Form Review
27 Steps
• Content Lead • Outside Content Specialist
• Content Manager • Subject-specific Test
• Content Specialist
Measurement Specialist
• Production, Editing
• Psychometrician
• IT Staff
(TMS)
36
Operational Item Review Operational Form Review
EOG/EOCs Form Construction
• All forms of each test are built to the same test specification
• Topic Level
:
– Forms within a subject have the same distribution of items by topic
• Sub-topic Level:
– Forms within a subject may or may not have the same distribution of items by subtopic
38
Three Year Timeline
• 2010-11 Item Writing/Reviewing/Pilot Test
• 2011-12 Field test
• 2012-13 Operational forms
– First year requires standard setting
– Continually add new forms each successive year ( Pre-equated to first year forms )
39
If you would like to participate in item writing or review…
Website
• https://center.ncsu.edu/nc/login/index.php
40
41
How do we know the EOG/EOCs are technically sound?
We adhere to the Standards for Educational and
Psychological Testing
• American Psychological Association (APA)
• American Educational Research Association
(AERA)
• National Council on Measurement Standards
(NCME)
42
How do we know the EOG/EOCs are technically sound?
The development process and the analysis plan are reviewed by Technical Advisors,
National/State Leaders in…
• Educational Assessment
• Standard Setting
• Psychometrics
• District Level Accountability/Research
• Exceptional Children Research
• Others
43
How do we know the EOG/EOCs are technically sound?
Evidence from the tests is submitted to the U.S.
Department of Education (USDOE) for Peer Review.
Some examples are…
• Valid & Reliable
• Fair & Accessible to all students
• Technical Quality of items & forms
• Alignment
• Content Standards
44
45
Frequently Asked Question
Question:
Are the cut scores (Levels 1-4) determined by
NCDPI staff?
Answer:
NO. Cut scores are determined by NC educators at a standard setting meeting. As a matter of fact, the meeting itself is managed and facilitated by an outside vendor. The NC State Board of Education
(SBE) must approve the recommended cut scores or request specific changes.
46
Standard Setting General
Assessments
• Nationally recognized company did study
(Vendor)
• Vendor managed work while DPI observed
• Multi-day in-person meeting
• Use a Proven Test-Based Method
– Item Mapping (Bookmark)
– Three requirements: Item Map, ALDs, & panelists
( educators ) to make content judgments
47
Item Map- Ordered Item Booklet
(OIB)
• Review OIB
– Items (2012-13 administration) ordered in ascending difficulty as calculated from actual student performance
– Each item on single page
– Panelist identifies the last item where a “just barely” student would get item correct more often than not
48
Item Map- Ordered Item Booklet
(OIB)
EOG/EOC
Items Scale
49
Achievement Level Descriptors
(ALDs) Development & Purpose
• Reviewed current content standards
• Developed ALDs
– Knowledge/skills expected at level
– Eventually used in score reporting
– Used to define entry points to each level
• Developed “Just Barely” Descriptors
– These define the entry points of each achievement level.
– These represent the minimum knowledge and skill needed to be proficient.
50
NC READY Achievement Levels
Achievement Level 4 - Superior Command
Achievement Level 3 - Solid Command
Achievement Level 2 - Partial Command
Achievement Level 1 - Limited Command
51
Overview: Proficiency cut placement
Achievement Levels
Superior command
Solid command
Partial command
Limited command
EOG/EOC
Items Scale
What do the scores mean?
52
Overview: Proficiency cut placement
Achievement Levels
Superior command
Solid command
Partial command
Limited command
EOG/EOC
Items Scale
On Track
ALDs capture the meaning of the scale levels.
53
Standard Setting: Teacher Panels
• Judgment of experts in content and student population
– Over 200 NC Educators participated
• 16-20 per panel
– Post-secondary educators were represented
• Experts in content
– Each region was represented
– Each grade-level and content area was represented
• Grade-band groups
– 3-5; 6-8; High School
54
Three Rounds of Judgment
• Round 1
– Set cuts, then broke into small groups, discussed cut scores, and why they chose the cuts
• Round 2
– Set cuts, then reviewed external validity measures
• EXPLORE/ACT linking study
• Percentage of 2013 test takers in each level (initial impact data)
• Round 3
– Set final teacher grade level panel cut, received final impact data, and completed evaluation survey
55
Vertical Articulation
• Last step
– Subset of teacher panelists
– Reviewed results across grades and/or subjects
– Small adjustments made to smooth the percentages across grades within each content area
56
Standard Setting: SBE Approved
• After all Standard Setting meetings, the next phase…
– Policymakers: SBE, NCDPI ( Curriculum, Exceptional
Children, & Accountability )
– Presented “teacher panel” recommendations to SBE for approval
• Ultimately a policy decision by SBE
– Presented September 2013
– Adopted October 2013
57
Individual Student Report
Class Rosters
Goal Summary Reports
Accountability Reports
58
Frequently Asked Question
Question:
Did the NCDPI delay 2012-13 scores in case the test was bad?
Answer:
NO. There must be a standard setting process to determine Levels 1-4 when new assessments are administered. A full year’s worth of data is necessary before teachers can be convened for the standard setting process.
59
Frequently Asked Question
Question:
Did NCDPI require districts to use a 0-100 score as at least 20% of the student’s final grade during
2012-13?
Answer:
NO. The NCDPI waived NC State Board of
Education policy GCS-C-003 during the 2012-13 school year.
(http://sbepolicy.dpi.state.nc.us/Policies/GCS-C-
003.asp?Acr=GCS&Cat=C&Pol=003)
60
Different Reports with which you may be familiar…
Level Report
Student Individual Student Report
Classroom
School
LEA
Goal Summary
Class Roster
Goal Summary
Class Roster
Accountability Reports
State Accountability Reports
61
Example: Goal Report at Teacher Level
Profile based on blueprint
Profile based on
Common
Core
Categories
62
Common Problems with Reporting,
Interpreting, and Using Subscores
• Subscores are usually highly correlated. So, differences between subscores are often small.
• Subscores at the topic level are often not specific enough to inform instructional practice.
• Subscores at the sub-topic level are often very specific, but usually have too few items to be reliable.
63
• Scores with less than 10 items are typically too unreliable to interpret. Mean scores from spiraled forms improve reliability. 64
Provides valid data about curriculum implementation only when…
• all forms in the spiral are administered within the same classroom/school/LEA
• there are at least five students per form (with multiple forms) , and
• approximately equal numbers of students have taken each form
Cannot provide valid data when we don’t have enough participation or only one form.
65
Four Steps to Interpreting the Goal
Summary Reports
1. Transform the percentage of items per form to the raw number of items.
2. Transform weighted mean percent correct to raw subscore correct.
3. Determine the room for improvement by subtracting the raw subscore correct from the number of items per reporting category.
4. Triangulate results with other measures.
66
Math I Example from Fall 2013
67
Step 1
Transform the percentage of items per form to the raw number of items.
(49 X .306 = 15 items)
68
Step 2
Transform weighted mean percent correct to raw subscore correct (15 X
.458= 6.87 items).
69
Step 3
Determine the room for
improvement by subtracting the raw subscore correct from the number of items per reporting category (15 -
6.87 = 8.13)
Room for
Improvement
8.13
70
Interpreting Goal Summary Reports
Room for
Improvement
8.13
12.67
1.74
1.96
2.93
71
Interpreting Goal Summary Reports
Actual Correlations
Functions r= .73
Algebra r= 0.68
Be more confident in the longer more reliable subscores.
72
Always Remember…
• Test items on forms used from year to year are different.
– Tests equivalent at the total score level, not at the sub-topic level.
– Thus, forms from year to year may have more or less difficult items on a particular topic or sub-topic.
73
Tests are only one source of data
• Augment school-level Goal Summary
Reports with other sources of evidence like classroom observation and interaction.
– Goal Summary Reports
– Classroom observations
– Formative Assessment
– Benchmark Assessment
– Pacing guides
– Lesson plans
– Peer observations
74
Comparing Scores:
Examples of “Apples to Apples”
• Same year
– State to LEA average scale
– State to School average scale score
– State to LEA or School percent correct for the same topic
• Across years
– State, LEA, or School average scale scores within curriculum cycle across years
– Student percentiles across years
75
• Released Forms http://www.ncpublicschools.org/accountability/t esting/releasedforms
• Test Specifications http://www.ncpublicschools.org/accountability/
• Technical Reports (coming in May) http://www.ncpublicschools.org/accountability/t esting/technicalnotes
76
• Guidelines, Practice and Examples Math Gridded
Response Items http://www.ncpublicschools.org/accountability/
• NC Final Exams (test specs, released forms/items, reference sheets) http://www.ncpublicschools.org/accountability/commo n-exams/
• NC Testing Program Overview http://www.ncpublicschools.org/docs/accountability/nc testoverview1314.pdf
77
Resources
• NC READY Accountability Report http://www.ncaccountabilitymodel.org/
• Achievement Level Information
(cut scores
& descriptors) http://www.ncpublicschools.org/accountab ility/testing/shared/achievelevel/
78
79