What Parents and Teachers Need to Know
About Interpreting Test Results
Lee Ann R. Sharman, M.S.
ORBIDA Lecture Series
April 13, 2010
You’re Not Paying Attention!
By the end of our session, you will be understand these key terms/concepts:
◦ Different strokes for different folks: all tests are not equal!
◦ Basic statistics you MUST know
Reliability and validity
The Bell Curve – a beautiful thing
◦ Error, and why it matters
◦ Common mistakes that lead to poor decisions
Entitlement decisions (eligibility)
Skills assessment (diagnostic)
Screening and Progress Monitoring (RtI)
Instructional planning, accommodations and modifications
Curriculum evaluation – is it working?
Increased focus on data-based decision making to measure outcomes
Working smarter: ask the right questions!
Understand the differences between types of tests and what they were designed to measure:
◦ Curriculum-based measures (Dibels, Aimsweb)
◦ Teacher-made criterion referenced tests
◦ Published criterion referenced tests
◦ Norm-referenced tests of
OAKS
Woodcock-Johnson III
◦ Norm-references tests of
Achievemen t
Cognitive Ability
The test you choose depends on what questions you want to answer.
School records (file reviews)
Interviews
Medical and Developmental histories
Error analyses
Use of portfolios
Observations
The Snapshot: Point in Time Performance
Measuring Improvement (Change) and
Growth
…You can use a hammer to push in a screw, but a screwdriver will be easier and more efficient
What OAKS is :
•OAKS is a “Point in Time” measurement, intended to be used more as a Summative
Assessment. It’s a SNAPSHOT.
•Gives information on group achievement towards state standards to stakeholders; “Are enough students in our district meeting benchmarks?
What OAKS is NOT:
OAKS is not intended to give information (see
OARs) that will inform instruction or interventions
A tool designed for Progress Monitoring
A measure of aptitude or ability
A comprehensive measure of identified content
◦ All models involve tiers of interventions, progress monitoring, and cut scores to determine who is a
“responder” (or not).
◦ Dibels is a commonly used tool for progress monitoring
A few different models, but in this case we refer to measurement of the cognitive abilities underlying areas of unexpected low academic achievement
Specific cognitive abilities (processing measures, e.g. Rapid Automatic Naming, Phonemic
Awareness, Long-Term Retrieval) predict reading, writing, and math acquisition ability
e.g. Norm or Criterion-referenced
2. What is it used for - the purpose?
3.
Is it valid for the stated purpose (what it measures or doesn’t measure)?
4.
Is the person administering the test a qualified administrator?
5.
Are the results valid (test conditions optimal, etc.)
Parental permission…true informed consent
Screening for sensory impairments or physical problems
File review of school records
Parent/caregiver interview
Documented interventions and quality instruction
Behavioral assessment or observation
Summary and recommendations
I. Identifying data – the Who, What, When
II. Background Information
A.
Student history
B.
Reason for Referral
C.
Classroom Observation
D.
Parent Information
E.
Instruction received/strategies implemented
III.
IV.
V.
Test Results
Test Interpretation
Summary and Conclusions
A.
Summary
B.
Recommendations for instruction
C.
Recommendations for further assessment, if needed
Must haves:
◦
Skilled examiner
◦
Optimal test conditions
◦
Cultural bias – be aware
◦
Validity/reliability
◦
Appropriate measures for goal
◦ Home/Environmental issues
◦ Sensory acuity problems
◦ Previous educational history
◦ Language factors
Second language and/or language disorders
◦ Social/Emotional/Behavioral issues
The Matthew Effect richer”
– poor reading skills depress IQ scores (
Stanovich
)…”The rich get
The Flynn Effect – IQ is increasing in the population over time; tests are renormed to reflect this phenomenon
The devil is in those details…learn the basic principles of
Statistics are used to measure things and describe relationships between things, using numbers
3.
4.
1.
2.
Standard Scores (SS) and Scaled Scores (ss)
Percentile Ranks (% rank)
Age and Grade Equivalents (AE/GE)
Relative Proficiency Index (RPI)
OR, The Normal Frequency Distribution
Mean and standard deviation of the test used reported
Standard scores, percentile ranks, and standard errors of measures, with explanations of each
Both composite or broad scores and subtest scores, with an explanation of each
Information about developmental ceilings, functional levels, skill sequences, and instructional needs upon which assessment/curriculum linkages can be used to write the IEP goals
These are raw scores which have been transformed to have a given mean (average) and standard deviation (set range or unit of scores). The student’s test score is compared to that average. A standard score expresses how far a student’s score lies above or below the average of the total distribution of scores.
Composite or
Cluster scores
Standard or scaled scores
Raw scores
Similar to SS, but in a different form. Allows us to determine a student’s position (relative ranking) compared to the standardized sample
Percentile rank is NOT the same as a percent score! PR refers to a percentage of persons;
PC refers to a percentage of test items correct.
Valuable statistic, found only on WJ-III
◦ Written as a percentage, or number out of 90, indicating percent of proficiency on similar tasks that students in the comparison group would have
90% success. Correlated with Independent,
Instructional, and Frustration levels (see sample)
Making faulty comparisons: Compare only data sets measuring the same content, with good content/construct validity, that are NORMED ON THE SAME POPULATION
Using and AE/GE as a measure of the child’s proficiency/skill mastery of grade level material
Error exists! Don’t forget about the confidence intervals
◦ SEM creates uncertainty around reporting 1 number
Confusing Percentile RANKS with Percentages:
◦ PR = relative ranking out of 100
◦ Percentage = percentage correct
Age equivalents are developed by figuring out what the average test score is (the mean) for a group of children of a certain age taking the test; not the same as skills
Grade equivalents are developed by figuring out what the average test score is (the mean) for a student in each grade.
Commonly used
Misleading
Misunderstood
Difficult to explain
May have little relevance
Avoid in favor of Standard Scores/%Ranks
“When assessed with teacher made tests, Sally locates information within the text with 60% accuracy.”
VS.
“Sally’s performance on the OLSAT falls at the
60 th %ile rank.”
Are the student’s skills better developed in one part of a domain than another?
For example:
“While Susan’s Broad Math score was within the low average range, she performed at a average level on a subtest that assesses problem solving, but scored well below average on a subtest that assesses basic math calculation.
Test scores don’t support the teacher report of a weakness?
◦ First, look at differences in task demands of the testing situation, and in the classroom, when hypothesizing a reason for the difference.
◦ Look at student’s Proficiency score (RPI) vs.
Standard Score (SS)
“Although weaknesses in mathematics were noted as a concern by Billy’s teacher, Billy scored in the average range on assessments of math skills. These tests required Billy to perform calculations and to solve word problems that were read aloud to him. It was noted he often paused for 10 seconds or more before starting paper and pencil tasks in mathematics.”
“Billy’s teacher stated that he does well in spelling. However, he scored well below average on a subtest of spelling skills. Billy appeared to be bored while taking the spelling test, so a lack of vigilance in his effort may have depressed his score. Also, the school spelling tests use words he has been practicing for a week.
The lower score may indicate that he is maintaining the correct spelling of words in long-term memory, and is not able to correctly encode new words he has not had time to study.
e.g. Norm or Criterion-referenced
2. What is it used for - the purpose?
3.
Is it valid for the stated purpose (what it measures or doesn’t measure)?
4.
Is the person administering the test a qualified administrator?
5.
Are the results valid (test conditions optimal, etc.)
The good news: We are moving away from the old “Test and Place” mentality.
The challenge: School teams are using more comprehensive data sets, which require more knowledge to interpret
More good news: The best decisions are made using multiple sources of good information
“The true utility of assessment is the extent to which it enables us to find the match between the student and an intervention that is effective in getting him or her on track to reach a meaningful and important goal. The true validity of any assessment took should be evaluated by the impact it has on student outcomes.”
(Cummings/McKenna 2007)
“…one of the problems of writing about intelligence is how to remind readers often enough how little an IQ score tells you about whether or not the human being next to you is someone whom you will admire or cherish.”
(Herrnstein and Murray)