The Art of Interpreting Test Results


So Much Data, So Little Time:

What Parents and Teachers Need to Know About Interpreting Test Results

Lee Ann R. Sharman, M.S.

ORBIDA Lecture Series April 13, 2010

You’re Not Paying Attention!

 By the end of our session, you will be understand these key terms/concepts: ◦ ◦ ◦ ◦ Different strokes for different folks: all tests are not equal!

Basic statistics you MUST know  Reliability and validity  The Bell Curve – a beautiful thing Error, and why it matters Common mistakes that lead to poor decisions

     Entitlement decisions (eligibility) Skills assessment (diagnostic) Screening and Progress Monitoring (RtI) Instructional planning, accommodations and modifications Curriculum evaluation – is it working?

 Increased focus on data-based decision making to measure outcomes

Working smarter: ask the right questions!

 Understand the differences between types of tests and what they were designed to measure: ◦ ◦ ◦ ◦ ◦   Curriculum-based measures (Dibels, Aimsweb) Teacher-made criterion referenced tests Published criterion referenced tests Norm-referenced tests of OAKS Woodcock-Johnson III Norm-references tests of Achievemen t Cognitive Ability The test you choose depends on what questions you want to answer.

      School records (file reviews) Interviews Medical and Developmental histories Error analyses Use of portfolios Observations

The Snapshot: Point in Time Performance

Measuring Improvement (Change) and Growth

 …You can use a hammer to push in a screw, but a screwdriver will be easier and more efficient




•OAKS is a “Point in Time” measurement, intended to be used more as a Summative Assessment. It’s a


•Gives information on group achievement towards state standards to stakeholders; “Are enough students in our district meeting benchmarks?

 What OAKS is NOT:     OAKS is not intended to give information (see OARs) that will inform instruction or interventions A tool designed for Progress Monitoring A measure of aptitude or ability A comprehensive measure of identified content

 ◦

Response to Intervention – RtI

All models involve tiers of interventions, progress monitoring, and cut scores to determine who is a “responder” (or not).

◦ Dibels is a commonly used tool for progress monitoring

 A few different models, but in this case we refer to measurement of the cognitive abilities underlying areas of unexpected low academic achievement  Specific cognitive abilities (processing measures, e.g. Rapid Automatic Naming, Phonemic Awareness, Long-Term Retrieval) predict reading, writing, and math acquisition ability


e.g. Norm or Criterion-referenced 2. What is it used for - the purpose? 3.


Is it valid for the stated purpose (what it measures or doesn’t measure)?

Is the person administering the test a qualified administrator?


Are the results valid (test conditions optimal, etc.)

        Parental permission…true informed consent Screening for sensory impairments or physical problems File review of school records Parent/caregiver interview Documented interventions and quality instruction

Intellectual and academic assessment

Behavioral assessment or observation Summary and recommendations

I. Identifying data – the Who, What, When II. Background Information A.

Student history B.




Reason for Referral Classroom Observation Parent Information Instruction received/strategies implemented







Test Results Test Interpretation Summary and Conclusions Summary Recommendations for instruction Recommendations for further assessment, if needed

 Must haves: ◦ Skilled examiner ◦ Optimal test conditions ◦ Cultural bias – be aware ◦ Validity/reliability ◦ Appropriate measures for goal

Kids are more than the scores – the “rule

◦ ◦ ◦ ◦ ◦


Home/Environmental issues Sensory acuity problems Previous educational history Language factors  Second language and/or language disorders Social/Emotional/Behavioral issues

 The Matthew Effect depress IQ scores ( Stanovich )…”The rich get richer” – poor reading skills  The Flynn Effect – IQ is increasing in the population over time; tests are renormed to reflect this phenomenon

 The devil is in those details…learn the basic principles of


Simply stated:

 Statistics are used to measure things and describe relationships between things, using numbers





Standard Scores (SS) and Scaled Scores (ss) Percentile Ranks (% rank) Age and Grade Equivalents (AE/GE) Relative Proficiency Index (RPI)

OR, The Normal Frequency Distribution

   Mean and standard deviation of the test used reported Standard scores, percentile ranks, and standard errors of measures, with explanations of each Both composite or broad scores and subtest scores, with an explanation of each

 Information about developmental ceilings, functional levels, skill sequences, and instructional needs upon which assessment/curriculum linkages can be used to write the IEP goals

 These are raw scores which have been transformed to have a given mean (average) and standard deviation (set range or unit of scores). The student’s test score is compared to that average. A standard score expresses how far a student’s score lies above or below the average of the total distribution of scores.

Composite or Cluster scores Standard or scaled scores Raw scores

  Similar to SS, but in a different form. Allows us to determine a student’s position (relative ranking) compared to the standardized sample Percentile rank is NOT the same as a percent score! PR refers to a percentage of persons; PC refers to a percentage of test items correct.

 ◦ Valuable statistic, found only on WJ-III Written as a percentage, or number out of 90, indicating percent of proficiency on similar tasks that students in the comparison group would have 90% success. Correlated with Independent, Instructional, and Frustration levels (see sample)

 Making faulty comparisons: Compare only data sets measuring the same content, with good content/construct validity, that are NORMED ON THE SAME POPULATION  Using and AE/GE as a measure of the child’s proficiency/skill mastery of grade level material  Error exists! Don’t forget about the confidence intervals ◦ SEM creates uncertainty around reporting 1 number  Confusing Percentile RANKS with Percentages: ◦ PR = relative ranking out of 100 ◦ Percentage = percentage correct

 Age equivalents are developed by figuring out what the average test score is (the mean) for a group of children of a certain age taking the test; not the same as skills  Grade equivalents are developed by figuring out what the average test score is (the mean) for a student in each grade.

      Commonly used Misleading Misunderstood Difficult to explain May have little relevance Avoid in favor of Standard Scores/%Ranks

 “When assessed with teacher made tests, Sally locates information within the text with 60% accuracy.” VS.

 “Sally’s performance on the OLSAT falls at the 60 th %ile rank.”

 Are the student’s skills better developed in one part of a domain than another?

For example: “While Susan’s Broad Math score was within the low average range, she performed at a average level on a subtest that assesses problem solving, but scored well below average on a subtest that assesses basic math calculation.

 Test scores don’t support the teacher report of a weakness?

◦ First, look at differences in task demands of the testing situation, and in the classroom, when hypothesizing a reason for the difference.

◦ Look at student’s Proficiency score (RPI) vs. Standard Score (SS)

 “Although weaknesses in mathematics were noted as a concern by Billy’s teacher, Billy scored in the average range on assessments of math skills. These tests required Billy to perform calculations and to solve word problems that were read aloud to him. It was noted he often paused for 10 seconds or more before starting paper and pencil tasks in mathematics.”

 “Billy’s teacher stated that he does well in spelling. However, he scored well below average on a subtest of spelling skills. Billy appeared to be bored while taking the spelling test, so a lack of vigilance in his effort may have depressed his score. Also, the school spelling tests use words he has been practicing for a week.

 The lower score may indicate that he is maintaining the correct spelling of words in long-term memory, and is not able to correctly encode new words he has not had time to study.


e.g. Norm or Criterion-referenced 2. What is it used for - the purpose? 3.


Is it valid for the stated purpose (what it measures or doesn’t measure)?

Is the person administering the test a qualified administrator?


Are the results valid (test conditions optimal, etc.)

 The good news: We are moving away from the old “Test and Place” mentality.  The challenge: School teams are using more comprehensive data sets, which require more knowledge to interpret  More good news: The best decisions are made using multiple sources of good information

 “The true utility of assessment is the extent to which it enables us to find the match between the student and an intervention that is effective in getting him or her on track to reach a meaningful and important goal. The true validity of any assessment took should be evaluated by the impact it has on student outcomes.” (Cummings/McKenna 2007)

 “…one of the problems of writing about intelligence is how to remind readers often enough how little an IQ score tells you about whether or not the human being next to you is someone whom you will admire or cherish.” (Herrnstein and Murray)