The Art of Interpreting Test Results

advertisement

So Much Data, So Little Time:

What Parents and Teachers Need to Know

About Interpreting Test Results

Lee Ann R. Sharman, M.S.

ORBIDA Lecture Series

April 13, 2010

You’re Not Paying Attention!

By the end of our session, you will be understand these key terms/concepts:

◦ Different strokes for different folks: all tests are not equal!

◦ Basic statistics you MUST know

 Reliability and validity

 The Bell Curve – a beautiful thing

◦ Error, and why it matters

◦ Common mistakes that lead to poor decisions

 Entitlement decisions (eligibility)

Skills assessment (diagnostic)

Screening and Progress Monitoring (RtI)

Instructional planning, accommodations and modifications

Curriculum evaluation – is it working?

 Increased focus on data-based decision making to measure outcomes

Working smarter: ask the right questions!

 Understand the differences between types of tests and what they were designed to measure:

◦ Curriculum-based measures (Dibels, Aimsweb)

◦ Teacher-made criterion referenced tests

◦ Published criterion referenced tests

◦ Norm-referenced tests of

 OAKS

 Woodcock-Johnson III

◦ Norm-references tests of

Achievemen t

Cognitive Ability

The test you choose depends on what questions you want to answer.

School records (file reviews)

Interviews

Medical and Developmental histories

Error analyses

Use of portfolios

Observations

 The Snapshot: Point in Time Performance

 Measuring Improvement (Change) and

Growth

 …You can use a hammer to push in a screw, but a screwdriver will be easier and more efficient

What OAKS is :

•OAKS is a “Point in Time” measurement, intended to be used more as a Summative

Assessment. It’s a SNAPSHOT.

•Gives information on group achievement towards state standards to stakeholders; “Are enough students in our district meeting benchmarks?

What OAKS is NOT:

 OAKS is not intended to give information (see

OARs) that will inform instruction or interventions

 A tool designed for Progress Monitoring

 A measure of aptitude or ability

 A comprehensive measure of identified content

Response to Intervention – RtI

◦ All models involve tiers of interventions, progress monitoring, and cut scores to determine who is a

“responder” (or not).

◦ Dibels is a commonly used tool for progress monitoring

A few different models, but in this case we refer to measurement of the cognitive abilities underlying areas of unexpected low academic achievement

Specific cognitive abilities (processing measures, e.g. Rapid Automatic Naming, Phonemic

Awareness, Long-Term Retrieval) predict reading, writing, and math acquisition ability

1. WHAT KIND OF TEST IS THIS?

e.g. Norm or Criterion-referenced

2. What is it used for - the purpose?

3.

Is it valid for the stated purpose (what it measures or doesn’t measure)?

4.

Is the person administering the test a qualified administrator?

5.

Are the results valid (test conditions optimal, etc.)

Parental permission…true informed consent

Screening for sensory impairments or physical problems

File review of school records

Parent/caregiver interview

Documented interventions and quality instruction

Intellectual and academic assessment

Behavioral assessment or observation

Summary and recommendations

I. Identifying data – the Who, What, When

II. Background Information

A.

Student history

B.

Reason for Referral

C.

Classroom Observation

D.

Parent Information

E.

Instruction received/strategies implemented

III.

IV.

V.

Test Results

Test Interpretation

Summary and Conclusions

A.

Summary

B.

Recommendations for instruction

C.

Recommendations for further assessment, if needed

 Must haves:

Skilled examiner

Optimal test conditions

Cultural bias – be aware

Validity/reliability

Appropriate measures for goal

Kids are more than the scores – the “rule outs”:

◦ Home/Environmental issues

◦ Sensory acuity problems

◦ Previous educational history

◦ Language factors

 Second language and/or language disorders

◦ Social/Emotional/Behavioral issues

The Matthew Effect richer”

– poor reading skills depress IQ scores (

Stanovich

)…”The rich get

The Flynn Effect – IQ is increasing in the population over time; tests are renormed to reflect this phenomenon

The devil is in those details…learn the basic principles of

Statistics

Simply stated:

 Statistics are used to measure things and describe relationships between things, using numbers

3.

4.

1.

2.

Standard Scores (SS) and Scaled Scores (ss)

Percentile Ranks (% rank)

Age and Grade Equivalents (AE/GE)

Relative Proficiency Index (RPI)

OR, The Normal Frequency Distribution

Mean and standard deviation of the test used reported

Standard scores, percentile ranks, and standard errors of measures, with explanations of each

Both composite or broad scores and subtest scores, with an explanation of each

Information about developmental ceilings, functional levels, skill sequences, and instructional needs upon which assessment/curriculum linkages can be used to write the IEP goals

These are raw scores which have been transformed to have a given mean (average) and standard deviation (set range or unit of scores). The student’s test score is compared to that average. A standard score expresses how far a student’s score lies above or below the average of the total distribution of scores.

Composite or

Cluster scores

Standard or scaled scores

Raw scores

Similar to SS, but in a different form. Allows us to determine a student’s position (relative ranking) compared to the standardized sample

Percentile rank is NOT the same as a percent score! PR refers to a percentage of persons;

PC refers to a percentage of test items correct.

Valuable statistic, found only on WJ-III

◦ Written as a percentage, or number out of 90, indicating percent of proficiency on similar tasks that students in the comparison group would have

90% success. Correlated with Independent,

Instructional, and Frustration levels (see sample)

Making faulty comparisons: Compare only data sets measuring the same content, with good content/construct validity, that are NORMED ON THE SAME POPULATION

Using and AE/GE as a measure of the child’s proficiency/skill mastery of grade level material

Error exists! Don’t forget about the confidence intervals

◦ SEM creates uncertainty around reporting 1 number

Confusing Percentile RANKS with Percentages:

◦ PR = relative ranking out of 100

◦ Percentage = percentage correct

Age equivalents are developed by figuring out what the average test score is (the mean) for a group of children of a certain age taking the test; not the same as skills

Grade equivalents are developed by figuring out what the average test score is (the mean) for a student in each grade.

Commonly used

Misleading

Misunderstood

Difficult to explain

May have little relevance

Avoid in favor of Standard Scores/%Ranks

“When assessed with teacher made tests, Sally locates information within the text with 60% accuracy.”

VS.

“Sally’s performance on the OLSAT falls at the

60 th %ile rank.”

Are the student’s skills better developed in one part of a domain than another?

For example:

“While Susan’s Broad Math score was within the low average range, she performed at a average level on a subtest that assesses problem solving, but scored well below average on a subtest that assesses basic math calculation.

Test scores don’t support the teacher report of a weakness?

◦ First, look at differences in task demands of the testing situation, and in the classroom, when hypothesizing a reason for the difference.

◦ Look at student’s Proficiency score (RPI) vs.

Standard Score (SS)

“Although weaknesses in mathematics were noted as a concern by Billy’s teacher, Billy scored in the average range on assessments of math skills. These tests required Billy to perform calculations and to solve word problems that were read aloud to him. It was noted he often paused for 10 seconds or more before starting paper and pencil tasks in mathematics.”

“Billy’s teacher stated that he does well in spelling. However, he scored well below average on a subtest of spelling skills. Billy appeared to be bored while taking the spelling test, so a lack of vigilance in his effort may have depressed his score. Also, the school spelling tests use words he has been practicing for a week.

The lower score may indicate that he is maintaining the correct spelling of words in long-term memory, and is not able to correctly encode new words he has not had time to study.

1. WHAT KIND OF TEST IS THIS?

e.g. Norm or Criterion-referenced

2. What is it used for - the purpose?

3.

Is it valid for the stated purpose (what it measures or doesn’t measure)?

4.

Is the person administering the test a qualified administrator?

5.

Are the results valid (test conditions optimal, etc.)

The good news: We are moving away from the old “Test and Place” mentality.

The challenge: School teams are using more comprehensive data sets, which require more knowledge to interpret

More good news: The best decisions are made using multiple sources of good information

“The true utility of assessment is the extent to which it enables us to find the match between the student and an intervention that is effective in getting him or her on track to reach a meaningful and important goal. The true validity of any assessment took should be evaluated by the impact it has on student outcomes.”

(Cummings/McKenna 2007)

“…one of the problems of writing about intelligence is how to remind readers often enough how little an IQ score tells you about whether or not the human being next to you is someone whom you will admire or cherish.”

(Herrnstein and Murray)

Download