Norms and Criteria: According to Pythagorean tradition the circle represents the spiritual realm; the square represents material existence. So the ideal human body represents the marriage of matter and spirit, reflected in its geometric proportions. A first norm? A first criteria? 1 Topics Basic Issues or Decisions to be made Should we “adjust” the raw score before proceeding? Norm or Criteria Interpretation . . . Or both? Types of Norms Percentiles Standard Scores Developmental norms Norm groups Criterion-referencing and Performance Standards Dynamic Assessment and Self-referencing on Repeated Measures 2 “Adjusting” The Raw Score We have already noted the most immediate result from an assessment or test is the raw score. Sometimes, before we proceed to discuss the meaning of the score (i.e., interpret it from either a norm or criteria perspective), the raw score is adjusted. Usually this is done by researchers, not classroom teachers. Two special considerations Correction for Guessing Only for selected-response items Use has faded Factoring in Item Difficulty Students get a higher “Theta” Score based on doing well on the more difficult items of a test. In fact, you may be looking at a test score report given in percentiles or standard scores and not realize you are looking at a transformation from a Theta score rather than the traditional raw score. 3 Interpreting Student Performance Norms or Criteria . . . . Intelligent interpretation of student performance is crucial for the use of educational assessment information. We are building toward this with previous discussions of Building / choosing good tests Determining reliability Determining validity So now we are set to explore some methods of interpretation. These methods fall into two basic categories or approaches: Norm-referenced Compare this student with others. Criterion-referenced Compare this student with some judgment regarding expected performance level irrespective of others. 4 Percentile Rank at or below . . . Percentiles and Percentile Rank Definition: % of cases “at or below” As we noted earlier, these two terms are different conceptually, however, in practice often both terms are used interchangeably. Strengths: Easy to describe Easy to compute Weaknesses Confusion with a “percentage-right score” Inequality of units [see next slide] 5 The illustration of the Inequality of Units in Percentiles 6 Norm-Referenced Systems Transforming Scores Remember, the z score tells how many standard deviation units a score is away from the mean. I can take any set of scores I want to “norm” (i.e. make judgments by comparing scores to each other) and create the z distribution. But z scores are hard for lay people to interpret (a range from -3 to +3 has little meaning to them). So, how about if I transform them! Zowee, Batman! You will hear people call these transformed scores many names, names like: “standard scores”, “norms”, “normed scores”, “scaled scores”. Double Zowee! Becoming a standard score Definition - conversion from z-score into system with a “nice, arbitrarily chosen” M and SD (see illustration of conversion process on the next slide) 7 Illustration of the conversion from . . . Raw Score to Standard Score 8 Cherchez la femme . . . Look for the woman, er . . . table In ordinary practice, you simply use a table in the test manual to convert a raw score to a standard score. Thus, it is important to understand how this works, but you would likely never do this yourself (unless you developed your own test, e.g. The “McEwing Test of Procrastination” which I will someday get around to constructing). 9 Standard Scores really a family of scores . . . some examples Intelligence . . . The IQ Score: One of the most widely implemented, controversial and misunderstood test scores ever used and abused. Historically, schools (and the public) used the ratio IQ (IQ = MA/CA X 100) Today, we use the deviation IQ (most appropriately called the “school ability index”); M = 100 and SD = 15 or 16 The “father” of the IQ test was Alfred Binet. Binet developed the test at the request of a national commission who wanted to identify students in need of help in coping with the school curriculum. (see next slide) 10 Alfred Binet (1857-1911) self-taught French psychologist In 1905 Binet had children do tasks such as follow commands, copy patterns, name objects, and put things in order. He gave the test to Paris schoolchildren and created a standard “intelligence scale” based on his data. For example, a 6-year-old child who passed all the tasks usually passed by 6-year-olds (but no tasks beyond) would have a mental age that exactly matched his chronological age, 6.0. In accordance with the commission’s charge, he reasoned that students testing below age level should be given help to achieve at levels more like their age peers. Binet stressed that intellectual development progressed at variable rates and could be impacted by the environment (therefore not based solely on genetics). He also argued that intelligence was malleable rather than fixed and IQ testing could only be used on children with comparable backgrounds. Along with collaborator Théodore Simon, Binet published revisions of his intelligence scale in 1908 and 1911, the last appearing just before his death. 11 Lewis M. Terman (1877-1956) school principal; college professor at Stanford Terman admired Binet’s work. During World War I, Terman served in the United States army conducting psychological tests. He and his students developed the Alpha and Beta tests which were used to allocate soldiers into the most appropriate areas of military service. Terman also adopted William Stern's suggestions to multiply the mental age / chronological age ratio times 100 (to get rid of the decimal) and call the score be called an intelligence quotient or IQ. Today we usually refer to this approach to intelligence as the ratio IQ. In keeping with his army experiences, when Terman moved to testing classroom children, he proposed using his “Stanford-Binet IQ Test” to classify children and put them on the appropriate job-track. Terman believed IQ was inherited and was the strongest predictor of one's ultimate success in life. By the way, Terman “claimed” that he himself had an IQ of 180 . . . 12 Terman the Researcher Terman administered IQ tests, written in English, to Spanish-speakers and non-schooled African-Americans. From his research he concluded: “High-grade or border-line deficiency . . . is very, very common among Spanish-Indian and Mexican families of the Southwest and also among negroes. Their dullness seems to be racial, or at least inherent in the family stocks from which they come . . . . Children of this group should be segregated into separate classes . . . . They cannot master abstractions but they can often be made into efficient workers . . . from a eugenic point of view they constitute a grave problem because of their unusually prolific breeding.” (The Measurement of Intelligence, 1916, p. 91-92). 13 Part of the stated goals of the Stanford-Binet IQ Test Use of the Stanford-Binet scale in American schools would (according to Chapter I of the test manual itself) “allow for the scientific diagnosis and classification of children to be placed in special classes; bring tens of thousands of high-grade defectives under the surveillance and protection of society; reduce delinquency; help the schools respond to children of superior intelligence; assist in assigning children to school grades; and help determine vocational fitness . . . .” (White, 2000) NEXT A table related to the deviation IQ is on the next slide. What do you notice? 14 DEVIATION IQ REFERENCE CHART Wechsler, D. (1944). The Measurement of Adult Intelligence. Baltimore: The Williams & Wilkins Company. Reber, A.S. (1995). The Penguin Dictionary of Psychology, 2nd ed. Toronto: Penguin Books. I.Q. Basics – I.Q. Comparison Site Deviation IQ Reference Chart % Of Pop. Under Level Point Value (15SD) Idiot ~0.0000001% <10 Profound Moron ~0.000001% <16 Exceptional Moron ~0.00001% <22 Moron ~0.0001% <29 Extremely Retarded ~0.001% <36 Highly Retarded ~0.01% <44 Retarded ~0.1% <54 Significantly Below Average ~1% <65 Below Average ~10% <81 Average ~50% ~100 Above Average ~90% >119 Significantly Above Average ~99% >135 Gifted ~99.9% >146 Highly Gifted ~99.99% >156 Extremely Gifted ~99.999% >164 Genius ~99.9999% >171 Exceptional Genius ~99.99999% >178 Profound Genius ~99.999999% >184 Savant ~99.9999999% >190 Intelligence Level Point Value (16SD) <4 <10 <17 <24 <32 <40 <50 <63 <79 ~100 >121 >137 >150 >160 >168 >176 >183 >190 >196 15 Got Vygotsky? self-taught Russian psychologist (1896-1934) Lev Vygotsky graduated with a law degree at Moscow University. After graduation, he taught literature in secondary school and psychology at a teacher’s college. While Vygotsky had no formal training in psychology, ideas related to developmental psychology fascinated him. Vygotsky’s thoughts were influenced by Marxist theorists. Marxists believe that one can only understand individuals in the context of their social-historical environment. Similarly, mental abilities and processes were viewed in terms of the historical sequence of events that produced them. Upon his death from tuberculosis, his ideas were repudiated by the Soviet government. They banned his work because he did some research with intelligence tests (intelligence tests were condemned by the Communist Party). Vygotsky was actually criticizing the tests when he was using them in his research, but this point was lost on the government officials. When the Cold War ended, Vygotsky's works were opened to the West. 16 Vygotsky and IQ IQ is culturally inherited . . . not genetically inherited . . . Rather than seeing intelligence as much the same across cultures, Vygotsky saw intellectual abilities as being much more specific to the culture (think “family, community, nation”) in which the child was reared (Vasta,R., Haith, M.M., Miller,S.A., 1995). Culture makes two sorts of contributions to the child’s intellectual development. First, children acquire much of their thinking (e.g., knowledge) from it. Second, children acquire the processes or means of their thinking (e.g., tools of intellectual adaptation) from the surrounding culture. Therefore, culture provides the child with the means to decide both “what” to think and “how” to think. Vygotsky elaborates this “culture as intelligence” idea as follows: “Every function in the child’s cultural development appears twice: first, between people (interpsychological) and then inside the child (intra-psychological). All the higher functions originate as actual relationships between individuals.” (Vygotsky, 1978) One might conclude, the “richer” the personal interactions, then the “richer” the mind of the person. We will come back to this idea later, for now let us look at more examples of standard scores. 17 More Standard Scores of Interest . . . . T-scores, SATs, GREs NCEs (Normal Curve Equivalent) Recall that the percentile rank scale is not an equal-interval scale; NCEs solve this problem by converting percentile ranks to an equal-interval scale. NCEs range from 1 to 99 with a mean of 50. The major advantage of NCEs over percentile ranks is that NCEs can be averaged. Used almost exclusively by federal reporting requirement for achievement testing. Stanines Widely used in schools so we will look at them in more detail in the next slide. 18 More on Stanines contraction of “standard nine” . . . Stanines divide the normal distribution into 9 units each of which cover the same length along the base of the normal curve (except the units which cover the two tails). Stanines have a M = 5 and SD = 2 and range 1 (lowest) – 9 (highest). Stanines can be used to convert any test score into a single digit number. This was valuable when paper punch cards were the standard method of storing this kind of information. However, because all stanines are integers, two scores in a single stanine are sometimes further apart than two scores in adjacent stanines. This reduces their value. Stanine scores are useful in comparing a student's performance across different content areas. For example, a 6 in Mathematics and an 8 in Reading generally indicate a meaningful difference in a student's learning for the two respective content areas. While stanine scores are good at signifying broad differences in performance, they should be used cautiously when making any finer distinctions about performance. 19 Stanines Defined Descriptively: NOT RECOMMENDED Stanines facilitate using words rather than numbers in presenting statistical data. Most people like words, but this practice is arbitrary and less accurate: “Bill tested considerably below average." 20 Pros & Cons of Standard Scores Strengths Wide applicability Nice statistical properties Teachers often build their narrative reports on these standard scores using the “accepted descriptive words” rather than the numbers. Weaknesses May be hard to explain to laypersons Need to know M and SD of original test Teachers often build their narrative reports on these standard scores using the “accepted descriptive words” rather than the numbers. 21 Developmental Norms Another area of real and potential misuse . . . Main examples: Grade equivalents 4.5 Fourth Grade, Fifth Month Mental ages (age equivalents) 5.10 Fifth Year, Tenth month Others: stage theories (Piaget), physical measures (height in relation to age) Strengths “Natural” interpretation (is this really a strength?) Looks at multi-level growth parents / teachers want Weaknesses Limited to growth functions Commonly misused (see next slide) 22 According to: Margaret J. Kay, Ed.D. Psychologist The practice of using grade equivalency scores to identify learning disabled children in educational reports and IEP’s is wide spread and misleading. The normative data for most tests are usually collected at one point every year. How, then, are grade equivalents obtained for every month? They are extrapolated at the upper and lower ends of the growth curve. This estimation produces scores that are systematically too low in the Fall and too high in the Spring. Problems associated with this practice are: A high probability of over-identifying learning disabled children exists if screening is conducted in the Fall. A high probability of under-identifying learning disabled children occurs if screening is conducted in the Spring. 23 24 Norm Groups To whom are my students being compared . . . Look for detailed description in test manual to ascertain the norming group. Might one or a combination of the following: Users (all previous test takers, e.g. ACT) Subgroup (ACT scores achieved by men) Local (students in the district) Institutional (State) National International 25 Example: National vs. Local Norms Sally’s score (the x below) is at the 55th percentile when compared to National tests takers, but her score is at the 45th percentile compared to Local test takers 26 Usefulness of Standardization Group To what extent do the norms provide a meaningful framework? Two issues: Stability Usually not a problem because the norms are developed based on so many cases. Representativeness Compare data on norm group with data on the target group Typical variables for comparison: Age, gender, ability, education, geographic region, size city, racial/ethnic group, socioeconomic status 27 Criterion-referenced “Criterion-referenced” refers to the nature of the interpretation, not the nature of the test. Requires well-defined content domain. Often more complicated than it first sounds. Often uses “rubrics” – guides for defining performance levels. Ohio likes to use the term Performance Standards 28 Performance Standards Outgrowths of standards-based education Common terms: advanced, proficient, basic Each division requires a cut-score Cut-scores determined By groups of people Using one of several different methods Determined basically by judgments Ohio uses the term benchmark the specific component of the knowledge or skill identified by an academic content, performance or operational standard. 29 Self-Referencing on Repeated Measures . . . some call this dynamic assessment . . . . . . has elements of both norm & criterion Dynamic assessment is an interactive approach that embeds intervention within the assessment procedure. Dynamic assessment is a product of research by developmental psychologist Lev Vygotsky. Main features: Improved task performance becomes the criterion for the student; Her/His own past performance constitutes the norm Simple counts, brief tasks, repeated frequently, results graphed Has many potential uses: Documenting Special Education student progress Assessing Basic Skill progress Monitoring School Attitude changes Also known as CBA "The term curriculum-based assessment means simply measurement that uses direct observation and recording of a student's performance in the local curriculum as a basis for gathering information to make instructional decisions" (Deno, 1987). And also known as CBM (see next slide) 30 Curriculum-Based Measurement Student Progress Monitoring . . . Curriculum-Based Measurement (CBM) is a method teachers use to find out how students are progressing in basic academic areas while there is still time to intervene. CBM can be helpful to teachers and students because it provides current, week-by-week information on academic progress. The teacher using CBM finds out how well a child is progressing in learning the content for the academic year in time to modify his/her instructional strategies. If a student’s performance is not meeting expectations, the teacher then changes the way of teaching to try to find the type and amount of instruction this particular student needs to make sufficient progress toward meeting the academic goals. This assessment approach allows the student to see immediate progress and may be more motivational than “punitive” tests and quizzes. This powerful assessment approach can also be shared with parents to document their child’s progress. See next slide for an example progress chart. 31 Kim's Progress in Words Read per Minute 40 35 25 20 15 10 28 25 22 19 16 13 10 7 4 5 0 1 Words/Min 30 Day Intervention 32 Closing Thoughts on . . . Dynamic Assessment & Vygotsky If we accept Vygotsky’s view of intellectual development, we might conclude that it is, in fact, learning that leads to intellectual development (as opposed to the other way around). In Vygotsky’s view, the standard IQ test only indicates what a child can achieve on his/her own. He calls this the ‘level of actual development.’ While such a measure is undoubtedly important, it is also incomplete. Given appropriate help from an adult, children can increase their thinking ability. What the child can achieve with this outside help is referred to as the ‘level of potential development.’ (Vasta, R., Haith, M.M., Miller, S.A., 1995) As educators, are we not interested in increasing this potential rather than labelling and sorting children based on IQ scores? 33 Practical Advice 1. Understand relations among types of norms. 2. Be cautious about IQ scores & grade equivalents. 3. Know the nature of the norm group(s). 4. Know what process was used to develop the performance standards (e.g., benchmarks) in a criterion-referenced test. 5. Consider using dynamic assessment as part of your assessment repertoire. 34