All Mental Measurements Yearbook test reviews are copyrighted by

advertisement
All Mental Measurements Yearbook test reviews are copyrighted by the Buros Institute. Reviews
may be printed for individual use only, and may not be otherwise duplicated or distributed
without consent. Information on citations of this test review may be viewed on the Buros website
under FAQ.
[18083510]
KeyMath-3 Diagnostic Assessment.
Purpose: Designed to assess understanding and applications of mathematics concepts and skills.
Population: Ages 4-6 to 21-11.
Publication Dates: 1971-2007.
Acronym: KeyMath-3 DA.
Scores, 14: Basic Concepts (Numeration, Algebra, Geometry, Measurement, Data Analysis and
Probability, Total), Operations (Mental Computation and Estimation, Addition and Subtraction,
Multiplication and Division, Total), Applications (Foundations of Problem Solving, Applied
Problem Solving, Total), Total.
Administration: Individual.
Forms, 2: A, B.
Price Data, 2008: $699 per complete kit including manual (2007, 371 pages), Form A and Form
B test easels, 25 Form A & Form B record forms, and carrying bag; $399 per single form
(specify form) kit including manual, test easels, 25 record forms, and carrying bag; $76 per 25
record forms (specify form); $259 per KeyMath-3 ASSIST scoring software.
Time: (30-40) minutes Grades PK-2; (75-90) minutes Grades 3 and up.
Authors: Austin J. Connolly.
Publisher: Pearson.
Cross References: For reviews by G. Gage Kingsbury and James A. Wollack of a previous
edition, see 14:194; see also T5:139 (15 references) and T4:1355 (5 references); for reviews by
Michael D. Beck and Carmen J. Finley of an earlier edition, see 11:191 (26 references); see also
T3:1250 (12 references); for an excerpted review by Alex Bannatyne of an earlier edition, see
8:305 (10 references).
Review of the KeyMath-3 Diagnostic Assessment by THERESA GRAHAM, Adjunct Faculty,
University of Nebraska-Lincoln, Lincoln, NE:
DESCRIPTION. The KeyMath-3 Diagnostic Assessment (herein referred to as KeyMath-3 DA)
is an untimed, norm-referenced test that provides a comprehensive assessment of key
mathematical concepts and skills, for individuals ranging in age from 4 years, 6 months through
21 years. The KeyMath-3 DA is composed of two parallel forms (Form A and Form B) with 10
subtests. The subtests, based on National Council of Teachers of Mathematics (NCTM)
Principles and Standards for School Mathematics (NCTM, 2000) include the following areas:
Basic Concepts (Numeration, Algebra, Geometry, Measurement, Data Analysis and Probability),
Operations (Mental Computation and Estimation, Addition and Subtraction, Multiplication and
Division), and Applications (Foundations of Problem Solving, Applied Problem Solving).
The KeyMath-3 DA was revised to update test items to correspond with the NCTM standards
and to create parallel testing forms (Form A and B), which allow educators to monitor the
progress of an individual by taking alternating forms. In addition, the KeyMath-3 DA was
designed to provide a link to the KeyMath-3 Essential Resources (KeyMath-3 ER). KeyMath-3
ER is an instructional program that includes lessons and activities directly related to the 10
subtests included in the KeyMath-3 DA. Finally, the KeyMath-3 DA provides updated normative
and interpretive data.
ADMINISTRATION AND SCORING. The materials for the KeyMath-3 DA consist of two
easels for each form, a manual, and record forms for both forms. The easels are self-standing and
very easy to use, including tabs for the different subtests and instructions on start points and
establishing basal and ceiling. The KeyMath-3 DA manual describes the general testing
guidelines and scoring information. Answers and score summaries, including raw score, scale
score, standard score, confidence-interval value, confidence interval, grade/age equivalent, and
percentile rank can be recorded on the record form.
The Numeration subtest is always administered first with the start point determined by the grade
level. The start points for the other subtests are determined by the ceiling item on the Numeration
subtest. Stop points are determined by the ceiling set and ceiling item.
It is generally recommended that all subtests be administered in the order that they are presented.
However, the test author notes that there may be reasonable circumstances in which a specific
area (e.g., the Basic Concepts area) would be given. According to the manual, average test times
for the inventory range from 30-40 minutes for younger examinees to 75-90 minutes for older
examinees. Both of these estimates seem very conservative given the number of subtests and
breadth of items included. It may be that the estimates do not include rest breaks, which may be
necessary for examinees of any age. Test fatigue may become an issue for some examinees.
Because all of the correct responses are written on the test easel, scoring is easily done while the
test is being administered. A "1" is circled for correct responses, and a "0" is circled for incorrect
responses. Detailed scoring rules are provided in the manual.
A subtest raw score is determined by subtracting the total number of incorrect items from the
highest ceiling item. Appendices in the testing manual provide all of the normative and
interpretive tables for both testing forms to convert a raw score to a standard score and to assess
score profile, subtest comparison, and functional range. Additional tables are provided to convert
a raw score to a growth scale value (GSV) using the optional ASSIST scoring and reporting
software. Finally, because the KeyMath-3 DA was developed simultaneously with the KeyMath3 ER, results on the KeyMath-3 DA can be linked to instructional programs included in the
KeyMath-3 ER. Although all of the scoring can be done easily by hand, examiners may want to
purchase the ASSIST scoring and reporting software, especially if they plan to follow examinee
progress over time or if they plan to make an instructional plan using the KeyMath-3 ER. The
software reportedly provides progress reports and graphical displays of performance over time.
DEVELOPMENT. The development of the KeyMath-3 DA consisted of surveying relevant
professionals regarding the content, administration, and use. In response to survey results, five
NCTM content standards (Number and Operations, Algebra, Geometry, Measurement, Data
Analysis and Probability) and five process standards (Problem Solving, Reasoning and Proof,
Communication, Connections, and Representation) were used to frame the 10 subtests and to
generate about 550 new items. These items were initially piloted on a group of 250 students in
Grades 1 through 8. From the pilot, many items were modified, the number of extended response
items was scaled back, and start points and sequencing was determined.
Prior to the standardization studies, two tryout studies were conducted to pretest the items for
item difficulty, grade level-performance, and gender/ethnicity bias. In the first tryout study, there
were 1,238 participants ranging from prekindergarten through ninth grade. Although the author
notes that the "samples were also controlled by race/ethnicity" (manual, p. 45), no information
was given regarding socioeconomic status. Moreover, the sample did not include participants in
grades higher than ninth grade. In the administration, the subtests were divided into two forms to
reduce testing time.
Item difficulty, item fit, and item discrimination were determined using the Rasch analyses. In
addition, item distractors for the multiple-choice items were evaluated. Finally, a panel of
reviewers representing different cultural and geographical backgrounds reviewed the test for
cultural sensitivity. Based on the results of the first tryout study, the test author states that some
items were added, modified, and dropped from the analyses and review performed. However,
examples of specific changes and exact number of changes are not provided.
A second tryout study was conducted with items from only two of the subtests (Foundations of
Problem Solving and Applied Problem Solving) because the test author noted that many of the
items in these subtests had been modified as a result of the first tryout study. In this study, 1,196
individuals were included (again ranging in age from prekindergarten to ninth grade). A separate
smaller tryout study (N = 199, kindergarten through second grade children) was performed to
assess the wording of items in the Mental Computation and Estimation subtest. Although the test
author states that changes were made as a result of these studies, specific results are not
provided.
For the standardization study, two forms (Form A and Form B) were developed, each consisting
of 444 items. A few of the items were the same between the two forms (the exact number is not
noted). The rest of the items were similar in content but differed slightly in specific numbers
used. The example given is an item from the Mental Computation and Estimation subtest, which
simply changed the numbers in a simple subtraction.
The standardization study consisted of two administrations, spring and fall. In the spring
administration, about half of the sample took Form A and 280 examinees took both Forms A and
B. Rasch analyses assessed item difficulty and item fit. In addition, the parallel forms were
compared for similarity in difficulty. As a result, 74 items were dropped after the spring
administration either for poor fit (19 items), difficulty difference between the two forms (14
items), or reduction of item concentration at the higher end of the assessment (41 items),
resulting in 372 items. No mention was made as to how many (if any) of the remaining items
were altered.
TECHNICAL.
Standardization. The sample used to standardize the KeyMath-3 DA consisted of 3,630
individuals ranging in age from 4 years, 6 months through 21 years (1,565 in the spring
administration; 1,540 in the fall administration). The author's goal was to test at least 110
individuals in each grade level in each season. The sample was recruited to represent ethnic,
mother's educational level, geographic region, and special education distributions in the United
States. Tables are provided in the manual with information regarding distribution of the sample
in terms of ethnicity, gender, education level, and geographic region.
Generally, examinees were given either Form A or Form B. A sample of 280 were administered
both forms. Because Form A and Form B were not equivalent, raw scores could not be pooled
and these data were considered to be separate normative data. Raw scores for examinees who
took both forms were converted to w-ability scores via a joint Rasch calibration of the two
forms. This procedure allows standard scores between the two forms to be compared.
Reliability. Reliability was assessed in terms of internal consistency, test-retest reliability, and
alternate-form reliability. Internal consistency was determined using the split-half method with
appropriate adjustments made because examinees at the same grade or age may take different
sets of test items. Reliability coefficients are presented for the different age groups and grades for
fall and spring. Coefficients ranged from .86 to .99 for the total test scores depending on age
group. The data from each form (A and B) were pooled to provide an estimate of the population
variance for each age and grade. The pooled variance was used to adjust the reliability for each
form to better approximate the reliability of the population.
Alternate-form reliability was assessed by counterbalancing Form A and Form B for a subset of
the standardized sample (N = 280, see above). The median alternate-form reliabilities are .82 and
.85, suggesting that similar scores would be obtained if taking Form A or Form B. Finally, testretest reliability was assessed with a group of 103 (ranging in grade from Pre-K to Grade 12 and
divided into two grade ranges). However, little information is given regarding the exact
distribution of ages of the examinees included. The retest occurred anywhere between 6-28 days.
The median subtest test-retest reliability coefficients ranged from .86 to .88, demonstrating high
stability over time.
Validity. Content validity was established in several ways. First, the KeyMath-3 DA was created
using the NCTM principles and standards, utilizing NCTM materials. In addition, over 400
educators and professional consultants provided feedback regarding the content of the
assessment.
Construct validity was assessed by demonstrating that raw scores on the KeyMath-3 DA increase
rapidly among the younger ages and begin to plateau in high school. However, raw scores only
increased an average of 2 points in the high school grades, a finding that may suggest the test
does not adequately measure higher mathematical knowledge as taught in the high school grades.
KeyMath-3 DA was shown to be related to a number of different measures of mathematical
achievement, including the KeyMath Revised, Normative Update: A Diagnostic Inventory of
Essential Mathematics (KeyMath-R/NU; Connolly, 1998), Kaufman Test of Educational
Achievement, Second Edition (KTEA-II; Kaufman & Kaufman, 2004), the Iowa Test of Basic
Skills (ITBS; Hoover, Dunbar, & Frisbie, 2001), the Measures of Academic Progress (MAP;
Northwest Evaluation Association, 2006), and the Group Mathematics Assessment and
Diagnostic Evaluation (GMADE; Williams, 2006). For both the KeyMath-R/NU and the KTEAII, administration was counterbalanced with a test interval average of 9-11 days. For the other
inventories, data were gathered from school records with an average test interval of 31-44 days.
Tables summarizing the demographics of the study sample used are provided in the test manual.
The correlation of the total test scores on the KeyMath-3 DA and the KeyMath-R/NU averaged
.92, with the correlation of the subtests ranging from low .70s to low .90s. However, average
standard scores from these two measures differed such that the scores should not be considered
interchangeable. For the other measures, performance on the KeyMath-3 DA was moderately to
highly correlated with the other measures of mathematics achievement.
Finally, analyses were performed to assess whether performance on the KeyMath-3 DA
distinguished between examinees representative of special populations, including examinees
identified as gifted, diagnosed with ADHD, diagnosed with a Learning Disability (with math
alone, reading alone, and math and reading combined), and identified with a mild intellectual
disability. As expected, no differences were found between examinees identified with ADHD
and what would be expected in the general population. However, performance on the KeyMath-3
DA differed significantly among the rest of the special population groups and the general
population group. All three of these assessments of validity suggest that KeyMath-3 DA is a
sufficiently valid instrument of mathematical achievement.
COMMENTARY. The KeyMath-3 DA was designed to provide a comprehensive measurement
of mathematics achievement that mirrored the standards and principles outlined by the NCTM
among individuals in prekindergarten through high school. Indeed, the author did a good job in
relying on the NCTM framework and assessing not only procedural knowledge, but also
assessing problem-solving ability and conceptual knowledge. However, because the test takes a
long period to administer, and it must be administered individually, it may not be useful in a
general setting. It may be better suited for cases where a deficit is suspected. It may also be
useful if discrepant abilities are suspected in order to identify strengths and weaknesses. In
addition, the test seems to have a significant ceiling effect and may not be as useful for high
school students who are performing at or above age/grade level.
At the younger end of the continuum, the connection between the KeyMath-3 DA and the
KeyMath-3 ER is a great tool for educators who are looking for instructional programs to
augment and/or instruct in areas of deficit or strength. The KeyMath-3 DA materials are easy to
use, and the manual instructions make it an easy instrument to administer, score, and interpret. It
is a reasonably reliable and valid tool to assess comprehensive understanding of mathematics and
can also be used to pinpoint and monitor special areas of mathematical knowledge. It may also
be useful for program evaluation.
REVIEWER'S REFERENCES
Connolly, A. J. (1998). KeyMath Revised, normative update manual. Circle Pines, MN: AGS
Publishing.
Hoover, H. D., Dunbar, S. B., & Frisbie, D. A. (2001). Iowa Test of Basic Skills (ITBS). Rolling
Meadows, IL: Riverside Publishing.
Kaufman, A. S., & Kaufman, N. L. (2004). KTEA II: Kaufman Test of Educational Achievement
(2nd ed.). Circle Pines, MN: AGS Publishing.
National Council of Teachers of Mathematics. (2000). Principles and standards for school
mathematics. Reston, VA: Author.
Northwest Evaluation Association. (2004). Measures of Academic Progress (MAP). Lake
Oswego, OR: Author.
Williams, K. (2004). The Group Mathematics Assessment and Diagnostic Evaluation (GMADE).
Parsippany, NJ: Dale Seymour Publications.
Review of the KeyMath-3 Diagnostic Assessment by SUZANNE LANE, Professor of Research
Methodology, and DEBRA MOORE, Ph.D. Candidate, University of Pittsburgh, Pittsburgh, PA:
DESCRIPTION. The KeyMath-3 Diagnostic Assessment (KeyMath-3 DA) is a norm-referenced,
individually administered measure of mathematical concepts and skills for prekindergarten
through Grade 12 students (ages 4 years, 6 months through 21 years). The test's purpose is to
provide diagnostic information that can be used to tailor intervention programs for students and
to monitor student performance over time. The assessment is designed for use with the
KeyMath-3 Essential Resources (KeyMath-3 ER) instructional programs. The KeyMath-3 DA is
a revised version of the KeyMath Revised (KeyMath-R; 14:194). An algebra subtest has been
added as well as other new content so that it is aligned with the standards outlined by the
Principles and Standards for School Mathematics (NCTM, 2000). There are also new normative
and interpretative data.
The test consists of two parallel forms (Form A and Form B), each with 372 items divided across
three content areas: Basic Concepts, Operations, and Applications. These three content areas are
divided into 10 subtests ranging from Numeration to Applied Problem Solving. Items are
presented one at a time on an easel. The test administrator reads the directions and the item and
then records the examinee's answer. Items within each subtest are arranged in order of increasing
difficulty. The first subtest, Numeration, has a starting point determined by grade level.
Successive subtests have starting points based on the ceiling item from the Numeration subtest.
The test is untimed, but the author estimates 30 to 40 minutes for elementary grades and 75 to 90
minutes for secondary grades.
After administration, the examiner can create score reports by hand or with the use of an optional
software package that creates the same score reports plus some additional reports. The reports
refer the examiner to tables in the technical manual that provide scale scores, standard scores,
percentile ranks, grade equivalents, and age equivalents, as well as five descriptive categories
referring to the level of student performance.
DEVELOPMENT. The Principles and Standards for School Mathematics (NCTM, 2000), state
math standards, and the KeyMath-R guided the revisions. A test blueprint that delineates the 10
subtests, and math topics and objectives within each subtest formed the basis of item
development. A pilot study was conducted in the spring of 2004 and the results were used to
modify items, order items by difficulty, and to determine the starting points for each grade for
the tryout.
A tryout was conducted during the fall and winter of the 2004-2005 (October-January) school
year. A random sampling procedure, stratified by grade and gender, was used in obtaining the
tryout sample of over 2,400 students in prekindergarten through Grade 9. A total of 496 items
were divided into subtests across two forms. The data were analyzed using the Rasch model. The
item difficulty distribution for each subtest was examined and items were added to fill gaps in
the distribution and items were removed in dense areas of the distribution. No information,
however, is provided on the targeted item difficulty distribution and the extent to which the
targets were achieved. Analyses of item fit, differential item functioning, and reliability were
also conducted. Finally, fairness reviews were conducted and the results were used in decisions
regarding item modifications and deletions. A second tryout was conducted for two subtests that
had considerable modifications as a result of the first tryout.
TECHNICAL.
Standardization. A nationally representative sample of 3,630 individuals aged 4 years, 6 months
to 21 years, 11 months participated in the standardization of Forms A and B in both the spring
and fall of 2006. The norm sample was chosen using a stratified random sampling procedure
with geographical region, gender, SES, race, and special education status as stratifying variables.
Sample percentages within each stratifying category matched closely the 2004 U.S. Census data.
Students with a verified "specific learning disability" (manual, p. 61) were slightly
overrepresented and Hispanics in the low socioeconomic status category were somewhat
underrepresented in the norm sample.
Item difficulty, item fit, and DIF values were used from the spring standardization results to
determine the final item set and item order. A table identifying the number of items dropped due
to various reasons (e.g., poor fit, difficulty difference) is provided. Methods used to determine
optimal start points, determine basal and ceiling set rules, and examine the accuracy of the
scoring rules are also described clearly in the manual.
To evaluate the comparability of Forms A and B, means and standard deviations of the subtest,
area, and total test raw scores by grade and season are provided. The majority of the subtest
means across forms are within 2 points and the standard deviations are similar. To develop the
norms, half of the norm sample was administered Form A and the other half was administered
Form B. The two forms were then calibrated jointly, with a sample of 280 examinees who were
administered both forms serving as the link. A linear transformation of the ability estimates for
each examinee from the joint calibration was conducted to obtain w-ability scores. Subtest age
norms were developed for each of the 17 age groups using normalizing translations that
converted w-ability scores to scale scores. For each of the subtests, a growth curve for each oddvalue score (scores range from 1 to 19) was plotted against age and then smoothed. These
procedures were used to calculate the w-ability score values for the 62 age groups reported in the
final norm tables. Using linear interpolation, scale score equivalents for intervening w-ability
scores were obtained. Finally, w-ability scores were converted back to their corresponding raw
score values. Grade equivalents were obtained for K-9 and age equivalents were obtained for
ages 4 years, 6 months through 15 years, 11 months.
Reliability. The split-half method was used to estimate internal-consistency reliability. Internal
consistency coefficients for subtests, areas, and total test across Forms A and B for both the fall
and the spring administration are presented by grade level and by age. The coefficients range
from .60 to .95 with the large majority of estimates in the .80s. The lowest coefficients were for
primary grades.
To obtain alternate-forms reliability, the sample was divided into two subgroups, K-5 and 6-12.
The test forms were administered, using a counterbalanced design, to 280 examinees at intervals
averaging 10 days. Alternate-form reliabilities for subtests for both samples range from the
middle .70s to the low .90s, and alternate-form reliabilities for areas were in the high .80s to mid
.90s.
To obtain test-retest reliability coefficients, the sample was again divided into two subgroups.
The test forms were administered, on average, 17 days apart to 103 examinees. Test-retest
reliabilities for subtests in the K-5 subgroup range from .65 to .95 with the majority in the middle
.80s. Test-retest reliabilities for the subtests in the 6-12 subgroup range from .70 to .92 with the
majority in the high .80s.
Validity. Validity evidence provided pertains to the test content, adequacy of detecting
performance growth, internal structure of the test, external structure of the test, and adequacy of
potential decisions based on the test scores. Content validity evidence is provided by a welldefined test blueprint that specifies objectives within the 10 subtests and was informed by the
Principles and Standards for School Mathematics (NCTM, 2000), state math standards, and
information provided by educational practitioners and math curriculum consultants. Items were
also reviewed for content and bias.
Performance growth across grades was verified by several methods. First, mean raw scores on
subtests, areas, and total test showed an increase across grades. As expected there was more
rapid growth during the early grades and less growth at the upper grades. Second, the median
growth scale value (GSV) corresponding to the total test raw score was plotted for each grade
and the median GSV increased from K through 12.
To provide internal structure evidence, correlations among subtest scores were obtained as well
as correlations among the subtests and area and total standard scores. It was expected that the
correlations between subtests in the same area would be higher than correlations between
subtests in different areas. However, correlations between subtest scores were similar regardless
of area, with most exceeding .60. For the K-2 group, for example, the correlations between areas
range from .72 to .83 and the correlations between area and total test range from .84 to .97.
External validity evidence was obtained by examining the relationship between the KeyMath-3
DA test and the earlier edition of the test (KeyMath-R/NU), the individually administered
Kaufman Test of Educational Achievement, Second Edition (KTEA-II), and three groupadministered mathematics tests. The results indicated that the subtest, area, and total test is
related to these other measures and in some cases there is an expected differential pattern of
correlations when using the subtest and area scores.
KeyMath-3 DA is often used with special populations so the average standard scores were
obtained for six groups that represent different diagnoses or special education classifications
(e.g., attention-deficit, giftedness, math learning disability). Expected differences between the
weighted means for the special populations and the general population were obtained. For
example, weighted means for the math learning disability group were generally 1 standard
deviation lower than the general population means.
COMMENTARY. The KeyMath-3 DA is an individually administered test that provides
valuable diagnostic information to teachers regarding students' strengths and weaknesses. This
latest edition better reflects the mathematics being taught in schools by the addition of the
problem-solving subtests, the linkage of items to the NCTM process standards, and use of
calculators at the upper grade levels. Because the revision for this test was based heavily on the
NCTM process standards, the use of external reviewers to verify the link between the items and
the NCTM standards would have been a welcome addition. However, there is ample information
about the content of the test to help potential users determine if it is appropriate for their
purposes. The inclusion of items that ask students to explain their procedures and understanding
is noteworthy. For some of these items, additional follow-up questions are provided to the
examiner in order to prompt students to provide fuller explanations, which is an attractive feature
for an individually administered test.
The manual provides an excellent overview of the available normative scores and the way in
which to interpret and use these scores. The score profile clearly displays information that will
aid the user in identifying instructional needs for the individual student. The confidence band for
each of the scores is drawn on the profile, aiding in accurate interpretation by considering
measurement error. Further, a 68% confidence band is used for the subtest scores, whereas a
90% or 95% confidence band is used for the area and total scores. The use of different
confidence bands depending on the nature of the interpretation and potential decision reflects the
care of the test author in providing the most appropriate information to the test user.
The manual provides a clear and comprehensive description of the design and the technical
quality of the test. The procedures and samples for the standardization were well documented. It
should be noted, however, that the spring standardization sample was used to determine the final
item set and item order. It was assumed that because the Rasch item difficulties are not
dependent on characteristics of the sample, values obtained from the spring standardization
would generalize to the entire data set. Theoretically, the difficulty estimates should be invariant
across samples; however, this finding does not always occur in practice. An examination of the
invariance of item parameters across the spring and fall samples could have provided evidence to
support this assumption. Reliability coefficients for subtest scores are moderate to high and are
generally high for the area and total scores, indicating that users can have confidence in using the
area and total scores for making relatively high stakes decisions and using the subtest scores in
examining a student's strengths and weaknesses.
Overall, the validity evidence supports using the test for its intended purpose. The author,
however, may want to explore the extent to which each subtest measures something unique so
that patterns of performance on the subtests can better inform the design of individual student
intervention programs.
SUMMARY. The KeyMath-3 Diagnostic Assessment is an individually administered test that is
well developed and provides scores that can inform the design of individual student intervention
programs and monitor performance over time. The manual is well written and provides clear
guidelines for administering, scoring, and interpreting the scores. The technical information
provided in the manual is comprehensive, clearly presented, and supports the use of the
instrument for its purpose.
REVIEWER'S REFERENCE
National Council of Teachers of Mathematics. (2000). Principles and standards for school
mathematics. Reston, VA: Author.
Download