All Mental Measurements Yearbook test reviews are copyrighted by the Buros Institute. Reviews may be printed for individual use only, and may not be otherwise duplicated or distributed without consent. Information on citations of this test review may be viewed on the Buros website under FAQ. [18083510] KeyMath-3 Diagnostic Assessment. Purpose: Designed to assess understanding and applications of mathematics concepts and skills. Population: Ages 4-6 to 21-11. Publication Dates: 1971-2007. Acronym: KeyMath-3 DA. Scores, 14: Basic Concepts (Numeration, Algebra, Geometry, Measurement, Data Analysis and Probability, Total), Operations (Mental Computation and Estimation, Addition and Subtraction, Multiplication and Division, Total), Applications (Foundations of Problem Solving, Applied Problem Solving, Total), Total. Administration: Individual. Forms, 2: A, B. Price Data, 2008: $699 per complete kit including manual (2007, 371 pages), Form A and Form B test easels, 25 Form A & Form B record forms, and carrying bag; $399 per single form (specify form) kit including manual, test easels, 25 record forms, and carrying bag; $76 per 25 record forms (specify form); $259 per KeyMath-3 ASSIST scoring software. Time: (30-40) minutes Grades PK-2; (75-90) minutes Grades 3 and up. Authors: Austin J. Connolly. Publisher: Pearson. Cross References: For reviews by G. Gage Kingsbury and James A. Wollack of a previous edition, see 14:194; see also T5:139 (15 references) and T4:1355 (5 references); for reviews by Michael D. Beck and Carmen J. Finley of an earlier edition, see 11:191 (26 references); see also T3:1250 (12 references); for an excerpted review by Alex Bannatyne of an earlier edition, see 8:305 (10 references). Review of the KeyMath-3 Diagnostic Assessment by THERESA GRAHAM, Adjunct Faculty, University of Nebraska-Lincoln, Lincoln, NE: DESCRIPTION. The KeyMath-3 Diagnostic Assessment (herein referred to as KeyMath-3 DA) is an untimed, norm-referenced test that provides a comprehensive assessment of key mathematical concepts and skills, for individuals ranging in age from 4 years, 6 months through 21 years. The KeyMath-3 DA is composed of two parallel forms (Form A and Form B) with 10 subtests. The subtests, based on National Council of Teachers of Mathematics (NCTM) Principles and Standards for School Mathematics (NCTM, 2000) include the following areas: Basic Concepts (Numeration, Algebra, Geometry, Measurement, Data Analysis and Probability), Operations (Mental Computation and Estimation, Addition and Subtraction, Multiplication and Division), and Applications (Foundations of Problem Solving, Applied Problem Solving). The KeyMath-3 DA was revised to update test items to correspond with the NCTM standards and to create parallel testing forms (Form A and B), which allow educators to monitor the progress of an individual by taking alternating forms. In addition, the KeyMath-3 DA was designed to provide a link to the KeyMath-3 Essential Resources (KeyMath-3 ER). KeyMath-3 ER is an instructional program that includes lessons and activities directly related to the 10 subtests included in the KeyMath-3 DA. Finally, the KeyMath-3 DA provides updated normative and interpretive data. ADMINISTRATION AND SCORING. The materials for the KeyMath-3 DA consist of two easels for each form, a manual, and record forms for both forms. The easels are self-standing and very easy to use, including tabs for the different subtests and instructions on start points and establishing basal and ceiling. The KeyMath-3 DA manual describes the general testing guidelines and scoring information. Answers and score summaries, including raw score, scale score, standard score, confidence-interval value, confidence interval, grade/age equivalent, and percentile rank can be recorded on the record form. The Numeration subtest is always administered first with the start point determined by the grade level. The start points for the other subtests are determined by the ceiling item on the Numeration subtest. Stop points are determined by the ceiling set and ceiling item. It is generally recommended that all subtests be administered in the order that they are presented. However, the test author notes that there may be reasonable circumstances in which a specific area (e.g., the Basic Concepts area) would be given. According to the manual, average test times for the inventory range from 30-40 minutes for younger examinees to 75-90 minutes for older examinees. Both of these estimates seem very conservative given the number of subtests and breadth of items included. It may be that the estimates do not include rest breaks, which may be necessary for examinees of any age. Test fatigue may become an issue for some examinees. Because all of the correct responses are written on the test easel, scoring is easily done while the test is being administered. A "1" is circled for correct responses, and a "0" is circled for incorrect responses. Detailed scoring rules are provided in the manual. A subtest raw score is determined by subtracting the total number of incorrect items from the highest ceiling item. Appendices in the testing manual provide all of the normative and interpretive tables for both testing forms to convert a raw score to a standard score and to assess score profile, subtest comparison, and functional range. Additional tables are provided to convert a raw score to a growth scale value (GSV) using the optional ASSIST scoring and reporting software. Finally, because the KeyMath-3 DA was developed simultaneously with the KeyMath3 ER, results on the KeyMath-3 DA can be linked to instructional programs included in the KeyMath-3 ER. Although all of the scoring can be done easily by hand, examiners may want to purchase the ASSIST scoring and reporting software, especially if they plan to follow examinee progress over time or if they plan to make an instructional plan using the KeyMath-3 ER. The software reportedly provides progress reports and graphical displays of performance over time. DEVELOPMENT. The development of the KeyMath-3 DA consisted of surveying relevant professionals regarding the content, administration, and use. In response to survey results, five NCTM content standards (Number and Operations, Algebra, Geometry, Measurement, Data Analysis and Probability) and five process standards (Problem Solving, Reasoning and Proof, Communication, Connections, and Representation) were used to frame the 10 subtests and to generate about 550 new items. These items were initially piloted on a group of 250 students in Grades 1 through 8. From the pilot, many items were modified, the number of extended response items was scaled back, and start points and sequencing was determined. Prior to the standardization studies, two tryout studies were conducted to pretest the items for item difficulty, grade level-performance, and gender/ethnicity bias. In the first tryout study, there were 1,238 participants ranging from prekindergarten through ninth grade. Although the author notes that the "samples were also controlled by race/ethnicity" (manual, p. 45), no information was given regarding socioeconomic status. Moreover, the sample did not include participants in grades higher than ninth grade. In the administration, the subtests were divided into two forms to reduce testing time. Item difficulty, item fit, and item discrimination were determined using the Rasch analyses. In addition, item distractors for the multiple-choice items were evaluated. Finally, a panel of reviewers representing different cultural and geographical backgrounds reviewed the test for cultural sensitivity. Based on the results of the first tryout study, the test author states that some items were added, modified, and dropped from the analyses and review performed. However, examples of specific changes and exact number of changes are not provided. A second tryout study was conducted with items from only two of the subtests (Foundations of Problem Solving and Applied Problem Solving) because the test author noted that many of the items in these subtests had been modified as a result of the first tryout study. In this study, 1,196 individuals were included (again ranging in age from prekindergarten to ninth grade). A separate smaller tryout study (N = 199, kindergarten through second grade children) was performed to assess the wording of items in the Mental Computation and Estimation subtest. Although the test author states that changes were made as a result of these studies, specific results are not provided. For the standardization study, two forms (Form A and Form B) were developed, each consisting of 444 items. A few of the items were the same between the two forms (the exact number is not noted). The rest of the items were similar in content but differed slightly in specific numbers used. The example given is an item from the Mental Computation and Estimation subtest, which simply changed the numbers in a simple subtraction. The standardization study consisted of two administrations, spring and fall. In the spring administration, about half of the sample took Form A and 280 examinees took both Forms A and B. Rasch analyses assessed item difficulty and item fit. In addition, the parallel forms were compared for similarity in difficulty. As a result, 74 items were dropped after the spring administration either for poor fit (19 items), difficulty difference between the two forms (14 items), or reduction of item concentration at the higher end of the assessment (41 items), resulting in 372 items. No mention was made as to how many (if any) of the remaining items were altered. TECHNICAL. Standardization. The sample used to standardize the KeyMath-3 DA consisted of 3,630 individuals ranging in age from 4 years, 6 months through 21 years (1,565 in the spring administration; 1,540 in the fall administration). The author's goal was to test at least 110 individuals in each grade level in each season. The sample was recruited to represent ethnic, mother's educational level, geographic region, and special education distributions in the United States. Tables are provided in the manual with information regarding distribution of the sample in terms of ethnicity, gender, education level, and geographic region. Generally, examinees were given either Form A or Form B. A sample of 280 were administered both forms. Because Form A and Form B were not equivalent, raw scores could not be pooled and these data were considered to be separate normative data. Raw scores for examinees who took both forms were converted to w-ability scores via a joint Rasch calibration of the two forms. This procedure allows standard scores between the two forms to be compared. Reliability. Reliability was assessed in terms of internal consistency, test-retest reliability, and alternate-form reliability. Internal consistency was determined using the split-half method with appropriate adjustments made because examinees at the same grade or age may take different sets of test items. Reliability coefficients are presented for the different age groups and grades for fall and spring. Coefficients ranged from .86 to .99 for the total test scores depending on age group. The data from each form (A and B) were pooled to provide an estimate of the population variance for each age and grade. The pooled variance was used to adjust the reliability for each form to better approximate the reliability of the population. Alternate-form reliability was assessed by counterbalancing Form A and Form B for a subset of the standardized sample (N = 280, see above). The median alternate-form reliabilities are .82 and .85, suggesting that similar scores would be obtained if taking Form A or Form B. Finally, testretest reliability was assessed with a group of 103 (ranging in grade from Pre-K to Grade 12 and divided into two grade ranges). However, little information is given regarding the exact distribution of ages of the examinees included. The retest occurred anywhere between 6-28 days. The median subtest test-retest reliability coefficients ranged from .86 to .88, demonstrating high stability over time. Validity. Content validity was established in several ways. First, the KeyMath-3 DA was created using the NCTM principles and standards, utilizing NCTM materials. In addition, over 400 educators and professional consultants provided feedback regarding the content of the assessment. Construct validity was assessed by demonstrating that raw scores on the KeyMath-3 DA increase rapidly among the younger ages and begin to plateau in high school. However, raw scores only increased an average of 2 points in the high school grades, a finding that may suggest the test does not adequately measure higher mathematical knowledge as taught in the high school grades. KeyMath-3 DA was shown to be related to a number of different measures of mathematical achievement, including the KeyMath Revised, Normative Update: A Diagnostic Inventory of Essential Mathematics (KeyMath-R/NU; Connolly, 1998), Kaufman Test of Educational Achievement, Second Edition (KTEA-II; Kaufman & Kaufman, 2004), the Iowa Test of Basic Skills (ITBS; Hoover, Dunbar, & Frisbie, 2001), the Measures of Academic Progress (MAP; Northwest Evaluation Association, 2006), and the Group Mathematics Assessment and Diagnostic Evaluation (GMADE; Williams, 2006). For both the KeyMath-R/NU and the KTEAII, administration was counterbalanced with a test interval average of 9-11 days. For the other inventories, data were gathered from school records with an average test interval of 31-44 days. Tables summarizing the demographics of the study sample used are provided in the test manual. The correlation of the total test scores on the KeyMath-3 DA and the KeyMath-R/NU averaged .92, with the correlation of the subtests ranging from low .70s to low .90s. However, average standard scores from these two measures differed such that the scores should not be considered interchangeable. For the other measures, performance on the KeyMath-3 DA was moderately to highly correlated with the other measures of mathematics achievement. Finally, analyses were performed to assess whether performance on the KeyMath-3 DA distinguished between examinees representative of special populations, including examinees identified as gifted, diagnosed with ADHD, diagnosed with a Learning Disability (with math alone, reading alone, and math and reading combined), and identified with a mild intellectual disability. As expected, no differences were found between examinees identified with ADHD and what would be expected in the general population. However, performance on the KeyMath-3 DA differed significantly among the rest of the special population groups and the general population group. All three of these assessments of validity suggest that KeyMath-3 DA is a sufficiently valid instrument of mathematical achievement. COMMENTARY. The KeyMath-3 DA was designed to provide a comprehensive measurement of mathematics achievement that mirrored the standards and principles outlined by the NCTM among individuals in prekindergarten through high school. Indeed, the author did a good job in relying on the NCTM framework and assessing not only procedural knowledge, but also assessing problem-solving ability and conceptual knowledge. However, because the test takes a long period to administer, and it must be administered individually, it may not be useful in a general setting. It may be better suited for cases where a deficit is suspected. It may also be useful if discrepant abilities are suspected in order to identify strengths and weaknesses. In addition, the test seems to have a significant ceiling effect and may not be as useful for high school students who are performing at or above age/grade level. At the younger end of the continuum, the connection between the KeyMath-3 DA and the KeyMath-3 ER is a great tool for educators who are looking for instructional programs to augment and/or instruct in areas of deficit or strength. The KeyMath-3 DA materials are easy to use, and the manual instructions make it an easy instrument to administer, score, and interpret. It is a reasonably reliable and valid tool to assess comprehensive understanding of mathematics and can also be used to pinpoint and monitor special areas of mathematical knowledge. It may also be useful for program evaluation. REVIEWER'S REFERENCES Connolly, A. J. (1998). KeyMath Revised, normative update manual. Circle Pines, MN: AGS Publishing. Hoover, H. D., Dunbar, S. B., & Frisbie, D. A. (2001). Iowa Test of Basic Skills (ITBS). Rolling Meadows, IL: Riverside Publishing. Kaufman, A. S., & Kaufman, N. L. (2004). KTEA II: Kaufman Test of Educational Achievement (2nd ed.). Circle Pines, MN: AGS Publishing. National Council of Teachers of Mathematics. (2000). Principles and standards for school mathematics. Reston, VA: Author. Northwest Evaluation Association. (2004). Measures of Academic Progress (MAP). Lake Oswego, OR: Author. Williams, K. (2004). The Group Mathematics Assessment and Diagnostic Evaluation (GMADE). Parsippany, NJ: Dale Seymour Publications. Review of the KeyMath-3 Diagnostic Assessment by SUZANNE LANE, Professor of Research Methodology, and DEBRA MOORE, Ph.D. Candidate, University of Pittsburgh, Pittsburgh, PA: DESCRIPTION. The KeyMath-3 Diagnostic Assessment (KeyMath-3 DA) is a norm-referenced, individually administered measure of mathematical concepts and skills for prekindergarten through Grade 12 students (ages 4 years, 6 months through 21 years). The test's purpose is to provide diagnostic information that can be used to tailor intervention programs for students and to monitor student performance over time. The assessment is designed for use with the KeyMath-3 Essential Resources (KeyMath-3 ER) instructional programs. The KeyMath-3 DA is a revised version of the KeyMath Revised (KeyMath-R; 14:194). An algebra subtest has been added as well as other new content so that it is aligned with the standards outlined by the Principles and Standards for School Mathematics (NCTM, 2000). There are also new normative and interpretative data. The test consists of two parallel forms (Form A and Form B), each with 372 items divided across three content areas: Basic Concepts, Operations, and Applications. These three content areas are divided into 10 subtests ranging from Numeration to Applied Problem Solving. Items are presented one at a time on an easel. The test administrator reads the directions and the item and then records the examinee's answer. Items within each subtest are arranged in order of increasing difficulty. The first subtest, Numeration, has a starting point determined by grade level. Successive subtests have starting points based on the ceiling item from the Numeration subtest. The test is untimed, but the author estimates 30 to 40 minutes for elementary grades and 75 to 90 minutes for secondary grades. After administration, the examiner can create score reports by hand or with the use of an optional software package that creates the same score reports plus some additional reports. The reports refer the examiner to tables in the technical manual that provide scale scores, standard scores, percentile ranks, grade equivalents, and age equivalents, as well as five descriptive categories referring to the level of student performance. DEVELOPMENT. The Principles and Standards for School Mathematics (NCTM, 2000), state math standards, and the KeyMath-R guided the revisions. A test blueprint that delineates the 10 subtests, and math topics and objectives within each subtest formed the basis of item development. A pilot study was conducted in the spring of 2004 and the results were used to modify items, order items by difficulty, and to determine the starting points for each grade for the tryout. A tryout was conducted during the fall and winter of the 2004-2005 (October-January) school year. A random sampling procedure, stratified by grade and gender, was used in obtaining the tryout sample of over 2,400 students in prekindergarten through Grade 9. A total of 496 items were divided into subtests across two forms. The data were analyzed using the Rasch model. The item difficulty distribution for each subtest was examined and items were added to fill gaps in the distribution and items were removed in dense areas of the distribution. No information, however, is provided on the targeted item difficulty distribution and the extent to which the targets were achieved. Analyses of item fit, differential item functioning, and reliability were also conducted. Finally, fairness reviews were conducted and the results were used in decisions regarding item modifications and deletions. A second tryout was conducted for two subtests that had considerable modifications as a result of the first tryout. TECHNICAL. Standardization. A nationally representative sample of 3,630 individuals aged 4 years, 6 months to 21 years, 11 months participated in the standardization of Forms A and B in both the spring and fall of 2006. The norm sample was chosen using a stratified random sampling procedure with geographical region, gender, SES, race, and special education status as stratifying variables. Sample percentages within each stratifying category matched closely the 2004 U.S. Census data. Students with a verified "specific learning disability" (manual, p. 61) were slightly overrepresented and Hispanics in the low socioeconomic status category were somewhat underrepresented in the norm sample. Item difficulty, item fit, and DIF values were used from the spring standardization results to determine the final item set and item order. A table identifying the number of items dropped due to various reasons (e.g., poor fit, difficulty difference) is provided. Methods used to determine optimal start points, determine basal and ceiling set rules, and examine the accuracy of the scoring rules are also described clearly in the manual. To evaluate the comparability of Forms A and B, means and standard deviations of the subtest, area, and total test raw scores by grade and season are provided. The majority of the subtest means across forms are within 2 points and the standard deviations are similar. To develop the norms, half of the norm sample was administered Form A and the other half was administered Form B. The two forms were then calibrated jointly, with a sample of 280 examinees who were administered both forms serving as the link. A linear transformation of the ability estimates for each examinee from the joint calibration was conducted to obtain w-ability scores. Subtest age norms were developed for each of the 17 age groups using normalizing translations that converted w-ability scores to scale scores. For each of the subtests, a growth curve for each oddvalue score (scores range from 1 to 19) was plotted against age and then smoothed. These procedures were used to calculate the w-ability score values for the 62 age groups reported in the final norm tables. Using linear interpolation, scale score equivalents for intervening w-ability scores were obtained. Finally, w-ability scores were converted back to their corresponding raw score values. Grade equivalents were obtained for K-9 and age equivalents were obtained for ages 4 years, 6 months through 15 years, 11 months. Reliability. The split-half method was used to estimate internal-consistency reliability. Internal consistency coefficients for subtests, areas, and total test across Forms A and B for both the fall and the spring administration are presented by grade level and by age. The coefficients range from .60 to .95 with the large majority of estimates in the .80s. The lowest coefficients were for primary grades. To obtain alternate-forms reliability, the sample was divided into two subgroups, K-5 and 6-12. The test forms were administered, using a counterbalanced design, to 280 examinees at intervals averaging 10 days. Alternate-form reliabilities for subtests for both samples range from the middle .70s to the low .90s, and alternate-form reliabilities for areas were in the high .80s to mid .90s. To obtain test-retest reliability coefficients, the sample was again divided into two subgroups. The test forms were administered, on average, 17 days apart to 103 examinees. Test-retest reliabilities for subtests in the K-5 subgroup range from .65 to .95 with the majority in the middle .80s. Test-retest reliabilities for the subtests in the 6-12 subgroup range from .70 to .92 with the majority in the high .80s. Validity. Validity evidence provided pertains to the test content, adequacy of detecting performance growth, internal structure of the test, external structure of the test, and adequacy of potential decisions based on the test scores. Content validity evidence is provided by a welldefined test blueprint that specifies objectives within the 10 subtests and was informed by the Principles and Standards for School Mathematics (NCTM, 2000), state math standards, and information provided by educational practitioners and math curriculum consultants. Items were also reviewed for content and bias. Performance growth across grades was verified by several methods. First, mean raw scores on subtests, areas, and total test showed an increase across grades. As expected there was more rapid growth during the early grades and less growth at the upper grades. Second, the median growth scale value (GSV) corresponding to the total test raw score was plotted for each grade and the median GSV increased from K through 12. To provide internal structure evidence, correlations among subtest scores were obtained as well as correlations among the subtests and area and total standard scores. It was expected that the correlations between subtests in the same area would be higher than correlations between subtests in different areas. However, correlations between subtest scores were similar regardless of area, with most exceeding .60. For the K-2 group, for example, the correlations between areas range from .72 to .83 and the correlations between area and total test range from .84 to .97. External validity evidence was obtained by examining the relationship between the KeyMath-3 DA test and the earlier edition of the test (KeyMath-R/NU), the individually administered Kaufman Test of Educational Achievement, Second Edition (KTEA-II), and three groupadministered mathematics tests. The results indicated that the subtest, area, and total test is related to these other measures and in some cases there is an expected differential pattern of correlations when using the subtest and area scores. KeyMath-3 DA is often used with special populations so the average standard scores were obtained for six groups that represent different diagnoses or special education classifications (e.g., attention-deficit, giftedness, math learning disability). Expected differences between the weighted means for the special populations and the general population were obtained. For example, weighted means for the math learning disability group were generally 1 standard deviation lower than the general population means. COMMENTARY. The KeyMath-3 DA is an individually administered test that provides valuable diagnostic information to teachers regarding students' strengths and weaknesses. This latest edition better reflects the mathematics being taught in schools by the addition of the problem-solving subtests, the linkage of items to the NCTM process standards, and use of calculators at the upper grade levels. Because the revision for this test was based heavily on the NCTM process standards, the use of external reviewers to verify the link between the items and the NCTM standards would have been a welcome addition. However, there is ample information about the content of the test to help potential users determine if it is appropriate for their purposes. The inclusion of items that ask students to explain their procedures and understanding is noteworthy. For some of these items, additional follow-up questions are provided to the examiner in order to prompt students to provide fuller explanations, which is an attractive feature for an individually administered test. The manual provides an excellent overview of the available normative scores and the way in which to interpret and use these scores. The score profile clearly displays information that will aid the user in identifying instructional needs for the individual student. The confidence band for each of the scores is drawn on the profile, aiding in accurate interpretation by considering measurement error. Further, a 68% confidence band is used for the subtest scores, whereas a 90% or 95% confidence band is used for the area and total scores. The use of different confidence bands depending on the nature of the interpretation and potential decision reflects the care of the test author in providing the most appropriate information to the test user. The manual provides a clear and comprehensive description of the design and the technical quality of the test. The procedures and samples for the standardization were well documented. It should be noted, however, that the spring standardization sample was used to determine the final item set and item order. It was assumed that because the Rasch item difficulties are not dependent on characteristics of the sample, values obtained from the spring standardization would generalize to the entire data set. Theoretically, the difficulty estimates should be invariant across samples; however, this finding does not always occur in practice. An examination of the invariance of item parameters across the spring and fall samples could have provided evidence to support this assumption. Reliability coefficients for subtest scores are moderate to high and are generally high for the area and total scores, indicating that users can have confidence in using the area and total scores for making relatively high stakes decisions and using the subtest scores in examining a student's strengths and weaknesses. Overall, the validity evidence supports using the test for its intended purpose. The author, however, may want to explore the extent to which each subtest measures something unique so that patterns of performance on the subtests can better inform the design of individual student intervention programs. SUMMARY. The KeyMath-3 Diagnostic Assessment is an individually administered test that is well developed and provides scores that can inform the design of individual student intervention programs and monitor performance over time. The manual is well written and provides clear guidelines for administering, scoring, and interpreting the scores. The technical information provided in the manual is comprehensive, clearly presented, and supports the use of the instrument for its purpose. REVIEWER'S REFERENCE National Council of Teachers of Mathematics. (2000). Principles and standards for school mathematics. Reston, VA: Author.