PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS A Psychometric Analysis of Childhood Vocabulary Tests By Ellen L. Bogue Honors Thesis Submitted to the Department of Speech and Hearing Science and the College of Applied Health Sciences in partial fulfillment of the requirements for James Scholar distinction. May, 2011 Research Mentor: Dr. Laura S. DeThorne 1 PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 2 Literature Review The present study provides a review of the psychometric properties of 10 commonly used child vocabulary measures. Standardized tests, defined as measures which provide normative data through use of standardized administration materials and procedures, are commonly used by both clinicians and investigators within speech-language pathology. Standardized tests are employed across ages, settings, and disorders for a variety of purposes. Within speech-language pathology, normed measures are often used to screen for deficits, identify specific areas of strengths and weaknesses, plan intervention, and monitor language progress (Plante & Vance, 1995; Merrell & Plante, 1997). A child’s performance on a standardized test will have different implications depending on the purpose for which the test is being used, as DeThorne & Schaefer (2004) note in their discussion of high- versus low-stakes testing. A high-stakes situation is one in which the outcome will determine diagnosis, educational placement, or treatment eligibility, while screening or research studies would be considered low-stakes testing environments, at least for the individual being tested. Because standardized tests are commonly used for a variety of high-stakes purposes, it is important to be able to understand a test’s effectiveness for its intended purpose. Understanding the effectiveness of standardized measures is contingent on their psychometric properties. Three aspects of test psychometrics are considered here: those related to the constituency of the normative sample, evidence of reliability, and evidence of validity. Each of these areas will be reviewed, followed by a general discussion of language assessment procedures, with specific focus on vocabulary in particular. Psychometric Properties of Test Construction Standardization Sample PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 3 The first area of psychometric analysis relates to the quality of the normative sample, which must be adequate in size, representativeness, and recency. In order for a standardization sample to be representative, it must be relatively large in size in order to encompass “the full range of the distribution” (Andersson, 2005). A common sample size criterion is 100 or more individuals for each normative group in the sample, which is most often divided by age (Andersson, 2005; DeThorne & Schaefer, 2004; McCauley & Swisher, 1984). For example, if a test provides normative data in six-month increments for children between ages of 2 and 4 years, then there should be at least 100 children in each of the following age groups: 2;0-2;6, 2;7-3;0, 3;1-3;6, and 3;7-4;0. A representative sample is also one that includes individuals who are characteristic of the population for whom a test has been developed, particularly in regard to cultural language variation. Current guidelines for test development recommend that the standardization sample contain groups that are in proportion with the overall population with regard to race/ethnicity, geographic region, parent education level/socioeconomic status, and gender. With regard to including individuals with disabilities in the normative sample, professional views are mixed. Some suggest that children with disabilities should be included in the sample so that the full range of language abilities is accurately represented in the sample (Andersson, 2005; DeThorne & Schaefer, 2004). This position argues that failing to include children with language disabilities in the normative sample would serve to inflate the normative data and potentially increase false negatives in the assessment process. However, there is also indication that including children with language disabilities in normative samples decreases test sensitivity (Pena, Spaulding, & Plante, 2006). In sum, the disagreement surrounding the inclusion of individuals with disabilities within test construction highlights the frequent trade-offs inherent in PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 4 maximizing sensitivity and specificity. Because of this trade-off, we have decided not to require the inclusion of children with language disabilities as an explicit criterion, but instead focus on the need to include key information on sensitivity and specificity as validity evidence. The final aspect of the normative sample to be addressed here is recency. Because characteristics of a population can change over time, it is also important for the normative sample to be fairly recent – a test will not be effective if a child is being measured against an outdated sample. A prime example from the realm of cognitive testing is the Flynn effect, which is the tendency of IQ scores to increase by three points per decade (Flynn, 1999). We could argue that vocabulary is particularly sensitive to changes over time, as new words are coined and meanings easily change within the span of a generation. Also the objects often portrayed within vocabulary test picture plates, such as telephones, computers, etc. change in expected appearance. Reliability Another important criteria type in a test’s psychometric evaluation is reliability, a measure of a test’s consistency across examiners (interrater reliability), across test items (internal consistency), and over time (test-retest reliability). Interrater reliability measures the consistency with which separate administrators score a test, ensuring that scores do not vary considerably between different raters. Internal consistency compares a child’s scores on one subset of test items (e.g., odd-numbered items) with another (e.g., even-numbered items), thus measuring the consistency of the construct being evaluated across items. Test-retest reliability reflects the correlation between an individual’s scores on the same test administered after a certain time period. This evaluates how reliable a child’s scores would be on subsequent administrations of the same test. Across types of reliability, a common criterion is a coefficient of greater than or PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 5 equal to .90 within each normative group (Andersson, 2005; DeThorne & Schaefer, 2004; McCauley & Swisher, 1984). Related to reliability is the concept of standard error of measure (SEM), which is derived from internal reliability and allows the calculation of a confidence interval (CI) for an individual child’s score based on the inherent error of test construction and administration (DeThorne & Schaefer, 2004). Confidence intervals typically correspond to standard deviations of the normal curve forming either a 68% or 95% CI. For example, if a child receives a standard score of 94 and the SEM is ±4 standardized points, then the child’s 68% CI would be 90 to 98. The statistical interpretation of such a confidence interval is that if the child could be administered the same test on 100 occasions, the true score would fall between 90 and 98 68% of the time. A smaller SEM results in a tighter confidence interval, which corresponds to greater confidence in the child’s score. For this reason it is important that the SEM be reported in the test’s manual for each normative group. Validity The third area of psychometric evaluation is validity. Validity is a measure of how well a test assesses the construct it claims to test, which is the most important and relevant measure of a test’s effectiveness. A test can be reliable without being valid, but cannot be valid without being reliable. Like reliability, evidence for validity takes many forms, but unlike reliability, the criteria are difficult to quantify. Forms of validity evidence include developmental trends, correlation with similar tests, factor analyses, group comparisons, and predictive validity. Regarding developmental trends, a skill like language development is expected to improve with age. Consequently, raw scores on a language test are expected to increase with age. As such, evidence of a positive association between age and raw scores provides a basic form of validity evidence. While language raw scores should improve with age, there are a number of other PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 6 factors that also develop with age, so this particular form of validity is far from sufficient in documenting the validity of a test’s construction (DeThorne & Schaefer, 2004). A second form of support for validity evidence comes from a test’s correlation with other tests designed to assess a similar construct. For example, a newly developed vocabulary test would be expected to correlate highly with other commonly used vocabulary measures. This form of validity only has strength if the test being used as a comparison is psychometrically strong – a high correlation with a poorly constructed test only means that the new test is similarly flawed. Also significant to the evaluation of validity are factor analyses, group comparisons, and predictive validity. Factor analyses determine how closely different items are related to each other and are measuring the same skill (DeThorne & Schaefer, 2004). Because factor analysis is most commonly applied to multidimensional assessments, it will not be reviewed here. Particularly germane however to the validity of standardized vocabulary tests, is the concept of group comparisons. As the name implies, group comparisons involve administering the test to relevant subgroups of the population. Relevant in this case would refer to children with vocabulary deficits compared to typically developing peers. Because these two groups, by definition, differ in their vocabulary abilities, a difference between these two groups that favored the typically developing group would provide evidence of a language test’s validity. Despite the strength of this approach, group differences can still mask a substantial amount of individual variation. Said another way, subgroup distributions can overlap substantially even when means differ. Ultimately it is the extent of overlap between groups that governs a test’s diagnostic accuracy. Related most directly to diagnostic accuracy is a test’s evidence of sensitivity and specificity. Sensitivity measures how well a test identifies individuals who have a deficit, while PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 7 specificity measures how well the test classifies those without impairment as typically developing. Since one of the most common uses of standardized tests is to determine whether or not a child has a language impairment, it is very important for a test to be strong in both sensitivity and specificity. Unfortunately, high sensitivity (i.e., identifying all children with true language disability) often increases the likelihood of over-identifying typically developing children as impaired, (i.e., false positives), thereby leading to lower specificity. Similarly, high specificity often increases the likelihood of missing children with true impairment (i.e., false negatives), thereby leading to lower sensitivity. Both sensitivity and specificity are likely to vary based on the cutoff criteria used to categorize a child as impaired. Consequently, a test would ideally report sensitivity and specificity for each normative group based on the most commonly employed cutoff criteria (i.e., -1.0 to -1.9 standard deviations below the mean; Eickhoff, Betz, & Ristow, 2010). The test’s sensitivity and specificity would also meet criteria for acceptable values. Hargrove (2006) suggests 80% sensitivity and 90% specificity, while Plante and Vance (1995) link acceptability to the purpose of testing. For diagnostic purposes, Plante and Vance suggest that 90% accuracy is “good” and 80% is “fair;” assumingly these values apply both to sensitivity and specificity. For screening, Plante and Vance recommend 90-100% sensitivity, 80% specificity for a “good” rating, and 70% for “fair.” Unlike sensitivity and specificity values, which relate to present conditions, predictive validity attempts to determine a test’s ability to predict an individual’s performance over time as well as in related areas, such as success in school and reading ability. Although an important form of validity evidence, such information is rarely provided in test manuals, perhaps due to the required longitudinal nature of such data. Due to this reason, and the fact that criteria for PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 8 adequate predictive validity have not been established, we have not focused on predictive validity in our test review. Strengths & Weaknesses Standardized measures are useful in a number of ways but also demonstrate certain weaknesses. Their prime strength lies in the fact that standardized tests provide both a quantitative depiction of a child’s abilities as well as normative data against which an individual’s score can be compared. Assuming that the test is psychometrically sound, having a quantitative score makes it easier to compare a child with his or her peers and determine whether or not the child’s performance can be considered within typical limits. While many problems may be minimized by a well constructed test, there are also issues that will arise regardless of how carefully a test has been developed. According to McCauley & Swisher (1984), the results of a standardized test can only provide an estimation of a child’s language skill, and therefore should not be relied upon as the sole means of assessment. Since there is so much individual variability, it is important to separate a child’s scores from the child himself, because there are many different factors involved in an individual child’s performance on a standardized test. For example, test results may be influenced by other factors unique to the child such as attention, frustration tolerance, and test anxiety (Fleege, Charlesworth, Burts, & Hart, 1992; Speltz, DeKlyen, Calderon, Greenberg, & Fisher, 1999). Related to the concept that vocabulary tests do not provide a pure reflection of vocabulary knowledge or potential, it is also true that test scores may not accurately reflect everyday language functioning. Normed measures only provide information about a child’s abilities on a single day with an examiner, most likely in the context of a clinic – tests cannot necessarily predict how a child uses language outside the clinic in more familiar situations. Watkins and DeThorne (2000) highlight the concern that tests PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 9 may not clearly indicate “functional language needs” and also note that tests do not provide specific information regarding what appropriate vocabulary targets should be. Finally, vocabulary tests are also limited in effectiveness with nonmainstream populations whose dialectical differences may lead to poorer test performance but not be indicative of impairment (Watkins & DeThorne, 2000). Parent Report Measures Due in part to shortfalls on the part of standardized tests, parent report measures are an important additional form of evidence in any assessment of child language. Parent report measures are those that are intended to be completed by the caregiver of the child under consideration. These measures often take the form of checklists or Likert scale items which are given to parents to complete. Like other methods of assessment, there are both advantages and disadvantages to parent report measures. In terms of advantages, parent report measures are cost- and time-effective for the clinician and less stressful for the child than standardized administration. According to Watkins & DeThorne (2000), such measures can also be useful in identifying areas of strength and weakness and offer inherent social validity, meaning that they will represent how well a child is functioning in his or her community. Parent report measures are also valuable because parents are able to observe their child’s language skills in a variety of settings and across time. While some question how accurate or reliable a parent’s report is, a review conducted by Dinnebeil and Rule (1994) has demonstrated the validity of parents’ estimations of their children’s skills through a review of the literature concerning the congruence of parents’ and professionals’ judgments. Results of their review of 23 studies demonstrated a strong positive correlation between the two, with a mean correlation coefficient of .73. Despite such evidence, parent report, like any single measure, is limited in its perspective. One PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 10 limitation of parent report is that normative data are not usually available, thereby making it difficult to compare a child to his or her peer group. One exception in the realm of parent report measures of vocabulary is the MacArthur Communicative Development Inventory (Fenson, Marchman, Thal, Dale, Reznick, & Bates, 2007), which provides normative data in accompaniment of its vocabulary checklist. Caregivers complete the checklist on a standard form and then the examiner is able to compare those results to norms collected in comparable circumstances. Given the normative data, paired with consistent scoring and administration procedures, The MacArthur is included in the present review. In sum, the most valid assessment results are likely to emerge from a combination of assessment methods, integrating results from standardized testing, parent report, and behavioral observation (Watkins & DeThorne, 2000). Significance of Vocabulary One important and frequently tested component of child language is vocabulary. Vocabulary measures are most commonly used to determine if intervention is necessary, to track an individual’s progress, and to describe overall language functioning (Watkins & DeThorne, 2000). In addition to their clinical role, standardized measures of vocabulary are utilized in research for similar purposes, as well as to characterize clinical populations of interest. Vocabulary is also a common and important component of IQ assessments, and either poor or exceptional vocabulary skills can have a large impact on IQ scores, which could in turn have an effect on a child’s prognosis and educational placement. As an example of the extent to which vocabulary and IQ are often confounded, the PPVT (test titles and their abbreviations are provided in column 1 of Table 2), a measure evaluated in this review, was originally marketed as an IQ measure. PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 11 Beyond individual assessment points, it is important to remember the significance of vocabulary development in a child’s life. Vocabulary skills play a large role in overall language ability and are associated with achievement in reading, (Catts, Fey, & Tomblin, 1997; Lombardino, et al., 1997) both of which will also have a significant impact on a child’s performance in school. In addition, vocabulary proficiency is linked to other developmental areas, such as social status in school, and it has been shown that a poor vocabulary is associated with more negative perceptions by others; in a study by Gertner, Rice, and Hadley (1994), the PPVT-R was found to be the best predictor of peer popularity among preschoolers. Given the importance of vocabulary in children’s everyday lives, reliable and valid forms of assessment are critical. Previous Research Previous research demonstrates that many standardized tests may fall short of psychometric expectations. McCauley & Swisher’s 1984 review of 30 preschool language and articulation measures, three of which, (The EOWPVT, PPVT, and PPVT-R), were tests of vocabulary, failed to meet psychometric criteria. The 30 tests were evaluated on the basis of ten criteria, and half of the criteria were met by fewer than six tests. Results of a review of 21 tests of child language, two of which were vocabulary tests, published ten years later (Plante & Vance, 1994) suggested that there was “little improvement in overall quality” of tests since McCauley and Swisher’s study. The Plante and Vance review used the same ten criteria as McCauley and Swisher’s study and similarly found that of the 21 tests reviewed, no test met more than seven criteria. Studies regarding the diagnostic accuracy of various global language measures have also been performed, suggesting that while some tests might be fairly accurate in identifying an PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 12 impairment, others may not be as precise. For example, Spaulding, Plante, & Farinella (2006) have demonstrated that for a majority of the 43 tests in their study, ten of which (The Boehm-3, Boehm-P3, CREVT-2, EOWPVT, EVT, PPVT-III, ROWPVT, Test of Word Knowledge, WORD: A, and WORD: R) were standardized vocabulary measures, the scores of children with a previously identified language impairment were not consistently at the low end of the distribution. For four of the vocabulary measures involved in the study (The CREVT-2, EVT, PPVT-III, and Boehm-3), mean differences between language-impaired and normative or control groups were less than 1.5 standard deviations. Similar to Spaulding et al, the results of a study of vocabulary tests conducted by Grey, Plante, Vance, & Henrichsen (1999) suggested that none of the four tests included in the study was a strong indicator of specific language impairment. In their study, the PPVT-III, ROWPVT, EOWPVT, and EOWPVT-R (all of which are reviewed in the present study) were administered to preschool-aged children with specific language impairment (SLI) and to preschool-aged children with normal language. While the children with SLI did score lower than the children with normal language, they still scored within the normal range. Although vocabulary tests have been included in review of standardized language measures, we were not able to find a comprehensive review focused on uni-dimensional vocabulary measures. The aim of the present study is to review common child vocabulary measures on the basis of specific psychometric criteria in terms of the standardization sample, reliability, and validity. As vocabulary measures are so commonly used and as vocabulary is such a significant component of language development, it is important that vocabulary tests are reliable and valid evaluations. The ultimate goal of this review, then, is to aid clinicians and researchers in making decisions regarding which measures to use for their specific purposes. PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 13 Methods/Results Test Selection Assessment tools were included in this review if they met three inclusionary criteria derived from the literature on test development (e.g. McCauley & Swisher, 1984; Plante & Vance, 1994; DeThorne & Schaefer, 2004). First, each test needed to be standardized in the sense that it employed prescribed materials and procedures and provided normative data. Second, the test needed to be a uni-dimensional test of vocabulary in children under 18 years of age. Multi-dimensional tests which included a vocabulary subtest as part of a larger, more comprehensive assessment were not included, for example, the Clinical Evaluation of Language Fundamentals – Fourth Edition (Semel, Wiig, & Secord, 2003), the Preschool Language Scale – Fourth Edition (Zimmerman, Steiner, & Pond, 2002), and the Test of Language Development – Fourth Edition (Hammill & Newcomer, 2008). Finally, tests also had to have been developed or revised within the past 20 years. Based on these inclusionary criteria, relevant measures were first identified through test inventories from the University of Illinois Applied Health Sciences Library and the Speech and Language Clinic and through literature review via online databases (e.g., PsycInfo, ERIC, PubMed) and ancestral searches. These search procedures led to 10 standardized vocabulary measures, summarized in Table 1, which served as the focus of this review. The measures summarized in Table 1 are all assessments of semantic knowledge, three targeting receptive only, four expressive only, and three tapping both receptive and expressive knowledge. In terms of required tasks, all but the MCDI-III and the WORD-2, versions A and E include a picturelabeling component, for example, “Show me X” or “What is this?” In contrast, the WORD tests involve associations, synonyms, semantic absurdities, antonyms, definitions, and flexible word PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 14 use. Unlike all of the other measures, the MCDI-III is a parent report measure that includes a checklist of common early vocabulary. Caregivers are asked to fill in bubbles next to the words their children say and/or understand. Although different in terms of administration procedures from the other measures, the MCDI-III is included here because it is based on standardized administration and includes normative data. Review Process Each test was evaluated on the basis of its psychometric properties, including the makeup of the standardization sample as well as evidence of reliability, and validity, largely following the criteria set forth by DeThorne & Schaefer (2004). The evaluation was based exclusively on information provided in the test manuals, which were individually reviewed by the investigator through multiple passes. The specification of evaluation criteria is summarized below according to the three primary areas of standardization sample, reliability, and validity. Standardization Sample. The standardization sample was considered adequate based on three criteria taken from DeThorne & Schaefer (2004): adequacy of size, comparison to census data, and recency. Each individual criterion is highlighted below with a summary of performance. Information regarding the normative sample of individual tests is summarized in columns 1-3 of Table 2. Size. First, in terms of sample size, there had to be at least 100 individuals in each normed subgroup, meaning that each subgroup for which norms are provided (whether this be by age, gender, grade level, etc.) had to include 100 children. Although 100 is a somewhat arbitrary value, the rationale is that each normative group needs to be large enough to capture the inherent variability associated with any trait. Of the 10 measures reviewed, five failed to meet the size PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 15 criterion. The EOWPVT and ROWPVT, normed using the same sample, failed to include 100 individuals for only one age group – age two – because only 60 children were included, and the WORD-2:E missed the cut-off by only two individuals (n=98) for the 11;6-11;11 group. The MAVA failed to meet the size criterion for three groups: 12;0-12;11 for the Receptive test (n=78), and 11;0-11;11 (n=92) and 12;0-12;11 (n=84) for the Expressive test. The CREVT-2 failed to meet the size criterion because it did not present numbers for each six-month normative age group, instead presenting data per year. Finally, the MCDI-III failed due to individual group sizes that ranged from 29 to 79. In the case of the MCDI-III, it seems important to mention that this measure presents norms separately for girls and boys, and uses six month, rather than twelve month intervals. Census Data. The second criterion, also concerned with representativeness, was that data from the standardization sample had to be provided in conjunction with U.S. census data from within five years of the test’s development in order to assist comparisons in terms of race/ethnicity, geographic region, parent education level/socioeconomic status, and gender. Results are summarized in column 3 of Table 2. This review did not evaluate how adequately the standardization sample matched the census data due to limited precedent in this area, and to the complexity in determining what constitutes adequate group representation. All 10 tests presented their normative sample data in comparison to census data, with proportions falling within 20%. However, it should be noted that since each test was required to be representative with regard to U.S. census data from within five years of the test’s development, most of the tests’ samples were not based on the most recent census data. Recency. PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 16 The third and final criterion for the standardization sample is that it had been collected within the last 15 years; see column 4 of Table 2. This criterion was important given the shift in vocabulary knowledge over time, as well as the easily dated images often used to evaluate it. Think of how the iconic image of ‘telephone’ has changed over the last twenty years, as well as the new vocabulary associated with it (e.g. texting, IM, Twitter, etc.). All 10 measures had standardization sample data collected within the past 15 years. However, though the current version of the MCDI-III was published in 2007 and states that the standardization sample has been updated since past editions (p. 52-53), no year is given for when the updated sample was collected. Thus, since the last edition was published in 2007, we are interpolating that the sample of the 2007 edition meets the recency criterion. To summarize results for evaluation of the tests’ normative samples, four of the 10 tests fully met all three criteria: the Boehm-3, EVT-2, PPVT-4, and WORD-2:A. Reliability. Turning now from the standardization sample characteristics to statistical test consistency, each test’s reliability was evaluated in terms of inter-examiner reliability, internal consistency, test-retest reliability, and standard error of measurement. Based on prior review criteria (DeThorne & Schaefer, 2004; McCauley & Swisher, 1984), inter-examiner reliability, internal consistency, and test-retest reliability were considered acceptable if the correlation coefficients were at or above .90 for each normative group. In contrast, there is no clear precedent for what the standard error of measure should be, perhaps because it is in part derived from internal consistency. Given that SEM is derived from reliability, an additional cut-off was not set for this value; instead it was expected that test manuals provided SEM for each normed subgroup so that examiners could calculate confidence intervals for resulting standard scores. PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 17 Internal Consistency. The first aspect of reliability reviewed was internal consistency, a reflection of how reliable a test is across items. Five of the 10 tests reviewed failed to meet the >.90 standard for one of two different reasons: either values fell below the .90 criterion or data was not presented for each normed subgroup. The Boehm-3, WORD-2:A, and WORD-2:E fell into the first category, all presenting internal consistency data for each normative group, but the values reported were lower than .90 for at least one subgroup. For the WORD-2:A, no individual subtest value was greater than .84, with the lowest value at .72. Similarly, the WORD-2:E had no value greater than .84, with a range of .69-.80. The Boehm-3 values ranged from .80-.91. The CREVT-2 failed both because it did not present data for each normative group (values were presented per year) and because of low values – 76% (65/86) of the values were above .90, with the lower coefficients ranging from .78 to .89 for our target age range of 18 years or younger. The MCDI-III provided strong values of .95 through .96 across the different vocabulary scales (i.e., Words and Gestures-Words Understood, Words and Gestures-Words Produced, and Words and Sentences-Words Produced); however these values were collapsed across normative subgroups. Test-retest. None of the tests reviewed met the >.90 criteria for test-retest reliability, although all of the tests did present some test-retest data. Reasons for failing to meet criteria fell into two categories: either failing to meet the .90 criterion across all subgroups, or failing to provide relevant data for each subgroup. Falling into the first category, the WORD-2:E met criterion for all but two normative groups; 10;6-10;11 and 11;0-11;5 had coefficients of .85 and .87 respectively. Similarly, the WORD-2:A met criterion for all groups except for 12;0-12;11 and PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 18 16;0-16;11 (.78 and .89 respectively). Values for the Boehm-3 ranged from .70-.89. Six measures reported reliability coefficients based on collapsed subgroups which can mask substantial variability: the EOWPVT, ROWPVT, CREVT-2, EVT-2, PPVT-4, and MAVA. Specifically, the CREVT-2, EVT-2, and PPVT-4, though they presented collapsed data, all presented values that were greater than .90. Similarly, the MAVA reported values collapsed across subgroups that exceeded .90; however it was different in that values were derived from three examiners rather than pairs. The EOWPVT and ROWPVT also collapsed age groups, but with coefficients that fell below .90: .73 to .91 for the EOWPVT and from .85 to .97 for the ROWPVT. The MCDI-III provided a description of its test-retest reliability, which suggested that its normative subgroups had been collapsed and that not all values met the .90 standard. Specifically the manual stated that for the Words and Gestures portion of the measure, vocabulary comprehension correlations were “in the upper .80s” except for the 12-month-old group, for which the correlation was .61 (p.101). Similarly, correlations for vocabulary production were reportedly “low” in the 8-10 month group, and were “in the mid-.80s” for later months. For CDI: Words and Sentences, test-retest correlations were reported “above .90” at each age (p. 101). SEM. Of the 10 reviewed tests, five passed criterion for the presence of SEM for each normed subgroup: the Boehm-3, EOWPVT, EVT-2, PPVT-4, and ROWPVT. Five tests failed, either due to a failure to provide SEM data at all or due to reporting it in a way that prevented a meaningful comparison to our criterion. With regard to the former, the MAVA did not present SEM data in the test manual, though according to the manual (p. 28), it does reportedly include software that provides 90% confidence intervals for each administration of the test. Although PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 19 useful for interpretation of an individual score, an explicit list of SEM values by normed subgroup is needed to make a priori decisions about a test’s use. Similarly, the MCDI-III manual failed to report SEM values, although it did provide standard deviations for each for each normative group and an explanation of how SEM is calculated. The CREVT-2 did not report data for each six-month normative group, and instead provided SEM data per year. The WORD2:A and WORD-2:E tests reported SEM in terms of test-retest reliability rather than internal consistency, thereby making the values difficult to compare. Inter-examiner. Only one test, the MAVA, met the criterion of >.90 for inter-examiner reliability. Of the remaining measures, three, the Boehm-3, EVT-2, PPVT-4, and the MCDI-III, did not report inter-examiner reliability at all. The CREVT-2 failed to provide values for each normative group, but for the groups reported (Form A and B for each subtest) all values were greater than .90. The EOWPVT and ROWPVT provided inter-examiner data, though normative groups were collapsed, and it was unclear from the manual what aspects of inter-examiner reliability the reported values represent; the reliability of scoring, response evaluation, and administration between examiners were all calculated in some way. The WORD-2:A and WORD-2:E reported inter-examiner reliability as Percent Identical and Percent Different comparisons, and thus their data was not able to be compared with this study’s criterion. However, the Percent Identical comparisons were high, ranging from 96.4-99.8% for the WORD-2:A and 96.3-99.6% for the WORD-2:E. The results related to review of all reliability evidence is available in columns five through eight in Table 2. In sum, none of the 10 tests included in this study fully met all criteria for reliability. However, four tests met two reliability criteria (internal reliability and SEM): PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 20 EOWPVT, ROWPVT, PPVT-4, and EVT-2. It is worth nothing that these four tests actually represent two pairs of tests that were developed together, the EOWPVT and ROWPVT and the EVT-2 and PPVT-4. Validity. Reliability provides a necessary but not sufficient indication of a test’s validity. Consequently, additional indices of test validity are required, although clear criteria have not been established. Accordingly, tests here were reviewed for evidence of developmental trends, group differences, and correlation with similar tests, stressing that data must be present to allow test users to make their own decisions about adequacy of evidence for individual purposes. In order for tests to meet criterion, evidence of each of these forms of validity simply had to be provided in the test manuals, with the following additional specifications: for developmental trends, there had to be a discernable increase between the raw scores of each age group, no matter how small. For group differences, tests had to present data on children with a language impairment compared to typical peers or normative data. Last, for correlations with similar tests, evidence of a moderate to large correlation (>0.3; Cohen, 1988) with at least one other standardized vocabulary measure or vocabulary subtest of a global language measure was required. Finally, sensitivity and specificity were taken into account. Following the criteria set forth by Hargrove (2006), 80% sensitivity and 90% specificity were required of the measures in this study. The results related to review of all reliability evidence is available in columns nine through twelve in Table 2. Developmental Trends. When looking at the mean scores across age groups, all but one of the tests reviewed demonstrated evidence of an increase in raw scores present across age groups. However, only PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 21 the EVT-2, PPVT-4, EOWPVT, ROWPVT, and CREVT-2 explicitly discussed the developmental trends as a form of validity evidence. The CREVT-2 failed to meet criteria because scores stayed the same between ages 15 to 16 years for the Receptive portion of both forms A and B, and 13 to 14 years for the Expressive form of form A and 11 to 12 and 15 to 16 for form B. Test Comparison. Although all 10 tests presented evidence of correlation with other tests purported to measure similar abilities, only 9 met the specified criteria. Specifically, the Boehm-3 failed to meet this criterion because it was only compared to an earlier version of the Boehm, which did not provide external validation. Although a large variety of other tests were reported for this form of validity, including measures of IQ, literacy, and academic achievement, all reviewed tests, other than the Boehm-3, included at least one other language-based measure, such as other measures included in this study, as well as global language measures (e.g. CELF-4, PLS-4). However, the types of measures to which the tests were compared as well as the strengths of their correlations varied widely. Group Comparisons. Although the tests reviewed provided evidence of group comparisons on a wide variety of populations, including racial and SES comparisons, the current study required at least one mean comparison, between a typical language group and a language-impaired group. Five of the tests reviewed passed this criterion: the CREVT-2, EOWPVT, EVT-2, PPVT-4, and ROWPVT. The Boehm-3 failed because it did not provide any evidence for group comparison, while the MAVA failed because while it discussed a field study with students receiving special education services, it did not present the values of the comparison or specify whether the special education PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 22 group was language-impaired. The MCDI-III failed because it provided the results of comparisons of groups of differing maternal education and birth order, but not a languageimpaired group. Similarly, the WORD-2:A and WORD-2:E only provided evidence of ethnic and racial differences, and thus failed to meet our criterion as well. Sensitivity/Specificity. Although it may well be considered the most important information to have when determining the validity of a test, sensitivity and specificity evidence was only presented by one test, the MAVA, with the remaining 9 tests failing to pass the specified criterion. The present study follows the criteria of 80% sensitivity and 90% specificity set forth by Hargrove (2006). The MAVA presented extremely high sensitivity and specificity for both -1 and -1.5 SD cutoffs for both the Receptive and Expressive subtests, passing these criteria. For the Receptive portion, sensitivity values were 97% and 100% for -1 S.D. and -1.5 S.D. cut-offs, respectively, and specificity was 100% and 85%. Expressive values for sensitivity and specificity were all 100% except for sensitivity at the -1.5 S.D. cut-off, which was 83%. To summarize, none of the 10 tests analyzed in this study passed all validity criteria. However, the EVT-2, PPVT-4, EOWPVT, ROWPVT, and MAVA did emerge as the strongest measures in the realm of validity evidence, passing three of the four validity criteria each. PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 23 Discussion In this review, 10 commonly used standardized vocabulary tests were evaluated on the basis of their standardization sample, reliability, and validity evidence. In terms of the standardization sample, most tests came reasonably close to meeting criteria. Specifically, all ten tests passed in terms of representativeness and recency, suggesting that current test developers are, at the very least, attempting to have their standardization samples in proportion with the current population. Six tests failed to meet the criteria of at least 100 individuals per normative subgroup; however, most failed only at one or two subgroups and were usually very close to having an adequate number. Evidence of reliability and validity were less encouraging. Specifically, none of the ten tests passed all reliability criteria, though four, the EOWPVT, EVT2, PPVT-4, and ROWPVT, passed two of the four criteria. Evidence of test-retest and interexaminer reliability were particular areas of need. With regard to validity, five tests met three of the four designated criteria: the EOWPVT, EVT-2, MAVA, PPVT-4, and ROWPVT. Only the MAVA reported sensitivity and specificity data, which is arguably one of the most informative pieces of validity evidence a test can provide, at least for the purpose of diagnostic accuracy. Given these results, the remainder of the discussion will highlight limitations of our review, as well as implications of our findings, both in terms of clinical practices as well as of test development. Limitations in Our Criteria Although this evaluation did discriminate between tests, there are inherent limitations in the criteria used. First, cutoff values, such as those employed for reliability criteria, inherently create an arbitrary dichotomy out of a continuous variable. For example, there is a negligible difference between a reliability value of .89 and .90. However, the dichotomous pass/fail PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 24 distinction was consistent with prior literature (Andersson, 2005; DeThorne & Schaefer, 2004; McCauley & Swisher, 1984) and was considered a useful way to simplify a lot of complex data. That said, we incorporated information within the text regarding how far values fell from the target value, so that readers could make better informed judgments for their individual purposes. A second limitation that should be mentioned here relates to the criteria for validity, which was qualitative rather than quantitative in nature. In other words, the criteria specified what information should be provided, but provided less specification on what data should look like. For example, data regarding group differences was required without specification of how much of a group difference between children with and without language impairment was an adequate indication of validity. Though limited, our review approach was generally consistent with prior standards for validity evidence (Andersson, 2005; DeThorne & Schaefer, 2004; McCauley & Swisher, 1984) as specifics for validity are largely contingent on the purpose of the assessment and the characteristics of the child being considered. Recommended Measures Although the selection of individual measures should always be determined on an individual basis, of the vocabulary tests reviewed, the PPVT-4, EVT-2, ROWPVT, EOWPVT, and MAVA provided the strongest psychometric criteria overall. It is worth noting that the PPVT-4 and EVT-2 are receptive and expressive measures based on the same normative sample, as are the ROWPVT and EOWPVT. Consequently, the five recommended measures only represent three sets of normative data. It should also be noted that the MAVA, while passing less total criteria than the aforementioned measures, did present extremely high sensitivity and specificity data, suggesting that this test is also a relatively well-developed one for diagnostic PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 25 purposes. Measures meeting the lowest number of criteria across all three areas of the standardization sample, reliability, and validity were the MCDI-III and CREVT-2. As the EOWPVT, EVT-2, MAVA, PPVT-4, and ROWPVT meet the highest number of criteria, these measures are recommended. This list includes two expressive vocabulary measures, the EOWPVT and EVT-2, two receptive measures, the ROWPVT and PPVT-4, and one test that has both an expressive and receptive component, the MAVA. Suggestions for Test Development The results of this study suggest several areas in which standardized measures need improvement. Based on the criteria used in this review, it is clear that stronger reliability values (>.90), as well more consistent methods of measuring and reporting reliability data, particularly in test-retest and inter-examiner reliability, are common areas in need of improvement. Another area of significant deficit is that of sensitivity and specificity. As these are important and useful measures of how well a test can discriminate between individuals with typical abilities and those who are language impaired. Recommended Best Practices in Vocabulary Assessment Though tests were evaluated in this study for general purposes, in truth, the appropriateness of any measures is applied on a case-by-case basis. Consequently, the strength of individual tests will vary with age, specific form of impairment, and the purpose of assessment. For example, one test may be generally strong, but include an unacceptably small number of individuals in its 3;0-3;11 normative group. Thus, this measure may not be the best option to use if testing a three-year-old child. Specifically, the EOWPVT and ROWPVT, normed using the same standardization sample, only included 60 individuals in the two-year-old normative subgroup, and therefore although these are otherwise strong measures, they may be PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 26 less appropriate for use with a two-year-old child. Additionally, it is important to remember that even well-constructed tests will suffer from limitations inherent to most, if not all standardized measures. Consequently, assessment, particularly for high-stakes purposes such as educational placement, should never be dependent on a single form of assessment. Best practice is contingent on integrating multiple forms of assessment, incorporating both parent report and observational measures as well (Watkins & DeThorne, 2000). Summary There was no one test reviewed in this study that met all criteria. The aforementioned weaknesses of standardized measures, coupled with the weaknesses inherent in all such tests, highlight the fact that, while standardized tests are legitimate and useful forms of assessment, no one form of assessment can be relied upon solely. Even a ‘perfectly’ developed test only measures performance on one day, under specific circumstances, and may not accurately represent an individual’s strengths and weaknesses accurately. It is important that examiners are aware of a test’s psychometric properties and how to determine their weaknesses in order to properly select an assessment measure. PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 27 References Andersson, L. (2005). Determining the adequacy of tests of children’s language. Communication Disorders Quarterly 26(4), 207-225. Boehm, A. E. (2000). Boehm Test of Basic Concepts, Third Edition. San Antonio, TX: The Psychological Corporation. Bowers et al. (2005). The WORD Test-2, Adolescent. East Moline, IL: LinguiSystems, Inc. Bowers et al. (2004). The WORD Test-2, Elementary. East Moline, IL: LinguiSystems, Inc. Brownell, R. (2000). Expressive One-Word Picture Vocabulary Test. Novato, CA: Academic Therapy Publications. Brownell, R. (2000). Receptive One-Word Picture Vocabulary Test. Novato, CA: Academic Therapy Publications. Catts, H. W., Fey, M., & Tomblin, B. Language basis of reading disabilities. Presented at the Fourth Annual Meeting of the Society for the Scientific Study of Reading, Chicago, IL Cohen J, 1988. Statistical power analysis for the behavioral sciences, 2nd ed. Hillsdale, NJ: Erlbaum. Dale, P. S. (1991). The validity of a parent report measure of vocabulary and syntax at 24 months. Journal of Speech and Hearing Research 34, 565-571. DeThorne, L. S. & Schaefer, B. A. (2004). A guide to child nonverbal IQ measures. American Journal of Speech-Language Pathology 13, 275-290. Dinnebeil, L.A., & Rule, S. (1994). Congruence between parents’ and professionals’ judgments about the development of young children with disabilities: A review of the literature. Topics in Early Childhood Special Education, 14, 1-25. PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 28 Dunn, L. M., & Dunn, D. M. (2007) Peabody Picture Vocabulary Test, Fourth Edition. San Antonio, TX: Pearson. Eickhoff, J., Betz, S. K., & Ristow, J. (2010, June). Clinical procedures used by speech language pathologists to diagnose SLI. Poster session presented at the Symposium on Research in Child Language Disorders, Madison, WI. Fenson, L., Marchman, V. A., Thal, D. J., Dale, P. S., Reznick, J. S., & Bates, E. (2007). MacArthur-Bates Communicative Development Inventory, Third Edition. Baltimore, MD: Paul H. Brookes Publishing Co. Fleege, P. O., Charlesworth, R., Burts, D. C., & Hart, C. H. (1992). Stress begins in kindergarten: a look at behavior during standardized testing. Journal of Research in Childhood Education, 7, 20–26. Flynn, J. R. (1999). Searching for justice: The discovery of IQ gains over time. American Psychologist, 54, 5-20. Gertner, B. L., Rice, M. L., Hadley, P. A. (1994). Influence of communicative competence on peer preferences in a preschool classroom. Journal of Speech and Hearing Research, 37, 913-923. Gray, S., Plante, E., Vance, R., & Henrichsen, M. (1999). The diagnostic accuracy of four vocabulary tests administered to preschool-age children. Language, Speech, and Hearing in Schools 30, 196-206. Hammill, D. D., & Newcomer, P. L. (2008). Test of language development – Intermediate Fourth edition. San Antonio, TX: Pearson. Hargrove, P. (2006). EBP tutorial #10: EBP metrics for assessment. Language Learning and Educations, 23-24. PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 29 Lombardino, L. J., Riccio, C. A., Hynd, G. W., & Pinheiro, S. B. (1997). Linguistic deficits in children with reading disabilities. American Journal of Speech-Language Pathology, 6, 71-78. McCauley, R. J. & Strand, E. A. (2008). A review of standardized tests of nonverbal oral and speech motor performance in children. American Journal of Speech-Language Pathology 17, 81-91. McCauley, R. J. & Swisher, L. (1984). Psychometric review of language and articulation tests for preschool children. Journal of Speech and Hearing Disorders 49, 34-42. Merrell, A. W., & Plante, E. (1997). Norm-referenced test interpretation in the diagnostic process. Language, Speech, and Hearing Services in Schools 28, 50-58. Montgomery, J. K. (2008) Montgomery Assessment of Vocabulary Acquisition. Greenville, SC: Super Duper Publications. Pena, E. D., Spaulding, T. J., & Plante, E. (2006). The composition of normative groups and diagnostic decision making: Shooting ourselves in the foot. American Journal of SpeechLanguage Pathology 15, 247-254. Plante, E. & Vance, R. (1994). Selection of preschool language tests: A data-based approach. Language, Speech, and Hearing Services in Schools 25, 15-24. Plante, E & Vance, R. (1995). Diagnostic accuracy of two tests of preschool language. American Journal of Speech-Language Pathology 4(2), 70-76. Semel, E., Wiig, E. H., & Secord, W. A. (2003). Clinical evaluation of language fundamentals, Fourth edition. San Antonio, TX: Pearson. PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 30 Spaulding, T. J., Plante, E., Farinella, K. A. (2006). Eligibility criteria for language impairment: Is the low end of normal always appropriate? Language, Speech, and Hearing Services in Schools 37, 61-72. Speltz, M. L., DeKlyen, M., Calderon, R., Greenberg, M. T., & Fisher, P. A. (1999). Neuropsychological characteristics and test behavior in boys with early onset conduct problems. Journal of Abnormal Psychology, 108, 315–325. Wallace, G., & Hammil, D. D. (2002). Comprehensive Receptive and Expressive Vocabulary Test, Second Edition. Austim, TX: PRO-ED. Watkins, R. V. & DeThorne, L. S. (2000). Assessing children’s vocabulary skills: From word knowledge to word-learning potential. Seminars in Speech and Language 21(3), 235-245. Wiig, E. H. & Secord, W. (1992). Test of Word Knowledge. San Antonio, TX: The Psychological Corporation. Williams, K. T. (2000). Expressive Vocabulary Test, Second Edition. San Antonio, TX: Pearson. Zimmerman, I. L., Steiner, V. G., & Pond, R. E. (2002). Preschool language scale, Fourth edition. San Antonio, TX: Pearson. PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 31 Table 1 Summary of Childhood Vocabulary Measures Test Boehm Test of Basic Concepts (Boehm-3) Comprehensive Receptive and Expressive Vocabulary Test (CREVT-2) Expressive One-Word Picture Vocabulary Test (EOWPVT) Expressive Vocabulary Test (EVT) MacArthur-Bates Communicative Development Inventory (MCDI-III) Montgomery Assessment of Vocabulary Acquisition (MAVA) Peabody Picture Vocabulary Test (PPVT-4) Picture Plate Description Price Full color drawings $81 Color photographs, six pictures per plate $279 10-15 minutes Full color drawings, one picture per plate $175 for Fourth Edition 2;6-81+ 10-20 minutes Full color drawings, one picture per plate $414 for Forms A & B, $224 for one form CDI: Words and Gestures: 8-18 mos CDI: Words and Sentences: 16-30 mos N/A N/A $121.95 3;0-12;11 30-40 minutes for both tests Full color drawings Receptive: four pictures per plate Expressive: one picture per plate $199 2;6-81+ 10-15 minutes Full color drawings, one picture per plate $414 for Forms A & B, $224 for one form Age Range Testing Time Kindergarten – Second Grade 1 session – 45 minutes 2 sessions – 30 minutes each Expressive: 4;0-89;11 Receptive: 5;0-89;11 Both subtests – 20-30 minutes One subtest – 10-15 minutes 2;0-18;11 Subtests Expressive & Receptive subtests Receptive & Expressive tests PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS Receptive One-Word Picture Vocabulary Test (ROWPVT) The WORD Test-2: Adolescent (WORD2:A) The WORD Test-2: Elementary (WORD2:E) 2;0-18;11 32 10-15 minutes 12;0-17;11 30 minutes 6;0-11;11 30 minutes Tasks: Associations, Synonyms, Semantic Absurdities, Antonyms, Definitions, Flexible Word Use Tasks: Associations, Synonyms, Semantic Absurdities, Antonyms, Definitions, Flexible Word Use Full color line drawings; four pictures per plate $175 for Fourth Edition N/A $160 N/A $160 PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS 33 Table 2 Evaluation of Each Measure Based on Specific Psychometric Criteria Sensitivity & Specificty Group Comparisons Test Comparison Validity Developmental SEM Test-retest Internal Recent Reliability Inter-examiner Vocabulary Test Representative Sizeable Normative Sample Boehm Test of Basic Concepts, Third Edition (Boehm-3)* + p.87 + p.87-91 + p.86 p.100101 p.101102 + p.101 0 + p.95 + p.106109 0 0 Expressive Vocabulary Test (CREVT-2) p.38 + p.36 + p.43-45 p.51-52 p.45 p.55 p.71-73 + p.70,7879 + p.74 0 Expressive One Word Picture Vocabulary Test (EOWPVT) p.58 + p.57 + p.53 + p.63 p.63,65 + p.67 p.65-66 + p.61,73 + p.71,7377 + p.76-78 0 Expressive Vocabulary Test (EVT-2)† + p.50 + p.49-57 + p.38 + p.65,66 p.66-68 + p.65-66, 68-69 0 + p.58,60, 69 + p.70-75 + p.74-79 0 p/52-53 + p.53 + p.52-53 p.100101 p.101 p.101102 0 + p.55-84 + p.106107 p. 85-95 0 MacArthur-Bates Communicative Development Inventory, Third Edition (MCDI-III) PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS Montgomery Assessment of Vocabulary Acquisition (MAVA) 34 p.53,75 + p.53-59, 74-81 + + p.65-66, 86-87 p.69, 8990 0 p.69, 90 + + p.63, 84- p.67, 87- p.62, 8385 88 84 Peabody Picture Vocabulary Test (PPVT-4) † + p.39 + p.39-44 + p.32 + p.53-54 p.55-57 + p.54 0 + p.5051,58 + p.58-63, 87-88 + p.63-68 0 Receptive OneWord Picture Vocabulary Test (ROWPVT) p.52 + p.50-51 + p.46 + p.55-56 p.56,58 + p.59 p.56-57 + 54, 65 + p.62-68, 70 + p.69-70 0 The WORD Test-2: Adolescent + p.55, 63 + p.55 + p.54 p.71 p.70 p.59,72 + p.63 0 p.55, 65 + p.55 + p.54 p.80 p.79 p.61,81 + p.65 0 + p.58-59, 80-82 + p.59, 9597 0 The WORD Test-2: Elementary p.57-58, 70 p.57-58, 70 Key: + = specified criteria met; - = specified criteria not met; 0 = no evidence provided in the test manual *Spanish version also available; Available in forms E & F † Available in Forms A & B + p.67-68, 88-89 0