A psychometric hood vocabulary tests

advertisement
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
A Psychometric Analysis of Childhood Vocabulary Tests
By
Ellen L. Bogue
Honors Thesis
Submitted to the Department of Speech and Hearing Science
and the College of Applied Health Sciences
in partial fulfillment of the requirements for James Scholar distinction.
May, 2011
Research Mentor: Dr. Laura S. DeThorne
1
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
2
Literature Review
The present study provides a review of the psychometric properties of 10 commonly used
child vocabulary measures. Standardized tests, defined as measures which provide normative
data through use of standardized administration materials and procedures, are commonly used by
both clinicians and investigators within speech-language pathology. Standardized tests are
employed across ages, settings, and disorders for a variety of purposes. Within speech-language
pathology, normed measures are often used to screen for deficits, identify specific areas of
strengths and weaknesses, plan intervention, and monitor language progress (Plante & Vance,
1995; Merrell & Plante, 1997). A child’s performance on a standardized test will have different
implications depending on the purpose for which the test is being used, as DeThorne & Schaefer
(2004) note in their discussion of high- versus low-stakes testing. A high-stakes situation is one
in which the outcome will determine diagnosis, educational placement, or treatment eligibility,
while screening or research studies would be considered low-stakes testing environments, at least
for the individual being tested. Because standardized tests are commonly used for a variety of
high-stakes purposes, it is important to be able to understand a test’s effectiveness for its
intended purpose. Understanding the effectiveness of standardized measures is contingent on
their psychometric properties. Three aspects of test psychometrics are considered here: those
related to the constituency of the normative sample, evidence of reliability, and evidence of
validity. Each of these areas will be reviewed, followed by a general discussion of language
assessment procedures, with specific focus on vocabulary in particular.
Psychometric Properties of Test Construction
Standardization Sample
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
3
The first area of psychometric analysis relates to the quality of the normative sample,
which must be adequate in size, representativeness, and recency. In order for a standardization
sample to be representative, it must be relatively large in size in order to encompass “the full
range of the distribution” (Andersson, 2005). A common sample size criterion is 100 or more
individuals for each normative group in the sample, which is most often divided by age
(Andersson, 2005; DeThorne & Schaefer, 2004; McCauley & Swisher, 1984). For example, if a
test provides normative data in six-month increments for children between ages of 2 and 4 years,
then there should be at least 100 children in each of the following age groups: 2;0-2;6, 2;7-3;0,
3;1-3;6, and 3;7-4;0.
A representative sample is also one that includes individuals who are characteristic of the
population for whom a test has been developed, particularly in regard to cultural language
variation. Current guidelines for test development recommend that the standardization sample
contain groups that are in proportion with the overall population with regard to race/ethnicity,
geographic region, parent education level/socioeconomic status, and gender. With regard to
including individuals with disabilities in the normative sample, professional views are mixed.
Some suggest that children with disabilities should be included in the sample so that the full
range of language abilities is accurately represented in the sample (Andersson, 2005; DeThorne
& Schaefer, 2004). This position argues that failing to include children with language
disabilities in the normative sample would serve to inflate the normative data and potentially
increase false negatives in the assessment process. However, there is also indication that
including children with language disabilities in normative samples decreases test sensitivity
(Pena, Spaulding, & Plante, 2006). In sum, the disagreement surrounding the inclusion of
individuals with disabilities within test construction highlights the frequent trade-offs inherent in
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
4
maximizing sensitivity and specificity. Because of this trade-off, we have decided not to require
the inclusion of children with language disabilities as an explicit criterion, but instead focus on
the need to include key information on sensitivity and specificity as validity evidence.
The final aspect of the normative sample to be addressed here is recency. Because
characteristics of a population can change over time, it is also important for the normative
sample to be fairly recent – a test will not be effective if a child is being measured against an
outdated sample. A prime example from the realm of cognitive testing is the Flynn effect, which
is the tendency of IQ scores to increase by three points per decade (Flynn, 1999). We could
argue that vocabulary is particularly sensitive to changes over time, as new words are coined and
meanings easily change within the span of a generation. Also the objects often portrayed within
vocabulary test picture plates, such as telephones, computers, etc. change in expected
appearance.
Reliability
Another important criteria type in a test’s psychometric evaluation is reliability, a
measure of a test’s consistency across examiners (interrater reliability), across test items (internal
consistency), and over time (test-retest reliability). Interrater reliability measures the consistency
with which separate administrators score a test, ensuring that scores do not vary considerably
between different raters. Internal consistency compares a child’s scores on one subset of test
items (e.g., odd-numbered items) with another (e.g., even-numbered items), thus measuring the
consistency of the construct being evaluated across items. Test-retest reliability reflects the
correlation between an individual’s scores on the same test administered after a certain time
period. This evaluates how reliable a child’s scores would be on subsequent administrations of
the same test. Across types of reliability, a common criterion is a coefficient of greater than or
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
5
equal to .90 within each normative group (Andersson, 2005; DeThorne & Schaefer, 2004;
McCauley & Swisher, 1984). Related to reliability is the concept of standard error of measure
(SEM), which is derived from internal reliability and allows the calculation of a confidence
interval (CI) for an individual child’s score based on the inherent error of test construction and
administration (DeThorne & Schaefer, 2004). Confidence intervals typically correspond to
standard deviations of the normal curve forming either a 68% or 95% CI. For example, if a child
receives a standard score of 94 and the SEM is ±4 standardized points, then the child’s 68% CI
would be 90 to 98. The statistical interpretation of such a confidence interval is that if the child
could be administered the same test on 100 occasions, the true score would fall between 90 and
98 68% of the time. A smaller SEM results in a tighter confidence interval, which corresponds
to greater confidence in the child’s score. For this reason it is important that the SEM be
reported in the test’s manual for each normative group.
Validity
The third area of psychometric evaluation is validity. Validity is a measure of how well a
test assesses the construct it claims to test, which is the most important and relevant measure of a
test’s effectiveness. A test can be reliable without being valid, but cannot be valid without being
reliable. Like reliability, evidence for validity takes many forms, but unlike reliability, the
criteria are difficult to quantify. Forms of validity evidence include developmental trends,
correlation with similar tests, factor analyses, group comparisons, and predictive validity.
Regarding developmental trends, a skill like language development is expected to improve with
age. Consequently, raw scores on a language test are expected to increase with age. As such,
evidence of a positive association between age and raw scores provides a basic form of validity
evidence. While language raw scores should improve with age, there are a number of other
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
6
factors that also develop with age, so this particular form of validity is far from sufficient in
documenting the validity of a test’s construction (DeThorne & Schaefer, 2004). A second form
of support for validity evidence comes from a test’s correlation with other tests designed to
assess a similar construct. For example, a newly developed vocabulary test would be expected to
correlate highly with other commonly used vocabulary measures. This form of validity only has
strength if the test being used as a comparison is psychometrically strong – a high correlation
with a poorly constructed test only means that the new test is similarly flawed.
Also significant to the evaluation of validity are factor analyses, group comparisons, and
predictive validity. Factor analyses determine how closely different items are related to each
other and are measuring the same skill (DeThorne & Schaefer, 2004). Because factor analysis is
most commonly applied to multidimensional assessments, it will not be reviewed here.
Particularly germane however to the validity of standardized vocabulary tests, is the concept of
group comparisons. As the name implies, group comparisons involve administering the test to
relevant subgroups of the population. Relevant in this case would refer to children with
vocabulary deficits compared to typically developing peers. Because these two groups, by
definition, differ in their vocabulary abilities, a difference between these two groups that favored
the typically developing group would provide evidence of a language test’s validity. Despite the
strength of this approach, group differences can still mask a substantial amount of individual
variation. Said another way, subgroup distributions can overlap substantially even when means
differ. Ultimately it is the extent of overlap between groups that governs a test’s diagnostic
accuracy.
Related most directly to diagnostic accuracy is a test’s evidence of sensitivity and
specificity. Sensitivity measures how well a test identifies individuals who have a deficit, while
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
7
specificity measures how well the test classifies those without impairment as typically
developing. Since one of the most common uses of standardized tests is to determine whether or
not a child has a language impairment, it is very important for a test to be strong in both
sensitivity and specificity. Unfortunately, high sensitivity (i.e., identifying all children with true
language disability) often increases the likelihood of over-identifying typically developing
children as impaired, (i.e., false positives), thereby leading to lower specificity. Similarly, high
specificity often increases the likelihood of missing children with true impairment (i.e., false
negatives), thereby leading to lower sensitivity. Both sensitivity and specificity are likely to vary
based on the cutoff criteria used to categorize a child as impaired. Consequently, a test would
ideally report sensitivity and specificity for each normative group based on the most commonly
employed cutoff criteria (i.e., -1.0 to -1.9 standard deviations below the mean; Eickhoff, Betz, &
Ristow, 2010). The test’s sensitivity and specificity would also meet criteria for acceptable
values. Hargrove (2006) suggests 80% sensitivity and 90% specificity, while Plante and Vance
(1995) link acceptability to the purpose of testing. For diagnostic purposes, Plante and Vance
suggest that 90% accuracy is “good” and 80% is “fair;” assumingly these values apply both to
sensitivity and specificity. For screening, Plante and Vance recommend 90-100% sensitivity,
80% specificity for a “good” rating, and 70% for “fair.”
Unlike sensitivity and specificity values, which relate to present conditions, predictive
validity attempts to determine a test’s ability to predict an individual’s performance over time as
well as in related areas, such as success in school and reading ability. Although an important
form of validity evidence, such information is rarely provided in test manuals, perhaps due to the
required longitudinal nature of such data. Due to this reason, and the fact that criteria for
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
8
adequate predictive validity have not been established, we have not focused on predictive
validity in our test review.
Strengths & Weaknesses
Standardized measures are useful in a number of ways but also demonstrate certain
weaknesses. Their prime strength lies in the fact that standardized tests provide both a
quantitative depiction of a child’s abilities as well as normative data against which an
individual’s score can be compared. Assuming that the test is psychometrically sound, having a
quantitative score makes it easier to compare a child with his or her peers and determine whether
or not the child’s performance can be considered within typical limits.
While many problems may be minimized by a well constructed test, there are also issues
that will arise regardless of how carefully a test has been developed. According to McCauley &
Swisher (1984), the results of a standardized test can only provide an estimation of a child’s
language skill, and therefore should not be relied upon as the sole means of assessment. Since
there is so much individual variability, it is important to separate a child’s scores from the child
himself, because there are many different factors involved in an individual child’s performance
on a standardized test. For example, test results may be influenced by other factors unique to the
child such as attention, frustration tolerance, and test anxiety (Fleege, Charlesworth, Burts, &
Hart, 1992; Speltz, DeKlyen, Calderon, Greenberg, & Fisher, 1999). Related to the concept that
vocabulary tests do not provide a pure reflection of vocabulary knowledge or potential, it is also
true that test scores may not accurately reflect everyday language functioning. Normed measures
only provide information about a child’s abilities on a single day with an examiner, most likely
in the context of a clinic – tests cannot necessarily predict how a child uses language outside the
clinic in more familiar situations. Watkins and DeThorne (2000) highlight the concern that tests
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
9
may not clearly indicate “functional language needs” and also note that tests do not provide
specific information regarding what appropriate vocabulary targets should be. Finally,
vocabulary tests are also limited in effectiveness with nonmainstream populations whose
dialectical differences may lead to poorer test performance but not be indicative of impairment
(Watkins & DeThorne, 2000).
Parent Report Measures
Due in part to shortfalls on the part of standardized tests, parent report measures are an
important additional form of evidence in any assessment of child language. Parent report
measures are those that are intended to be completed by the caregiver of the child under
consideration. These measures often take the form of checklists or Likert scale items which are
given to parents to complete. Like other methods of assessment, there are both advantages and
disadvantages to parent report measures. In terms of advantages, parent report measures are
cost- and time-effective for the clinician and less stressful for the child than standardized
administration. According to Watkins & DeThorne (2000), such measures can also be useful in
identifying areas of strength and weakness and offer inherent social validity, meaning that they
will represent how well a child is functioning in his or her community. Parent report measures
are also valuable because parents are able to observe their child’s language skills in a variety of
settings and across time. While some question how accurate or reliable a parent’s report is, a
review conducted by Dinnebeil and Rule (1994) has demonstrated the validity of parents’
estimations of their children’s skills through a review of the literature concerning the congruence
of parents’ and professionals’ judgments. Results of their review of 23 studies demonstrated a
strong positive correlation between the two, with a mean correlation coefficient of .73. Despite
such evidence, parent report, like any single measure, is limited in its perspective. One
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
10
limitation of parent report is that normative data are not usually available, thereby making it
difficult to compare a child to his or her peer group. One exception in the realm of parent report
measures of vocabulary is the MacArthur Communicative Development Inventory (Fenson,
Marchman, Thal, Dale, Reznick, & Bates, 2007), which provides normative data in
accompaniment of its vocabulary checklist. Caregivers complete the checklist on a standard
form and then the examiner is able to compare those results to norms collected in comparable
circumstances. Given the normative data, paired with consistent scoring and administration
procedures, The MacArthur is included in the present review. In sum, the most valid assessment
results are likely to emerge from a combination of assessment methods, integrating results from
standardized testing, parent report, and behavioral observation (Watkins & DeThorne, 2000).
Significance of Vocabulary
One important and frequently tested component of child language is vocabulary.
Vocabulary measures are most commonly used to determine if intervention is necessary, to track
an individual’s progress, and to describe overall language functioning (Watkins & DeThorne,
2000). In addition to their clinical role, standardized measures of vocabulary are utilized in
research for similar purposes, as well as to characterize clinical populations of interest.
Vocabulary is also a common and important component of IQ assessments, and either poor or
exceptional vocabulary skills can have a large impact on IQ scores, which could in turn have an
effect on a child’s prognosis and educational placement. As an example of the extent to which
vocabulary and IQ are often confounded, the PPVT (test titles and their abbreviations are
provided in column 1 of Table 2), a measure evaluated in this review, was originally marketed as
an IQ measure.
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
11
Beyond individual assessment points, it is important to remember the significance of
vocabulary development in a child’s life. Vocabulary skills play a large role in overall language
ability and are associated with achievement in reading, (Catts, Fey, & Tomblin, 1997;
Lombardino, et al., 1997) both of which will also have a significant impact on a child’s
performance in school. In addition, vocabulary proficiency is linked to other developmental
areas, such as social status in school, and it has been shown that a poor vocabulary is associated
with more negative perceptions by others; in a study by Gertner, Rice, and Hadley (1994), the
PPVT-R was found to be the best predictor of peer popularity among preschoolers. Given the
importance of vocabulary in children’s everyday lives, reliable and valid forms of assessment are
critical.
Previous Research
Previous research demonstrates that many standardized tests may fall short of
psychometric expectations. McCauley & Swisher’s 1984 review of 30 preschool language and
articulation measures, three of which, (The EOWPVT, PPVT, and PPVT-R), were tests of
vocabulary, failed to meet psychometric criteria. The 30 tests were evaluated on the basis of ten
criteria, and half of the criteria were met by fewer than six tests. Results of a review of 21 tests
of child language, two of which were vocabulary tests, published ten years later (Plante &
Vance, 1994) suggested that there was “little improvement in overall quality” of tests since
McCauley and Swisher’s study. The Plante and Vance review used the same ten criteria as
McCauley and Swisher’s study and similarly found that of the 21 tests reviewed, no test met
more than seven criteria.
Studies regarding the diagnostic accuracy of various global language measures have also
been performed, suggesting that while some tests might be fairly accurate in identifying an
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
12
impairment, others may not be as precise. For example, Spaulding, Plante, & Farinella (2006)
have demonstrated that for a majority of the 43 tests in their study, ten of which (The Boehm-3,
Boehm-P3, CREVT-2, EOWPVT, EVT, PPVT-III, ROWPVT, Test of Word Knowledge,
WORD: A, and WORD: R) were standardized vocabulary measures, the scores of children with
a previously identified language impairment were not consistently at the low end of the
distribution. For four of the vocabulary measures involved in the study (The CREVT-2, EVT,
PPVT-III, and Boehm-3), mean differences between language-impaired and normative or control
groups were less than 1.5 standard deviations.
Similar to Spaulding et al, the results of a study of vocabulary tests conducted by Grey,
Plante, Vance, & Henrichsen (1999) suggested that none of the four tests included in the study
was a strong indicator of specific language impairment. In their study, the PPVT-III, ROWPVT,
EOWPVT, and EOWPVT-R (all of which are reviewed in the present study) were administered
to preschool-aged children with specific language impairment (SLI) and to preschool-aged
children with normal language. While the children with SLI did score lower than the children
with normal language, they still scored within the normal range. Although vocabulary tests have
been included in review of standardized language measures, we were not able to find a
comprehensive review focused on uni-dimensional vocabulary measures.
The aim of the present study is to review common child vocabulary measures on the basis
of specific psychometric criteria in terms of the standardization sample, reliability, and validity.
As vocabulary measures are so commonly used and as vocabulary is such a significant
component of language development, it is important that vocabulary tests are reliable and valid
evaluations. The ultimate goal of this review, then, is to aid clinicians and researchers in making
decisions regarding which measures to use for their specific purposes.
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
13
Methods/Results
Test Selection
Assessment tools were included in this review if they met three inclusionary criteria
derived from the literature on test development (e.g. McCauley & Swisher, 1984; Plante &
Vance, 1994; DeThorne & Schaefer, 2004). First, each test needed to be standardized in the
sense that it employed prescribed materials and procedures and provided normative data.
Second, the test needed to be a uni-dimensional test of vocabulary in children under 18 years of
age. Multi-dimensional tests which included a vocabulary subtest as part of a larger, more
comprehensive assessment were not included, for example, the Clinical Evaluation of Language
Fundamentals – Fourth Edition (Semel, Wiig, & Secord, 2003), the Preschool Language Scale –
Fourth Edition (Zimmerman, Steiner, & Pond, 2002), and the Test of Language Development –
Fourth Edition (Hammill & Newcomer, 2008). Finally, tests also had to have been developed or
revised within the past 20 years.
Based on these inclusionary criteria, relevant measures were first identified through test
inventories from the University of Illinois Applied Health Sciences Library and the Speech and
Language Clinic and through literature review via online databases (e.g., PsycInfo, ERIC,
PubMed) and ancestral searches. These search procedures led to 10 standardized vocabulary
measures, summarized in Table 1, which served as the focus of this review. The measures
summarized in Table 1 are all assessments of semantic knowledge, three targeting receptive
only, four expressive only, and three tapping both receptive and expressive knowledge. In terms
of required tasks, all but the MCDI-III and the WORD-2, versions A and E include a picturelabeling component, for example, “Show me X” or “What is this?” In contrast, the WORD tests
involve associations, synonyms, semantic absurdities, antonyms, definitions, and flexible word
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
14
use. Unlike all of the other measures, the MCDI-III is a parent report measure that includes a
checklist of common early vocabulary. Caregivers are asked to fill in bubbles next to the words
their children say and/or understand. Although different in terms of administration procedures
from the other measures, the MCDI-III is included here because it is based on standardized
administration and includes normative data.
Review Process
Each test was evaluated on the basis of its psychometric properties, including the makeup of the standardization sample as well as evidence of reliability, and validity, largely following
the criteria set forth by DeThorne & Schaefer (2004). The evaluation was based exclusively on
information provided in the test manuals, which were individually reviewed by the investigator
through multiple passes. The specification of evaluation criteria is summarized below according
to the three primary areas of standardization sample, reliability, and validity.
Standardization Sample.
The standardization sample was considered adequate based on three criteria taken from
DeThorne & Schaefer (2004): adequacy of size, comparison to census data, and recency. Each
individual criterion is highlighted below with a summary of performance. Information regarding
the normative sample of individual tests is summarized in columns 1-3 of Table 2.
Size.
First, in terms of sample size, there had to be at least 100 individuals in each normed
subgroup, meaning that each subgroup for which norms are provided (whether this be by age,
gender, grade level, etc.) had to include 100 children. Although 100 is a somewhat arbitrary
value, the rationale is that each normative group needs to be large enough to capture the inherent
variability associated with any trait. Of the 10 measures reviewed, five failed to meet the size
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
15
criterion. The EOWPVT and ROWPVT, normed using the same sample, failed to include 100
individuals for only one age group – age two – because only 60 children were included, and the
WORD-2:E missed the cut-off by only two individuals (n=98) for the 11;6-11;11 group. The
MAVA failed to meet the size criterion for three groups: 12;0-12;11 for the Receptive test
(n=78), and 11;0-11;11 (n=92) and 12;0-12;11 (n=84) for the Expressive test. The CREVT-2
failed to meet the size criterion because it did not present numbers for each six-month normative
age group, instead presenting data per year. Finally, the MCDI-III failed due to individual group
sizes that ranged from 29 to 79. In the case of the MCDI-III, it seems important to mention that
this measure presents norms separately for girls and boys, and uses six month, rather than twelve
month intervals.
Census Data.
The second criterion, also concerned with representativeness, was that data from the
standardization sample had to be provided in conjunction with U.S. census data from within five
years of the test’s development in order to assist comparisons in terms of race/ethnicity,
geographic region, parent education level/socioeconomic status, and gender. Results are
summarized in column 3 of Table 2. This review did not evaluate how adequately the
standardization sample matched the census data due to limited precedent in this area, and to the
complexity in determining what constitutes adequate group representation. All 10 tests
presented their normative sample data in comparison to census data, with proportions falling
within 20%. However, it should be noted that since each test was required to be representative
with regard to U.S. census data from within five years of the test’s development, most of the
tests’ samples were not based on the most recent census data.
Recency.
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
16
The third and final criterion for the standardization sample is that it had been collected
within the last 15 years; see column 4 of Table 2. This criterion was important given the shift in
vocabulary knowledge over time, as well as the easily dated images often used to evaluate it.
Think of how the iconic image of ‘telephone’ has changed over the last twenty years, as well as
the new vocabulary associated with it (e.g. texting, IM, Twitter, etc.). All 10 measures had
standardization sample data collected within the past 15 years. However, though the current
version of the MCDI-III was published in 2007 and states that the standardization sample has
been updated since past editions (p. 52-53), no year is given for when the updated sample was
collected. Thus, since the last edition was published in 2007, we are interpolating that the
sample of the 2007 edition meets the recency criterion.
To summarize results for evaluation of the tests’ normative samples, four of the 10 tests
fully met all three criteria: the Boehm-3, EVT-2, PPVT-4, and WORD-2:A.
Reliability.
Turning now from the standardization sample characteristics to statistical test
consistency, each test’s reliability was evaluated in terms of inter-examiner reliability, internal
consistency, test-retest reliability, and standard error of measurement. Based on prior review
criteria (DeThorne & Schaefer, 2004; McCauley & Swisher, 1984), inter-examiner reliability,
internal consistency, and test-retest reliability were considered acceptable if the correlation
coefficients were at or above .90 for each normative group. In contrast, there is no clear
precedent for what the standard error of measure should be, perhaps because it is in part derived
from internal consistency. Given that SEM is derived from reliability, an additional cut-off was
not set for this value; instead it was expected that test manuals provided SEM for each normed
subgroup so that examiners could calculate confidence intervals for resulting standard scores.
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
17
Internal Consistency.
The first aspect of reliability reviewed was internal consistency, a reflection of how
reliable a test is across items. Five of the 10 tests reviewed failed to meet the >.90 standard for
one of two different reasons: either values fell below the .90 criterion or data was not presented
for each normed subgroup. The Boehm-3, WORD-2:A, and WORD-2:E fell into the first
category, all presenting internal consistency data for each normative group, but the values
reported were lower than .90 for at least one subgroup. For the WORD-2:A, no individual
subtest value was greater than .84, with the lowest value at .72. Similarly, the WORD-2:E had
no value greater than .84, with a range of .69-.80. The Boehm-3 values ranged from .80-.91.
The CREVT-2 failed both because it did not present data for each normative group (values were
presented per year) and because of low values – 76% (65/86) of the values were above .90, with
the lower coefficients ranging from .78 to .89 for our target age range of 18 years or younger.
The MCDI-III provided strong values of .95 through .96 across the different vocabulary scales
(i.e., Words and Gestures-Words Understood, Words and Gestures-Words Produced, and Words
and Sentences-Words Produced); however these values were collapsed across normative
subgroups.
Test-retest.
None of the tests reviewed met the >.90 criteria for test-retest reliability, although all of
the tests did present some test-retest data. Reasons for failing to meet criteria fell into two
categories: either failing to meet the .90 criterion across all subgroups, or failing to provide
relevant data for each subgroup. Falling into the first category, the WORD-2:E met criterion for
all but two normative groups; 10;6-10;11 and 11;0-11;5 had coefficients of .85 and .87
respectively. Similarly, the WORD-2:A met criterion for all groups except for 12;0-12;11 and
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
18
16;0-16;11 (.78 and .89 respectively). Values for the Boehm-3 ranged from .70-.89. Six
measures reported reliability coefficients based on collapsed subgroups which can mask
substantial variability: the EOWPVT, ROWPVT, CREVT-2, EVT-2, PPVT-4, and MAVA.
Specifically, the CREVT-2, EVT-2, and PPVT-4, though they presented collapsed data, all
presented values that were greater than .90. Similarly, the MAVA reported values collapsed
across subgroups that exceeded .90; however it was different in that values were derived from
three examiners rather than pairs. The EOWPVT and ROWPVT also collapsed age groups, but
with coefficients that fell below .90: .73 to .91 for the EOWPVT and from .85 to .97 for the
ROWPVT. The MCDI-III provided a description of its test-retest reliability, which suggested
that its normative subgroups had been collapsed and that not all values met the .90 standard.
Specifically the manual stated that for the Words and Gestures portion of the measure,
vocabulary comprehension correlations were “in the upper .80s” except for the 12-month-old
group, for which the correlation was .61 (p.101). Similarly, correlations for vocabulary
production were reportedly “low” in the 8-10 month group, and were “in the mid-.80s” for later
months. For CDI: Words and Sentences, test-retest correlations were reported “above .90” at
each age (p. 101).
SEM.
Of the 10 reviewed tests, five passed criterion for the presence of SEM for each normed
subgroup: the Boehm-3, EOWPVT, EVT-2, PPVT-4, and ROWPVT. Five tests failed, either
due to a failure to provide SEM data at all or due to reporting it in a way that prevented a
meaningful comparison to our criterion. With regard to the former, the MAVA did not present
SEM data in the test manual, though according to the manual (p. 28), it does reportedly include
software that provides 90% confidence intervals for each administration of the test. Although
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
19
useful for interpretation of an individual score, an explicit list of SEM values by normed
subgroup is needed to make a priori decisions about a test’s use. Similarly, the MCDI-III manual
failed to report SEM values, although it did provide standard deviations for each for each
normative group and an explanation of how SEM is calculated. The CREVT-2 did not report
data for each six-month normative group, and instead provided SEM data per year. The WORD2:A and WORD-2:E tests reported SEM in terms of test-retest reliability rather than internal
consistency, thereby making the values difficult to compare.
Inter-examiner.
Only one test, the MAVA, met the criterion of >.90 for inter-examiner reliability. Of the
remaining measures, three, the Boehm-3, EVT-2, PPVT-4, and the MCDI-III, did not report
inter-examiner reliability at all. The CREVT-2 failed to provide values for each normative
group, but for the groups reported (Form A and B for each subtest) all values were greater than
.90. The EOWPVT and ROWPVT provided inter-examiner data, though normative groups were
collapsed, and it was unclear from the manual what aspects of inter-examiner reliability the
reported values represent; the reliability of scoring, response evaluation, and administration
between examiners were all calculated in some way. The WORD-2:A and WORD-2:E reported
inter-examiner reliability as Percent Identical and Percent Different comparisons, and thus their
data was not able to be compared with this study’s criterion. However, the Percent Identical
comparisons were high, ranging from 96.4-99.8% for the WORD-2:A and 96.3-99.6% for the
WORD-2:E.
The results related to review of all reliability evidence is available in columns five
through eight in Table 2. In sum, none of the 10 tests included in this study fully met all criteria
for reliability. However, four tests met two reliability criteria (internal reliability and SEM):
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
20
EOWPVT, ROWPVT, PPVT-4, and EVT-2. It is worth nothing that these four tests actually
represent two pairs of tests that were developed together, the EOWPVT and ROWPVT and the
EVT-2 and PPVT-4.
Validity.
Reliability provides a necessary but not sufficient indication of a test’s validity.
Consequently, additional indices of test validity are required, although clear criteria have not
been established. Accordingly, tests here were reviewed for evidence of developmental trends,
group differences, and correlation with similar tests, stressing that data must be present to allow
test users to make their own decisions about adequacy of evidence for individual purposes. In
order for tests to meet criterion, evidence of each of these forms of validity simply had to be
provided in the test manuals, with the following additional specifications: for developmental
trends, there had to be a discernable increase between the raw scores of each age group, no
matter how small. For group differences, tests had to present data on children with a language
impairment compared to typical peers or normative data. Last, for correlations with similar tests,
evidence of a moderate to large correlation (>0.3; Cohen, 1988) with at least one other
standardized vocabulary measure or vocabulary subtest of a global language measure was
required. Finally, sensitivity and specificity were taken into account. Following the criteria set
forth by Hargrove (2006), 80% sensitivity and 90% specificity were required of the measures in
this study. The results related to review of all reliability evidence is available in columns nine
through twelve in Table 2.
Developmental Trends.
When looking at the mean scores across age groups, all but one of the tests reviewed
demonstrated evidence of an increase in raw scores present across age groups. However, only
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
21
the EVT-2, PPVT-4, EOWPVT, ROWPVT, and CREVT-2 explicitly discussed the
developmental trends as a form of validity evidence. The CREVT-2 failed to meet criteria
because scores stayed the same between ages 15 to 16 years for the Receptive portion of both
forms A and B, and 13 to 14 years for the Expressive form of form A and 11 to 12 and 15 to 16
for form B.
Test Comparison.
Although all 10 tests presented evidence of correlation with other tests purported to
measure similar abilities, only 9 met the specified criteria. Specifically, the Boehm-3 failed to
meet this criterion because it was only compared to an earlier version of the Boehm, which did
not provide external validation. Although a large variety of other tests were reported for this
form of validity, including measures of IQ, literacy, and academic achievement, all reviewed
tests, other than the Boehm-3, included at least one other language-based measure, such as other
measures included in this study, as well as global language measures (e.g. CELF-4, PLS-4).
However, the types of measures to which the tests were compared as well as the strengths of
their correlations varied widely.
Group Comparisons.
Although the tests reviewed provided evidence of group comparisons on a wide variety
of populations, including racial and SES comparisons, the current study required at least one
mean comparison, between a typical language group and a language-impaired group. Five of the
tests reviewed passed this criterion: the CREVT-2, EOWPVT, EVT-2, PPVT-4, and ROWPVT.
The Boehm-3 failed because it did not provide any evidence for group comparison, while the
MAVA failed because while it discussed a field study with students receiving special education
services, it did not present the values of the comparison or specify whether the special education
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
22
group was language-impaired. The MCDI-III failed because it provided the results of
comparisons of groups of differing maternal education and birth order, but not a languageimpaired group. Similarly, the WORD-2:A and WORD-2:E only provided evidence of ethnic
and racial differences, and thus failed to meet our criterion as well.
Sensitivity/Specificity.
Although it may well be considered the most important information to have when
determining the validity of a test, sensitivity and specificity evidence was only presented by one
test, the MAVA, with the remaining 9 tests failing to pass the specified criterion. The present
study follows the criteria of 80% sensitivity and 90% specificity set forth by Hargrove (2006).
The MAVA presented extremely high sensitivity and specificity for both -1 and -1.5 SD cutoffs
for both the Receptive and Expressive subtests, passing these criteria. For the Receptive portion,
sensitivity values were 97% and 100% for -1 S.D. and -1.5 S.D. cut-offs, respectively, and
specificity was 100% and 85%. Expressive values for sensitivity and specificity were all 100%
except for sensitivity at the -1.5 S.D. cut-off, which was 83%.
To summarize, none of the 10 tests analyzed in this study passed all validity criteria.
However, the EVT-2, PPVT-4, EOWPVT, ROWPVT, and MAVA did emerge as the strongest
measures in the realm of validity evidence, passing three of the four validity criteria each.
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
23
Discussion
In this review, 10 commonly used standardized vocabulary tests were evaluated on the
basis of their standardization sample, reliability, and validity evidence. In terms of the
standardization sample, most tests came reasonably close to meeting criteria. Specifically, all ten
tests passed in terms of representativeness and recency, suggesting that current test developers
are, at the very least, attempting to have their standardization samples in proportion with the
current population. Six tests failed to meet the criteria of at least 100 individuals per normative
subgroup; however, most failed only at one or two subgroups and were usually very close to
having an adequate number. Evidence of reliability and validity were less encouraging.
Specifically, none of the ten tests passed all reliability criteria, though four, the EOWPVT, EVT2, PPVT-4, and ROWPVT, passed two of the four criteria. Evidence of test-retest and interexaminer reliability were particular areas of need. With regard to validity, five tests met three of
the four designated criteria: the EOWPVT, EVT-2, MAVA, PPVT-4, and ROWPVT. Only the
MAVA reported sensitivity and specificity data, which is arguably one of the most informative
pieces of validity evidence a test can provide, at least for the purpose of diagnostic accuracy.
Given these results, the remainder of the discussion will highlight limitations of our review, as
well as implications of our findings, both in terms of clinical practices as well as of test
development.
Limitations in Our Criteria
Although this evaluation did discriminate between tests, there are inherent limitations in
the criteria used. First, cutoff values, such as those employed for reliability criteria, inherently
create an arbitrary dichotomy out of a continuous variable. For example, there is a negligible
difference between a reliability value of .89 and .90. However, the dichotomous pass/fail
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
24
distinction was consistent with prior literature (Andersson, 2005; DeThorne & Schaefer, 2004;
McCauley & Swisher, 1984) and was considered a useful way to simplify a lot of complex data.
That said, we incorporated information within the text regarding how far values fell from the
target value, so that readers could make better informed judgments for their individual purposes.
A second limitation that should be mentioned here relates to the criteria for validity, which was
qualitative rather than quantitative in nature. In other words, the criteria specified what
information should be provided, but provided less specification on what data should look like.
For example, data regarding group differences was required without specification of how much
of a group difference between children with and without language impairment was an adequate
indication of validity. Though limited, our review approach was generally consistent with prior
standards for validity evidence (Andersson, 2005; DeThorne & Schaefer, 2004; McCauley &
Swisher, 1984) as specifics for validity are largely contingent on the purpose of the assessment
and the characteristics of the child being considered.
Recommended Measures
Although the selection of individual measures should always be determined on an
individual basis, of the vocabulary tests reviewed, the PPVT-4, EVT-2, ROWPVT, EOWPVT,
and MAVA provided the strongest psychometric criteria overall. It is worth noting that the
PPVT-4 and EVT-2 are receptive and expressive measures based on the same normative sample,
as are the ROWPVT and EOWPVT. Consequently, the five recommended measures only
represent three sets of normative data. It should also be noted that the MAVA, while passing
less total criteria than the aforementioned measures, did present extremely high sensitivity and
specificity data, suggesting that this test is also a relatively well-developed one for diagnostic
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
25
purposes. Measures meeting the lowest number of criteria across all three areas of the
standardization sample, reliability, and validity were the MCDI-III and CREVT-2.
As the EOWPVT, EVT-2, MAVA, PPVT-4, and ROWPVT meet the highest number of
criteria, these measures are recommended. This list includes two expressive vocabulary
measures, the EOWPVT and EVT-2, two receptive measures, the ROWPVT and PPVT-4, and
one test that has both an expressive and receptive component, the MAVA.
Suggestions for Test Development
The results of this study suggest several areas in which standardized measures need
improvement. Based on the criteria used in this review, it is clear that stronger reliability values
(>.90), as well more consistent methods of measuring and reporting reliability data, particularly
in test-retest and inter-examiner reliability, are common areas in need of improvement. Another
area of significant deficit is that of sensitivity and specificity. As these are important and useful
measures of how well a test can discriminate between individuals with typical abilities and those
who are language impaired.
Recommended Best Practices in Vocabulary Assessment
Though tests were evaluated in this study for general purposes, in truth, the
appropriateness of any measures is applied on a case-by-case basis. Consequently, the strength
of individual tests will vary with age, specific form of impairment, and the purpose of
assessment. For example, one test may be generally strong, but include an unacceptably small
number of individuals in its 3;0-3;11 normative group. Thus, this measure may not be the best
option to use if testing a three-year-old child. Specifically, the EOWPVT and ROWPVT,
normed using the same standardization sample, only included 60 individuals in the two-year-old
normative subgroup, and therefore although these are otherwise strong measures, they may be
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
26
less appropriate for use with a two-year-old child. Additionally, it is important to remember that
even well-constructed tests will suffer from limitations inherent to most, if not all standardized
measures. Consequently, assessment, particularly for high-stakes purposes such as educational
placement, should never be dependent on a single form of assessment. Best practice is
contingent on integrating multiple forms of assessment, incorporating both parent report and
observational measures as well (Watkins & DeThorne, 2000).
Summary
There was no one test reviewed in this study that met all criteria. The aforementioned
weaknesses of standardized measures, coupled with the weaknesses inherent in all such tests,
highlight the fact that, while standardized tests are legitimate and useful forms of assessment, no
one form of assessment can be relied upon solely. Even a ‘perfectly’ developed test only
measures performance on one day, under specific circumstances, and may not accurately
represent an individual’s strengths and weaknesses accurately. It is important that examiners are
aware of a test’s psychometric properties and how to determine their weaknesses in order to
properly select an assessment measure.
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
27
References
Andersson, L. (2005). Determining the adequacy of tests of children’s language. Communication
Disorders Quarterly 26(4), 207-225.
Boehm, A. E. (2000). Boehm Test of Basic Concepts, Third Edition. San Antonio, TX: The
Psychological Corporation.
Bowers et al. (2005). The WORD Test-2, Adolescent. East Moline, IL: LinguiSystems, Inc.
Bowers et al. (2004). The WORD Test-2, Elementary. East Moline, IL: LinguiSystems, Inc.
Brownell, R. (2000). Expressive One-Word Picture Vocabulary Test. Novato, CA: Academic
Therapy Publications.
Brownell, R. (2000). Receptive One-Word Picture Vocabulary Test. Novato, CA: Academic
Therapy Publications.
Catts, H. W., Fey, M., & Tomblin, B. Language basis of reading disabilities. Presented at the
Fourth Annual Meeting of the Society for the Scientific Study of Reading, Chicago, IL
Cohen J, 1988. Statistical power analysis for the behavioral sciences, 2nd ed. Hillsdale, NJ:
Erlbaum.
Dale, P. S. (1991). The validity of a parent report measure of vocabulary and syntax at 24
months. Journal of Speech and Hearing Research 34, 565-571.
DeThorne, L. S. & Schaefer, B. A. (2004). A guide to child nonverbal IQ measures. American
Journal of Speech-Language Pathology 13, 275-290.
Dinnebeil, L.A., & Rule, S. (1994). Congruence between parents’ and professionals’ judgments
about the development of young children with disabilities: A review of the literature.
Topics in Early Childhood Special Education, 14, 1-25.
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
28
Dunn, L. M., & Dunn, D. M. (2007) Peabody Picture Vocabulary Test, Fourth Edition. San
Antonio, TX: Pearson.
Eickhoff, J., Betz, S. K., & Ristow, J. (2010, June). Clinical procedures used by speech
language pathologists to diagnose SLI. Poster session presented at the Symposium on
Research in Child Language Disorders, Madison, WI.
Fenson, L., Marchman, V. A., Thal, D. J., Dale, P. S., Reznick, J. S., & Bates, E. (2007).
MacArthur-Bates Communicative Development Inventory, Third Edition. Baltimore, MD:
Paul H. Brookes Publishing Co.
Fleege, P. O., Charlesworth, R., Burts, D. C., & Hart, C. H. (1992). Stress begins in
kindergarten: a look at behavior during standardized testing. Journal of Research in
Childhood Education, 7, 20–26.
Flynn, J. R. (1999). Searching for justice: The discovery of IQ gains over time. American
Psychologist, 54, 5-20.
Gertner, B. L., Rice, M. L., Hadley, P. A. (1994). Influence of communicative competence on
peer preferences in a preschool classroom. Journal of Speech and Hearing Research, 37,
913-923.
Gray, S., Plante, E., Vance, R., & Henrichsen, M. (1999). The diagnostic accuracy of four
vocabulary tests administered to preschool-age children. Language, Speech, and Hearing
in Schools 30, 196-206.
Hammill, D. D., & Newcomer, P. L. (2008). Test of language development – Intermediate
Fourth edition. San Antonio, TX: Pearson.
Hargrove, P. (2006). EBP tutorial #10: EBP metrics for assessment. Language Learning and
Educations, 23-24.
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
29
Lombardino, L. J., Riccio, C. A., Hynd, G. W., & Pinheiro, S. B. (1997). Linguistic deficits in
children with reading disabilities. American Journal of Speech-Language Pathology, 6,
71-78.
McCauley, R. J. & Strand, E. A. (2008). A review of standardized tests of nonverbal oral and
speech motor performance in children. American Journal of Speech-Language Pathology
17, 81-91.
McCauley, R. J. & Swisher, L. (1984). Psychometric review of language and articulation tests
for preschool children. Journal of Speech and Hearing Disorders 49, 34-42.
Merrell, A. W., & Plante, E. (1997). Norm-referenced test interpretation in the diagnostic
process. Language, Speech, and Hearing Services in Schools 28, 50-58.
Montgomery, J. K. (2008) Montgomery Assessment of Vocabulary Acquisition. Greenville, SC:
Super Duper Publications.
Pena, E. D., Spaulding, T. J., & Plante, E. (2006). The composition of normative groups and
diagnostic decision making: Shooting ourselves in the foot. American Journal of SpeechLanguage Pathology 15, 247-254.
Plante, E. & Vance, R. (1994). Selection of preschool language tests: A data-based approach.
Language, Speech, and Hearing Services in Schools 25, 15-24.
Plante, E & Vance, R. (1995). Diagnostic accuracy of two tests of preschool language. American
Journal of Speech-Language Pathology 4(2), 70-76.
Semel, E., Wiig, E. H., & Secord, W. A. (2003). Clinical evaluation of language fundamentals,
Fourth edition. San Antonio, TX: Pearson.
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
30
Spaulding, T. J., Plante, E., Farinella, K. A. (2006). Eligibility criteria for language impairment:
Is the low end of normal always appropriate? Language, Speech, and Hearing Services in
Schools 37, 61-72.
Speltz, M. L., DeKlyen, M., Calderon, R., Greenberg, M. T., & Fisher, P. A. (1999).
Neuropsychological characteristics and test behavior in boys with early onset conduct
problems. Journal of Abnormal Psychology, 108, 315–325.
Wallace, G., & Hammil, D. D. (2002). Comprehensive Receptive and Expressive Vocabulary
Test, Second Edition. Austim, TX: PRO-ED.
Watkins, R. V. & DeThorne, L. S. (2000). Assessing children’s vocabulary skills: From word
knowledge to word-learning potential. Seminars in Speech and Language 21(3), 235-245.
Wiig, E. H. & Secord, W. (1992). Test of Word Knowledge. San Antonio, TX: The Psychological
Corporation.
Williams, K. T. (2000). Expressive Vocabulary Test, Second Edition. San Antonio, TX: Pearson.
Zimmerman, I. L., Steiner, V. G., & Pond, R. E. (2002). Preschool language scale, Fourth
edition. San Antonio, TX: Pearson.
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
31
Table 1
Summary of Childhood Vocabulary Measures
Test
Boehm Test of Basic
Concepts (Boehm-3)
Comprehensive
Receptive and
Expressive Vocabulary
Test (CREVT-2)
Expressive One-Word
Picture Vocabulary
Test (EOWPVT)
Expressive Vocabulary
Test (EVT)
MacArthur-Bates
Communicative
Development Inventory
(MCDI-III)
Montgomery
Assessment of
Vocabulary Acquisition
(MAVA)
Peabody Picture
Vocabulary Test
(PPVT-4)
Picture Plate
Description
Price
Full color drawings
$81
Color photographs,
six pictures per plate
$279
10-15 minutes
Full color drawings,
one picture per plate
$175 for Fourth
Edition
2;6-81+
10-20 minutes
Full color drawings,
one picture per plate
$414 for Forms A
& B, $224 for one
form
CDI: Words and
Gestures: 8-18 mos
CDI: Words and
Sentences: 16-30 mos
N/A
N/A
$121.95
3;0-12;11
30-40 minutes for
both tests
Full color drawings
Receptive: four
pictures per plate
Expressive: one
picture per plate
$199
2;6-81+
10-15 minutes
Full color drawings,
one picture per plate
$414 for Forms A
& B, $224 for one
form
Age Range
Testing Time
Kindergarten – Second
Grade
1 session – 45 minutes
2 sessions – 30
minutes each
Expressive: 4;0-89;11
Receptive: 5;0-89;11
Both subtests – 20-30
minutes
One subtest – 10-15
minutes
2;0-18;11
Subtests
Expressive &
Receptive subtests
Receptive &
Expressive tests
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
Receptive One-Word
Picture Vocabulary
Test (ROWPVT)
The WORD Test-2:
Adolescent (WORD2:A)
The WORD Test-2:
Elementary (WORD2:E)
2;0-18;11
32
10-15 minutes
12;0-17;11
30 minutes
6;0-11;11
30 minutes
Tasks: Associations,
Synonyms, Semantic
Absurdities,
Antonyms,
Definitions, Flexible
Word Use
Tasks: Associations,
Synonyms, Semantic
Absurdities,
Antonyms,
Definitions, Flexible
Word Use
Full color line
drawings; four
pictures per plate
$175 for Fourth
Edition
N/A
$160
N/A
$160
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
33
Table 2
Evaluation of Each Measure Based on Specific Psychometric Criteria
Sensitivity &
Specificty
Group
Comparisons
Test Comparison
Validity
Developmental
SEM
Test-retest
Internal
Recent
Reliability
Inter-examiner
Vocabulary Test
Representative
Sizeable
Normative Sample
Boehm Test of Basic
Concepts, Third
Edition (Boehm-3)*
+
p.87
+
p.87-91
+
p.86
p.100101
p.101102
+
p.101
0
+
p.95
+
p.106109
0
0
Expressive
Vocabulary Test
(CREVT-2)
p.38
+
p.36
+
p.43-45
p.51-52
p.45
p.55
p.71-73
+
p.70,7879
+
p.74
0
Expressive One
Word Picture
Vocabulary Test
(EOWPVT)
p.58
+
p.57
+
p.53
+
p.63
p.63,65
+
p.67
p.65-66
+
p.61,73
+
p.71,7377
+
p.76-78
0
Expressive
Vocabulary Test
(EVT-2)†
+
p.50
+
p.49-57
+
p.38
+
p.65,66
p.66-68
+
p.65-66,
68-69
0
+
p.58,60,
69
+
p.70-75
+
p.74-79
0
p/52-53
+
p.53
+
p.52-53
p.100101
p.101
p.101102
0
+
p.55-84
+
p.106107
p. 85-95
0
MacArthur-Bates
Communicative
Development
Inventory, Third
Edition (MCDI-III)
PSYCHOMETRIC ANALYSIS OF VOCABULARY TESTS
Montgomery
Assessment of
Vocabulary
Acquisition (MAVA)
34
p.53,75
+
p.53-59,
74-81
+
+
p.65-66,
86-87
p.69, 8990
0
p.69, 90
+
+
p.63, 84- p.67, 87- p.62, 8385
88
84
Peabody Picture
Vocabulary Test
(PPVT-4) †
+
p.39
+
p.39-44
+
p.32
+
p.53-54
p.55-57
+
p.54
0
+
p.5051,58
+
p.58-63,
87-88
+
p.63-68
0
Receptive OneWord Picture
Vocabulary Test
(ROWPVT)
p.52
+
p.50-51
+
p.46
+
p.55-56
p.56,58
+
p.59
p.56-57
+
54, 65
+
p.62-68,
70
+
p.69-70
0
The WORD Test-2:
Adolescent
+
p.55, 63
+
p.55
+
p.54
p.71
p.70
p.59,72
+
p.63
0
p.55, 65
+
p.55
+
p.54
p.80
p.79
p.61,81
+
p.65
0
+
p.58-59,
80-82
+
p.59, 9597
0
The WORD Test-2:
Elementary
p.57-58,
70
p.57-58,
70
Key: + = specified criteria met; - = specified criteria not met; 0 = no evidence provided in the test manual
*Spanish version also available; Available in forms E & F
†
Available in Forms A & B
+
p.67-68,
88-89
0
Download