Test Validity-WISC-IV

advertisement
Test Validity-WISC-IV
By Jill Hutzel, A.M, K.W & L.K.
What is Validity?
“The concept of validity has evolved over more than 60
years, and various definitions of validity have been
proposed. A rather basic definition of validity is ‘the
degree to which a test measures what it is supposed to
measure.’ Although this definition is relatively common
and straightforward, it oversimplifies the issues a bit. A
better definition, reflecting the most contemporary
perspective, is that validity is ‘the degree to which
evidence and theory support the interpretations of test
scores entailed by the proposed uses’ of a test (AERA,
APA, & NCME, 1999, p.9).”
(Furr & Bacharach, 2008)
1.6- Procedures Used to Generate Test Content
In general, “one type of validity evidence relates to the
match between the actual content of a test and the
content that should be included in the test. If a test is to
be interpreted as a measure of a particular construct, then
the content of the test should reflect the important facets
of the construct. The supposed psychological nature of
the construct should dictate the appropriate content of
the test. Validity evidence of this type is sometimes
referred to as content validity…” (Furr & Bacharach, 2008).
1.6 Continued…
According to the WISC-IV technical manual (2004) examining
relationships between a test’s content and the construct it is
intended to measure provides a major source of evidence about
the validity of the test. Evidence of content validity is not based
on statistics or empirical testing: rather, it is based on the
degree to which the test items adequately represent and relate
to the trait or function that is being measured. Test content also
involves wording and format of items, as well as procedures for
administering and scoring the test. Comprehensive literature
and expert reviews were conducted to examine the content of
the WISC-IV.
1.6 Continued…
A number of concurrent studies were conducted to provide additional
evidence for the scale’s reliability and validity. Retest data is reported for all
ages and for five separate age groups (6:0-7:11, 8:0-9:11, 10.0-11:11, 12:013:11, & 14:0-16:11). Evidence for the convergent and discriminant validity
of the WISC-IV is provided by correlational studies with a number (8 total)
of different instruments (ex. WPPSI-III, WAIS-III, WIAT-II, etc.). Evidence of
construct validity was provided through a series of exploratory and
confirmatory *factor-analytic studies and mean comparisons using matched
samples of clinical and nonclinical children.
*Factor analysis is a mathematical procedure used to explain the pattern of intercorrelations
among a set of variables (such as individual test items, entire tests, subtests, or rating scales) by
deriving the smallest number of meaningful variables or factors- It is based on the assumption
that a significant correlation between two variables indicates a common underlying factor shared
by both variables.
1.10 What evidence supports the interpretation of performance based on test items
[response processes]? This may include comparisons to previous test versions, expert
reviews, or an examination of incorrect or unintended correct responses.
•Validity
evidence based on the response processes of
children supports that children do engage the expected
cognitive process when responding to the designated subtest
tasks (Wechsler, 2004)
o Strong theoretical and empirical evidence supports
the validity of earlier subtests (previous versions),
which carried over, of the WISC-IV based on response
processes
o Validity for newly introduced subtests were
supported with extensive literature reviews, empirical
examination, and expert consultation
1.10 Continued…
•Further evidence of validity was acquired through both empirical and
qualitative examination of response processes during the test’s
development (Wechsler, 2004):
*The response frequencies for multiple choice items were examined
for trends in commonly incorrect given responses
-These common, incorrect responses were then examined to
determine if there was an issue with the response
*Children were also asked to explain their answers and reasoning
behind the grouping of pictures on the Picture Concept subtest
-These responses were paired with the response frequencies to
identify the common rationale for children’s correct and incorrect
responses
-If possibly acceptable, yet unintended, responses came about
because of this, item content was changed and items were once
again reexamined
1.10 Continued…
* Children’s reactions and comments were also noted
-During trials of the Letter-Number Sequencing some children commented
on embedded words. This provided examiners with evidence that the word
acted as a cue in the working memory task
* Direct questioning was also used
-Children were questioned about their problem-solving strategies for the
Matrix Reasoning subtest
-Appropriate adjustments were then made to remove any distracters in the
items
* Finally, the WISC-IV was compared to the other following tests in order to
determine the relationship between them (Williams, Weiss, & Rolfhus,
2003):
-WISC-III, WPPSI-III, WAIS-III, WASI, WIAT-III, Children’s Memory Scale
(CMS), Gifted Rating Scale (GRS), BarOn Emotional Quotient-Inventory:
Youth Version (Bar-On EQ-I:YV), and Adaptive Behavior Assessment
System-Second Edition (ABAS-II)
•Data from the standardization sample, response frequencies for each age group and
ability, and the questioning of participants’ responses was used to confirm or alter
the assigned point values of subtests within the verbal domain (Wechsler, 2004)
1.11 What evidence is provided that supports the test’s internal structure (e.g.,
intercorrelations, factor analysis, cross-validation, etc.)?
•In a classic article, Campbell and Fiske (1959) presented a theoretical
methodology for interpreting the patterns of correlations in a multitraitmultimethod matrix to provide evidence of convergent validity and
discriminate validity. Their original methodology was based on the
examination of correlational patterns in a matrix where relatively high
correlations (convergent validity) are predicted for some variable and relatively
low correlations (discriminant validity) are predicted for other variables. Data
supporting a priori hypotheses about the pattern of the relationships provide
evidence of construct validity.
•Several priori hypotheses were made regarding the intercorrelation studies.
-1st – all subtests would show some degree of correlation to one
another based on the assumption that the subtests are measuring a
general intelligence factor. It was assumed that all subtests would
have at least low to moderate correlations with each other
-2nd – it was expected that the subtest contributing to specific scale
would have higher correlations with each other than subtests
comprising other scales.
1.11 Continued…
-3rd – evidence from previous studies indicate that some subtests are more
related to general intelligence than other subtests. For instance, block
design, similarities, vocabulary, a picture completion have all been shown to
have high g loadings. The following predications were made based on this
evidence. It was expected that subtests with high g loadings, regardless of scale
membership, would have relatively high correlations with each other. It was also
expected that the correlation between two high g-loadings subtests from same
scale would be higher than the correlation between two high g-loading subtests
from different scale.
Table 5.1 presents the average intercorrelations of the subtest and sums of scaled scores
for the composites for 11 age groups
-Statistically, all intersubtest correlations are significant. The pattern of WISC IV
intercorrelations is very similar to that found for WISC II and other Wechsler
intelligence tests in which most of the subtests have significant correlations
with other subtests
-The same general pattern of intercorrelations also appears scores the 11 age
groups. These data generally support the expectation that subtest of similar
functioning correlate more highly with each other than other subtests
measuring different types of functioning, providing initial evidence of both the
convergent and discrminant validity of WISC IV
1.11 Continued…Factor Analysis
-Both exploratory and confirmatory factor analyses were conducted to evaluate the internal structure.
The initial step was to determine whether the pattern of obtained results matched the hypothesized
factor structure. Then the stability of the factor structure was examined across different age groups.
Finally, the predicted model was tested against alternative models using confirmatory factor analytic
methods.
-The first series of exploratory factor analyses included only the core subtests; the second series utilized
both the core and supplemental subtests. Analyses with core subtest were conducted on the 2200
children included in normative sample. Analyses with core and supplemental subtests were conducted on
the 1525 children from the normative ample who also completed arithmetic. To further examine the
stability of the factor structure, the sample was divided into four separate age groups of 2 04 3 yr
intervals (ages 6-7, 8-10, 11-13, and 14-16). Factor analyses were then conducted separately for the four
age groups.
-Table 5.3 reports the results of the factor analysis of the core subtests for all ages and by age groups.
Results were consistent with the predicted factor model. The primary loading of each core subtest falls
clearly on its corresponding factor as predicted. None of the secondary loadings in the analyses for all
ages exceeded .20. The other results for the core subtest analysis were as predicted across age groups table on page 51
Table on page 53 and on page 54- Table 5.4 on page 55 n 56
-This table reports the factor analysis results of core and supplemental subtests for all ages
and by age group
-For all ages, a small secondary loading was observed for picture completion on the verbal
comprehension factor. The results by age group are somewhat more complex. The
secondary loading of picture completion on the verbal comprehension factor appears in all
age groups. Information exhibits a small secondary loading on the working memory factor
at ages 6-7 and 8-10. Picture concepts again split between the verbal comprehension and
perceptual reasoning factors for ages 6-7 as it did in previous analysis. Arithmetic, clearly a
working memory subtests at all ages becomes factorially more complex at ages 11-13 and
14-16, as indicated by the small secondary loadings of this subtest on the verbal
comprehension and perceptual reasoning factors.
1.11 Continued…Cross Validation Analysis
-The replicability of factor scores for the Verbal Comprehension, Perceptual Reasoning, Working Memory, and
Processing Speed factors was verified by a cross-validation procedure.
-It was designed to test the consistency of the factor score scores random subsamples of children. A random group of
440 children (40 per age group), without replacement, from the WISC-IV standardization sample and set aside a cross
validation sample. Seven random samples of 440 children each were then drawn, without replacement, to serve as
derivation samples. Using all 15 subtests, regression type factor score coefficients were calculated for each of these
seven derivation samples with the principal axis factoring procedure.
-At the conclusion of the factor analysis for each derivation sample, factor score coefficients for each of the four WISCIV factors were obtained. The seven sets of factor-score coefficients for the derivation samples were expected to yield
comparable factor scores when applied to the cross-validation sample if the scores were truly stable.
-Factor scores were then calculated for the cross-validation sample. Each child had seven factor scores for each of the
four WISC_iV factors (based on factor score coefficients from derivation sample). These factor scores were then
intercorrelated for each child in the cross validation sample. The target correlations were those based on factor scores
derived to measure the sample factor (e.g. Verbal Comprehension). These correlations were inspected for each WISCIV factor.
-The median correlations for targeted factors were:
o .999 for Verbal Comprehension
o .999 for Perceptual Reasoning
o .999 for Working Memory
o .999 for Processing Speed
These results demonstrate the stability of the VCI, PRI, WMI, and PSI scores across samples and support four factor
structure.
1.11 Continued…Correlations with the WAIS-III
-The WISC-IV and the WAIS-III were administered to 198 children aged 16, in
counterbalanced order, with a testing interval of 10-67 days and a mean testing interval of
22 days.
-The WISC-IV mean composite scores range from .5 to 2.7 points below 100. With the
exception of the WMI, the WAIS-III composite scores are higher than those of the WISC-IV.
The main effect for test was statistically significant and all differences for composites
exhibited small effect sizes.
-The corrected correlation coefficient between the two instruments ranges from .73 (PRIPOI) to .89 (FSIQ-FSIQ) for the composites and from .56 (symbol Search) to .83
(vocabulary) for subtests.
-Subtests correlations were higher in this study than in the WPSSI-III study. The results are
consistent with those found between the WISC-III and WAIS-III.
-The magnitude of correlations suggests that the two instruments measure highly similar
constructs.
1.11 Continued…Correlations with WISC-III
-Both the WISC-IV and the WISC-III were administered in counterbalanced order to 244
children from ages 6-16; the test-retest interval was 5 to 42 days.
-The correlated correlation between the WISC-III VIQ and WISC-IV VCI is .87 and .74
between the WISC-III PIQ and the WISC-IV PRI. The lower correlation between the PIQ
and PRI reflects important changes made to this composite in WISC-IV. Tasks that were
primarily visual and spatial were replaced with fluid reasoning tasks, making the PRI a
stronger measure of fluid reasoning than the PIQ, for this reason a moderate correlation
was expected.
-The WISC-III FSIQ and the WISC-IV FSIQ correlate highly (r=.89).
-The older WISC-III norms provided slightly inflated estimates for today’s children. The
overall difference between the WISC-IV FSIQ scores is 2.5 points, with WISC-III scores the
higher of the two.
1.5 Does test performance predict adequate criterion performance, and if so, what
evidence is provided for predictive/concurrent validity?
-Criterion validity is an alternative perspective that de-emphasizes the conceptual
meaning or interpretation of test scores. Test users may simply wish to use a test to
differentiate between groups of people or to make predications about future
outcomes. (Ex. Is the test “valid” enough for its intended purpose?)
From the traditional three-faceted view of validity, criterion validity refers to the
degree to which test scores can predict specific criterion variables. The key to validity
is the empirical association between test scores and scores on the relevant criterion
variable. *All that matters is the test’s ability to differentiate groups or predict specific
outcomes.
-Another alternative perspective on validity emphasizes the need to learn what test
scores mean, rather than testing specific hypotheses about test scores. Meaning test
developers and users can evaluate a test by assuming that the meaning of test scores is
itself an interesting and important question to be addressed. This approach is also
called a “deductive” approach. From this approach researchers “allow constructs to
evolve and change as a planned part of the test construction process itself” (Tellegen &
Waller, in press).
1.15 Continued…
-The inductive approach to validity might be most relevant within a research context,
and it can be seen as a back-and-fourth process. In an applied context, test developers
and test users will probably focus on a test for the purpose of well-specified use
(predicating job performance). In a research context, test developers and test users
might be interested in a new area of interest and developing a theoretical foundation
for that area. For such cases, test construction and evaluation work together with the
researcher’s evolving understanding of the constructs being measured.
-The third alternative perspective on test validity strongly emphasizes the connection
between tests and psychological constructs. *A test is a valid measure of a construct if
and only if the intended construct truly influences the respondent’s performance on
the test.
*Overall, constructs not only exist and are crucial part of validity, but they should be
the guiding forces in the test development and validation process.
1.24 Information Provided on Possible Unintended Consequences of Testing
Psychological and educational tests are commonly administered with the belief that
those being addressed will derive some benefit from the test results. Tests also assist
professionals and institutions in meeting professional and legal standards governing
diagnosis and classification, program and institutional planning, and research and
evaluation activities. Professionals also should acknowledge intended and unintended
consequences of relaying on informal judgments instead of those informed from test
results. Although information about the consequences of testing may influence
decisions about test use (whether or not to use a test), adverse consequences do not in
themselves detract from the validity of the intended test interpretations.
For a better understand, when creating tests everything needs to be taken into
consideration. Not all tests will affect everyone the same-some tests will affect one
person more than another. For example, Does the test construct benefit males more
than females? Or visa versa?
1.24 Continued…
“The theoretical assumption that scientists make are particularly shaped by value
judgments, and even the labels that scientists attach to their theoretical concepts are
particularly shaped by values” (Furr & Bacharach, 2008).
For example, which labels are considered scientifically correct? Are they biased labels
or are they biased test themselves?
“Value judgments have potentially subtle (and sometimes not so subtle) influences on
the scientific process. Proponents on consequential validity argues that such influences
should be recognized and evaluated as clearly as possible in a testing context” (Furr &
Bacharach, 2008)
In other words, do values and or personal judgments influence test constructs?
1.24 Continued…
•Some of the possible unintended consequences of testing are:
1. It may fail to measure the processes underlying a child’s response.
2. Cannot capture the multidimensional nature of intelligence by using a single number
quantifying i.q.
•There may be some errors in administration, such as: failure to query, record
responses verbatim, add individual subtest scores correctly, transform raw scores to
scaled scores correctly, add scaled scores correctly, transform scaled scores to IQ’s
correctly, report Composite scores and Full Scale IQ scores correctly.
•Some other unintended consequences of the WISC-IV may be: The failure to provide
full conversion tables, failure to provide psychometric basis, differing numbers of
children used to compute standardizations, limited range of scores, limited criterion
validity studies, There are possible difficulties in scoring responses, somewhat large
practice effects, poor quality of some test materials, There are some occasionally
confusing guidelines, The inclusion of Cancellation as subtest.
(Sattler, 2008).
Additional studies to prove the validity of a test are always necessary when
testing children and especially children with special needs.
Two studies which show the performance from special need children and the
WISC-III and IV include Mayes and Calhoun (2006) which “studied a sample of
118 children with attention-deficit/hyperactivity disorder. They obtained a
mean Full Scale IQ of 108.0 with a range of 24 points between their Processing
Speed Composite and Perceptual Reasoning Composite and the same range
between their Working Memory Composite and Perceptual Reasoning
Composite”.
In a similar study, “Falk, Silverman, and Moran (2004) [who] studied a sample
of 103 intellectually gifted children. They obtained a Full Scale IQ of 127.2,
with a range of 27 points between their Processing Speed Composite and
Perceptual Reasoning Composite” (Sattler, 2008).
Reference
American Educational Research Association, American Psychological Association, and National Council
on Measurement in Education. (1999). Standards for educational and psychological testing.
Washington, DC: American Educational Research Association.
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitraitmultimethod matrix. Psychological Bulletin, 56, 81-105.
Falk, R. F., Silverman, L. K., & Moran, D. (2004, November). Using two WISC-IV indices to identify the
gifted. Paper presented at the 51st Annual Convention of the National Association for
Gifted Children, Salt Lake City, UT. Retrieved March 9, 2007, from
http://www.gifteddevleopment.com/PDF_files/WISC-IVindices.pdf.
Furr, R. M. & Bacharach, V. R. (2008). Psychometrics: An Introduction. Thousand Oaks,
CA: Sage Publications.ISBN: 978-1-412-927604
Mayes, S. D., & Calhoun, S. L. (2006). WISC-IV and WISC-III profiles in children with ADHD. Journal of
Attention Disorder, 9, 486-493.
Sattler, J.M. (2008a). Assessment of children: Cognitive foundations (5th ed.).
San Diego: Author
Reference
Tellegen, A., & Waller, N.G. (in press). Exploring personality through test construction:
Development of the Multidimensional Personality Questionnaire. Minneapolis:
University of Minnesota Press.
Wechsler, D. (2004). WISC-IV Technical and Interpretive Manual. San
Antonio, TX: Psychological Corporation.
Williams, P., Weiss, L., Rolfhus, E. (2003). WISC-IV Technical Report # 2
Psychometric Properties. WISC-IV Technical Manual # 2. San Antonio, TX:
Psychological Corporation.
Download