Test Validity-WISC-IV By Jill Hutzel, A.M, K.W & L.K. What is Validity? “The concept of validity has evolved over more than 60 years, and various definitions of validity have been proposed. A rather basic definition of validity is ‘the degree to which a test measures what it is supposed to measure.’ Although this definition is relatively common and straightforward, it oversimplifies the issues a bit. A better definition, reflecting the most contemporary perspective, is that validity is ‘the degree to which evidence and theory support the interpretations of test scores entailed by the proposed uses’ of a test (AERA, APA, & NCME, 1999, p.9).” (Furr & Bacharach, 2008) 1.6- Procedures Used to Generate Test Content In general, “one type of validity evidence relates to the match between the actual content of a test and the content that should be included in the test. If a test is to be interpreted as a measure of a particular construct, then the content of the test should reflect the important facets of the construct. The supposed psychological nature of the construct should dictate the appropriate content of the test. Validity evidence of this type is sometimes referred to as content validity…” (Furr & Bacharach, 2008). 1.6 Continued… According to the WISC-IV technical manual (2004) examining relationships between a test’s content and the construct it is intended to measure provides a major source of evidence about the validity of the test. Evidence of content validity is not based on statistics or empirical testing: rather, it is based on the degree to which the test items adequately represent and relate to the trait or function that is being measured. Test content also involves wording and format of items, as well as procedures for administering and scoring the test. Comprehensive literature and expert reviews were conducted to examine the content of the WISC-IV. 1.6 Continued… A number of concurrent studies were conducted to provide additional evidence for the scale’s reliability and validity. Retest data is reported for all ages and for five separate age groups (6:0-7:11, 8:0-9:11, 10.0-11:11, 12:013:11, & 14:0-16:11). Evidence for the convergent and discriminant validity of the WISC-IV is provided by correlational studies with a number (8 total) of different instruments (ex. WPPSI-III, WAIS-III, WIAT-II, etc.). Evidence of construct validity was provided through a series of exploratory and confirmatory *factor-analytic studies and mean comparisons using matched samples of clinical and nonclinical children. *Factor analysis is a mathematical procedure used to explain the pattern of intercorrelations among a set of variables (such as individual test items, entire tests, subtests, or rating scales) by deriving the smallest number of meaningful variables or factors- It is based on the assumption that a significant correlation between two variables indicates a common underlying factor shared by both variables. 1.10 What evidence supports the interpretation of performance based on test items [response processes]? This may include comparisons to previous test versions, expert reviews, or an examination of incorrect or unintended correct responses. •Validity evidence based on the response processes of children supports that children do engage the expected cognitive process when responding to the designated subtest tasks (Wechsler, 2004) o Strong theoretical and empirical evidence supports the validity of earlier subtests (previous versions), which carried over, of the WISC-IV based on response processes o Validity for newly introduced subtests were supported with extensive literature reviews, empirical examination, and expert consultation 1.10 Continued… •Further evidence of validity was acquired through both empirical and qualitative examination of response processes during the test’s development (Wechsler, 2004): *The response frequencies for multiple choice items were examined for trends in commonly incorrect given responses -These common, incorrect responses were then examined to determine if there was an issue with the response *Children were also asked to explain their answers and reasoning behind the grouping of pictures on the Picture Concept subtest -These responses were paired with the response frequencies to identify the common rationale for children’s correct and incorrect responses -If possibly acceptable, yet unintended, responses came about because of this, item content was changed and items were once again reexamined 1.10 Continued… * Children’s reactions and comments were also noted -During trials of the Letter-Number Sequencing some children commented on embedded words. This provided examiners with evidence that the word acted as a cue in the working memory task * Direct questioning was also used -Children were questioned about their problem-solving strategies for the Matrix Reasoning subtest -Appropriate adjustments were then made to remove any distracters in the items * Finally, the WISC-IV was compared to the other following tests in order to determine the relationship between them (Williams, Weiss, & Rolfhus, 2003): -WISC-III, WPPSI-III, WAIS-III, WASI, WIAT-III, Children’s Memory Scale (CMS), Gifted Rating Scale (GRS), BarOn Emotional Quotient-Inventory: Youth Version (Bar-On EQ-I:YV), and Adaptive Behavior Assessment System-Second Edition (ABAS-II) •Data from the standardization sample, response frequencies for each age group and ability, and the questioning of participants’ responses was used to confirm or alter the assigned point values of subtests within the verbal domain (Wechsler, 2004) 1.11 What evidence is provided that supports the test’s internal structure (e.g., intercorrelations, factor analysis, cross-validation, etc.)? •In a classic article, Campbell and Fiske (1959) presented a theoretical methodology for interpreting the patterns of correlations in a multitraitmultimethod matrix to provide evidence of convergent validity and discriminate validity. Their original methodology was based on the examination of correlational patterns in a matrix where relatively high correlations (convergent validity) are predicted for some variable and relatively low correlations (discriminant validity) are predicted for other variables. Data supporting a priori hypotheses about the pattern of the relationships provide evidence of construct validity. •Several priori hypotheses were made regarding the intercorrelation studies. -1st – all subtests would show some degree of correlation to one another based on the assumption that the subtests are measuring a general intelligence factor. It was assumed that all subtests would have at least low to moderate correlations with each other -2nd – it was expected that the subtest contributing to specific scale would have higher correlations with each other than subtests comprising other scales. 1.11 Continued… -3rd – evidence from previous studies indicate that some subtests are more related to general intelligence than other subtests. For instance, block design, similarities, vocabulary, a picture completion have all been shown to have high g loadings. The following predications were made based on this evidence. It was expected that subtests with high g loadings, regardless of scale membership, would have relatively high correlations with each other. It was also expected that the correlation between two high g-loadings subtests from same scale would be higher than the correlation between two high g-loading subtests from different scale. Table 5.1 presents the average intercorrelations of the subtest and sums of scaled scores for the composites for 11 age groups -Statistically, all intersubtest correlations are significant. The pattern of WISC IV intercorrelations is very similar to that found for WISC II and other Wechsler intelligence tests in which most of the subtests have significant correlations with other subtests -The same general pattern of intercorrelations also appears scores the 11 age groups. These data generally support the expectation that subtest of similar functioning correlate more highly with each other than other subtests measuring different types of functioning, providing initial evidence of both the convergent and discrminant validity of WISC IV 1.11 Continued…Factor Analysis -Both exploratory and confirmatory factor analyses were conducted to evaluate the internal structure. The initial step was to determine whether the pattern of obtained results matched the hypothesized factor structure. Then the stability of the factor structure was examined across different age groups. Finally, the predicted model was tested against alternative models using confirmatory factor analytic methods. -The first series of exploratory factor analyses included only the core subtests; the second series utilized both the core and supplemental subtests. Analyses with core subtest were conducted on the 2200 children included in normative sample. Analyses with core and supplemental subtests were conducted on the 1525 children from the normative ample who also completed arithmetic. To further examine the stability of the factor structure, the sample was divided into four separate age groups of 2 04 3 yr intervals (ages 6-7, 8-10, 11-13, and 14-16). Factor analyses were then conducted separately for the four age groups. -Table 5.3 reports the results of the factor analysis of the core subtests for all ages and by age groups. Results were consistent with the predicted factor model. The primary loading of each core subtest falls clearly on its corresponding factor as predicted. None of the secondary loadings in the analyses for all ages exceeded .20. The other results for the core subtest analysis were as predicted across age groups table on page 51 Table on page 53 and on page 54- Table 5.4 on page 55 n 56 -This table reports the factor analysis results of core and supplemental subtests for all ages and by age group -For all ages, a small secondary loading was observed for picture completion on the verbal comprehension factor. The results by age group are somewhat more complex. The secondary loading of picture completion on the verbal comprehension factor appears in all age groups. Information exhibits a small secondary loading on the working memory factor at ages 6-7 and 8-10. Picture concepts again split between the verbal comprehension and perceptual reasoning factors for ages 6-7 as it did in previous analysis. Arithmetic, clearly a working memory subtests at all ages becomes factorially more complex at ages 11-13 and 14-16, as indicated by the small secondary loadings of this subtest on the verbal comprehension and perceptual reasoning factors. 1.11 Continued…Cross Validation Analysis -The replicability of factor scores for the Verbal Comprehension, Perceptual Reasoning, Working Memory, and Processing Speed factors was verified by a cross-validation procedure. -It was designed to test the consistency of the factor score scores random subsamples of children. A random group of 440 children (40 per age group), without replacement, from the WISC-IV standardization sample and set aside a cross validation sample. Seven random samples of 440 children each were then drawn, without replacement, to serve as derivation samples. Using all 15 subtests, regression type factor score coefficients were calculated for each of these seven derivation samples with the principal axis factoring procedure. -At the conclusion of the factor analysis for each derivation sample, factor score coefficients for each of the four WISCIV factors were obtained. The seven sets of factor-score coefficients for the derivation samples were expected to yield comparable factor scores when applied to the cross-validation sample if the scores were truly stable. -Factor scores were then calculated for the cross-validation sample. Each child had seven factor scores for each of the four WISC_iV factors (based on factor score coefficients from derivation sample). These factor scores were then intercorrelated for each child in the cross validation sample. The target correlations were those based on factor scores derived to measure the sample factor (e.g. Verbal Comprehension). These correlations were inspected for each WISCIV factor. -The median correlations for targeted factors were: o .999 for Verbal Comprehension o .999 for Perceptual Reasoning o .999 for Working Memory o .999 for Processing Speed These results demonstrate the stability of the VCI, PRI, WMI, and PSI scores across samples and support four factor structure. 1.11 Continued…Correlations with the WAIS-III -The WISC-IV and the WAIS-III were administered to 198 children aged 16, in counterbalanced order, with a testing interval of 10-67 days and a mean testing interval of 22 days. -The WISC-IV mean composite scores range from .5 to 2.7 points below 100. With the exception of the WMI, the WAIS-III composite scores are higher than those of the WISC-IV. The main effect for test was statistically significant and all differences for composites exhibited small effect sizes. -The corrected correlation coefficient between the two instruments ranges from .73 (PRIPOI) to .89 (FSIQ-FSIQ) for the composites and from .56 (symbol Search) to .83 (vocabulary) for subtests. -Subtests correlations were higher in this study than in the WPSSI-III study. The results are consistent with those found between the WISC-III and WAIS-III. -The magnitude of correlations suggests that the two instruments measure highly similar constructs. 1.11 Continued…Correlations with WISC-III -Both the WISC-IV and the WISC-III were administered in counterbalanced order to 244 children from ages 6-16; the test-retest interval was 5 to 42 days. -The correlated correlation between the WISC-III VIQ and WISC-IV VCI is .87 and .74 between the WISC-III PIQ and the WISC-IV PRI. The lower correlation between the PIQ and PRI reflects important changes made to this composite in WISC-IV. Tasks that were primarily visual and spatial were replaced with fluid reasoning tasks, making the PRI a stronger measure of fluid reasoning than the PIQ, for this reason a moderate correlation was expected. -The WISC-III FSIQ and the WISC-IV FSIQ correlate highly (r=.89). -The older WISC-III norms provided slightly inflated estimates for today’s children. The overall difference between the WISC-IV FSIQ scores is 2.5 points, with WISC-III scores the higher of the two. 1.5 Does test performance predict adequate criterion performance, and if so, what evidence is provided for predictive/concurrent validity? -Criterion validity is an alternative perspective that de-emphasizes the conceptual meaning or interpretation of test scores. Test users may simply wish to use a test to differentiate between groups of people or to make predications about future outcomes. (Ex. Is the test “valid” enough for its intended purpose?) From the traditional three-faceted view of validity, criterion validity refers to the degree to which test scores can predict specific criterion variables. The key to validity is the empirical association between test scores and scores on the relevant criterion variable. *All that matters is the test’s ability to differentiate groups or predict specific outcomes. -Another alternative perspective on validity emphasizes the need to learn what test scores mean, rather than testing specific hypotheses about test scores. Meaning test developers and users can evaluate a test by assuming that the meaning of test scores is itself an interesting and important question to be addressed. This approach is also called a “deductive” approach. From this approach researchers “allow constructs to evolve and change as a planned part of the test construction process itself” (Tellegen & Waller, in press). 1.15 Continued… -The inductive approach to validity might be most relevant within a research context, and it can be seen as a back-and-fourth process. In an applied context, test developers and test users will probably focus on a test for the purpose of well-specified use (predicating job performance). In a research context, test developers and test users might be interested in a new area of interest and developing a theoretical foundation for that area. For such cases, test construction and evaluation work together with the researcher’s evolving understanding of the constructs being measured. -The third alternative perspective on test validity strongly emphasizes the connection between tests and psychological constructs. *A test is a valid measure of a construct if and only if the intended construct truly influences the respondent’s performance on the test. *Overall, constructs not only exist and are crucial part of validity, but they should be the guiding forces in the test development and validation process. 1.24 Information Provided on Possible Unintended Consequences of Testing Psychological and educational tests are commonly administered with the belief that those being addressed will derive some benefit from the test results. Tests also assist professionals and institutions in meeting professional and legal standards governing diagnosis and classification, program and institutional planning, and research and evaluation activities. Professionals also should acknowledge intended and unintended consequences of relaying on informal judgments instead of those informed from test results. Although information about the consequences of testing may influence decisions about test use (whether or not to use a test), adverse consequences do not in themselves detract from the validity of the intended test interpretations. For a better understand, when creating tests everything needs to be taken into consideration. Not all tests will affect everyone the same-some tests will affect one person more than another. For example, Does the test construct benefit males more than females? Or visa versa? 1.24 Continued… “The theoretical assumption that scientists make are particularly shaped by value judgments, and even the labels that scientists attach to their theoretical concepts are particularly shaped by values” (Furr & Bacharach, 2008). For example, which labels are considered scientifically correct? Are they biased labels or are they biased test themselves? “Value judgments have potentially subtle (and sometimes not so subtle) influences on the scientific process. Proponents on consequential validity argues that such influences should be recognized and evaluated as clearly as possible in a testing context” (Furr & Bacharach, 2008) In other words, do values and or personal judgments influence test constructs? 1.24 Continued… •Some of the possible unintended consequences of testing are: 1. It may fail to measure the processes underlying a child’s response. 2. Cannot capture the multidimensional nature of intelligence by using a single number quantifying i.q. •There may be some errors in administration, such as: failure to query, record responses verbatim, add individual subtest scores correctly, transform raw scores to scaled scores correctly, add scaled scores correctly, transform scaled scores to IQ’s correctly, report Composite scores and Full Scale IQ scores correctly. •Some other unintended consequences of the WISC-IV may be: The failure to provide full conversion tables, failure to provide psychometric basis, differing numbers of children used to compute standardizations, limited range of scores, limited criterion validity studies, There are possible difficulties in scoring responses, somewhat large practice effects, poor quality of some test materials, There are some occasionally confusing guidelines, The inclusion of Cancellation as subtest. (Sattler, 2008). Additional studies to prove the validity of a test are always necessary when testing children and especially children with special needs. Two studies which show the performance from special need children and the WISC-III and IV include Mayes and Calhoun (2006) which “studied a sample of 118 children with attention-deficit/hyperactivity disorder. They obtained a mean Full Scale IQ of 108.0 with a range of 24 points between their Processing Speed Composite and Perceptual Reasoning Composite and the same range between their Working Memory Composite and Perceptual Reasoning Composite”. In a similar study, “Falk, Silverman, and Moran (2004) [who] studied a sample of 103 intellectually gifted children. They obtained a Full Scale IQ of 127.2, with a range of 27 points between their Processing Speed Composite and Perceptual Reasoning Composite” (Sattler, 2008). Reference American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitraitmultimethod matrix. Psychological Bulletin, 56, 81-105. Falk, R. F., Silverman, L. K., & Moran, D. (2004, November). Using two WISC-IV indices to identify the gifted. Paper presented at the 51st Annual Convention of the National Association for Gifted Children, Salt Lake City, UT. Retrieved March 9, 2007, from http://www.gifteddevleopment.com/PDF_files/WISC-IVindices.pdf. Furr, R. M. & Bacharach, V. R. (2008). Psychometrics: An Introduction. Thousand Oaks, CA: Sage Publications.ISBN: 978-1-412-927604 Mayes, S. D., & Calhoun, S. L. (2006). WISC-IV and WISC-III profiles in children with ADHD. Journal of Attention Disorder, 9, 486-493. Sattler, J.M. (2008a). Assessment of children: Cognitive foundations (5th ed.). San Diego: Author Reference Tellegen, A., & Waller, N.G. (in press). Exploring personality through test construction: Development of the Multidimensional Personality Questionnaire. Minneapolis: University of Minnesota Press. Wechsler, D. (2004). WISC-IV Technical and Interpretive Manual. San Antonio, TX: Psychological Corporation. Williams, P., Weiss, L., Rolfhus, E. (2003). WISC-IV Technical Report # 2 Psychometric Properties. WISC-IV Technical Manual # 2. San Antonio, TX: Psychological Corporation.