Test Construction and Reliability WISC-IV By Jill Hutzel, A.M, K.W & L.K What Does This Test Measure (3.2)? • The Wechsler Intelligence Scale for Children-Fourth Edition (WISC-IV) was designed to measure the intellectual functioning in specific cognitive areas such as: • Verbal Comprehension • Perceptual Reasoning • Working Memory • Processing Speed This test also provides a composite score (ex. Full Scale IQ) that represents a child’s general intellectual ability. These Four Index Scores measures a child’s overall: Crystallized Ability (Gc)- acquired skills and knowledge that are developmentally dependent on exposure to the culture Visual Processing (Gv)- a facility for visualizing and manipulating figures and responding appropriately to spatial forms Fluid Reasoning (Gf)- a broad pattern of reasoning, seriation, sorting, and classifying Processing Speed (Gs)- an ability to scan and react to simple tasks rapidly (Sattler, 2008) What Are the Test Specifics (3.3)? Age of Examinees- 6:0- 16:11 Number of Subtests- 10 subtests and 4 indexes including: (VCI) Similarities, Vocabulary, Comprehension (PRI) Block Design, Picture Concepts, Matrix Reasoning (WMI) Digit Span, Letter-Number Sequencing (PSI) Coding & Symbol Search Number of Supplemental Subtests- 5 including: (VCI) Information & Word Reasoning (PRI) Picture Completion (WMI) Arithmetic (PSI) Cancellation Administration time is approximately 65 to 80 minutes Qualification of Examiners- Graduate or Professional level of training in psychological assessment Procedures to Norm and Standardize Test Scores (3.4) • • • • • • • • • This test developed in 5 general stages: Conceptual Development, Pilot, National Tryout, Standardization, Final Assembly and Evaluation Sample size 2,200 children ages 6:0 to 16:11 (the arithmetic subtest was normed a subsample of 1,100 children-100 children per age group) In order to prove evidence of the scales validity, additional children were administered the WISC-IV and other cognitive measures including the WISC III, WAIS III, WPPSI-III, WASI, WIAT-II,CMS,GRS, BarOn EQ and ABAS-II Description of the Sample- To ensure the standardization samples included representative proportions of children according to selected demographic variables including sex, age, race/ethnicity, parent education level, geographic region. Researchers used the March 2000 Census from the U.S. Bureau AGE- 2,200 Children divided into 11 age groups- 200 participants in each age group (ex. 6:0-6:11, 7:0-7:11… 16:11) SEX- Equal number of males and females in each age group (100 in each group) RACE- The proportions of racial groups were based on the racial proportions of children within that age group of the US population according to the Census Parent Education Level- The sample was divided according to 5 parent education levels based on years of education completed Geographic Region-Divided into the 4 major geographic regions specified by the census reports Procedures Used to Develop Test Items 3.5 (Conduct and document review by relevant, independent experts, including review process and experts’ qualifications, relevant experiences, and demographics) • Specific procedures were utilized in the WISC IV research program to optimize the quality of obtained data and to assist in the formulation of final scoring criteria. • One of the first steps was to recruit examiners with extensive experience testing children and adolescents. Potential examiners completed a questionnaire by supplying information about their educational and professional experience, administration experience with various intellectual measures, certification, and licensing status. The majority was certified or licensed professional working in private or public facilities. • Potential standardization examiners were provided training material, which consisted of training video, a summary of common administration and scoring errors, and a two part training quiz. The content of the training quiz included questions on administration and scoring rules as well as a task that required the examiner to identify administration and scoring errors in a fictitious test protocol. 3.5 continued… • Selected examiners scored at least 90% correction on both parts of the training quiz. Any errors or omissions on the training quiz were reviewed with the examiner. As an oversight measure, examiners were required to submit a review case prior to testing additional children. Every attempt was made to discuss administration and scoring errors on the review case with the examiner within 48 hours of its submission. Subsequent cases were reviewed within 72 hours of receipt if possible, and any errors resulting in loss or inaccuracy of data were discussed with the examiner. A periodic newsletter was sent to all examiners, alerting them to potentially problematic areas. • All scorers had a minimum of a bachelor’s degree and attended a 5-day training program led by members of the research team. Scorers were required to score at least 90% correct on a quiz that required them to identify scoring errors in a fictitious protocol. Each protocol collected during the national tryout and standardization stages of development was rescored and entered into a database by two qualified scorers working independently. Any discrepancies between the two scorers were resolved daily by a third scorer (resolver).The resolvers were chosen based on their demonstration of exceptional scoring accuracy and previous scoring experience. 3.6 (Empirical analyses and/or expert judgment as to the appropriateness of test items, content, and response formats for different groups of test takers) • To ensure the validity of the WISC IV, 16 special group studies were conducted during the WISC IV Standardization. The results from the special group studies support for the validity and clinical utility of the WISC IV. The majority of results are consistent with expectations based on previous research and theoretical foundations of the scales development. It is expected that future investigations utilizing the WISC IV in different clinical settings and populations will provide additional evidence of the scales utility for clinical diagnosis and intervention purposes. 3.7 (Procedures used to develop, review, and tryout items from item pool) • Early in the development process, 45 assessments professional from eight major cities met in a focus group with members of a marketing research firm to refine revision goals and assist in the formulation of the scales working blueprint. • Also, a telephone survey (N=308) was conducted with users of the WISC-III as well as professionals in child and adolescent assessment. The research team, advisory panel that was composed of national recognized experts in school psychology and clinical neuropsychology, and clinical measurement consultants from the Harcourt assessment reviewed the feedback from the focus groups and telephone surveys. Based on findings, the working blueprint was established and the first research version of the scale was developed for the use in the initial pilot study. 3.7 continued… • The primary goal of the pilot stage was to produce a version of the scale for use in the subsequent national tryout stage. A number of research questions were addressed through a series of five pilot studies (N= 255, 151, 110, 389, and 197) and three mini pilot studies (N=31,16, and 34). Each of these studies utilized a research version of the scale that included various groupings of subtests retained from the WISC III and new, experimental subtests that were being considered for inclusion at the national tryout stage. • The primary research questions at this stage of development focused on such issues as content and reliance of items, adequacy of subtests floors and ceilings, clarity of instructions to the examiner and child, identification of response processes, administration procedures, scoring criteria, item bias and other relevant psychometric properties. 3.8 (selection procedures and demographics of item tryout and/or standardization sample) • • • The national tryout stage utilized a version of the scale with 16 subtests; Data were obtained from a stratified sample of 1,270children, who reflected key demographic variables in the national population. An analysis gathered by the U.S Bureau of the Census (1998) provided basis for stratification along the following variables: age, sex, race, parent education level, and geographic region. Using this larger, more representative sample of children, research questions from the pilot phased were reexamined, and additional issues were addressed. Refinements to the item order were made based on more precise estimates of their relative difficulty level, and exploratory and confirmatory factor analyses were conducted to determine the underlying factor structure of the scale. In addition, data were collected at this stage from a number of special groups (children identified as intellectually gifted, children with intellectual disability or learning disorders, and children with ADHD) to provide additional evidence regarding the adequacy of the subtest floors and ceilings, as well as clinical utility of the scale. An oversample of252 African American children and 186 Hispanic children was collected to allow for a statistical examination of item bias using IRT methods of analysis. 3.8 Continued… • After reviewing the accumulated evidence from the pilot and national tryout studies a standardization edition of the WISC IV was created. • The standardization sample consisted of 2,200children who were divided into 11 age groups where each age group consisted of200 participants. The U.S. Bureau of the Census collected an analysis of data in March 2000 along the variables of: age, sex, race/ethnicity, parent education level, and geographic region. For each age group, the proportions of Whites, African Americans, Hispanics, Asians, and other racial groups were based in the racial proportions of children within the corresponding age group of the U.S> population according to March 2000 census data. The sample was stratified according to five-parent education levels based on the number of years of school completed. If the child resided with only one parent or guardian, the educational level of that parent of guardian was assigned. If the child resided with two parents, a parent and a guardian, or two guardians, the average of both individuals’ educational levels was used, with partial levels rounded up to the next highest level. Evidence for Internal Consistency 2.7 According to the Technical Manual… • The evidence for internal consistency was obtained using the normative sample and the split half method. “The split-half method is done by sorting the items on a test into two parallel subtests of equal size. Then you compute a composite score for each subtest and correlate the two composite scores. By doing so, you have created two parallel tests from the items within one test. It is possible to use these subtest scores to compute and estimate of total test reliability” (Furr and Bacharach, 2008). • As stated by the WISC IV technical manual, the split half method was used on all subtests excluding Coding, Symbol Search and Cancellation due to these being Processing Speed subtests. Therefore, test-retest stability coefficients were used as the reliability estimates for these particular subtests. 2.7 continued… • The reliability coefficients for the WISC IV composite scales range from .88 (Processing Speed) to .97 (Full Scale). These coefficients are generally higher than those of the individual subtests that comprise these composite scales. The average reliability coefficient for the Processing Speed composite scale is slightly lower (.88) because it is based on the test-retest reliabilities which tend to be lower then the split half reliabilities. The reliability coefficients for the WISC IV composite scales are identical to or slightly better than corresponding scales in the WISC III. • The evidence of Internal Consistency Reliability was obtained from the split half method from a group of children ranging from the ages of 6 to 16. The overall averages for these special clinical groups are as follows: Verbal Comprehension (VCI) .94, Perceptual Reasoning (PRI) .92, Working Memory (WMI) .92, Processing Speed (PSI) .88, and Full Scale (FSIQ) .97. Test-Retest Approaches (Are alternate-form or testretest approaches used, and if so, what were the results? Were separately timed administrations used to investigate a practice effect, and if so, what were the results? Additional information includes procedures used to estimate this type of reliability) 2.9 • Test-Retest Approaches Yes. A test-retest approach was used. According to Wechsler (2004), the sample consisted of: 243 children 18 to 27 participants in each of the 11 age groups Each participant was given two separate WISCIV administrations: Ranging from 13 to 63 days between test and retest (Mean interval of 32 days) 2.9 Continued… • The sample consisted of: •52.3% Female vs. 47.7% Male •74.1% White •7.8% African American •11.1% Hispanic •7.0% Other • The Parent Education Level: •0 – 8 Years (Y): 4.9% •9 – 11 Y: 9.1% •12 Y: 25.9% •13 – 15 Y: 36.2& •> 16 Y: 23.9% 2.9 Continued… • Used Pearson’s product-moment correlation to estimate TEST-RETEST RELIABILITY for 5 different age groups (Wechsler, 2004) Age groups: 6-7, 8-9, 10-11, 12-13, 14-16 r = (SP)/(SqRt (SSxSSy)) Table 4.4 in the WISC-IV Integrated Technical and Interpretive Manual displays: • Mean subtest scaled scores and composite scores with SD • Standard Differences (effect sizes) between the first and second testing's • Correlation coefficients corrected for the variability of the standardization sample Continuation and Chart Follows 2.9 Continued… • (Williams, Weiss, & Rolfhus, 2003) • Used Fisher’s Z Transformation to calculate TESTRETEST COEFFICIENTS for the Overall Sample (Wechsler, 2004) • Standard Difference calculated using: (The mean score difference between the first and second testing session) divided by (the pooled standard deviation) Effect Size – A measure intended to provide a measurement of the absolute magnitude of a treatment effect, independent of the size of the sample(s) being used (Gravetter & Wallnau, 2009) Cohen’s d = mean difference/standard deviation Comprehension had the smallest effect size (.08), Picture Completion had the largest (.60), FSIQ had an effect size of (.46) 2.9 Continued… • RESULTS: The WISC-IV scores have adequate stability across time for all five age groups (Wechsler, 2004) • Corrected Stability Coefficient •Excellent (.92) -Vocabulary •Good (.80) -Block Design -Similarities -Digit Span -Coding -Letter-Number Sequencing -Matrix Reasoning -Comprehension -Symbol Search -Picture Completion -Information -Word Reasoning •Adequate (.70) -Other subtests Composite Scores have better stability than individual subtest scores Good (.80) or better 2.9 Continued… • Retest score means for the subtests of the WISC-IV are higher than the scores from the first testing session, possibly due to practice effects due to a short time interval between test and retest– Practice Effects •Re-test gains were smaller for the VCI and WMI subtests compared to the PRI and PSI subtests •Score Differences between test-retest primarily due to practice effects: VCI +2.1, PRI +5.2, WMI +2.6, PSI +7.1 , FSIQ +5.6 – Stability of the WISC-IV in a Sample of Elementary and Middle School Children Ryan, Glass, and Bartels (2010) investigated test-retest stability of the WISC-IV in 43 elementary/middle school students in a rural location, tested on two separate occasions, roughly 11 months apart Believed that the stability found from the WISC-IV standardization sample does not generalize to clinically realistic test-rest intervals and does not generalize to other populations 2.9 Continued… • • Participants – 76 students from a small private school in a Midwestern community – 43 were rested 25 female 18 male Stability Coefficients ranged from: .26 (Picture Concepts) .84 (Vocabulary) .88 (FSIQ) Table follows- 2.9 Continued… • Results 2.9 Continued… • Stability Coefficients from the standardization sample were slightly larger than from the sample in this study (Ryan et al., 2010) -FSIQ .91 in the Standardization Sample Vs. .88 in the abovementioned sample -Similar to the standardization sample, stability coefficients for the composite scores were slightly more stable than individual subtest scores, with the FSIQ being the most stable • Ryan et al. (2010) believe that the test-retest interval of 11 months, compared to the 32 day test-retest interval, accounted for an overall smaller stability coefficient and an overall smaller practice effect • This study supports Wechsler’s statistical evidence that (Ryan et al., 2010): -The FSIQ is the most stable score provided by the WISC-IV over time -During long test-retest intervals, only the FSIQ has sufficient stability for interpretation -Individual subtest scores should NOT be used for any diagnostic and/or decision-making purposes Evidence Provided for Both Interrater Consistency & Consistency Over Repeated Measurements (2.10) According to the WISC IV Technical Manual… • The test-retest sample for the WISC-IV was composed of 243 children. There were 18-27 participants from each of the 11 age groups. • The WISC-IV was given one time to all of the children. The test was then administered a second time anywhere from 13-63 days later. The mean interval was 32 days. There was %52.3 females and %47.7 males in the sample. • “The test-retest reliability was calculated for five age groups (6:0-7:11, 8:0-9:11, 10:0-11:11, 12:0-13:11, and 14:0-16:11) using Pearson’s product-moment correlation.” The coefficients of the test-retest for the general sample were calculated using Fisher’s z transformation. • “The standard difference was calculated using the mean score difference between two testing divided by the pooled standard deviation.” • The mean scores for the retest for all of the seven scaled process scores are higher than that from the first testing, in which the effect sizes ranged from .14 to .41. “In general, test-retest gains are less pronounced for the process scores in the Working Memory domain than for the process scores in the Perceptual and Processing Speed domains” (pg. 136 of technical manual) 2.10 Continued… • In a Study done by Ryan, Glass and Bartels they had 76 students in a Midwestern community take the WISC-IV, 43 of the students agreed to take a second WISC-IV examination, those 43 students were the participants of the investigation. • According to Ryan, Glass and Bartels (2010), in all of the dependent samples, except for one, the t-tests failed to discover significant differences in scores from the first time the WISC-IV was administered to the second time. • “Stability coefficients in the present sample were consistently smaller than those reported in the WISC-IV Technical and Interpretive Manual (Wechsler, 2003b) for children 8 to 9 years of age.” • This study did have some limitations though, the study was done in a rural community that is composed of mainly white students attending a private school and it is not a good representation of an ethnically diverse population (Ryan, Glass, Bartel, 2010). Reference • Furr, R. M. & Bacharach, V. R. (2008). Psychometrics: An Introduction. Thousand Oaks, CA: Sage Publications.ISBN: 978-1-412-927604 • Gravetter, F. & Wallnau, L. (2009). Statistics for the Behavioral Sciences-Eighth Edition. Wadsworth, CA: Cengage Learning. • Ryan, J., Glass, L., Bartels, J. (2010). Stability of the WISC-IV in a Sample of Elementary and Middle School Children. Applied Neuropsychology, 17: 68-72. • Sattler, J.M. (2008a). Assessment of children: Cognitive foundations (5th ed.). San Diego: Author • Wechsler, D. (2004). WISC-IV Technical and Interpretive Manual. San Antonio, TX: Psychological Corporation. • Williams, P., Weiss, L., Rolfhus, E. (2003). WISC-IV Technical Report # 1 Psychometric Properties. WISC-IV Technical Manual # 1. San Antonio, TX: Psychological Corporation. • Williams, P., Weiss, L., Rolfhus, E. (2003). WISC-IV Technical Report # 2 Psychometric Properties. WISC-IV Technical Manual # 2. San Antonio, TX: Psychological Corporation.