Test Construction and
By Jill Hutzel, A.M, K.W & L.K
What Does This Test Measure (3.2)?
• The Wechsler Intelligence Scale for Children-Fourth
Edition (WISC-IV) was designed to measure the
intellectual functioning in specific cognitive areas such
• Verbal Comprehension
• Perceptual Reasoning
• Working Memory
• Processing Speed
This test also provides a composite score (ex. Full Scale IQ)
that represents a child’s general intellectual ability.
These Four Index Scores
measures a child’s overall:
Crystallized Ability (Gc)- acquired skills and knowledge
that are developmentally dependent on exposure to the culture
Visual Processing (Gv)- a facility for visualizing and
manipulating figures and responding appropriately to spatial forms
Fluid Reasoning (Gf)- a broad pattern of reasoning, seriation,
sorting, and classifying
Processing Speed (Gs)- an ability to scan and react to simple
tasks rapidly
(Sattler, 2008)
What Are the Test Specifics (3.3)?
Age of Examinees- 6:0- 16:11
Number of Subtests- 10 subtests and 4 indexes including:
(VCI) Similarities, Vocabulary, Comprehension
(PRI) Block Design, Picture Concepts, Matrix Reasoning
(WMI) Digit Span, Letter-Number Sequencing
(PSI) Coding & Symbol Search
Number of Supplemental Subtests- 5 including:
(VCI) Information & Word Reasoning
(PRI) Picture Completion
(WMI) Arithmetic
(PSI) Cancellation
Administration time is approximately 65 to 80 minutes
Qualification of Examiners- Graduate or Professional level of
training in psychological assessment
Procedures to Norm and Standardize Test Scores
This test developed in 5 general stages: Conceptual Development, Pilot, National Tryout,
Standardization, Final Assembly and Evaluation
Sample size 2,200 children ages 6:0 to 16:11 (the arithmetic subtest was normed a subsample of
1,100 children-100 children per age group)
In order to prove evidence of the scales validity, additional children were administered the WISC-IV
and other cognitive measures including the WISC III, WAIS III, WPPSI-III, WASI, WIAT-II,CMS,GRS,
BarOn EQ and ABAS-II
Description of the Sample- To ensure the standardization samples included representative
proportions of children according to selected demographic variables including sex, age,
race/ethnicity, parent education level, geographic region. Researchers used the March 2000
Census from the U.S. Bureau
AGE- 2,200 Children divided into 11 age groups- 200 participants in each age group (ex. 6:0-6:11,
7:0-7:11… 16:11)
SEX- Equal number of males and females in each age group (100 in each group)
RACE- The proportions of racial groups were based on the racial proportions of children within
that age group of the US population according to the Census
Parent Education Level- The sample was divided according to 5 parent education levels based on
years of education completed
Geographic Region-Divided into the 4 major geographic regions specified by the census reports
Procedures Used to Develop Test Items
3.5 (Conduct and document review by relevant, independent experts, including
review process and experts’ qualifications, relevant experiences, and demographics)
• Specific procedures were utilized in the WISC IV research program to optimize the
quality of obtained data and to assist in the formulation of final scoring criteria.
• One of the first steps was to recruit examiners with extensive experience testing
children and adolescents. Potential examiners completed a questionnaire by
supplying information about their educational and professional experience,
administration experience with various intellectual measures, certification, and
licensing status. The majority was certified or licensed professional working in
private or public facilities.
• Potential standardization examiners were provided training material, which
consisted of training video, a summary of common administration and scoring
errors, and a two part training quiz. The content of the training quiz included
questions on administration and scoring rules as well as a task that required the
examiner to identify administration and scoring errors in a fictitious test protocol.
3.5 continued…
• Selected examiners scored at least 90% correction on both parts of the training quiz.
Any errors or omissions on the training quiz were reviewed with the examiner. As an
oversight measure, examiners were required to submit a review case prior to testing
additional children. Every attempt was made to discuss administration and scoring
errors on the review case with the examiner within 48 hours of its submission.
Subsequent cases were reviewed within 72 hours of receipt if possible, and any
errors resulting in loss or inaccuracy of data were discussed with the examiner. A
periodic newsletter was sent to all examiners, alerting them to potentially
problematic areas.
• All scorers had a minimum of a bachelor’s degree and attended a 5-day training
program led by members of the research team. Scorers were required to score at
least 90% correct on a quiz that required them to identify scoring errors in a
fictitious protocol. Each protocol collected during the national tryout and
standardization stages of development was rescored and entered into a database by
two qualified scorers working independently. Any discrepancies between the two
scorers were resolved daily by a third scorer (resolver).The resolvers were chosen
based on their demonstration of exceptional scoring accuracy and previous scoring
3.6 (Empirical analyses and/or expert judgment as to the appropriateness of test
items, content, and response formats for different groups of test takers)
• To ensure the validity of the WISC IV, 16 special
group studies were conducted during the WISC IV
Standardization. The results from the special group
studies support for the validity and clinical utility of
the WISC IV. The majority of results are consistent
with expectations based on previous research and
theoretical foundations of the scales development.
It is expected that future investigations utilizing the
WISC IV in different clinical settings and populations
will provide additional evidence of the scales utility
for clinical diagnosis and intervention purposes.
3.7 (Procedures used to develop, review, and tryout items from item pool)
• Early in the development process, 45 assessments
professional from eight major cities met in a focus group with
members of a marketing research firm to refine revision goals
and assist in the formulation of the scales working blueprint.
• Also, a telephone survey (N=308) was conducted with users of
the WISC-III as well as professionals in child and adolescent
assessment. The research team, advisory panel that was
composed of national recognized experts in school psychology
and clinical neuropsychology, and clinical measurement
consultants from the Harcourt assessment reviewed the
feedback from the focus groups and telephone surveys. Based
on findings, the working blueprint was established and the
first research version of the scale was developed for the use in
the initial pilot study.
3.7 continued…
• The primary goal of the pilot stage was to produce a version of the
scale for use in the subsequent national tryout stage. A number of
research questions were addressed through a series of five pilot
studies (N= 255, 151, 110, 389, and 197) and three mini pilot
studies (N=31,16, and 34). Each of these studies utilized a research
version of the scale that included various groupings of subtests
retained from the WISC III and new, experimental subtests that
were being considered for inclusion at the national tryout stage.
• The primary research questions at this stage of development
focused on such issues as content and reliance of items, adequacy
of subtests floors and ceilings, clarity of instructions to the
examiner and child, identification of response processes,
administration procedures, scoring criteria, item bias and other
relevant psychometric properties.
3.8 (selection procedures and demographics of item tryout and/or
standardization sample)
The national tryout stage utilized a version of the scale with 16 subtests; Data
were obtained from a stratified sample of 1,270children, who reflected key
demographic variables in the national population. An analysis gathered by the
U.S Bureau of the Census (1998) provided basis for stratification along the
following variables: age, sex, race, parent education level, and geographic
Using this larger, more representative sample of children, research questions
from the pilot phased were reexamined, and additional issues were
addressed. Refinements to the item order were made based on more precise
estimates of their relative difficulty level, and exploratory and confirmatory
factor analyses were conducted to determine the underlying factor structure
of the scale.
In addition, data were collected at this stage from a number of special groups
(children identified as intellectually gifted, children with intellectual disability
or learning disorders, and children with ADHD) to provide additional evidence
regarding the adequacy of the subtest floors and ceilings, as well as clinical
utility of the scale. An oversample of252 African American children and 186
Hispanic children was collected to allow for a statistical examination of item
bias using IRT methods of analysis.
3.8 Continued…
• After reviewing the accumulated evidence from the pilot and national
tryout studies a standardization edition of the WISC IV was created.
• The standardization sample consisted of 2,200children who were divided
into 11 age groups where each age group consisted of200 participants.
The U.S. Bureau of the Census collected an analysis of data in March 2000
along the variables of: age, sex, race/ethnicity, parent education level, and
geographic region. For each age group, the proportions of Whites, African
Americans, Hispanics, Asians, and other racial groups were based in the
racial proportions of children within the corresponding age group of the
U.S> population according to March 2000 census data. The sample was
stratified according to five-parent education levels based on the number
of years of school completed. If the child resided with only one parent or
guardian, the educational level of that parent of guardian was assigned. If
the child resided with two parents, a parent and a guardian, or two
guardians, the average of both individuals’ educational levels was used,
with partial levels rounded up to the next highest level.
Evidence for Internal Consistency
According to the Technical Manual…
• The evidence for internal consistency was obtained using the
normative sample and the split half method. “The split-half
method is done by sorting the items on a test into two parallel
subtests of equal size. Then you compute a composite score for
each subtest and correlate the two composite scores. By doing so,
you have created two parallel tests from the items within one test.
It is possible to use these subtest scores to compute and estimate
of total test reliability” (Furr and Bacharach, 2008).
• As stated by the WISC IV technical manual, the split half method
was used on all subtests excluding Coding, Symbol Search and
Cancellation due to these being Processing Speed subtests.
Therefore, test-retest stability coefficients were used as the
reliability estimates for these particular subtests.
2.7 continued…
• The reliability coefficients for the WISC IV composite scales range
from .88 (Processing Speed) to .97 (Full Scale). These coefficients
are generally higher than those of the individual subtests that
comprise these composite scales. The average reliability coefficient
for the Processing Speed composite scale is slightly lower (.88)
because it is based on the test-retest reliabilities which tend to be
lower then the split half reliabilities. The reliability coefficients for
the WISC IV composite scales are identical to or slightly better than
corresponding scales in the WISC III.
• The evidence of Internal Consistency Reliability was obtained from
the split half method from a group of children ranging from the
ages of 6 to 16. The overall averages for these special clinical
groups are as follows: Verbal Comprehension (VCI) .94, Perceptual
Reasoning (PRI) .92, Working Memory (WMI) .92, Processing Speed
(PSI) .88, and Full Scale (FSIQ) .97.
Test-Retest Approaches (Are alternate-form or testretest approaches used, and if so, what were the results? Were separately timed
administrations used to investigate a practice effect, and if so, what were the
results? Additional information includes procedures used to estimate this type of
• Test-Retest Approaches
Yes. A test-retest approach was used.
According to Wechsler (2004), the sample consisted of:
243 children
18 to 27 participants in each of the 11 age
Each participant was given two separate WISCIV administrations:
Ranging from 13 to 63 days between test and
(Mean interval of 32 days)
2.9 Continued…
• The sample consisted of:
•52.3% Female vs. 47.7% Male
•74.1% White
•7.8% African American
•11.1% Hispanic
•7.0% Other
• The Parent Education Level:
•0 – 8 Years (Y): 4.9%
•9 – 11 Y: 9.1%
•12 Y: 25.9%
•13 – 15 Y: 36.2&
•> 16 Y: 23.9%
2.9 Continued…
• Used Pearson’s product-moment correlation to estimate
TEST-RETEST RELIABILITY for 5 different age groups
(Wechsler, 2004)
Age groups: 6-7, 8-9, 10-11, 12-13, 14-16
r = (SP)/(SqRt (SSxSSy))
Table 4.4 in the WISC-IV Integrated Technical and Interpretive
Manual displays:
• Mean subtest scaled scores and composite scores with SD
• Standard Differences (effect sizes) between the first and
second testing's
• Correlation coefficients corrected for the variability of the
standardization sample
Continuation and Chart Follows
2.9 Continued…
• (Williams, Weiss, & Rolfhus, 2003)
• Used Fisher’s Z Transformation to calculate TESTRETEST COEFFICIENTS for the Overall Sample
(Wechsler, 2004)
• Standard Difference calculated using:
(The mean score difference between the first and
second testing session) divided by (the pooled
standard deviation)
Effect Size – A measure intended to provide a
measurement of the absolute magnitude of a
treatment effect, independent of the size of the
sample(s) being used (Gravetter & Wallnau, 2009)
Cohen’s d = mean difference/standard deviation
Comprehension had the smallest effect size (.08),
Picture Completion had the largest (.60),
FSIQ had an effect size of (.46)
2.9 Continued…
The WISC-IV scores have adequate stability across time for all five age groups
(Wechsler, 2004)
• Corrected Stability Coefficient
•Excellent (.92)
•Good (.80)
-Block Design
-Digit Span
-Letter-Number Sequencing
-Matrix Reasoning
-Symbol Search
-Picture Completion
-Word Reasoning
•Adequate (.70)
-Other subtests
Composite Scores have better stability than individual subtest scores
Good (.80) or better
2.9 Continued…
• Retest score means for the subtests of the WISC-IV are higher than the
scores from the first testing session, possibly due to practice effects due to
a short time interval between test and retest– Practice Effects
•Re-test gains were smaller for the VCI and WMI subtests compared to the
PRI and PSI subtests
•Score Differences between test-retest primarily due to practice effects:
VCI +2.1, PRI +5.2, WMI +2.6, PSI +7.1 , FSIQ +5.6
– Stability of the WISC-IV in a Sample of Elementary and Middle School
Ryan, Glass, and Bartels (2010) investigated test-retest stability of the WISC-IV in 43
elementary/middle school students in a rural location, tested on two separate occasions,
roughly 11 months apart
Believed that the stability found from the WISC-IV standardization sample does not
generalize to clinically realistic test-rest intervals and does not generalize to other
2.9 Continued…
– 76 students from a small private school in a Midwestern community
– 43 were rested
25 female
18 male
Stability Coefficients ranged from:
.26 (Picture Concepts)
.84 (Vocabulary)
.88 (FSIQ)
Table follows-
2.9 Continued…
• Results
2.9 Continued…
• Stability Coefficients from the standardization sample were slightly
larger than from the sample in this study (Ryan et al., 2010)
-FSIQ .91 in the Standardization Sample Vs. .88 in the
abovementioned sample
-Similar to the standardization sample, stability coefficients for
the composite scores were slightly more stable than
individual subtest scores, with the FSIQ being the most stable
• Ryan et al. (2010) believe that the test-retest interval of 11 months,
compared to the 32 day test-retest interval, accounted for an overall
smaller stability coefficient and an overall smaller practice effect
• This study supports Wechsler’s statistical evidence that (Ryan et al.,
-The FSIQ is the most stable score provided by the WISC-IV
over time
-During long test-retest intervals, only the FSIQ has sufficient
stability for interpretation
-Individual subtest scores should NOT be used for any
diagnostic and/or decision-making purposes
Evidence Provided for Both Interrater
Consistency & Consistency Over Repeated
Measurements (2.10)
According to the WISC IV Technical Manual…
The test-retest sample for the WISC-IV was composed of 243 children. There were 18-27
participants from each of the 11 age groups.
The WISC-IV was given one time to all of the children. The test was then administered a
second time anywhere from 13-63 days later. The mean interval was 32 days. There was
%52.3 females and %47.7 males in the sample.
“The test-retest reliability was calculated for five age groups (6:0-7:11, 8:0-9:11, 10:0-11:11,
12:0-13:11, and 14:0-16:11) using Pearson’s product-moment correlation.” The coefficients of
the test-retest for the general sample were calculated using Fisher’s z transformation.
“The standard difference was calculated using the mean score difference between two
testing divided by the pooled standard deviation.”
The mean scores for the retest for all of the seven scaled process scores are higher than that
from the first testing, in which the effect sizes ranged from .14 to .41. “In general, test-retest
gains are less pronounced for the process scores in the Working Memory domain than for the
process scores in the Perceptual and Processing Speed domains” (pg. 136 of technical
2.10 Continued…
• In a Study done by Ryan, Glass and Bartels they had 76 students in a
Midwestern community take the WISC-IV, 43 of the students agreed to
take a second WISC-IV examination, those 43 students were the
participants of the investigation.
• According to Ryan, Glass and Bartels (2010), in all of the dependent
samples, except for one, the t-tests failed to discover significant
differences in scores from the first time the WISC-IV was administered to
the second time.
• “Stability coefficients in the present sample were consistently smaller than
those reported in the WISC-IV Technical and Interpretive Manual
(Wechsler, 2003b) for children 8 to 9 years of age.”
• This study did have some limitations though, the study was done in a rural
community that is composed of mainly white students attending a private
school and it is not a good representation of an ethnically diverse
population (Ryan, Glass, Bartel, 2010).
Furr, R. M. & Bacharach, V. R. (2008). Psychometrics: An Introduction. Thousand Oaks,
CA: Sage Publications.ISBN: 978-1-412-927604
Gravetter, F. & Wallnau, L. (2009). Statistics for the Behavioral
Sciences-Eighth Edition. Wadsworth, CA: Cengage Learning.
Ryan, J., Glass, L., Bartels, J. (2010). Stability of the WISC-IV in a Sample of
Elementary and Middle School Children. Applied Neuropsychology,
17: 68-72.
Sattler, J.M. (2008a). Assessment of children: Cognitive foundations (5th ed.).
San Diego: Author
Wechsler, D. (2004). WISC-IV Technical and Interpretive Manual. San
Antonio, TX: Psychological Corporation.
Williams, P., Weiss, L., Rolfhus, E. (2003). WISC-IV Technical Report # 1
Psychometric Properties. WISC-IV Technical Manual # 1. San Antonio,
TX: Psychological Corporation.
Williams, P., Weiss, L., Rolfhus, E. (2003). WISC-IV Technical Report # 2
Psychometric Properties. WISC-IV Technical Manual # 2. San Antonio,
TX: Psychological Corporation.