Uploaded by Sharice De Guzman

COHEN PSYCH ASSESSMENT

advertisement
lOMoARcPSD|3728912
Cohen-Based-Summary of Psychological Testing &
Assessment
Bachelor of Science in Psychology (University of San Jose-Recoletos)
StuDocu is not sponsored or endorsed by any college or university
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912
CHAPTER 1: PSYCHOLOGICAL TESTING AND ASSESSMENT
CHAPTER 1: PSYCHOLOGICAL TESTING AND ASSESSMENT
TESTING AND ASSESSMENT
 Roots can be found in early twentieth century in France 1905
 Alfred Binet published a test designed to help place Paris school
children
 WW1, military used the test to screen large numbers of recruits
quickly for intellectual and emotional problems
 WW2, military depend more on tests to screen recruits for service
PSYCHOLOGICAL
PSYCHOLOGICAL TESTING
ASSESSMENT
Process of measuring
Gathering & integration of
psychology-related
psychology-related data for
variables by means of
DEFINITION the purpose of making a
devices/procedures
psychological evaluation with
designed to obtain a
accompany of tools.
sample of behavior
To answer a referral question,
To obtain some gauge,
solve problem or arrive at a
OBJECTIVE
usually numerical in
decision thru the use of tools
nature
of evaluation
Testing may be
PROCESS Typically individualized
individualized or group
Key in the process of
Tester is not key into the
ROLE OF
selecting tests as well as in
process; may be
EVALUATOR
drawing conclusions
substituted
Typically requires an
SKILL OF
Requires technician-like
educated selection, skill in
EVALUATIOR
skills
evaluation
Entail logical problem-solving
Typically yields a test
OUTCOME approach to answer the
score
referral ques.
3 FORMS OF ASSESSMENT:
1. COLLABORATIVE PSYCHOLOGICAL ASSESSMENT – assessor and
assesse work as partners from initial contact through final feedback
2. THERAPEUTIC PSYCHOLOGICAL ASSESSMENT – self-discovery and
new understandings are encouraged throughout the assessment
process
3. DYNAMIC PSYCHOLOGICAL ASSESSMENT – follows a model (a)
evaluation (b) intervention (a) evaluation. Provide a means for
evaluating how the assesse processes or benefits from some type of
intervention during the course of evaluation.
Tools of Psychological Assessment
A. The Test (a measuring device or procedure)
1. psychological test: a device or procedure designed to measure
variables related to psychology (intelligence, personality,
aptitude, interests, attitudes, or values)
2. format: refers to the form, plan, structure, arrangement, and
layout of test items as well as to related considerations such as
time limits.
a) also referred to as the form in which a test is
administered (pen and paper, computer, etc)
Computers can generate scenarios.
b) term is also used to denote the form or structure of
other evaluative tools, and processes, such as the
guidelines for creating a portfolio work sample
3. Ways That tests differ from one another:
a) administrative procedures
(1) some test administers have an active
knowledge
(a) some test administration
involves demonstration of
tasks
(b) usually one-on-one
(c) trained observation of
assessee’s performance
B.
(2)
some test administers don’t even have to
be present
(a) usually administered to larger
groups
(b) test takers complete tasks
independently
b) Scoring and interpretation procedures
(1) score: a code or summary statement,
usually (but not necessarily) numerical in
nature, that reflects an evaluation of
performance on a test, task, interview, or
some other sample of behavior
(2) scoring: process of assigning such
evaluative codes/ statements to
performance on tests, tasks, interviews,
or other behavior samples.
(3) different types of score:
(a) cut score: reference point,
usually numerical, derived by
judgement and used to divide
a set of data into two or more
classifications.
(i)
sometimes reached
without any formal
method: in order to
“eyeball”, teachers
who decide what is
passing and what is
failing.
(4) who scores it
(a) self-scored by testtaker
(b) computer
(c) trained examiner
c)
psychometric soundness/ technical quality
(1) psychometrics:the science of
psychological measurement.
(a) referring to to how
consistently and how
accurately a psychological test
measures what it purports to
measure.
(2) utility: refers to the usefulness or
practical value that a test or other tool of
assessment has for a particular purpose.
The Interview: method of gathering information through direct
communication involving reciprocal exchange
1. interviewer in face-to-face is taking note of
a) verbal language
b) nonverbal language
(1) body language movements
(2) facial expressions in response to
interviewer
(3) the extent of eye contact
(4) apparent willingness to cooperate
c)
how they are dressed
(1) neat vs sloppy vs inappropriate
2. interviewer over the phone taking note of
a) changes in the interviewee’s voice pitch
b) long pauses
c)
signs of emotion in response
3. ways that interviews differ:
a) length, purpose, and nature
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912
CHAPTER 1: PSYCHOLOGICAL TESTING AND ASSESSMENT
6. interpretive report: a formal or official computer-generated
in order to help make diagnostic, treatment,
account of test performance presented in both numeric and
selection, etc
4. panel interview
narrative form and including an explanation of the findings;
a) an interview conducted with one interviewee with
a) the three varieties of interpretive report are
more than one interviewer
(1) descriptive
(2) screening
The Portfolio
1. files of work products: paper, canvas, film, video, audio, etc
(3) consultive
2. samples of ones abilities and accomplishments
b) some contain relatively little interpretation and
Case History Data: records, transcripts, and other accounts in written,
simply call attention to certain high, low, or unusual
pictorial or other form that preserve archival information, official and
scores that needed to be focused on.
informal accounts, and other data and items relevant to assessee
c)
consultative report: A type of interpretive report
1. sheds light on an individual's past and current adjustment as
designed to provide expert and detailed analysis of
well as on events and circumstances that may have contributed
test data that mimics the work of an expert
to any changes in adjustment
consultant.
2. provides information about neuropsychological functioning
d) integrative report: a form of interpretive report of
prior to the occurrence of a trauma or other event that results
psychological assessment, usually computerin a deficit.
generated, in which data from behavioral, medical,
3. insight into current academic and behavioral standing
administrative, and/or other sources are integrated
4. useful in making judgments for future class placements
7. CAPA: computer assisted psychological assessment. (assistance
5. Case history Study: a report or illustrative account concerning
to the test user not the test taker)
person or an event that was compiled on the basis of case
a) enables test developers to create psychometrically
history data
sound tests using complex mathematical procedures
a) might shed light on how one individual’s personality
and calculations.
b) enables test users the construction of tailor-made
and particular set of environmental conditions
combined to produce a successful world leader.
test with built-in scoring and interpretive capabilities.
b) groupthink: work on a social psychological
c)
Pros:
phenomenon: contains rich case history material on
(1) test administrators have greater access to
collective decision making that did not always result
potential test users because of the global
in the best decisions.
reach of the internet.
Behavioral Observation: monitoring the actions of others or oneself by
(2) scoring and interpretation of test data
visual or electronic means while recording quantitative and/or qualitative
tend to be quicker than for paper-andinformation regarding those actions.
pencil tests
1. often used as a diagnostic aid in various settings: inpatient
(3) costs associated with internet testing tend
facilities, behavioral research laboratories, classrooms.
to be lower than costs associated with
2. naturalistic observation: behavioral observation that takes
paper-and-pencil tests
place in a naturally occurring setting (as opposed to a research
(4) the internet facilitates the testing of
laboratory) for the purpose of evaluation and informationotherwise isolated populations, as well as
gathering.
people with disabilities for whom getting
3. in practice tends to be used most frequently by researchers in
to a test center might prove as a hardship.
settings such as classrooms, clinics, prisons, etc.
(5) greener: conserves paper, shipping
Role- Play Tests
materials etc.
1. role play: acting an improvised or partially improvised part in a
d) Cons:
simulated situation.
(1) test client integrity
2. role-play test: tool of assessment wherein assessees are
(a) refers to the verification of the
directed to act as if they were in a particular situation. Assessees
identity of the test taker when
a test is administered online
are then evaluated with regard to their expressed thoughts,
(b) also refers to the sometimes
behaviors, abilities, etc
varying interests of the test
Computers as tools
taker vs that of the test
1. local processing: on site computerized scoring, interpretation,
administrator. The test taker
or other conversion of raw test data; contrast w/ CP and
might have access to notes,
teleprocessing
aids, internet resources etc.
2. central processing: computerized scoring, interpretation, or
other conversion of raw data that is physically transported from
(c) internet testing is only testing,
the same or other test sites; contrast w/ LP and teleprocessing.
not assessment
3. teleprocessing: computerized scoring, interpretation, or other
8. CAT: computerized adaptive testing: an interactive, computerconversion of raw test data sent over telephone lines by modem
administered test taking process wherein items presented to
from a test site to a central location for computer processing.
the test taker are based in part on the test taker's performance
contrast with CP and LP
on previous items
4. simple score report: a type of scoring report that provides only
a) EX: on a computerized test of academic abilities, the
a listing of scores
computer might be programmed to switch from
5. extended scoring report: a type of scoring report that provides
testing math skills to English skills after three
a listing of scores AND statistical data.
consecutive failures on math items.
H. Other Tools
b)
C.
D.
E.
F.
G.
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912
CHAPTER 1: PSYCHOLOGICAL TESTING AND ASSESSMENT
DVD- how would you respond to the events that take place in
satisfaction, personal values, quality of living conditions,
and quality of friendships and other social support.
the video
 BUSINESS AND MILITARY SETTINGS
a) sexual harassment in the workplace
 GOVERNMENTAL AND ORGANIZATIONAL CREDENTIALING
b) respond to various types of emergencies
How are Assessments Conducted?
c)
diagnosis/treatment plan for clients on videotape
 protocol: the form or sheet or booklet on which a testtaker’s
2. thermometers, biofeedback, etc
responses are entered.
o
term might also be used to refer to a description of a set of
TEST DEVELOPER
test- or assessment- related procedures, as in the sentence
 They are the one who create tests.
 They conceive, prepare, and develop tests. They also find a way to
, “the examiner dutifully followed the complete protocol
disseminate their tests, by publishing them either commercially or
for the stress interview”
through professional publications such as books or periodicals.
 rapport: working relationship between the examiner and the
TEST USER
examinee
 They select or decide to take a specific test off the shelf and use it for
some purpose. They may also participate in other roles, e.g., as
ASSESSEMENT OF PEOPLE WITH DISABILITITES
examiners or scorers.
 Define who requires alternate assessement, how such assessment are
TEST TAKER
to be conducted and how meaningful inferences are to be drawn
 Anyone who is the subject of an assessment
from the data derived from such assessment
 Test taker may vary on a continuum with respect to numerous
 Accommodation – adaptation of a test, procedure or situation or the
variables including:
substitution of one test for another to make the assessment more
o
The amount of anxiety they experience & the degree to
suitable for an assesee with exceptional needs.
which the test anxiety might affect the results

Translate it into Braillee and administere in that form.
o
The extent to which they understand & agree with the
 Alternate assessment – evaluative or diagnostic procedure or process
rationale of the assessment
that varies from the usual, customary, or standardized way a
o
Their capacity & willingness to cooperate
measurement is derived either by virtue of some special
o
Amount of physical pain/emotional distress they are
accommodation made to the assesee by means of alternative
experiencing
methods
o
Amount of physical discomfort
 Consider these four variables on which of many different types of
o
Extent to which they are alert & wide awake
accommodation should be employed:
o
Extent to which they are predisposed to agreeing or
o
The capabilities of the assesse
disagreeing when presented with stimulus
o
The purpose of the assessment
o
The extent to which they have received prior coaching
o
The meaning attached to test scores
o
May attribute to portraying themselves in a good light
o
The capabilities of the assessor
 Psychological autopsy – reconstruction of a deceased individual’s
REFERENCE SOURCES
psychological profile on the basis of archival records, artifacts, &
 TEST CATALOUGES – contains brief description of the test
interviews previously conducted with the deceased assesee
 TEST MANUALS – detailed information
TYPES OF SETTINGS
 REFERENCE VOLUMES – one stop shopping, provides detailed
 EDUCATIONAL SETTING
information for each test listed, including test publisher, author,
o
achievement test: evaluation of accomplishments or the
purpose, intended test population and test administration time
degree of learning that has taken place, usually with
 JOURNAL ARTICLES – contain reviews of the test
regard to an academic area.
 ONLINE DATABASES – most widely used bibliographic databases
o
diagnosis: a description or conclusion reached on the basis
TYPES OF TESTS
of evidence and opinion though a process of distinguishing
 INDIVIDUAL TEST – those given to only one person at a time
the nature of something and ruling out alternative
 GROUP TEST – administered to more than one person at a time by
conclusions.
single examiner
o
diagnostic test: a tool used to make a diagnosis, usually to
 ABILITY TESTS:
identify areas of deficit to be targeted for intervention
o
ACHIEVEMENT TESTS – refers to previous learning (ex.
o
informal evaluation: A typically non systematic, relatively
Spelling)
o
APTITUDE/PROGNOSTIC – refers to the potential for
brief, and “off the record” assessment leading to the
learning or acquiring a specific skill
formation of an opinion or attitude, conducted by any
o
INTELLIGENCE TESTS – refers to a person’s general
person in any way for any reason, in an unofficial context
potential to solve problems
and not subject to the same ethics or standards as
 PERSONALITY TESTS: refers to overt and covert dispositions
evaluation by a professiomal
o
OBJECTIVE/STRUCTURED TESTS – usually self-report,
 CLINICAL SETTING
require the subject to choose between two or more
o
these tools are used to help screen for or diagnose
alternative responses
behavior problems
o
PROJECTIVE/UNSTRUCTURED TESTS – refers to all
o
group testing is used primarily for screening: identifying
possible uses, applications and underlying concepts of
those individuals who require further diagnostic
psychological and educational tests
evaluation.
o
INTEREST TESTS –
 COUNSELING SETTING
o
schools,prisons, and governmental or privately owned
institutions
o
ultimate objective: the improvement of the assessee in
terms of adjustment, productivity, or some related
variable.
 GERIATRIC SETTING
o
quality of life: in psychological assesment, an evaluation
of variables such as perceived stress,lonliness, sources of
1.
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912
CHAPTER 2: HISTORICAL, CULTURAL AND LEGAL/ETHICAL CONSIDERATIONS
A HISTORICAL PERSPECTIVE
testakers from young children through senior
19TH CENTURY
adulthood.
 Tests and testing programs first came into being in China
B. THE MEASUREMENT OF PERSONALITY
 Testing was instituted as a means of selecting who, of many
o
Field of psychology was being too test oriented
applicants would obtain government jobs (Civil service)
o
Clinical psychology was synonymous to mental testing
 The job applicants are tested on proficiency in endeavors such as
o
ROBERT WOODWORTH – develop a measure of
music, archery, knowledge and skill etc.
adjustment and emotional stability that could be
GRECO-ROMAN WRITINGS (Middle Ages)
administered quickly and efficiently to groups of recruits
 World of evilness

To disguise the true purpose of the test,
 Deficiency in some bodily fluid as a factor believed to influence
questionnaire was labeled as Personal Data
personality
Sheet
 Hippocrates and Galen

He called it Woodworth Psychoneurotic
RENAISSANCE
Inventory – first widely used self-report test of
 Christian von Wolff – anticipated psychology as a science and
personality
psychological measurement as a specialty within that science
o
Self-report test:
CHARLES DARWIN AND INDIVIDUAL DIFFERENCES

Advantages:
 Tests designed to measure these individual differences in ability and

Respondents best qualified
personality among people

Disadvantages:
 “Origin of Species” chance variation in species would be selected or

Poor insight into self
rejected by nature according to adaptivity and survival value.

One
might
honestly
believe
“survival of the fittest”
something about self that isn’t true
FRANCIS GALTON

Unwillingness to report seemingly
 Explore and quantify individual differences between people.
negative qualities
 Classify people “according to their natural gifts”
o
Projective test: individual is assumed to project onto some
 Displayed the first anthropometric laboratory
ambiguous stimulus (inkblot, photo, etc.) his or her own
KARL PEARSON
unique needs, fears, hopes, and motivations
 Developed the product moment correlation technique.

Ex.) Rorschack inkblot
 His work can be traced directly from Galton
o
WILHEM MAX WUNDT
C. THE ACADEMIC AND APPLIED TRADITIONS
 First experimental psychology laboratory in University of Leipzig
 Focuses more on relating to how people were similar, not different
Culture and Assessment
from each other.
JAMES MCKEEN CATELL
Culture: ‘the socially transmitted behavior patterns, beliefs, and products of
 Individual differences in reaction time
work f a particular population, community, or group of people’
 Coined the term mental test
CHARLES SPEARMAN
Evolving Interest in Culture-Related Issues
 Originating the concept of test reliability as well as building the
Goddard tested immigrants and found most to be feebleminded
mathematical framework for the statistical technique of factor
-invalid; overestimated mental deficiency, even in native Englishanalysis
speakers
VICTOR HENRI
Lead to nature-nurture debate about what intelligence tests actually measure
 Frenchman who collaborated with Binet on papers suggesting how
Needed to “isolate” the cultural variable
mental tests could be used to measure higher mental processes
Culture-specific tests: tests designed for use with ppl from one culture, but not
EMIL KRAEPELIN
from another
 Early experimenter of word association technique as a formal test
-minorities still scored abnormally low
LIGHTNER WITMER
ex.) loaf of bread vs. tortillas
 “Little known founder of clinical psychology”
today tests undergo many steps to ensure its suitable for said nation
 Founded the first psychological clinic in the U.S.
-take testtakers reactions into account
PSYCHE CATELL
 Daughter of James Cattell
Some Issues Regarding Culture and Assessment
 Cattel Infant Intelligence Scale (CIIS) & Measurement of Intelligence in

Verbal Communication
Infants and Young Children
o
Examiner and examinee must speak the same language
RAYMOND CATTELL
o
Especially tricky with infrequently used vocabulary or
 Believed in lexical approach to defining personality which examines
unusual idioms employed
human languages for descriptors of personality dimensions
o
Translator may lose nuances of translation or give
20th CENTURY
unintentional hints toward more desirable answer
Birth of the first formal tests of intelligence
o
Also requires understanding of culture
Testing shifted to be of more understandable relevance/meaning

Nonverbal Communication and Behavior
A. THE MEASUREMENT OF INTELLIGENCE
o
Different between cultures
o
Binet created first intelligence to test to identify mentally
o
Ex.) meaning of not making eye contact
retarded school children in Paris (individual)
o
Body movement could even have physical cause
o
Binet-Simon Test has been revised over again
o
Psychoanalysis: Freud’s theory of personality and
o
Group intelligence tests emerged with need to screen
psychological treatment which stated that symbolic
intellect of WWI recruits
significance is assigned to many nonverbal acts.
o
David Wechsler – designed a test to measure adult
o
Timing tests in cultures not obsessed with speed
intelligence test
o
Lack of speaking could be reverence for elders

for him Intelligence is a global capacity of the

Standards of Evaluation
individual to act purposefully, to think rationally
o
Acceptable roles for women differ throughout culture
and to deal effectively with his environment.
o
“judgments as to who might be the best employee,

Wechsler-Bellevue Intelligence Scale 
manager, or leader may differ as a function of culture, as
Wechsler Adult Intelligence Test – was revised
might judgments regarding intelligence, wisdom, courage,
several times and extended the age range of
and other psychological variables”
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912
CHAPTER 2: HISTORICAL, CULTURAL AND LEGAL/ETHICAL CONSIDERATIONS
must ask ‘how appropriate are the norms or other
o
fully debrief participants
standards that will be used to make this evaluation’

The right to be informed of test findings
o
Formerly test administrators told to give participants only
Tests and Group Membership
positive information

ex.) must be 5’4” to be police officer- excludes cultures with short
o
No realistic information is required
stature
o
Tell test takers as little as possible about the nature of

ex.) Jewish lifestyle not well suited for corporate America
their performance on a particular test. So that the

affirmative action: voluntary and mandatory efforts to combat
examinee would leave the test session feeling pleased and
discrimination and promote equal opportunity in education and
statisfied.
employment for all
o
Test takers have the right also to know what

Psychology, tests, and public policy
recommendations are being made as a consequence of the
test data
Legal and Ethical Condiseration

The right to privacy and confidentiality
Code of professional ethics: defines the standard of care expected of members
o
Private right: “recognizes the freedom of the individual to
of a given profession.
pick and choose for himself the time, circumstances, and
particularly the extent to which he wishes to share or
The Concerns of the Public
withhold from others his attitudes, beliefs, behaviors, and

Beginning in world war I, fear that tests were only testing the ability
opinions”
to take tests
o
Privileged information: information protected by law

Legislation
from being disclosed in legal proceeding. Protects clients
o
Minimum competency testing programs: formal testing
from disclosure in judicial proceedings. Privilege belongs to
programs designed to be used in decisions regarding
the client not the psychologist.
various aspects of students’ educations
o
Confidentiality: concerns matters of communication
o
Truth-in-testing legislation: state laws to provide testtakers
outside the courtroom
with a means of learning the criteria by which they are

Safekeeping of test data: It is not a good policy
being judged
to maintain all records in perpetuity

Litigation

The right to the least stigmatizing label
o
Daubert ruling made federal judges the gatekeepers to
o
The standards advise that the least stigmatizing labels
determining what expert testimony is admitted
should always be assigned when reporting test results.
o
This overrode the Frye policy which only admitted
scientific testimony that had won general acceptance in
the scientific community.
o
The Concerns of the Profession

Test-user qualifications
o
Who should be allowed to use psych tests
o
Level A: tests or aids that can adequately be administered,
scored, and interpreted with the aid of the manual and a
general orientation to the kind of institution or
organization in which one is working
o
Level B: tests or aids that require some technical
knowledge of test construction and use and of supporting
psychological and educational fields
o
Level C: tests and aids requiring substantial understanding
of testing and supporting psych fields with experience

Testing people with disabilities
o
Difficulty in transforming the test into a form that can be
taken by testtaker
o
Transferring responses to be scorable
o
Meaningfully interpreting the test data

Computerized test administration, scoring, and interpretation
o
simple, convenient
o
easily copied, duplicated
o
insufficient research to compare it to pencil-and-paper
versions
o
value of computer interpretation is questionable
o
unprofessional, unregulated “psychological testing” online
The Rights of Testtakers

the right of informed consent
o
right to know why they are being evaluated, how test data
will be used and what information will be released to
whom
o
may be obtained by parent or legal representative
o
must be in written form:

general purpose of the testing

the specific reason it is being undertaken

general type of instruments to be administered
o
revealing this information before the test can contaminate
the results
o
deception only used if absolutely necessary
o
don’t use deception if it will cause emotional distress
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912
CHAPTER 3: A STATISTICS REFRESHER
 No absolute zero point
 Can take average
Why We Need Statistics
RATIO SCALE
Statistics are important for purposes of education
 In addition to all the properties of nominal, ordinal, and interval
o
Numbers provide convenient summaries and allow us to
measurement, ratio scale has true zero point
evaluate some observations relative to others
 Equal intervals between numbers
We use statistics to make inferences, which are logical deductions
 Ex.) measuring amount of pressure hand can exert
about events that cannot be observed directly
 True zero doesn’t mean someone will receive a score of 0, but means
o
Detective work of gathering and displaying clues –
that 0 has meaning
exploratory data analysis
o
Then confirmatory data analysis
NOTE:
Descriptive statistics are methods used to provide a concise
Permissible Operations
description of a collection of quantitative information
Level of measurement is important because it defines which
Inferential statistics are methods used to make inferences from
mathematical operations we can apply to numerical data
observations of a small group of people known as a sample to a larger
For nominal data, each observation can be placed in only one
group of individuals known as a population
mutually exclusive category
Ordinal measurements can be manipulated using arithmetic
SCALES OF MEASUREMENT
With interval data, one can apply any arithmetic operation to the
differences between scores
 MEASUREMENT – act of assigning numbers or symbols to
o
Cannot be used to make statements about ratios
characteristics of things according to rules. The rules serves as a
guideline for representing the magnitude. It always involves error.
DESCRIBING DATA
 SCALE – set of numbers whose properties model empirical properties
 Distribution: set of scores arrayed for recording or study
of the objects to which the numbers are assigned.
 Raw Score: straightforward, unmodified accounting of performance,
 CONTINUOUS SCALE – interval/ratio. A scale used to measure
usually numerical
continuous variable. Always involves error
 DISCRETE SCALE – nominal/ordinal used to measure a discrete
Frequency Distributions
variable (ex. Female or male)
 Frequency Distribution: All scores listed alongside the number of
 ERROR – collective influence of all of the factors on a test score.
times each score occurred
 Grouped Frequency Distribution: test-score intervals (class intervals),
PROPERTIES OF SCALES
replace the actual test scores
Magnitude, equal intervals, and an absolute 0
o
Highest and lowest class intervals= upper and lower limits
Magnitude
of distribution
The property of “moreness”
 Histogram: graph with vertical lines drawn at the true limits of each
A scale has the property of magnitude if we can say that a particular
test score (or class interval) forming TOUCHING rectangles- midpoint
instance of the attribute represents more, less, or equal amounts of
in center of bar
the given quantity than does another instance
 Bar Graph: rectangles DON’T touch
Equal Intervals
 Frequency Polygon: data illustrated with continuous line connecting
A scale has the property of equal intervals if the difference between
the points where test scores or class intervals meet frequencies
two points at any place on the scale has the same meaning as the
 A single test score means more if one relates it to other test scores
difference between two other points that differ by the same number
 A distribution of scores summarizes the scores for a group of
of scale units
individuals
A psychological test rarely has the property of equal intervals
 Frequency distribution: displays scores on a variable or a measure to
When a scale has the property of equal intervals, the relationship
reflect how frequently each value was obtained
between the measured units and some outcome can be described by
o
One defines all the possible scores and determines how
a straight line or a linear equation in the form Y=a+bX
many people obtained each of those scores
o
Shows that an increase in equal units on a given scale
 Income is an example of a variable that has a positive skew
reflects equal increases in the meaningful correlates of
 Whenever you draw a frequency distribution or a frequency polygon,
units
you must decide on the width of the class interval
Absolute 0
 Class interval: for inches of rainfall is the unit on the horizontal axis
An Absolute 0 is obtained when nothing of the property being
measured exists
Measures of Central Tendency
This is extremely difficult/impossible for many psychological qualities

Measure of central tendency: statistic that indicates the average or
midmost score between the extreme scores in a distribution.
NOMINAL SCALE

The Arithmetic Mean
 Simplest form of measurement
o
“X bar”
 Classification or categorization
o
sum of observations divided by number of observations
 Arithmetic operations can be performed with nominal data
o
Sigma (X/n)
 Ex.) Male or female
o
Used for interval or ratio data when distributions are
 Also includes test items
relatively normal
o
Ex.) yes/no responses

The Median
ORDINAL SCALE
o
The middle score
 Classifies in some kind of ranking order
o
Used for ordinal, interval, and ratio data
 Individuals compared to others and assigned a rank
o
Especially useful when few scores fall at extremes
 Imply nothing about how much greater one ranking is than another

The Mode
 Numbers/ranks do not indicate units of measure
o
Most frequently-occurring score
 No absolute zero point
o
Bimodal distribution- 2 scores both have highest
 Binet: believed that data derived from intelligence test are ordinal in
frequency
nature
o
Only common with nominal data
INTERVAL SCALE
Measures of Variability
 In addition to the features of nominal and ordinal scales, contain

Variability: indication of how scores in a distribution are scattered or
equal intervals between numbers
dispersed
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912





Skewness




Kurtosis




CHAPTER 3: A STATISTICS REFRESHER
The Range

The difference between a particular raw score and the mean divided
o
Difference between the highest and lowest scores
by the standard deviation

Used to compare test scores with difference scales
o
Quick but gross description of the spread of scores
The interquartile and semi-interquartile range
T-score
o
Distribution is split up by 3 quartiles, thus making 4

Standard score system composed of a scale that ranges from 5
quarters each representing 25% of the scores
standard deviations below the mean to 5 standard deviations above
o
Q2= median
the mean
o
Interquartile range measure of variability equal to the

No negatives
difference between Q3 and Q1
o
Semi-interquartile range interquartile range divided by 2
Other Standard Scores
Quartiles and Deciles

SAT
o
Quartiles are points that divide the frequency distribution

GRE
into equal fourths

Linear transformation: when a standard score retains a direct
o
First quartile is the 25th percentile; second quartile is the
numerical relationship to the original raw score
median, or 50th percentile; third quartile is the 75th

Nonlinear transformation: required when data are not normally
percentile
distributed, yet comparisons with normal distributions need to be
o
The interquartile range is bounded by the range of scores
made
that represents the middle 50% of the distribution
o
Normalized Standard Scores
o
Deciles are similar but use points that mark 10% rather

When scores don’t fall on normal distribution
than 25% intervals

“normalizing a distribution involves ‘stretching’
o
Stanine system: converts any set of scores into a
he skewed curve into the shape of a normal
transformed scale, which ranges from 1 to 9
curve and creating a corresponding scale of
The average deviation
standard scores, a scale called a normalized
o
X-mean=x
standard score scale”
o
Average deviation= (sum of all deviation scores)/ total
number of scores
o
Tells us on average how far scores are from the mean
The Standard Deviation
o
Similar to average deviation
o
But in order to overcome the (+/-) problem, each deviation
is squared
o
Standard deviation: a measure of variability equal to the
square root of the average squared deviations about the
mean
o
Is square root of variance
o
Variance: the mean of the squares of the difference b/w
the scores in a distribution and their mean

Found by squaring and summing all the
deviation scores and then dividing by the total
number of scores
o
s = sample standard deviation
o
sigma = population standard deviation
skewness: nature and extent to which symmetry is absent
POSITIVE SKEW Ex.) test was too hard
NEGATIVELY SKEWED ex.) test was too easy
can be gauges by examining relative distances of quartiles from the
median
steepness of distribution
platykurtic: relatively flat
leptokurtic: relatively peaked
mesokurtic: somewhere in the middle
The Normal Curve
Normal curve: bell-shaped, smooth, mathematically defined curve, highest at
center; both sides taper as it approaches the x-axis asymptotically
-symmetrical, and thus have mean, median, mode, is same
Area under the Normal Curve
Tails and body
Standard Scores
Standard Score: raw score that has been converted from one scale to another
scale, where the latter has arbitrarily set mean and standard deviation
-used for comparison
Z-score

conversion of a raw score into a number indicating how many
standard deviation units the raw score is below or above the mean of
the distribution.
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912
CHAPTER 4: OF TESTS AND TESTING

Some Assumptions About Psychological Testing and Assessment
Assumption 1: Psychological Traits and States Exist
o
Trait: any distinguishable, relatively enduring way in which one
individual varies from another
o
States: distinguish one person from another but are relatively
less enduring

Trait term that an observer applies, as well as
strength or magnitude of the trait presumed present
 based on observing a sample of behavior
o
Trait and state definitions also refer to individual variation
make comparisons with respect to the hypothetical average
person
o
Samples of behavior:

Direct observation

Analysis of self-report statements

Paper-and-pencil test answers
o
Psychological trait  covers wide range of possible
characteristics; ex:

Intelligence

Specific intellectual abilities

Cognitive style

Psychopathology
o
Controversy regarding how psychological tests exist

Psychological tests exist only as constructs: an
informed, scientific concept developed or
constructed to describe or explain a behavior

Cant see, hear or touch infer existence
from overt behavior: refers to an
observable action or the product of an
observable action, including test- or
assessment-related responses
o
Traits not expected to be manifested in behavior 100% of the
time

Seems to be rank-order stability in personality
traits relatively high correlations between trait
scores at different time points
o
Whether and to what degree a trait manifests itself is
dependent on the strength and nature of the situation
Assumption 2: Psychological Traits and States Can Be Quantified and
Measured
o
After acknowledged that psychological traits and states do exist,
the specific traits and states to be measured need to be defined

What types of behaviors are assumed to be
indicative of trait?

Test developer has to provide test users with a clear
operational definition of the construct under study
o
After being defined, test developer considers types of item
content that would provide insight into it

Ex: behaviors that are indicative of a particular trait
o
Should all questions be weighted the same?

Weighting the comparative value of a test’s items
comes about as the result of a complex interplay
among many factors:

Technical considerations

The way a construct has been defined (for
particular test)

Value society (and test developer) attach
to behaviors evaluated
o
Need to find appropriate ways to score the test and interpret
results

Cumulative scoring: test score is presumed to
represent the strength of the targeted ability or trait
or state

The more the testtaker responds in a
particular direction (as keyed by test
manual) the higher the testtaker is
presumed to possess the targeted trait or
ability
Assumption 3: Test-Related Behavior Predicts Non-Test-Related Behavior
o
Objective of test is to provide some indication of some aspects
of the examinee’s behavior
-
-
-
-
Tasks on some tests mimic the actual behaviors that
the test user is attempting to understand
o
Obtained behavior is usually used to predict future behavior
o
Could also be used to postdict behavior to aid in the
understanding of behavior that has already taken place
o
Tools of assessment, such as a diary, or case history data, might
be of great value in such an evaluation
Assumption 4: Tests and Other Measurement Techniques Have Strengths
and Weaknesses
o
Competent test users understand a lot about the tests they use

How it was developed

Circumstances under which it is appropriate to
administer the test

How test should be administered and to whom

How results should be interpreted
o
Understand and appreciation limitations for tests they use
Assumption 5: Various Sources of Error Are Part of the Assessment Process
o
Everyday error= misstates and miscalculations
o
Assessment error= a long-standing assumption that factors
other than what a test attempts to measure will influence
performance on a test
o
Error variance: component of a test score attributable to
sources other than the trait or ability measured

Assessees themselves are sources of error variance
o
Classical test theory (CTT)/ True score theory: assumption is
made that each testtaker has a true score on a test that would
be obtained but for the action of measurement error
Assumption 6: Testing and Assessment Can Be Conducted in a Fair and
Unbiased Manner
o
Court challenged to various tests and testing programs have
sensitized test developers and users to the societal demand for
fair tests used in a fair manner

Publishers strive to develop instruments that are fair
when used in strict accordance with guidelines in the
test manual
o
Fairness related problems/questions:

Culture is different from people whom the test was
intended for

Politics
Assumption 7: Testing and Assessment Benefit Society
o
Many critical decisions are based on testing and assessment
procedures
WHAT’S A “GOOD TEST”?
Criteria
o
Clear instruction for administration, scoring, and interpretation
Reliability
o
A “good test”/measuring tool reliable

Involves consistency: the prevision with which the
test measures and the extent to which error is
present in measurements

Unreliable measurement needs to be avoided
Validity
o
Test is considered valid if it doesn’t indeed measure what it
purports to measure
o
If there is controversy over the definition of a construct then the
validity is sure to be criticized as well
o
Questions regarding validity focus on the items that collectively
make up the test

Adequately sample range of areas to measure
construct

Individual items contribute to or take away from
test’s validity
o
Validity may also be questioned on grounds related to the
interpretation of test results
Other Considerations
o
“Good test” one that trained examiners can administer, score
and interpret with minimum difficulty

Useful

Yields actionable results that will ultimately benefit
individual testtakers or society at large
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912
CHAPTER 4: OF TESTS AND TESTING
Purpose of test compare performance of testtaker with
o
STANDARD ERROR OF THE DIFFERENCE – estimate how
performance of other testtakers (contains adequate norms:
large a difference between two scores should be before
normative data)
the difference is considered statistically significant

Normative data provides standard with which results
Developing norms for a standardized test
measured can be compared
o
Establish a standard set of instructions and conditions
under which the test is given makes scores of normative
Norm-referenced testing and assessment: method of evaluation and
sample more comparable with scores of future testtakers
a way of deriving meaning from test scored by evaluating an
o
All data collected and analyzed, test developer will
individual testtaker’s score and comparing it to scores of a group of
summarize data using descriptive statistics (measures of
testtakers
central tendency and variability)
Meaning of individual score is relative to other scores on the same

Test developer needs to provide precise
test
description of standardization sample itself
Norms (scholarly context): usual, average, normal, standard, expected

Descriptions of normative samples vary widely
or typical
in detail
Norms (psychometric context): the test performance data of a
Tracking
particular group of testtakers that are designed for use as a reference
Comparisons are usually with people of the same age
when evaluating or interpreting individual test scores
Children at the same age level tend to go through different growth
Normative sample: group of people whose performance on a
patterns
particular test is analyzed for reference in evaluation the performance
Pediatricians must know the child’s percentile within a given age
of individual testtakers
group
o
Yields a distribution of scores
This tendency to stay at about the same level relative to one’s peers is
Norming: refers to the process of deriving norms; particular type of
known as tracking (ie height and weight)
norm derivation
Diets may alter this “track”
o
Race norming: controversial practice of norming on the
Faults: some believe there is an analogy between the rates of physical
basis of race or ethnic background
growth and the rates of intellectual growth
Norming a test can be very expensive user norms/program norms:
o
Some say that children learn at different rates
consist of descriptive statistics based on a group of testtakers in a
o
This system discriminates against some children
given period of time rather than norms obtained by form sampling
methods
TYPES OF NORMS
Sampling to Develop Norms
o
Classification of norms ex: age, grade, national, local,
Standardization: process of administering a test to a representative
percentile, etc.
sample of testtakers for the purpose of establishing norms
o
PERCENTILES
o
Standardized when has clear, specified procedures

Median= 2nd quartile: the point at or below which
Sampling
50% of the scores fell and above which the remaining
o
Developer targets defined group as population test
50% fell
designed for

Might wish to divide distribution of scores into

All have at least one common, observable
deciles (instead of quartiles): 10 equal parts
characteristic

The Xth percentile is equal to the score at or below
o
To obtain distribution of scores:
which X% of scores fall

Test administered to everyone in targeted

Percentile: an expression of the percentage of
population
people whose score on a test or measure falls below

Administer test to a sample of the population
a particular raw score

Sample: portion of universe of

Percentage correct: refers to the
people deemed to be representative
distribution of raw scores (number of
of whole population
items that were answered correctly)

Sampling: process of selecting the
multiplied by 100 and divided by the total
portion of universe deemed to be
number of items *not same as percentile
representative of whole

Percentile is a converted score that refers
o
Subgroups within a defined population may differ with
to a percentage of testtakers

Percentiles are easily calculated popular way of
respect to some characteristics and it is sometimes
organizing test related data
essential to have these differences proportionately

Using percentiles with normal distribution real
represented in sample
differences between raw scores may be minimized

Stratified sampling: sample reflects statistics of
near the ends of the distribution and exaggerated in
whole population; helps prevent sampling bias
the middle (worsens with highly skewed data)
and ultimately aid in interpretation of findings
o
AGE NORMS

Purposive sampling: arbitrarily select sample
we believe to be representative of population

Age-equivalent scores/age norms: indicate the

Incidental/convenience sampling: sample that
average performance of different samples of
is convenient or available for use
testtakers who were at various ages at the time the

Very exclusive (contain exclusionary
test was administered
criteria)

Age norm tables for physical
TYPES OF STANDARD ERROR:
characteristics
o
STANDARD ERROR OF MEASUREMENT – estimate the

“Mental” age vs. physical age (need to
identify mental age)
extent to which an observed score deviates from a true
score
o
GRADE NORMS
o
STANDARD ERROR OF ESTIMATE – In regression, an

Grade norms: designed to indicate the average test
performance of testtakers in a given school grade
estimate of the degree of error involved in predicting the
value of one variable from another

Developed by administering the test to
representative samples of children over a
o
STANDARD ERROR OF THE MEAN – a measure of sampling
range of consecutive grades
error

Mean or median score for children at
each grade level is calculated
o
NORMS
-
-
-
-
-
-
-
-
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912
CHAPTER 4: OF TESTS AND TESTING
Great intuitive appeal
CORRELATION
Do not provide info as to the content or
 Degree and direction of correspondence between two things.
type of items that a student could or
 Correlation coefficient (r) – expresses a linear relationship between
could not answer correctly
two continuous variables

Developmental norms: (ex: grade norms and age
o
Numerical index that tells us the extent to which X and Y
norms) term applied broadly to norms developed on
are “co-related”
the basis of any trait, ability, skill, or other
 Positive correlation: high scores on Y are associated with high scores
characteristic that is presumed to develop,
on X, and low scores on Y correspond to low scores on X
deteriorate, or otherwise be affected by
 Negative correlation: higher scores on Y are associated with lower
chronological age, school grade, or stage of life
scores on X, and vise versa
o
NATIONAL NORMS
 No correlation: the variables are not related

National norms: derived from a normative sample
 -1 to 1
that was nationally representative of the population
 Correlation does not imply causation.
o
Ie weight, height, intelligence
at the time the norming study was conducted
o
NATIONAL ANCHOR NORMS

Many different tests purporting to measure the same
PEARSON r
 Pearson Product Moment Correlation Coefficient
human characteristics or abilities

National anchor norms: equivalency tables for scores
 Devised by Karl Pearson
 Relationship of two variables are linear and continuous
on tests that purpose to measure the same thing

Could provide the tool for comparisons
 Coefficient of Determination (r2) – indication of how much variance is
shared by the X and the Y variables

Provides stability to test scores by
anchoring them to other test scores
SPEARMAN RHO
 Rank order correlation coefficient

Begins with the computation of percentile
norms for each test to be compared
 Developed by Charles Spearman

Equipercentile method: equivalency of
 Used when the sample size is small and when both sets of
scores on different tests is calculated with
measurements are in ordinal form (ranking form)
reference to corresponding percentile
BISERIAL CORRELATION
scores
 expresses the relationship between a continuous variable and an
o
SUBGROUP NORMS
artificial dichotomous variable
o
If the dichotomous variable had been true then we would

Normative sample can be segmented by an criteria
initially used in selecting subjects for sample
use the point biserial correlation

Subgroup norms: result of segmentation; more
o
When both variables are dichotomous and at least one of
narrowly defined
the dichotomies is true, then the association between
o
LOCAL NORMS
them can be estimated using the phi coefficient

Local norms: provide normative info with respect to
o
If both dichotomous variables are artificial, we might use a
the local population’s performance on some test
special correlation coefficient – tetrachoric correlation

Typically developed by test users
themselves
REGRESSION
Fixed Reference Group Scoring Systems
 analysis of relationships among variables for the purpose of
o
Norms provide context for interpreting meaning of a test score
understanding how one variable may predict another
o
Fixed reference group scoring system: distribution of scored
 SIMPLE REGRESSION: one IV (X) and one DV (Y)
Regression line: defined as the best-fitting straight line through a set
obtained on the test from one group of testtakers (fixed
of points in a scatter diagram
reference group) is used as the basis for the calculation of test
o
Found by using the principle of least squares, which
scores for future administrators on the test

Ex: SAT test (developed in 1962)
minimizes the squared deviation around the regression
NORM-REFERENCED VERSUS CRITERION-REFERENCED EVALUATION
line
 Primary use: To predict one score or variable from another
Way to derive meaning from test score is to evaluate test score in
relation to other scores on same test (Norm-referenced)
 Standard error of estimate: the higher the correlation between X and
Criterion-referenced: derive meaning from a test score by evaluating
Y, the greater the accuracy of the prediction and the smaller the SEE.
it on the basis of whether or not some criterion has been met
 MULTIPLE REGRESSION: The use of more than one score to predict Y.
o
Criterion: a standard on which a judgment or decision may
 Regression coefficient: (b) slope of the regression line
o
Sum of squares for the covariance to the sum of squares
be based
Criterion-referenced testing and assessment: method of evaluation
for X
and way of deriving meaning from test scores by evaluating an
o
Sum of squares is defined as the sum of the squared
individual’s score with reference to a set standard (ex: to drive must
deviations around the mean
past driving test)
o
Covariance is used to express how much two measures
o
Derives from values and standards of an individual or
covary, or vary together
organization
 Slope describes how much change is expected in Y each time X
o
Also called Domain/content-referenced testing and
increases by one unit
assessment
 Intercept (a) is the value of Y when X is 0
o
Critique: if followed strictly, important info about
o
The point at which the regression line crosses the Y axis
individual’s performance relative to others can be
THE BEST-FITTING LINE
potentially lost
 The difference between the observed and predicted score (Y-Y’) is
Culture and Inference
called the residual
Culture is a factor in test administration, scoring and interpretation
 The best-fitting line is most appropriately found by squaring each
Test user should do research in advance on test’s available norms to
residual
check how appropriate it is for targeted testtaker population
 Best-fitting line is obtained by keeping these squared residuals as
o
Helpful to know about the culture of the testtaker
small as possible
o
Principle of least squares:
CORRELATION AND INFERENCE
 Correlation is a special case of regression in which the scores for both
variables are in standardized, or Z, units
 In correlation, the intercept is always 0


Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912
CHAPTER 4: OF TESTS AND TESTING
Pearson product moment correlation coefficient is a ratio used to
External influence is the third variable
Restricted Range
determine the degree of variation in one variable that can be
estimated from knowledge about variation in the other variable
Correlation and regression use variability on one variable to explain
variability on a second variable
Testing the Statistical Significance of a Correlation Coefficient
Begin with the null hypothesis that there is no relationship between
Restricted range problem: correlation requires variability; if the
variability is restricted, then significant correlations are difficult to
variables
Null hypothesis rejected is there is evidence that the association
find
Mulvariate Analysis
between two variables is significantly different from 0
t distribution is not a single distribution, but a family of distributions,
Multivariate analysis considers the relationship among combinations
of three of more variables
each with its own degrees of freedom
Degrees of freedom are defined as the sample size minus 2, or N-2
General Approach
Linear combination of variables is a weighted composite of the
Two-tailed test
original variables
Y’ = a+b1X1 + … bkXk
How to Interpret a Regression Plot
Regression plots are pictures that show the relationship between
variables
Common use of correlation is to determine the criterion validity
evidence for a test, or the relationship between a test score and
some well-defined criterion
Middle level of enjoyableness because it is the one observed most
frequently – normative because it uses info gained from
representative groups
Using the test as a predictor is not as good as perfect prediction, but
it is still better than using the normative info
A regression line such as in 3.9 shows that the test score tells us
nothing about the criterion beyond the normative info

TERMS AND ISSUES IN THE USE OF CORRELATION
Residual
Difference between the predicted and the observed values is called
the residual
o
Y-Y’
Important property of residual is that the sum of the residuals always
equals 0
Sum of the squared residuals is the smallest value according to the
principle of least squares
Standard Error of Estimate
Standard deviation of the residuals is the standard error of estimate
A measure of the accuracy of prediction
Prediction is most accurate when the standard error of estimate is
relatively small
Coefficient of Determination
Correlation coefficient squared is known as the coefficient of
determination
Tells us the proportion of the total variation in scores on Y that we
know as a function of information about X
Coefficient of Alienation
Coefficient of alienation is a measure of nonassociation between two
variables
Square root of 1-r2 –-- r is the coefficient of determination
High value means there is a high degree of nonassociation between 2
variables
Shrinkage
Tendency to overestimate the relationship, particularly if the sample
of subjects is small
Shrinkage is the amount of decrease observed when a regression
equation is created for one population and then applied to another
Cross Validation
Use regression equation to predict performance in a group of subjects
other than the ones to which the equation was applied
Standard error of estimate obtained for relationship between the
values predicted by the equation and the values actually observed –
called cross validation
The Correlation-Causation Problem
Experiments are required to determine whether manipulation of one
variable causes changes in another variable
A correlation alone does not prove causality, although it might lead to
other research that is designed to establish the causal relationships
between variables
Third Variable Explanation
Third variable, ie poor social adjustment, causes TV viewing and
aggression
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912
CHAPTER 5: RELIABILITY
RELIABILITY
TEST CONSTUCTION
Dependability and consistent
o
Item sampling or content sampling – refer to variation
Error implies that there will always be some inaccuracy in our
among items within a test as well as to variation among
measurements
items between test\
Tests that are relatively free of measurement error are deemed to be

The extent to which a test takers score is
reliable
affected by the content sampled on a test and
Reliability estimates in the range of .70 and .80 are good enough for
by the way the content is sampled (that is, the
way in which the item is constructed) is a
most purposes in basic research
source of error variance
Reliability coefficient: an index that indicates the ratio between the
TEST ADMINISTRATION
true score variance on a test and the total variance
o
may influence the test takers attention or motivation
HISTORY OF RELIABILITY:
o
Environment variables, test taker’s variables, examiner
o
Charles Spearman (1904): The Proof and Measurement of
variables. Level of professionalism
Association between Two Things
TEST SCORING AND INTERPRETATION
o
Then Thorndike
o
Computer scoring and a growing reliance on objective,
o
Item response theory has taken advantage of computer
computer-scorable items have virtually eliminated error
technology to advance psychological measurement
variance caused by scorer differences
significantly
o
However, other tools of assessment still require scoring by
o
Based on Spearman’s ideas
trained personnel
X = T + E  CLASSICAL TEST THEORY
o
If subjectivity is involved in scoring, then the scorer can be
o
assumes that each person has a true score that would be
a source of error variance
obtained if there were no errors in measurement
o
Despite rigorous scoring criteria set forth in many of the
o
Difference between the true score and the observed score
better known test of intelligence, examiner occasionally
results from measurement error
still are confronted by situations where an examinees
o
Assumption here is that errors of measurement are
response lies in a gray area
random
o
o
Basic sampling theory tells us that the distribution of
random errors is bell-shaped

The center of the distribution should represent
the true score, and the dispersion around the
mean of the distribution should display the
distribution of sampling errors
Classical test theory assumes that the true score for an
individual will not change with repeated applications of
the same test
o
o
-
-
-
Variance: standard deviation squared. It is useful because
it can be broken into components:
o
True variance: variance from true differences  are
assumed to be stable
o
Error variance: random irrelevant sources
Standard error of measurement: we assume that the distribution of
random errors will be the same for all people, classical test theory
uses the standard deviation of errors as the basic measure of error
o
Standard error of measurement tells us, on the average,
how much a score varies from the true score
o
Standard deviation of the observed score and the
reliability of the test are used to estimate the standard
error of measurement
Reliability: proportion of the total variance attributed to true
variance.
o
the greater portion of total variance attributed to true
variance, the more reliable the test
Measurement error: refers to collectively, all of the factors associated
with the process of measuring some variable, other than the variable
being measured
o
Random error: a source of error in measuring a targeted
variable caused by unpredictable fluctuations and
inconsistencies of other variables in the measurement
process

This source of error fluctuates from one testing
situation to another with no discernible pattern
that would systematically raise or lower scores
o
Systematic Error:

A source of error in measuring a variable that is
typically constant or proportionate to what is
presumed to be true value of the variable being
measured

Error is predictable and fixable

Does not affect score consistency
SOURCES OF ERROR VARIANCE
TEST-RETEST RELIABILITY
Also known as time-sampling reliability
Correlating pairs of scores from the same group on two different
administration of the same test
Measure something that is relatively stable over time
Sources of Error variance:
o
Passage of time: the longer the time that passes, the
greater the likelihood that reliability coefficient will be
lower.
o
Coefficient of stability: when the interval between testing
is greater than 6 months,
Consider possibility of carryover effect: occurs when first testing
session influences scores from the second session
If something affects all the test takers equally, then the results are
uniformly affected and no net errors occurs
Practice tests may make this effect happen
Practice can also affect tests of manual dexterity
Time interval between testing sessions must be selected and
evaluated carefully
Poor test-retest correlations do not always mean that a attest is
unreliable – suggest that the characteristic under study has changed
PARALLEL-FORM OR ALTERNATE FORMS RELIABILITY
compares two equivalent forms of a test that measure the same
attribute
Two forms should be equally constructed, both format, etc.
When two forms of the test are available, one can compare
performance on one form versus the other – equivalent forms
reliability or parallel forms
Coefficient of equivalence: degree of relationship between various
forms of a test can be evaluated by means of an alternate-forms
Parallel forms: each form of the test, the means and variances of
observed test scores are equal
Alternate forms: different versions of a test that have been
constructed so as to be parallel
(1) two test administrations with the same group are required
(2) test scores may be affected by factors such as motivation etc.
Problem: developing a new version of a test
INTERNAL CONSISTENCY
How well does each item measure the content/construct under
consideration
How consistent the items together
Used when tests are administered once
If all items on a test measure the same construct, then it has a good
internal consistency
Split-half reliability, KR20, Cronbach Alpha
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912
CHAPTER 5: RELIABILITY
o
SPLIT-HALF RELIABILITY
Correlating two pairs of scores obtained from equivalent halves of a
single test administered once.
This is useful when it is impractical to assess reliability with two tests
or to administer test twice
Results of one half of the test are then compared with the results of
the other
Rules in splitting forms into half:
o
Do not divide test in the middle because it would lower
the reliability
o
Different amounts of anxiety and differences in item
difficulty shall also be considered
o
Randomly assign items to one or the other half of the test
o
use the odd-even system: where one subscore is obtained
for the odd-numbered items in the test and another for
the even-numbered items
To correct for half-length, apply the Spearman-Brown formula, which
allows you to estimate what the correlation between the two halves
would have been if each half had been the length of the whole test
o
Use this if test user wish to shorten a test
o
Used to determine the number of items needed to attain a
desired level of reliability
Reliability increases as the test length increases
o
o
Test takers with the same score on a homogenous test
probably have similar abilities in the area tested
Test takers with the same score on a heterogeneous test
may have quite different abilities
However, homogenous testing is often an insufficient tool
for measuring multifaceted psychological variable such as
intelligence or personality
Measures of Inter-Scorer Reliability
In some types of tests under some conditions, the score may be more a
function of the scorer than of anything else
Inter-scorer reliability: is the degree of agreement or consistency between
two or more scorers (or judges or rather) with regard to a particular
measure
Coefficient of inter-scorer reliability: coefficient of correlation to
determine the degree of consistency among scorers in the scoring of a test
Kappa statistic is the best method for assessing the level of agreement
among several observers
o
Indicates the actual agreement as a proportion of the potential
agreement following the correction for chance agreement
o
Cohen’s Kappa – 2 raters
o
Fleiss’ Kappa – 3 or more raters
HOMOGENEITY VS. HETEROGENEITY OF TEST ITEMS
Homogeneous items has high degree of reliability
KUDER-RICHARDSON FORMULAS OR KR20/KR21
Kuder-Richardson technique simultaneously considers all possible
ways of splitting the items
The formula for calculating the reliability of a test in which the items
are dichotomous, scored 0 or 1, is the Kuder-Richardson 20 (see
p.114)
Introduced KR21 – uses an approximation of the sum of the pq
products – the mean test score
DYNAMIC VS. STATIC CHARACTERISTICS
Dynamic: trait, state, ability presumed to be ever-changing as a function of
situational and cognitive experiences
Static: trait, state, ability relatively unchanging
CRONBACH ALPHA
Cronbach developed a formula that estimates the internal
consistency of tests in which the items are not scored as 0 or 1 – a
more general reliability estimate, which he called coefficient alpha
Sum the individual item variances
o
Most general method of finding estimates of reliability
through internal consistency
Domain sampling: define a domain that represents a single trait or
characteristic, and each item is an individual sample of this general
characteristic
Factor analysis deals with the situation in which a test apparently
measures several different characteristics
o
Good for the process of test construction
Most widely used as a measure of reliability because it requires only
one administration of the test
Ranges from 0 to 1 “bigger is always better”
Other Methods of Estimating Internal Consistencies
Inter-item consistency: refers to the degree of correlation among all
the items on a scale
o
A measure of inter-item consistency is calculated from a
single administration of a single form of a test
o
An index of inter-item consistency, in turn, is useful in
assessing the homogeneity of the test
o
Tests are said to be homogenous if they contain items that
measure a single trait
o
Definition: the degree to which a test measures a single
factor
o
Heterogeneity: degree to which a test measures different
factors
o
Ex: homo=test that assesses knowledge only of #-D
television repair skills vs. a general electronics repair test
(hetero)
o
The more homogenous a test is, the more inter-item
consistency it can be expected to have
o
Test homogeneity is desirable because it allows relatively
straightforward test-score interpretation
SPEED TESTS VS. POWER TESTS
Speed test: test is homogenous, means that it is easy but short time
Power test: Few items, but more complex.
RESTRICTION OR INFLATION OF RANGE
If it is restricted, reliability tends to be lower.
If it is inflated, reliability tends to be higher.
CRITERION-REFERENCED TESTS
Provide an indication of where a testtaker stands with respect to some
variable or criterion.
Tends to contain material that has been mastered in hierarchical fashion.
Scores here tend to be interpreted in pass-fail terms.
Measure of reliability depends on the variability of the test scores: how
different the scores are from one another.
The Domain Sampling Model
This model considers the problems created by using a limited number
of items to represent a larger and more complicated construct
Our task in reliability analysis is to estimate how much error we would
make by using the score from the shorter test as an estimate of your
true ability
Conceptualizes reliability as the ratio of the variance of the observed
score on the shorter test and the variance of the long-run true score
Reliability can be estimated from the correlation of the observed test
score with the true score
Item Response Theory
Classical test theory requires that exactly the same test items be
administered to each person – BAD
Item response theory (IRT) is newer – computer is used to focus on
the range of item difficulty that helps assess an individual’s ability
level
o
More reliable estimate of ability is obtained using a
shorter test with fewer items
o
Takes a lot of items and effort
Generalizability theory
based on the idea that a persons test scores vary from testing to testing
because of variables in the testing situation
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912
-
-
-
-
CHAPTER 5: RELIABILITY
Instead of conceiving of all variability in a persons scores as error, Cronbach
encouraged test developers and researchers to describe the details of the
particular test situation or universe leading to a specific test score
This universe is described in terms of its facets: which include things like
the number of items in the test, the amount of training the test scorers
have had, and the purpose of the test administration
According to generalizability theory, given the exact same conditions of all
the facets in the universe, the exact same test score should be obtained
Universe score: the test score obtained and its analogous to a true score in
the true score model
Cronbach suggested that tests be developed with the aid of a
generalizability study followed by a decision study
Generalizability study: examines how generalizable scores from a
particular test are if the test is administered in different situations
How much of an impact different facets of the universe have on the test
score
Ex: is the test score affected by group as opposed to individual
administration
Coefficients of generalizability: the influence of particular facts on the test
score is represented by this. These coefficients are similar to reliability
coefficients in the true score model
Decision study: developers examine the usefulness of test scores in helping
the test user make decision
The decision study is designed to tell the test user how test scores should
be used and how dependable those scores are as a basis for decisions,
depending on the context of their use
What to Do About Low Reliability
Two common approaches are to increase the length of the test and to
throw out items that run down the reliability
Another procedure is to estimate what the true correlation would
have been if the test did not have measurement error
Increase the Number of Items
The larger the sample, the more likely that the test will represent the
true characteristic
o
This could entail a long and costly process however
Prophecy formula
Factor and Item Analysis
Reliability of a test depends on the extent to which all of the items
measure one common characteristic
Factor analysis
o
Tests are most reliable if they are unidimensional: one
factor should account for considerably more of the
variance than any other factor
Or examine the correlation between each item and the total score for
the test
o
Called discriminability analysis: when the correlation
between the performance on a single item and the total
test score is low, the item is probably measuring
something different from the other items on the test
Correction for Attenuation
Potential correlations are attenuated, or diminished, by measurement
error
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912
CHAPTER 6: VALIDITY
The Concept of Validity
(N/2)
Validity: as applied to a test, is a judgment or estimate of how well a test
o
CVR Content validity ratio
measures what it purports to measure in a particular context
o
ne  Number of panelists
o
Judgment based on evidence about the appropriateness of
stating “essential”
inferences drawn from test scores
o
N Total number of panelists
o
Validity of test must be shown from time to time to account for

CVR is calculated for each item
culture and advancement
o
Culture and the relativity of content validity
Inference: a logical result or deduction

Tests thought of as either valid or invalid
“Acceptable” or “weak” validity of tests and test scores

What constitutes historical fact depends to some
Validation: process of gathering and evaluating evidence about validity
extent on who is writing the history
o
Test user and testtaker both have roles in validation of test

Culture relativity
o
Test users may conduct their own validation studies: may yield

Politics (politically correct)
insights regarding a particular population of testtakers as
Criterion-Related Validity
compared to the norming sample (in manual)
Criterion-related validity: judgment of how adequately a test score can be
used to infer an individual’s most probable standing on some measure of
o
Local validation studies: absolutely necessary when test user
interest (measure of interest being the criterion)
plans to alter in some way the format, instructions, language, or
2 types:
content of the test
o
Concurrent validity: index of the degree to which a test score is
Types of Validity (Trinitarian view) *not mutually exclusive all contribute
related to some criterion measure obtained at the same time
to a unified picture of a test’s validity/ critique approach is fragmented
(concurrently)
and incomplete
o
Predictive validity: index of the degree to which a test score
o
Content validity: measure of validity based on an evaluation of
predicts some criterion measure
the subjects, topics, or content covered by the items in the test
What Is a Criterion?
o
Criterion-related validity: measure of validity obtained by
o
Criterion: a standard on which a judgment or decision may be
evaluating the relationship of scores obtained on the test to
based; standard against which a test or a test score is evaluated
scores on other tests or measures
(criterion-related validity)
o
Construct validity: measure of validity that is arrived at by
o
Characteristics of criterion
executing a comprehensive analysis of: (umbrella validity

Relevancy pertinent or applicable to the matter at
every other variety of validity falls under it)
hand

How scores on test relate to other test scores and

Validity (for the purpose which it is being used)
measures

How scores on test can be understood within some

Uncontaminated Criterion contamination: term
theoretical framework for understand the construct
applied to a criterion measure that has been based,
that the test was designed to measure
at least in part, on predictor measures
Strategies: ways of approaching the process of test validity
Concurrent Validity
o
Content validation strategies
o
Test scores are obtained at about the same time as the criterion
o
Criterion-related validation strategies
measures are obtained measures of the relationship between
the test scores and the criterion provide evidence of concurrent
o
Construct validation strategies
validity
Face Validity
o
Indicate the extent to which test scores may be used to estimate
o
Face validity: relates more to what a test appears to measure to
an individuals present standing on a criterion
the person being tested than to what the test actually measures
o
Once validity of inference from test scores is established= faster,
o
Judgment concerning how relevant the test items appear to
less expensive way to offer a diagnosis or a classification
be usually from testtaker, not test user
decision
o
Lack of face validity= lack of confidence in perceived
o
Concurrent validity of a test can be explored with respect to
effectiveness of test which decreases testtaker’s
another test
motivation/cooperation *may still be useful

Prior research must have satisfactorily demonstrated
Content validity
the 1st test’s validity
o
Content validity: a judgment of how adequately a test samples

1st test= validating criterion
behavior representative of the universe of behavior that the test
Predictive validity
was designed to sample
o
Test scores may be obtained at one time and the criterion

Ideally, test developers have a clear vision of the
measures obtained at a future time, usually after some
construct being measured clarity reflected in the
intervening event has taken place
content validity of the test

Intervening event training, experience, therapy,
o
Test blueprint: structure of the evaluation; a plan regarding the
medication, etc.
types of information to be covered by the items, the number of

Measures of relationship between the test scores
items tapping each area of coverage, the organization of the
and a criterion measure obtained at a future time
items in the test, etc.
provide an indication of the predictive validity test

Behavior observation is a technique frequently used
(how accurately scores on the test predict some
in test blueprinting
criterion measure)
o
The quantification of content validity
o
Ex: SAT test score and freshman gpa

Important in employment settings  tests used to
o
Judgments of criterion validity are based on 2 types of statistical
hire and promote
evidence:

One method: method for gauging agreement among

The validity coefficient
raters or judges regarding how essential a particular

Validity coefficient: correlation coefficient
item is (C.H. Lawshe)
that provides a measure of the

“Is the skill or knowledge measured by
relationship between test scores and
this item…
scores on the criterion measure
o
Essential

Ex: Pearson correlation coefficient used
o
Useful but not essential
to determine validity between 2 measures
o
Not necessary
(r)

To the performance of the job?”

Affected by restriction or inflation of

Content validity ratio (CVR):
range

CVR= ne – (N/2)
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912
CHAPTER 6: VALIDITY
Is the range of scores employed
appropriate to the objective of the
correlational analysis

No rules regarding the validity coefficient
(how high or low it should/could be for
test to be valid)

Incremental validity
o
More than one predictor
o
Incremental validity: the
o
degree to which an additional
predictor explains something
about the criterion measure
that is not explained by
predictors already in use

Expectancy data

Expectancy data: provides info that can
be used in evaluating the criterion-related
validity of a test

Score obtained on expectancy
test/tables likelihood testtaker will
score within some interval of scores on a
criterion measure (“passing”,
“acceptable”, etc.)

Expectancy table: shows the percentage
of people within specified test-score
intervals who subsequently were placed
in various categories of the criterion
o
May be created from
scatterplot
o
Shows relationships

Expectancy chart: graphic representation
of an expectancy table
o
The higher the initial rating,
the greater the probability of
job/academic success

Taylor Russell Table – provide an estimate of the
extent to which inclusion pf a particular test in the
selection system will actually improve selection

Selection ratio – relationship between the
number of people to be hired and the
number of people available to be hired

Base rate – percentage of people under
existing system for a particular position

Relationship between predictor and
criterion must be linear

Naylor-shine Tables – difference between the means
of the selected and unselected groups to derive an
index of what the test is adding to already
established procedures
o
Decision theory and Test utility

Base rate – extent to which a particular trait,
behavior, characteristic or attribute exists in the
population

Hit rate – defined as the proportion of people a test
accurately identifies as possessing or exhibiting a
particular trait.

Miss rate – proportion of people the test fails to
identify as having or not having attributes

False positive (type I error) – possess
particular attribute but actually does not
have. Ex: score above cutoff score, hired
but failed the job.

False negative (type II error) – does not
possess particular attribute but actually
does have. Ex. Scored below cutoff score,
not hired, but could have been successful
in the job
Construct Validity
o
Construct validity: judgment about the appropriateness of
inferences drawn from test scores regarding individual standings
on a variable called a construct

-
Construct: an informed, scientific idea developed or
hypothesized to describe or explain behavior

Ex: intelligence, depression, motivation,
personality, etc.

Unobservable, presupposed (underlying)
traits that a test developer invokes to
describe test behavior/criterion
performance

Viewed as unifying concept for all validity evidence
Evidence of Construct Validity

Various techniques of construct validation that
provide evidence:

Test is homogeneous measures single
construct

Test scores increase/decrease as function
of age, passage of time, or experimental
manipulation (theoretically predicted)

Test scored obtained after some even or
passage of time differ from pretest scores
(theoretically predicted)

Test scores obtained by people from
distinct groups vary (theoretically
predicted)

Test scores correlate with scores on other
tests (theoretically predicted)

Evidence of homogeneity

Homogeneity: refers to how uniform a
test is in measuring a single concept

Evidence correlations between subtest
scores and total test scores

Item-analysis procedures have been used
in quest for test homogeneity

Desirable but not necessary

Contributes no info about how construct
being measured relates to other
constructs

Evidence of changes with age

If test purports to measure a construct
that changes over time then the test
scores, too, should show progressive
changes to be considered valid
measurement of construct

Does not in itself provide info about how
construct relates to other constructs

Evidence of pretest-posttest changes

Can be evidence of construct validity

Some more typical intervening
experiences responsible for changes in
test scores are:
o
Formal education
o
Therapy/medication
o
Any life experience

Evidence from distinct groups/method of contrasted
groups

Method of contrasted groups: one way of
providing evidence for the validity of a
test is to demonstrate that scores on the
test vary in a predictable way as a
function of membership in some group

Rationale if a test is a valid measure of
a particular construct, test scores from
groups of people who would presumed
with respect to that construct should have
correspondingly different test scores

Convergent evidence

Evidence for the construct validity of a
particular test may converge from a
number of sources, such as tests or
measures designed to assess the
same/similar construct

Convergent evidence: scores on a test
undergo construct validity and correlate

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912
-
CHAPTER 6: VALIDITY
highly in the predicted direction with

Issues of fairness tend to be more difficult and
scores on older, more established and
involve values
already validated tests designed to

Fairness: the extent to which a test is used in an
measure the same/similar construct
impartial, just, and equitable way

Discriminant evidence

Sources of misunderstanding

Discriminant evidence: validity coefficient

Discrimination
showing little relationship between test

Group not included in standardization
scores and /or other variables with which
sample
scores on the test being construct
Performance differences between
validated should not theoretically be
identified groups
correlated

Provides evidence of construct validity
Relationship Between Reliability and Validity

Multitrait-multimethod matrix: “two or
A test should not correlate more highly with any other variable than it
correlates with itself
more traits”, “two or more methods”
A modest correlation between the true scores on two traits may be
matrix/table that results from correlating
missed if the test for each of the traits is not highly reliable
variables (traits) within and between
We can have reliability without validity
methods
o
It is impossible to demonstrate that an unreliable test is

Factor analysis

Factor analysis: shorthand term for a class
valid
of mathematical procedures designed to
identify factors or specific variables that
are typically attributes, characteristics, or
dimension on which people may differ

Frequently used as a data reduction
method in which several sets of scores
and correlations between them are
analyzed

Exploratory factor analysis: researchers
test the degree to which a hypothetical
model fits the actual data
o
Factor loading: conveys
information about the extent
to which the factor determines
the test score or scores
o
Complex procedures
Validity, Bias, and Fairness
o
Test Bias

Bias: a factor inherent in a test that systematically
prevents accurate, impartial measurement

Technical means to identify and remedy bias
(mathematically)

Bias implies systematic variation

Rating error

Rating: a numerical or verbal judgment
(or both) that places a person or an
attribute along a continuum identified by
a scale of numerical or word descriptions,
known as a rating scale

Rating error: judgment resulting from
intentional or unintentional misuse of a
rating scale

Leniency error/generosity error: error in
rating that arises from the tendency on
the part of the rater to be lenient in
scoring, marking, and/or grading

Severity error: rater exhibits general and
systematic reluctance to giving ratings at
either the positive or negative extreme

Overcome restriction of range rating errors is to use
rankings: procedure that requires the rater to
measure individuals against one another instead of
against an absolute scale

Rater is forced to select 1st, 2nd, 3rd, etc.

Halo effect: fact that for some raters, some rates can
do no wrong

Tendency to give a particular ratee a
higher rating than he or she objectively
deserves

Criterion data may be influenced by
rater’s knowledge of ratee race,
gender, etc.
o
Test fairness
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912
CHAPTER 7: UTILITY
Utility: usefulness or practical value of testing to improve efficiency
Factors that Affect a Test’s Utility

Psychometric Soundness
o
Reliability and validity of a test
o
Gives us the practical value of both the scores (reliability
and validity)
o
They tell us whether decisions are cost-effective
o
A valid test is not always a useful test

especially if testtakers do not follow test
directions

Costs
o
Economic and non economic
o
Ex.) using a less expensive and therefore less stringent
application process for airline personnel.

Benefits
o
Profits, gains, advantages
o
Ex.) more stringent hiring policy more productive
employees
o
Ex.) maintaining successful and academic environment of
university

o
o
o
o
Based on norm-related considerations rather
than on the relationship of test scores to a
criterion

Also called norm-referenced cut score

Ex.) top 10% of test scores get A’s
Fixed cut score: set with reference to a judgment
concerning a minimum level of proficiency required to be
included in a particular classification.

Also called absolute cut scores
Multiple cut scores: using two or more cut scores with
reference to one predictor for the purpose of categorizing
testtakers

Ex.) having cut score that marks an A, B, C etc.
all measuring same predictor
Multiple hurdles: for success, requires one individual to
complete many tasks, with elimination at each level

Ex.) written application group interview
personal interview etc.
Compensatory model of selection: assumption is made
that high scores on one attribute can compensate for low
scores on another attribute
Utility Analysis
Methods for Setting Cut Scores
What is Utility Analysis?
-a family of techniques that entail a cost-benefit analysis designed to yield
information relevant to a division about the usefulness and/or practical value of
a tool of assessment.
The Angoff Method
Judgments of experts are averaged
Utility analysis: An illustration
What’s the companies goal?

Limit the cost of selection
o
Don’t use FERT

Ensure that qualified candidates are not rejected
o
Set a cut score that yields the lowest false negative rate

Ensure that all candidates selected will prove to be qualified
o
Lowest dales positive rate

Ensure, to the extent possible, that qualified candidates will be
selected and unqualified candidates will be rejected
o
False positives are no better or worse than false negatives
o
Highest hit rate and lowest miss rate
How Is a Utility Analysis Conducted?
-objective: dictate what sort of information will be required as well as the
specific methods to be used

Expectancy Data
o
Expectancy table provides indication of the likelihood that
a testtaker will score within some interval of scores on a
criterion measure
o
Used to measure costs vs. benefits

Brogden-Cronbach-Gleser formula
o
Utility gain: estimate of the benefit of using a particular
test or selection method
o
Most simply is benefits-cost
o
Productivity gain: estimated increase in work output
The Known Groups Method
Collection of data on the predictor of interest from group known to posses and
not to possess trait, attribute, or ability
Cut score based on which test best discriminates the two groups performance
IRT-Based Method
Based on testtaker’s performance across all items on a test
Some portion of test items must be correct
Item-mapping method: determining difficulty level reflected by cut score (?)
Book-Mark method: test items are listed, one per page, in ascending level of
difficulty. An expert places a bookmark to mark the divide which separates
testtakers who have acquired minimal knowledge, skills, or abilities and those
that have not.
Problems include training of experts, possible floor and ceiling effects, and the
optimal length of item booklets
Other Methods
-discriminant analysis: family of statistical techniques used to shed light on the
relationship between certain variables and two or more naturally occurring
groups
ex.) the relationships between scores of tests and ppl judged to be
successful or unsuccessful at job
Some Practical Considerations

The Pool of Job Applicants
o
There is rarely a limitless supply of potential employees
o
Dependent on many factors, including economic
environment
o
We assume that top scoring individuals will accept the job,
but those individuals are more likely to be the ones being
offered higher positions

The complexity of the Job
o
It is questionable whether the same utility analysis
methods can be used to measure the eligibility of varying
complexities of jobs

The cut score in use
o
Relative cut score: may be defines as reference point
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912
CHAPTER 8: TEST DEVELOPMENT
o
Item format: variables such as the form, plan, structure,
arrangement and layout of individual test items
STEPS:
o
2 types
1. TEST CONCEPTUALIZATION
o
1.) selected-response format: testtaker selects a response from
2. TEST CONSTRUCTION
a set of alternative responses
3. TEST TRYOUT

includes multiple choice, true-false, and matching
4. ITEM ANALYSIS
o
2.) constructed-response format: testtaker supplies or creates
5. TEST REVISION
the correct answer

includes completion item, short answer and essay
TEST CONCEPTUALIZATION
Writing Items for computer administration
Thoughts or stimulus that could be almost everything.
o
Item bank: relatively large and easily accessible collection of
An emerging social phenomenon or pattern of behavior might serve
test questions
as the stimulus for the development of a new test.
o
Computerized Adaptive Testing (CAT): interactive, computerNorm referenced: An item for which high scorers on the test respond
administered testtaking process wherein items presented to the
correctly. Low scorers respond to that same item incorrectly
testtaker are based in part on testtaker’s performance on
Criterion referenced: high scorers on the test get a particular item
previous items.
right whereas low scorers on the test get that same item wrong.
o
Floor effect: the diminished utility of an assessment tool for
Pilot work: pilot study or pilot research. To know whether some items
distinguishing testtakers at the low end of the ability, trait, or
should be included in the final form of the instrument.
other attribute being measured
o
the test developer typically attempts to determine how
o
Ceiling effect: diminished utility of an assessment tool for
best to measure a targeted construct
distinguishing testtakers at the high end of the ability, trait,
TEST CONSTRUCTION
attribute being measured
Scaling: process of setting rules for assigning numbers in
o
Item branching: ability of computer to tailor the content and
measurement.
order of presentation of test items on the basis of responses to
L.L. Thurstone: credited for being the forefront of efforts to develop
previous items
methodologically sound scaling methods.
SCORING ITEMS
TYPES OF SCALES:
Cummulative scoring: testtakers earn cumulative credit with regard to
Nominal, ordinal, interval or ratio
a particular construct
Age-based scale
Class/category scoring: testtaker responses earn credit toward
Grade-based scale
placement in a particular class or category with other testtakers
Stanine scale (raw score converted to 1-9)
whose pattern of responses is presumably similar in some way
Unidimensional vs. multidimensional
Ipsative scoring: comparing a testtaker’s score on one within a test to
o
Unidimensional: measuring one construct
another scale within that same test
o
Multidimensional: measuring more than one construct
o
ex.) “John’s need for achievement is higher than his need
Comparative vs. categorical
for affiliation”
o
Comparative scaling: entails judgments of a stimulus in
ITEM WRITING (KAPLAN BOOK)
comparison with every other stimulus on the scale
Item Writing
o
Categorical scaling: stimuli are placed into one of two or
Personality and intelligence tests require different sorts of responses
more alternative categories that differ quantitatively with
Guidelines for item writing
respect to some continuum
o
Define clearly what you want to measure
Rating Scale: Which can be defined as a grouping of words,
o
Generate an item pool
statements, or symbols on which judgments of the strength of a
o
Avoid exceptionally long items
particular trait, attitude, or emotion are indicated by the testtaker
o
Keep the level of reading difficulty appropriate for those who
Summative scale: when final score is obtained by summing the
will complete the scale
ratings across all the items
o
Avoid “double-barreled” items that convey two or more ideas at
Likert scale: each item presents the testtaker with five alternative
the same time
responses usually on agree-disagree, or approve-disapprove
o
Consider mixing positively and negatively worded items
continuum
Must be sensitive to ethnic and cultural differences
Method of paired comparisons: presented with two stimuli and
Items that retain their reliability are more likely to focus on skills, while
asked to compare
those that lost reliability focused on more abstract concepts
Comparative scaling: judging of a stimulus in comparison with every
Item Formats
other stimulus on the scale
Simplest test uses dichotomous format
Categorical scaling: testtaker places stimuli into a category; those
The Dichotomous Format
categories differ quantitatively on a spectrum.
Dichotomous format offers two alternatives for each item
Guttman scale (Scalogram analysis): items range from sequentially
o
Ie. True-false examination
weaker to stronger expressions of attitude, belief, or feeling. A
Advantages:
testtaker who agrees with the stronger statement is assumed to also
o
Simplicity
agree with the milder statements
o
True-false items require absolute judgment
Equal-appearing intervals (Thurstone): direct estimation because
Disadvantages:
don’t need to transform testtaker’s response to another scale
o
True-false encourage students to memorize material
WRITING ITEMS
o
“truth” often comes in shades of gray
3 Questions of test developer
o
mere chance of getting any item correct is 50%
o
What range of content should the items cover?
Yes-no format on personality tests
o
Which of the many different types of item formats should be
Multiple-choice = polytomous
employed?
The Polytomous Format
o
How many items should be written in total and for each content
Polytomous format resembles the dichotomous format except that each
area covered?
item has more than two alternatives
Item pool: reservoir from which items will not be drawn for the final
o
Multiple-choice exams
version of the test (should be about double the number of questions as
Advantage:
final will have)
o
Little time for test takers to respond to a particular item because
Item format
they do not have to write
Incorrect choices are called distractors
CHAPTER 8: TEST DEVELOPMENT
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912
CHAPTER 8: TEST DEVELOPMENT
Disadvantages:
o
The midpoint representing the optimal difficulty is
o
How many distractors should a test have? --> 3 or 4
obtained by summing up the chance of success proportion
o
Distractors hurting reliability / validity of test
and 1.00 and then dividing the sum by 2
Item Reliability Index
o
Three alternative multiple-choice items may be better than five
o
Indication of the internal consistency of a test
alternative items because they retain the psychometric value
o
Equal to the product of the item-score standard deviation (s) and the
but take less time to develop and administer
o
Scoring of the MC exams? --> simply guessing should elicit
correlation (r)
o
Factor analysis and inter-item consistency
correctness
o
Correcting for this though, the expected score is 0 – as getting a
o
Factor analysis determines whether items on a test appear
question wrong loses you a point
to be measuring the same thing
Guessing can be good if you can narrow down a couple answers
The Item-Validity Index
Students are more likely to guess when they anticipate a lower grade on a
o
Statistic designed to provide an indication of the degree to which a
test than when they are more confident
test is measuring what it purports to measure
Guessing threshold describes the chances that a low-ability test taker will
o
Requires: item-score standard deviation, the correlation between the
obtain each score
item score and criterion score
True-false and MC tests are common to educational and achievement tests
The Item-Discrimination Index
Likert format, category scale, and the Q-sort used for personality-attitude
o
Measures how adequately an item separates or discriminates
tests
between high scorers and low scorers
Likert Format
o
“d”
Likert format: requires that a respondent indicate the degree of agreement
o
compares performance on a particular item with performance in the
with a particular attitudinal question
upper and lower regions of a distribution of continuous test scores
o
Strongly disagree ... Strongly agree
o
higher d means greater number of high scorers answering the item
o
For measurements of attitude
correctly
Used to create Likert Scales: scales require assessment of item
o
negative d means low-scoring examinees are more likely to answer
discriminability
the item correctly than high-scoring examinees
Familiar and easy --- likely to remain popular in personality and attitude
o
Analysis of item alternatives
tests
Item-Characteristic Curves?
Category Format
o
Graphic representation of item difficulty and discrimination
Category format: uses more choices than Likert; 10-point rating scale
Disadvantage: responses to items on 10-pt scales are affected by the
Other Considerations in Item Analysis
groupings of the people or things being rated
o
Guessing
People change their ratings depending on context
o
Usually in some direction
o
This problem can be avoided if the endpoints of the scale are
o
Depends on individuals ability to take risks
clearly defined and the subjects are frequently reminded of the
o
Item fairness
definitions of the endpoints
o
Bias
Optimal number of points is 7?
o
Speed tests
o
Number depends on the fineness of the discrimination that
o
Last items will appear to be more difficult because
subjects are willing to make
not everyone got to them
o
When people are highly involved with some issue, they will tend
to respond best to a greater number of categories
Qualitative Item Analysis
Increasing the number of response categories may not increase reliability

Qualitative methods: techniques of data generation and analysis that
and validity
rely primarily on verbal rather than mathematical or statistical
Visual analogue scale: respondent is given a 100-millimeter line and asked
procedures
to place a mark between two well-defined endpoints

Qualitative item analysis: various nonstatistical procedures designed
o
Measures self-rate health
to explore how individual test items work
Checklists and Q-Sorts
o
Through means like interviews and group discussions
Adjective Checklist: subject receives a long list of adjectives and indicates

“Think aloud” test administration
whether each one is characteristic of himself or herself
o
approach to cognitive assessment that entails respondents
o
Requires subjects either to endorse such adjectives or not, thus
vocalizing thoughts as they occur
allowing only two choices for each item
o
used to shed light on the testtker’s though processes
Q-Sort: increases the number of categories
during the administration of a test
o
Used to describe oneself or to provide ratings of others

Expert panels
Other Possibilities
o
Sensitivity review: study of test items in which they are
Forced-choice and Likert formats are clearly the most popular in
examined for fairness to all prospective testtakers as well
contemporary tests and measures
as for the presence of offensive language, stereotypes, or
Checklists have fallen out of favor because they are more prone to error
situations
than are formats that require responses to every item
ITEM ANALYSIS (KAPLAN BASED)
Frequent advice is to not use “all of the above” as a response option
The Extreme Group Method
-
Compares people who have done well with those who have done
poorly on a test
Difference between these proportions is called the discrimination
index
The Point Biserial Method
Find the correlation between performance on the item and
performance on the total test
Correlation between a dichotomous variable and a continuous
variable is called a point biserial correlation
On tests with only a few items, using this is problematic because
performance on the item contributes to the total test score
Pictures of Item Characteristics
Valuable way to learn about items is to graph their characteristics,
which you can do with the item characteristic curve
-
TEST TRYOUT
What is a good item?
o
Reliable and valid
o
Helps to discriminate testtakers
ITEM ANALYSIS
o
The Item-Difficulty Index
o
Obtained by calculating the proportion of the total number
of testtakers who answered the item correctly “p”
o
Higher p= easier item
o
Difficulty can be replaced with endorsement in nonachievement tests
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912
CHAPTER 8: TEST DEVELOPMENT
Prepare a graph for each individual test item
First step in developing these tests involves clearly specifying the
o
Total test score is used as an estimate of the amount of a
objectives by writing clear and precise statements about what the
learning program is attempting to achieve
‘trait’ possessed by individuals
To evaluate the items: one should give the test to two groups of
Relationship between performance on the item and performance on
students – one that has been exposed to the learning unit and one
the test gives some info about how well the item is tapping the info
that has not
we want
Bottom of the V is the antimode – the least frequent score
Drawing the Item Characteristic Curve
This point divides those who have been exposed to the unit from
To draw this, we need to define discrete categories of test
those who have not been exposed and is usually taken as the cutting
performance
score or point, or what marks the point of decision
If the test has been given to many people, we might choose to make
When people get scores higher than the antimode, we assume that
each test score a single category
they have met the objective of the test
Gradual positive slope of the line demonstrates that the proportion of
Limitations of Item Analysis
people who pass the item gradually increases as test scores increase
Main Problem: though statistical methods for item analysis tell the
o
This means that the item successfully discriminates at all
test constructor which items do a good job of separating students,
levels of test performance
they do not help the students learn
Ranges in which the curve changes suggest that the item is sensitive,
Although the data are available to give the child feedback on the
while flat ranges suggest areas of low sensitivity
“bug” in their thinking, nothing in the testing procedure initiates this
Item analysis breaks the general rule the increasing the number of
guidance
items makes a test more reliable
TEST REVISION
When bad items are eliminated, the effects of chance responding can
Test Revision in the Life Cycle of an Existing Test
be eliminated and the test can become more efficient, reliable, and

Tests get old and need revision
valid

Questions arise over equivalence of two tests
Item Response Theory

Cross-validation and Co-validation
According to classical test theory, a score is derived from the sum of
o
Cross-validation: revalidation of a test on a sample of
an individual’s responses to various items, which are sampled from a
testtakers other than those on whom test performance
larger domain that represents a specific trait or ability
was originally found to be a valid predictor of some
New approaches consider the chances of getting particular items right
criterion
or wrong – item response theory – make extensive use of item
o
Validity shrinkage: decrease in item validities that
analysis
inevitably occurs after cross-validation of finding
o
With this, each item on a test has its own item
o
Co-validation: test validation process conducted on two or
characteristic curve that describes the probability of
more tests using the same sample of testtakers
getting each particular item right or wrong given the ability
o
Co-norming: when co-validation is used in conjunction
level of each test taker
o
Testers can make an ability judgment without subjecting
with the creation of norms or the revision of existing
norms
the test taker to all of the test items
o
Quality assurance during test revision
Technical adv: builds on traditional models of item analysis and can
provide info on item functioning, the value of specific items, and the

test givers must have some degree of
reliability of a scale
qualification, training, and testing
Two dimensions used are difficulty and discriminability

anchor protocol: test protocol scored by a
Most attractive adv. Is that one can easily adapt the IRT tests for
highly authoritative scorer that is designed as a
computer administration
model for scoring and a mechanism for
o
Computer can rapidly identify the specific items that are
resolving scoring discrepancies

scoring drift: a discrepancy between scoring in
required to assess a particular ability level
an anchor protocol and the scoring of another
“peaked conventional”
protocol
“rectangular conventional” – requires that test items be selected to
The Use of IRT in Building and Revising Tests
create a wide range in level of difficulty
(item response theory)
o
problem: only a few items of the test are appropriate for

Evaluating the properties of existing tests and guiding test revision
individuals at each ability level; many test takers spend

Determining measurement equivalence across testtaker populations
much of their time responding to items either considerably
o
Differential item functioning (DIF): phenomenon, wherein
below their ability level or too difficult to solve
an item functions differently in one group of testtakers as
IRT addresses traditional problems in test construction well
compared to another group of testtakers known to have
IRT can identify respondents with unusual response patterns and offer
insights into cognitive processes of the test taker
the same level of the underlying trait

Developing item banks
May also reduce the biases against the people whoa re slow in
completing test problems
o
Items from other instruments item pool  scrutiny
External Criteria
preliminary item bank psychometric testingitem bank
Item analysis has been persistently plagued by researchers’ continued
dependence on internal criteria, or total test score, for evaluating
items
Linking Uncommon Measures
One challenge in test applications is how to determine linkages
between two different measures
Items for Criterion-Referenced Tests
Traditional use of tests requires that we determine how well
someone has done on a test by comparing the person’s performance
to that of others
Criterion-referenced tests compares performance with some clearly
defined criterion for learning
o
Popular approach in individualized instruction programs
o
Regarded as diagnostic instruments
-
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912
CHAPTER 9: INTELLIGENCE AND ITS MEASUREMENT
What is Intelligence?
o
ex.) linguistic, mechanical, arithmetical abilities
Intelligence: a multifaceted capacity that manifests itself in different ways across

Guilford: multiple-factor models of intelligence
the lifespan. Usually includes abilities to:
o
Explain mental activities by deemphasizing, any reference

Acquire and apply knowledge
to g

Reason logically

Thurstone: conceived intelligence as being composed of 7 primary

Plan effectively
abilities.

Infer perceptively

Gardner: developed theory of multiple intelligences

Make judgment and solve problems
o
Question over whether emotional intelligence exists.

Grasp and visualize concepts
o
Logical-mathematical, bodily-kinesthetic, linguistic,

Pay attention
musical, spatial, interpersonal and intrapersonal

Be intuitive

Raymond Cattell: fluid vs. crystallized intelligence

Find the right words and thoughts with facility
o
Crystallized intelligence: acquired skills and knowledge

Cope with, adjust to, and make the most of new situations
and their retrieval. Retrieval of information and application
Intelligence Defines: Views of the Lay Public
of general knowledge

Both social and academic
o
Fluid intelligence: nonverbal, relatively culture-free, and
Intelligence Defined: Views of Scholars and Test Professionals
independent of specific instruction.

Francis Galton

Horn: added more to 7 factors
o
First to publish on heritability of intelligence
o
Vulnerable abilities: decline with age and tend to return
o
Most intelligent persons were those with the best sensory
preinjury levels following brain damage
abilities
o
Maintained abilities: tend not to decline with age and may

Alfred Binet
return to preinjury levels following brain damage.
o
Made tests about intelligence, but didn’t define it

Carrol:
o
Components of intelligence: reasoning, judgment,
o
Three-stratum theory of cognitive abilities: like geology
memory, abstraction
o
Hierarchical model: meaning that all of the abilities listed
o
Added that definition is complex; requires interaction of
in a stratum are subsumed by or incorporated in the strata
components
above.
o
He argued that when one solves a particular problem, the
o
Those in the first stratum are narrow abilities
abilities used cannot be separated because they interact to

CHC model (Cattell-Horn-Carroll)
produce the solution.
o
Some overlap some difference

David Wechsler
o
Doesn’t use g
o
Best way to measure this global ability was by measuring
o
Has broader abilities than Carroll’s theory
aspects of several “qualitatively differentiable” abilities

McGrew: Integrated the Cattell-Horn and Carroll’s model
o
Complexity of intelligence

McGrew and Flanagan: integrated McGrew-Flanagan CHC Model
o
Conceptualization as an “aggregate” or “global” capacity
o
Features 10 broad stratum abilities

Jean Piaget
o
70 narrow-stratum abilities
o
Studied children
o
Makes no provision for the general intellectual ability
o
Believed order of maturation to be unchangeable
factor (g)
o
With age, increased schema: organized action or mental
o
It was omitted because it has little practical relevance to
structure that, when applied to the world, leads to
cross-battery assessment and interpretation
knowing or understanding.
The Information-Processing View
o
Learning occurred through assimilation (actively

Aleksandr Luria
organizing new information so that it fits in with what
o
How (not what) information is processed
already is perceived and thought) and accommodation
o
Simultaneous/parallel processing: integrated all at once
(changing what is already perceived or though so that it
o
Successive/sequential processing: each bit individually
fits with new information)
processed
o
Sensorimotor (0-2)

PASS model: (Planning, attention, simultaneous, successive)-model of
assessing intelligence
o
Preoperational (2-6)

Sternberg ‘The essence of intelligence is that it provides a means to
govern ourselves so that our thoughts and actions are organized,
o
Concrete Operational (7-12)
coherent, and responsive to both out internally driven needs and to
the needs of the environment”
o
Formal Operational (12 and older)
Measuring Intelligence

All share interactionism: complex concept by which heredity and
environment are presumed to interact and influence the
Types of Tasks Used in Intelligence Test
development of one’s intelligence

Infants: test sensorimotor, interviews with parents

Factor-analytic theories: focus is squarely on identifying the

Older child: verbal and performance abilities
ability(ies) deemed to constitute intelligence

Mental Age: index that refers to chronological age equivalent to

Information-processing theories: focus is on identifying the specific
one’s test performance
mental processes that constitute intelligence.

Adults: retention of general information, quantitative reasoning,
expressive language and memory, and social judgment
Factor-Analytic Theories of Intelligence:
Theory in Intelligence Test Development and Interpretation

Charles Spearman: pioneered new techniques to measure

Weschler made a dichotomous test (Performance and Verbal), but
intercorrelations between tests.
advocated multifaceted definition
o
Existence of a general intellectual ability factor (g) that

Thorndike: intelligence = social, concrete, abstract
tapped by all other mental abilities.

Putting theories into test are extremely hard

g representing the portion of the variance that all intelligence tests
have in common and the remaining portions of the variance being
Intelligence: Some Issues:
accounted for either by specific components (s) or by error
Nature vs. Nurture
components (e)

Currently believed to be mix of two

greater g = better test was thought to predict overall intelligence

Performationism: all structures, including intelligence are had at birth

group factors: neither as general as g nor as specific as s
and can’t be improved upon
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912
CHAPTER 9: INTELLIGENCE AND ITS MEASUREMENT
Led to predeterminism: one’s abilities are predetermined by genetic
inheritance and no learning or intervention can enhance it

Interactionist: ppl inherit certain intellectual potential
o
Theres a limit to genetic abilities (i.e. can’t ever have x-ray
vision)
The Stability of Intelligence

Stable pretty much throughout one’s adult life

Cognitive abilities seem to decline with age
The Construct Validity of Tests of Intelligence

Having construct validity requires having unified understanding of
what intelligence is

Very difficult. Spearman says its one thing, Guilford says its many

Thorndike approach is sort of compromise
o
Look for one central factor with three additional factors
representing social, concrete, and abstract intelligences
Other Issues

Flynn effect: IQ scores seem to rise every year, but not coupled with
rise in “true intelligence”

Personality
o
High IQ: Need for achievement, competition, curiosity,
confidence, emotional stability etc.
o
Low IQ: passivity, dependence, maladjustment
o
Temperament (used to describe infants)

Gender
o
Men usually outscore in visual spatialization tasks and
intelligence scores
o
Women tend to outscore in language-skill tasks
o
But differences can be bridged

Family Environment
o
Divorce can have negative effects
o
Begins with “maternal effects” in womb

Culture
o
Provides specific models for thinking, acting and feeling
o
Assumed that if cultural factors can be controlled then
differences between cultural groups will be lessened
o
Assumed that culture can be removed by the reliance on
exclusively nonverbal tasks

Tend not to be very good at predicting success
in various academic and business settings
o
Culture loading: the extent to which a test incorporates
the vocabulary, concepts, traditions, knowledge and
feelings associated with a particular culture
o
No test can be culture free
o
Culture-fair intelligence test: test/assessment process
designed to minimize the influence of culture with regard
to various aspects of evaluation procedure
o
Another approached called for cultural-specific intelligence
tests

Ex.) BITCH measured streetwiseness

Lacked predictive validity and useful, practical
information

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912
CHAPTER 10: TESTS OF INTELLIGENCE
The Stanford-Binet Intelligence Scales
Tests Designed for Individual Administration

First to have detailed administration and scoring instructions

Kaufman Adolescent and Adult Intelligence Test

First American test to test IQ

Kaufman Brief Intelligence Test

First to use alternate items (an item that can be used in place of

Kaufman Assessment Battery for Children
another)

Away from information processing and towards a distinction

Lacked minority group representation
between sequential and simultaneous processing

Ratio IQ=(mental age/chronological age)x100
Tests Designed for Group Administration

Deviation Ratio/test composite: performance of one individual

Group Testing in the Military
compared to the performance of others of the same age. Has
o
WWI need for government to test intelligence as
mean of 100 and standard deviation of 16
means of differentiating “unfit” and “exceptionally

Age scale: items grouped by age
superior ability”

Point scale: items organized by category
o
Army Alpha Test: to army recruits who could read.
The Stanford-Binet Intelligence Scales: Fifth Edition
Included general information questions, analogies, and

Measures fluid intelligence, crystallized knowledge, quantitative
scrambled sentences to reassemble
knowledge, visual-processing, and short-term (working) memory
o
Army Beta Test: to foreign or illiterate recruits,

Utilizes adaptive testing: testing individually tailored to testtakers
included mazes, coding, and picture completion.
to ensure that items are neither too difficult (frustrating) or too
o
After the war, the alpha and beta test were used
easy (false hope)
rampantly, and oftentimes misused

Examiner establishes rapport with testtaker, then administers
o
Screening tools: instrument of procedure used to
routing test to direct, route examinee to test items most likely at
identify a particular trait or constellation of traits
optimal level of difficulty
o
ASVAB (Armed Services Vocational Aptitude Battery):

Teaching items: show testtaker what is expected, how to do it.
administered to prospective to recruits or high school
o
Can be used for qualitative assessment, but not scoring
students looked for career guidance

Subtests for verbal and nonverbal tests share same name, but

5 career areas: clerical, electronics,
involve different tasks
mechanical, skill-technical, and combat

Floor: lowest level of items on subtest
operations

Ceiling: highest-level item of subtest

Group Testing in Schools

Basal level: base-level criterion that must be met for testing on
o
Useful in developing child’s profile- but cannot be sole
the subtest to continue
indicator

Ceiling level is met when testtaker fails certain number of items in
o
Groups of 10-15
a row. Test discontinues here.
o
Starting in Kindergarten

Scores: raw standard  composite
o
Also called traditional group testing, because more

Extra-test behavior: behavioral observation
modern forms can utilize computer. These more aptly
The Wechsler Tests
called individual testing
-commonality between all versions: all yield deviation IQ’s with mean of 100
Measures of Specific Intellectual Abilities
and standard deviation of 15

Widely used intelligence tests only test a sampling of the many
Wechsler Adult Intelligence Scale-Fourth Edition (WAIS-IV)
attributable factors aiding in intelligence

Core subtest: administered to obtain a composite score

Ex.) Creativity

Supplemental/Optional Subtest: provides additional clinical
o
Commonly thought to be composed of originality,
information or extending the number of abilities or processes
fluency, flexibility, and elaboration
sampled.
o
If the focus is too heavily on whether an answer is

Yields four index scores: Verbal Comprehension Index, a Working
correct, doesn’t allow for creativity
Memory Index, a Perceptual Reasoning Index, and a Processing
o
Achievement tests require convergent thinking:
Speed Index
deductive reasoning process that entails recall and
The Wechsler Intelligence Scale for Children –Fourth Edition (WISC-IV)
consideration of facts as well as a series of logical

Process score: index designed to help understand how testtakers
judgments to narrow down solutions and eventually
process various kinds of information
arrive at one solution

WISC-IV compared to the SB5
o
Divergent thinking: a reasoning process in which
The Wechsler Preschool and Primary Scale of Intelligence-Third Edition
thought is free in many different directions, making
(WPPSI-III)
several solutions possible

New school for children under 6

Associated words, uses of rubber band etc.

First major intelligence test which adequately sampled total

Test-retest reliability for some of these tests
population of the United States
are near unacceptable

Subtests labeled core, supplemental, or optional
Wechsler, Binet, and the Short Form

Short form: test that has been abbreviated in length to reduce
time needed to administer, score and interpret

used with caution, only for screening

provide only estimates

reducing the number of items usually reduces reliability and thus
validity

Wechsler Abbreviated Scale of Intelligence
The Wechsler Test in Perspective

Factor Analysis
o
Exploratory factor analysis: summarizing data when
we are not sure how many factors are present in our
data
o
Confirmatory factor analysis: used to test highly
specific factor analysis
Other Measures of Intelligence
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912
CHAP.11: Other Individual Tests
of Ability in Education and
Special Education
Alternative Individual Ability
Tests Compared with the Binet
and Wechsler Scales
None of these are
clearly superior from
a psychometric
standpoint
Some less stable,
most more limited in
their documented
validity
Compare poorly to
Binet and Wechsler
on all accounts
They don't rely on a
verbal response as
much as the B and W
Just use pointing or
Yes/No responses,
thus do not depend
on the complex
integration of visual
and motor
functioning
Contain a
performance scale or
subscale
Their specificity often
limits the range of
functions or abilities
that they can
measure
Because they are
designed for special
populations, some
alternatives can be
administered totally
without the verbal
instructions
Specific Individual Ability Tests
Earliest individual
tests typically
designed for specific
purposes or
populations
One of the first –
Seguin Form Board
Test – in 1800s –
produced only a
single score
o
Used
primarily
to
evaluate
mentally
retarded
adults and
emphasize
d speed
and
performan
ce
After, the HealyFernald Test was
developed as an
exclusively nonverbal
test for adolescent
delinquents
Knox developed a
battery of
performance tests for
non-English adult
immigrants to the US
– administered
without language;
speed not
emphasized
These early individual
tests designed for
specific populations,
produced a single
score, and had
nonverbal
performance scales
Could be
administered without
visual instructions
and used with
children as well as
adults
Infant Scales
Where mental
retardation or
developmental delays
are suspected, these
tests can supplement
observation, genetic
testing, and other
medical procedures
Brazelton Neonatal Assessment
Scale (BNAS)
Individual test for
infants between
3days and 4weeks
Purportedly provides
an index of a
newborn’s
competence
Favorable reviews
Considerable
research base
Wide use as a
research tool and as a
diagnostic tool for
special purposes
Commonly used scale
for the assessment of
neonates
Drawbacks:
o
No norms
are
available
o
More
research is
needed
concerning
the
meaning
and
implicatio
n of scores
o
Poorly
document
ed
predictive
and
construct
validity
o
Test-retest
reliability
leaves
much to
be desired
Gesell Developmental
Schedules (GDS)
Infant intelligence
measures
Used as a research
tool by those
interested in
assessing infant
intellectual
development after
exposure to mercury,
diagnoses of
abnormal brain
formation in utero
and assessing infants
with autism
Children of 2.3mth to
6.3yrs
Obtains normative
data concerning
various stages in
maturation
Individual’s
developmental
quotient (DQ) is
determined according
to a test score, which
is evaluated by
assessing the
presence or absence
of behavior
associated with
maturation
Provides an
intelligence quotient
like that of the Binet
o
(developm
ent
quotient /
chronologi
cal age) x
100
But, falls short of
acceptable
psychometric
standards
Standardization
sample not
representative of the
population
No reliability or
validity
Does appear to help
uncover subtle
deficits in infants
Bayley Scales of Infants and
Toddler Development – Third
Edition (BSID-III)
Base assessments on
normative
maturational
developmental data
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
Designed for infants
between 1 and
42mths
Assesses
development across 5
domains: cognitive,
language, motor,
socioemotional, and
adaptive
Motor scale: assumes
that later mental
functions depend on
motor development
Excellent
standardization
Generally positive
reviews
Strong internal
consistency
More validity studies
needed
Widely used in
research – children
with Down syndrome,
pervasive
developmental
disorders, cerebral
palsy, language
impairment, etc
Most
psychometrically
sound test of its kind
Predictive though?
Cattell Infant Intelligence Scale
(CIIS)
Based on normative
developmental data
Downward extension
of Stanford-Binet
scale for 2-30mth
olds
Similar to Gesell scale
Rarely used today
Sample is primarily
based on children of
parents from lower
and middle classes
and therefore does
not represent the
general population
Unchanged for 60yrs
Psychometrically
unsatisfactory
-
Major Tests for Young Children
McCarthy Scales of Children’s
Abilities (MSCA)
Measure ability in
children between 28yrs
Present a carefully
constructed
individual test of
human ability
Meager validity
Produces a pattern of
scores as well as a
variety of composite
scores
General cognitive
index (CGI): standard
score with a mean of
lOMoARcPSD|3728912
100 and a standard
deviation of 16
o
Index
reflects
how well
the child
has
integrated
prior
learning
experience
s and
adapted
them to
the
demands
of the
scales
Relatively good
psychometric
properties
Reliability coefficients
in the low .90s
In research studies
Good validity? Good
assessment tool
Kaufman Assessment Battery
for Children - Second Edition
(KABC-II)
Individual ability test
for children between
3-18yrs
18 subtests in 5
global scales called
sequential
processing,
simultaneous
processing, learning,
planning, and
knowledge
Intended for
psychological, clinical,
minority-group,
preschool, and
neuropsychological
assessment as well as
research
Sequentialsimultaneous
distinction
o
Sequential
processing
refers to a
child’s
ability to
solve
problems
by
mentally
arranging
input in
sequential
or serial
order
o
Simultane
ous
processing
refers to a
child’s
ability to
synthesize
info from
-
-
-
mental
wholes in
order to
solve a
problem
Nonverbal measure
of ability too
Well constructed and
psychometrically
sound
Not much evidence of
(good) validity
Poorer predictive
validity for school
achievement –
smaller differences
between whites and
minorities
Test suffers from a
noncorrespondence
between its definition
and its measurement
of intelligence
General Individual Ability Tests
for Handicapped and Special
Populations
Columbia Mental Maturity Scale
– Third Edition (CMMS)
Purports to evaluate
ability in normal and
variously
handicapped children
from 3-12yrs
Requires neither a
verbal response nor
fine motor skills
Requires subject to
discriminate
similarities and
differences by
indicating which
drawing does not
belong on a 6-by9inch card containing
3-5 drawings
Multiple choice
Standardization
sample is impressive
Vulnerable to random
error
Reliable instrument
that is useful in
assessing ability in
many people with
sensory, physical, or
language handicaps
Good screening
device
Peabody Picture Vocabulary
Test – Fourth Edition (PPVT-IV)
2-90yrs
multiple choice tests
that require subject
to indicate Yes/No in
some manner
Instructions
administered aloud
(not for the deaf)
Purports to measure
hearing or receptive
vocabulary,
presumably providing
a nonverbal estimate
of verbal intelligence
Can be done in
15mins, requires no
reading ability
Good reliability and
validity
Should never be used
as a substitute for a
Wechsler or Binet IQ
Important
component in a test
battery or used as a
screening device
Easy to administer
and useful for variety
of groups
BUT: Tendency to
underestimate IQ
scores, and problems
inherent in the
multiple-choice
format are bad
Leiter International
Performance Scale – Revised
(LIPS-R)
Strictly a
performance scale
Aims at providing a
nonverbal alternative
to the Stanford-Binet
scale for 2-18yr olds
For research, and
clinical settings,
where it is still widely
utilized to assess the
intellectual function
of children with
pervasive
developmental
disorders
Purports to provide a
nonverbal measure of
general intelligence
by sampling a wide
variety of functions
from memory to
nonverbal reasoning
Can be applied to the
deaf and languagedisabled
Untimed
Good validity
Porteus Maze Test (PMT)
Popular but poorly
standardized
nonverbal
performance
measure of
intelligence
Individual ability test
Consists of maze
problems (12)
Administered without
verbal instruction,
thus used for a
variety of special
populations
Needs
restandardization
Testing Learning Disabilities
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
-
-
-
-
-
-
Major concept is that
a child average in
intelligence may fail
in school because of a
specific deficit or
disability that
prevents learning
Federal law entitles
every eligible child
with a disability to a
free appropriate
public education and
emphasizes special
education and related
services designed to
meet his or her
unique needs and
prepare them for
further education,
employment, and
independent living
To qualify, child must
have a disability and
educational
performance affected
by it
Educators today can
find other ways to
determine when a
child needs extra help
Processed called
Response to
Intervention (RTI):
premise is that early
intervening services
can prevent academic
failure for many
students with
learning difficulties
Signs of learning
problem:
o
Disorganiz
ation
o
Careless
effort
o
Forgetfuln
ess
o
Refusal to
do
schoolwor
k or
homework
o
Slow
performan
ce
o
Poor
attention
o
Moodiness
Illinois Test of Psycholinguistic
Abilities (ITPA-3)
Assumes that failure
to respond correctly
to a stimulus can
result not only from a
defective output
system but also from
a defective input or
informationprocessing system
lOMoARcPSD|3728912
Stage 1: info must
first be received by
the senses before it
can be analyzed
Stage 2: info is
analyzed or
processed
Stage 3: with
processed info,
individual must make
a response
Theorizes that the
child may be impaired
in one or more
specific sensory
modalities
12 subtests that
measure individual’s
ability to receive
visual, auditory, or
tactile input
independently of
processing and
output factors
purports to help
isolate the specific
site of a learning
disability
For children 2-10yrs
Early versions hard to
administer and no
reliability or validity
Now, with revisions,
ITPA-3
psychometrically
sound measure of
children’s
psycholinguistic
abilities
Woodcock-Johnson III
Evaluates learning
disabilities
Designed as a broadrange individually
administered test to
be used in
educational settings
Assesses general
intellectual ability,
specific cognitive
abilities, scholastic
aptitude, oral
language, and
achievement
Based on the CHC
three-stratum theory
of intelligence
Compares child’s
score on cognitive
ability with sore on
achievement – can
evaluate possible
learning problems
Relatively good
psychometric
properties
For learning disability
tests, three
conclusions seem
warranted:
o
1. Test
constructo
-
o
o
rs appear
to be
respondin
g to the
same
criticisms
that led to
changes in
the Binet
and
Wechsler
scales and
ultimately
to the
developm
ent of the
KABC
2. Much
more
empirical
and
theoretical
research is
needed
3. Users or
learning
disabilities
tests
should
take great
pains to
understan
d the
weaknesse
s of these
procedure
s and not
overinterp
ret results
Visiographic Tests
Require a subject to
copy various designs
Benton Visual Retention Test –
Fifth Edition (BVRT-V)
Tests for brain
damage are based on
the concept of
psychological deficit,
in which a poor
performance on a
specific task is related
to or caused by some
underlying deficit
Assumes that brain
damage easily impairs
visual memory ability
For individuals 8yrs+
Consists of geometric
designs briefly
presented and then
removed
Computerized version
developed
Bender Visual Motor Gestalt
Test (BVMGT)
Consists of 9
geometric figures
that the subject is
imply asked to copy
By 9yrs, any child of
normal intelligence
can copy the figures
with only one or two
errors
Errors occur for
people whose mental
age is less than 9,
brain damage,
nonverbal learning
disabilities, emotional
problems
Questionable
reliability
Memory-for-Designs (MFD) Test
Drawing test that
involves perceptualmotor coordination
Used for people 860yrs
Good split-half
reliability
Needs for validity
documentation
All these tests
criticized because of
their limitations in
reliability and validity
documentation
Good as screening
devices though
Creativity: Torrance Tests of
Creative Thinking (TTCT)
Measurement of
creativity
underdeveloped in
psychological testing
Creativity: ability to
be original, to
combine known facts
in new ways, or to
find new relationships
between known facts
Evaluating this a
possible alternative
to IQ
Creativity tests in
early stages of
development
Torrance tests
separately measure
aspects of creative
thinking such as
fluency, originality,
and flexibility
Does not meet the
Binet and Wechsler
scales in terms of
standardization,
reliability, or validity
Unbiased indicator of
giftedness
Inconsistent tests, but
available data reflect
the tests’ merit and
fine potential
Individual Achievement Tests:
Wide Range Achievement Test-3
(WRAT-4)
Achievement tests
measure what the
person has actually
acquired or done with
that potential
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
-
-
-
Discrepancies
between IQ and
achievement have
traditionally been the
main defining feature
of a learning disability
Most achievement
tests are group tests
WRAT-4 purportedly
permits an estimate
of grade-level
functioning in word
reading, spelling,
math computation,
and sentence
comprehension
Used for children
5yrs+
Easy to administer
Problems:
o
Inaccuracy
in
evaluating
gradelevel
reading
ability
o
Not
proven as
psychomet
rically
sound
CHAP: 12: Standardized Tests in
Education, Civil Service, and the
Military
When justifying the use of
group standardized tests,
test users often have
problems defining what
exactly they are trying to
predict, or what the test
criterion is
Comparison of Group and
Individual Ability Tests
Individual tests require a
single examiner for a single
subject
o
Examiner
provides
instructions
o
Subject
responds,
examiner
records
response
o
Examiner
evaluates
response
o
Examiner takes
responsibility
for eliciting a
maximum
performance
o
Scoring requires
considerable
skill
Those who use the results
of group tests must assume
that the subject was
cooperative and motivated
-
lOMoARcPSD|3728912
Many subjects
tested at a time
o
Subjects record
own responses
o
Subjects not
praised for
responding
o
Low scores on
group tests
often difficult to
interpret
o
No safeguards
Advantages of Individual Tests
Provide info beyond the
test score
Allow the examiner to
observe behavior in a
standard setting
Allow individualized
interpretation of test
scores
Advantages of Group Tests
Are cost-efficient
Minimize professional time
for administration and
scoring
Require less examiner skill
and training
Have more objective and
more reliable scoring
procedures
Have especially broad
application
Overview of Group Tests
Characteristics of Group Tests
Characterized as paperand-pencil or booklet-andpencil tests because only
materials needed are a
printed booklet of test
items, a test manual,
scoring key, answer sheet,
and pencil
Computerized group
testing becoming more
popular
Most group tests are
multiple choice – some
free response
Group tests outnumber
individual tests
o
One major
difference is
whether the
test is primarily
verbal,
nonverbal, or
combination
Group test scores can be
converted to a variety of
units
Selecting Group Tests
Test user need never settle
for anything but welldocumented and
psychometrically sound
tests
Using Group Tests
Reliable and well
standardized as the best
individual tests
o
Validity data for some
group tests are
weak/meager/contradictor
y
Use Results with Caution
Never consider scores in
isolation or as absolutes
Be careful using tests for
prediction
Avoid overinterpreting test
scores
Be Especially Suspicious of Low
Scores
Assume that subjects
understand purpose of
testing, want to succeed,
and are equally rested/free
of stress
Consider Wide Discrepancies a
Warning Signal
May reflect emotional
problems or severe stress
When in Doubt, Refer
With low scores,
discrepancies, etc, refer
the subject for individual
testing
Get trained professional
Group Tests in the Schools:
Kindergarten Through 12th
Grade
Purpose of tests is to
measure educational
achievement in
schoolchildren
Achievement Tests verses
Aptitude Tests
Achievement tests attempt
to assess what a person
has learned following a
specific course of
instruction
o
Evaluate the
product of a
course of
training
o
Validity is
determined
primarily by
content-related
evidence
Aptitude tests attempt to
evaluate a student’s
potential for learning
rather than how much a
student has already
learned
o
Evaluate effects
of unknown and
uncontrolled
experiences
o
Validity is
judged primarily
on its ability to
predict future
performance
Intelligence test measures
general ability
These three tests are highly
interrelated
Group Achievement Tests
-
Stanford Achievement Test
one of the oldest of the
standardized achievement
tests widely used in school
system
Well-normed and criterionreferenced, with
psychometric
documentation
Another one is the
Metropolitan Achievement
Test, which measures
achievement in reading by
evaluating vocab, word
recognition, and reading
comprehension
Both of these are reliable
and normed on big samples
Group Tests of Mental Abilities
(Intelligence)
Kuhlmann-Anderson Test (KAT)
– 8th Edition
KAT is a group intelligence
test with 8 separate levels
covering kindergarten
through 12th grade
Items are primarily
nonverbal at lower levels,
requiring minimal reading
and language ability
Suited to young children
and those who might be
handicapped in following
verbal procedures
Scores can be expressed in
verbal, quantitative, and
total scores
Scores at other levels can
be expressed at percentile
bands: like a confidence
interval; provides the range
of percentiles that most
likely represent a subject’s
true score
Good construction,
standardization, and other
excellent psychometric
qualities
Good validity and reliability
Potential for use and
adaptation for non-Englishspeaking individuals or
even countries needs to be
explored
Henmon-Nelson Test (H-NT)
Of mental abilities
2 sets of norms available:
o
one based on
raw score
distributions by
age, the other
on raw scores
distributions by
grade
reliabilities in the .90s
helps predict future
academic success quickly
does NOT consider multiple
intelligences
Cognitive Abilities Test (COGAT)
Good reliability
-
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
-
-
-
-
Provides three separate
scores though: verbal,
quantitative, and
nonverbal
Item selection is superior
to the H-NT in terms of
selecting minority,
culturally diverse, and
economically
disadvantaged children
Can be adopted for use
outside the US
No cultural bias
Each of the subtests
required 32-34 minutes of
actual working time, which
the manual recommends
to be spread out over 2-3
days
Standard age scores
averaged some 15pts lower
for African American
students on the verbal
battery and quantitative
batteries
Summary of K-12 Group Tests
All are sound, viable
instruments
College Entrance Tests
SAT Reasoning Test,
Cooperative School and
College Ability Tests, and
American College Test
SAT Reasoning Test
Most widely used college
entrance test
Used for 1000+ private and
public institutions
Renorming of the SAT did
not alter the standing of
test takers relative to one
another in terms of
percentile rank
New scoring (2400) is likely
to reduce interpretation
errors, as interpreters can
no longer rely on
comparisons with older
versions
45mins longer – 3hrs and
45mins to administer
may disadvantage students
with disabilities such as
ADD
Verbal section now called
“critical reading” – focus on
reading comprehension
Math section eliminated
much of the basic grammar
school math questions
Weakness: poor predictive
power regarding the
grades of students who
score in the middle ranges
Little doubt that the SAT
predicts first-year college
GPA
o
But,
AfricanAmerica
ns and Latinos
lOMoARcPSD|3728912
o
tend to obtain
lower scores on
average
Women score
lower on SAT
but higher in
GPA
Cooperative School and College
Ability Tests
Falling out of favor
Developed in 1955, not
been updated
Purports to measure
school-learned abilities as
well as an individual’s
potential to undertake
additional schooling
Psychometric
documentation not strong
Little empirical data
support its major
assumption – that previous
success in acquiring schoollearned abilities can predict
future success in acquiring
such abilities
American College Test
Updated in 2005,
particularly useful for nonnative speakers of English
Produces specific content
scores and a composite
Makes use of the Iowa Test
of Educational
Development Scale
Compares with the SAT in
terms of predicting college
GPA alone or in
conjunction with highschool GPA
Internal consistency
coefficients are not as
strong in the ACT
Graduate And Professional
School Entrance Tests
Graduate Record Examination
Aptitude Test
GRE purports to measure
general scholastic ability
Most frequently used in
conjunction with GPA,
letters of rec, and other
academic factors
General section with verbal
and quantitative scores
Third section which
evaluates analytical
reasoning – now essay
format
Contains an advanced
section that measures
achievement in at least 20
majors
New 130-170 scoring scale
Standard mean score of
500, and SD of 100
Normative sample is
relatively small
-
-
-
-
-
-
-
Psychometric adequacy is
less than that of SAT –
validity and reliability
Predictive validity not great
Overpredicts the
achievement of younger
students while
underpredicting
performance of older
students
Many schools have
developed their own norms
and psychometric
documentation and can
use the GRE to predict
success in their programs
By looking at a GRE score in
conjunction with GPA,
graduate success can be
predicted with greater
accuracy than without the
GRE
Graduate schools also
frequently complain that
grades no longer predict
scholastic ability well
because of grade inflation
– the phenomenon of
rising average college
grades despite declines in
average SAT scores
o
Led to
corresponding
restriction in the
range of grades
As the validity of grades
and letters of rec becomes
more questionable,
reliance on test scores
increases
Definite overall decline in
verbal scores while
quantitative and analytical
scores are gradually rising
Miller Analogies Test
Designed to measures
scholastic aptitudes for
graduate studies
Strictly verbal
60 minutes
knowledge of specific
content and a wide vocab
are very useful
most important factors
appear to be the ability to
see relationships and a
knowledge of the various
ways analogies can be
formed
psychometric adequacy is
reasonable
does not predict research
ability, creativity, and other
factors important to grad
school
The Law School Admission Test
LSAT problems require
almost no specific
knowledge
Extreme time pressure
-
-
-
-
-
Three types of problems:
reading comprehension,
logical reasoning (~half),
and analytical reasoning
Weight given to the LSAT
score is openly published
for each school approved
by the American Bar
Association
Entrance into schools
based on weighted sum of
score and GPA
Psychometrically sound,
reliability coefficients in the
.90s
Predicts first-year GPA in
law school
Content validity is
exceptional
Bias for minority group
members, as well as
women
Nonverbal Group Ability Tests
Raven Progressive Matrices
RPM one of the best
known and most popular
nonverbal group tests
Suitable anytime one
needs an estimate of an
individual’s general
intelligence
Groups or individuals, 5yrsadults
Used throughout the
modern world
Uses matrices – nonverbal;
with or without a time limit
Research supports RPM as
a measure of general
intelligence, or Spearman’s
g
Appears to minimize the
effects of language and
culture
Tends to cut in half the
selection bias that occurs
with the Binet or Wechsler
Goodenough-Harris Drawing
Test (G-HDT)
Nonverbal intelligence test,
group or individual
Quick, east, and
inexpensive
Subject instructed to draw
a picture of a whole an and
to do the best job possible
Details get points
One can determine mental
ages by comparing scores
with those of the
normative sample
Raw scores can be
converted to standard
scores with a mean of 100
and SD of 15
Used extensively in test
batteries
The Culture Fair Intelligence
Test
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
-
-
-
Designed to provide an
estimate of intelligence
relatively free of cultural
and language influences
Paper-and-pencil
procedure that covers
three age groups
Two parallel forms are
available
Acceptable measure of
fluid intelligence
Standardized Tests Used in the
US Civil Service System
General Aptitude Test
Battery (GATB) – reading
ability test that purportedly
measures aptitude for a
variety of occupations
o
Makes
employment
decisions in govt
agencies
o
Attempts to
measure wide
range of
aptitudes from
general
intelligence to
manual
dexterity
Controversial because it
used within-group norming
prior to the passage of the
Civil Rights Act of 1991
Today, any kind of score
adjustments through
within-group norming in
employment practices is
strictly forbidden by law
Standardized Tests in the US
Military: The Armed Services
Vocational Aptitude Battery
ASVAB administered to
more than 1.3million
people a year
Designed for students in
grades 11 and 12 and in
postsecondary schools
Yields scores used in both
education and military
settings
Results can help identify
students who potentially
qualify for entry into the
military and can
recommend assignment to
various military
occupational training
programs
Great psychometric
qualities
Reliability coefficients are
excellent
Through computerized
format, subjects can be
tested adaptively, meaning
that the questions given
each person can be based
on his or her unique ability
lOMoARcPSD|3728912
-
This cuts testing time in
half
Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
Download