Uploaded by Isha

Psychological Assessment Reviewer

advertisement
1
PSYCH ASSESSMENT
TOPIC 1: OVERVIEW PSYCHOLOGICAL TESTING AND
ASSESSMENT
=PSYCHOLOGICAL TESTS=
• Are objective and standardized measure of a sample of
human behavior (Anastasi & Urbina, 1997).
•
These are instruments with three defining characteristics:
o It is a sample of human behavior.
o The sample is obtained under standardized conditions.
o There are established rules for scoring or for obtaining
quantitative information from the behavior sample.
*PYSCHOLOGICAL MEASUREMENT- The process of
assigning numbers (i.e. test scores) to persons in such a
way that some attributes of the person being numbers.
=GENERAL TYPES OF PSYCHOLOGICAL TESTS=
These tests are categorized according to the manner of
administration, purpose, and nature.
*Administration- Individual; Group
*Item Format- Objective; Projective
*Response Format- Verbal; Performance
*Domain Measures- Cognitive; Affective
=T Y P E S O F T E S T S=
*STANDARDIZED TESTS- Are those instruments that have
prescribed directions for administration, scoring, and
interpretation. • Examples: MBTI, MMPI, SB-5, WAIS
*GROUP TESTS- Are those that can be administered to a
group and usually done on a paper-pencil method; and can
be administered individually. • Examples: Achievement
Tests, RPM, MBTI
*SPEED TESTS- Are administered under a prescribed time
limit usually for a short period of time and not enough for
an individual to finish answering the entire test. • The level
of difficulty is the same for all items. • Example is the SRA
Verbal Test
*POWER TESTS- Are those that measure competencies and
abilities. • The time limit prescribed is usually enough for
one to accomplish the entire test. • Example is the
Differential Aptitude Test
*VERBAL TESTS- Are instruments that involve words to
measure a particular domain. • Example, admission tests
for many educational institutions.
*NONVERBAL TESTS- Are instruments that do not use
words, rather, they use geometrical drawing or patterns. •
Examples, RPM.
*COGNITIVE TESTS- Are those that measure thinking skills.
• Examples are the broad range of intelligence and
achievement tests.
*AFFECTIVE TESTS- Are those that measure personality,
interests, values, etc. • Examples: Life Satisfaction Scale,
16PF, MBTI
=TESTING OF HUMAN ABILITY=
*NON-STANDARDIZED TESTS (Informal Tests)- Are
exemplified by teacher-made tests either for formative or
summative evaluation of student performance. • Examples:
Prelim Exam, Quizzes
*NORM-REFERENCED TESTS- Instruments whose score
interpretation is based on the performance of a particular
group. • For example, Ravens Progressive Matrices (RPM)
which has several norm groups which serve as a
comparison group for the interpretation of scores.
*CRITERION-REFERENCED TESTS- These are the measures
whose criteria for passing or failing have decided before
hand. • For example, a passing score of 75%.
*INDIVIDUAL TESTS- Are instruments that are administered
one-on-one, face-to-face. • Examples: WAIS, SB-5, BenderGestalt
*TESTS FOR SPECIAL POPULATION-Developed similarly for
use with persons who cannot be properly or adequately
examined with traditional instruments, such as the
individual scales.
• Follows performance, or nonverbal tasks.
=PERSONALITY TESTS=
These are instruments that are used for the measurement
of emotional, motivational, interpersonal, and attitudinal
characteristics.
2
*Approaches to the Development of Personality
Assessment
• Empirical Criterion Keying
• Factor Analysis
• Personality Theories
= GENERAL PROCESS OF ASSESSMENT=
=R E A S O N S F O R U S I N G T E S T S=
*PROJECTIVE TECHNIQUES- Are relatively unstructured
tasks; a task that permits almost an unlimited variety of
possible responses; a disguised procedure task.
Examples:
• Rorschach Inkblot Test
• Thematic Apperception Test
• Sentence Completion Test
• Drawing Test
*PSYCHOLOGICAL TESTING - The process of measuring
psychology-related variables by means of devices or
procedures designed to obtain a sample of behavior.
*PSYCHOLOGICAL ASSESSMENT- The gathering and
integration of psychology-related data for the purpose of
making a psychological evaluation that is accomplished
through the use of tools such as tests, interviews, case
studies, behavioral observation, and especially designed
apparatuses and measurement procedures.
=THE TOOLS OF PSYCHOLOGICAL ASSESSMENT=
*THE TEST- A test is defined simply as a measuring device
or procedure.
*PSYCHOLOGICAL TEST- Refers to a device or procedure
designed to measure variables related to psychology:
• Intelligence
• Personality
• Aptitude
• Interests
• Attitudes
• Values
= DIFFERENCES IN PSYCHOLOGICAL TESTS AND OTHER
TOOLS=
=DIFFERENT APPROACHES TO ASSESSMENT=
*COLLABORATIVE PSYCHOLOGICAL ASSESSMENT- The
assessor and the assesse work as partners from initial
contact through final feedback
*THERAPEUTIC PSYCHOLOGICAL ASSESSMENTTherapeutic self-discovery and new understandings are
encouraged throughout the assessment process
*DYNAMIC ASSESSMENT-The interactive approach to
psychological assessment that usually follows a model of:
evaluation> intervention> evaluation. Interactive, changing,
and varying nature of assessment
*THE INTERVIEW- A method of gathering information
through direct communication involving reciprocal
exchange.
Differences: purpose, length, and nature.
Uses: diagnosis, treatment, selection, decisions.
3
* THE PORTFOLIO- Contains a sample of one’s ability and
accomplishment which can be used for evaluation.
*THE CASE HISTORY DATA- Are the records, transcripts,
and other accounts in written, pictorial, or other form that
preserve archival information, official and informal
accounts, and other data and items relevant to the assesse.
*Case Study or Case History- a report of illustrative
account concerning a person or an event that was compiled
on the basis of case history data.
*BEHAVIORAL OBSERVATION- Monitoring the actions of
others or oneself by visual or electronic means while
recording quantitative or qualitative information regarding
the actions.
Aids the development of therapeutic intervention which is
extremely useful in institutional settings such as schools,
hospitals, prisons, and group homes.
*THE ROLE-PLAY TESTS- Acting in improvised or partially
improvised part in a simulated situation. Assesses are
directed to act as if they were in a particular situation.
Evaluation: expressed thoughts, behaviors, abilities, and
other related variables.
Can be used as both a tool of assessment and a measure of
outcome.
*THE COMPUTERS AS TOOLS- Can serve as test
administrators (online or offline) and as highly efficient test
scores.
*Interpretive Reports- distinguished by its inclusion of
numerical or narrative interpretive statements in the
report.
*Consultative Reports- written in language appropriate
for communication between assessment professionals and
may provide expert opinion concerning data analysis.
*Integrative Report- employs previously collected data
into the test report.
=PARTICIPANTS IN THE TESTING PROCESS AND THEIR
ROLES=
*Test authors and developers- Conceive, prepare, and
develop test. Also find a way to disseminate their tests.
*Test publishers- Publish, market, and sell tests, thus
controlling their distribution.
*Test reviewers- Prepare evaluative critiques of tests based
on technical and practical merits.
* Test users- Select or decide which specific test/s will be
used for some purposes. May also act as examiners or
scorers.
*Test sponsors- Institutional boards or agencies who
contract test developers or publishers for various testing
services.
*Test administrators or examiner- Administer the test
either to one individual at a time or to groups.
*Test takers- Take the test by choice or necessity.
*Test scorers- Tally the raw scores and transform into test
scores through objective or mechanical scoring or through
the application of evaluative judgment.
*Test score interpreters- Interpret test results to
consumers such as; individual test takers or their relatives,
other professionals, or organizations of various kinds.
=SETTINGS WHERE ASSESSMENTS ARE CONDUCTED=
*EDUCATIONAL SETTINGS- Helps to identify children who
may have special needs. • Diagnostic tests and/or
achievement tests
*CLINICAL SETTINGS - For screening and or diagnosing
behavioral problems. • May be intelligence, personality,
neuropsychological tests, or other specialized instruments
depending on the presenting or suspected problem area.
*COUNSELING SETTINGS- Aims to improve the assesee’s
adjustment, productivity, or some related variables. • May
be personality, interest, attitude, and values tests.
*GERIATRIC SETTINGS- Quality of life assessment which
measures variables related to perceived stress, loneliness,
sources of satisfaction, personal values, quality of living
conditions, and quality of friendships and social support.
*BUSINESS & MILITARY SETTINGS- Decision making about
the careers of the personnel.
*GOVERNMENTAL & ORGANIZATIONAL CREDENTIALINGLicensing or certifying exams.
*ACADEMIC RESEARCH SETTINGS- Sound knowledge of
measurement principles and assessment tools are required
prior to research publication.
4
=SOME TERMS TO REMEMBER=
*PROTOCOL- Typically refers to the form or sheet or
booklet on which a test taker's responses are entered. •
May also be used to refer to a description of a set of test-or
assessment-related procedures.
*RAPPORT- The working relationship between the
examiner and the examinees.
*ACCOMMODATION- The adaptation of a test, procedure,
or situation or the substitution of one test for another, to
make the assessment more suitable for an assessee
without an exceptional need.
*ALTERNATE ASSESSMENT - An evaluative or diagnostic
procedure or process that varies from the usual, customary,
or standardized way a measurement is derived. •
Alternative methods designed to measure the same
variables.
= TEST USER QUALIFICATION LEVELS=
• ONLINE DATABASES- maintained by APA; PsycINFO,
ClinPSYC, PsyARTICLES, etc.
=A BRIEF HISTORY OF PSYCHOLOGICAL TESTING=
*20TH Century France- The roots of contemporary
psychological testing and assessment.
*1905- Alfred Binet and a colleague published a test to help
place Paris schoolchildren in classes.
*1917 World War I- The military needed a way to screen
large numbers of recruits quickly for intellectual and
emotional problems.
*World War II- Military depend even more on psychological
tests to screen recruits for the service.
*Post-war- More and more tests purporting to measure an
ever-widening array of psychological variables were
developed and used.
=PROMINENT FIGURES IN THE HISTORY OF
PSYCHOMETRICS=
=I N D I V I D U A L D I F F E R E N C E S=
In spite of our similarities, no two humans are exactly the
same.
*CHARLES DARWIN- Believed that some of the individual
differences are more adaptive than others.
• Individual differences, over time, lead to more complex,
intelligent organisms.
= SOURCES OF INFORMATION ABOUT TESTS=
• TEST CATALOGUES- usually contain only a brief
description of the test and seldom contain the kind of
detailed technical information.
• TEST MANUALS- detailed information concerning the
development if a particular test and technical information
relating to it.
• REFERENCE VOLUMES- periodically updated which
provides detailed information for each test listed; Mental
Measurements Yearbook.
• JOURNAL ARTICLES- contain reviews of the test, updated,
or dependent studies of its psychometric soundness, or
examples of how the instrument was used in either
research or applied context.
*FRANCIS GALTON- Cousin of Charles Darwin
• He was an applied Darwinist. He claimed that some
people possessed characteristics that made them more fit
than others.
• Wrote Hereditary Genius (1869).
• Set up an anthropometric laboratory at the International
Exposition of 1884.
• Noted that persons with mental retardation also tend to
have diminished ability to discriminate among heat, cold,
and pain.
5
*Charles Spearman- Had been trying to prove Galton’s
hypothesis concerning the link between intelligence and
visual acuity.
• Expanded the use of correlational methods pioneered by
Galton and Karl Pearson, and provided the conceptual
foundation for factor analysis, a technique for reducing a
large number of variables to a smaller set of factors that
would become central to the advancement of testing and
trait theory.
• Devised a theory of intelligence that emphasized a
general intelligence factor (g) present in all intellectual
activities.
*KARL PEARSON-Famous student of Galton.
• Continued Galton’s early work with statistical regression.
• Invented the formula for the coefficient of correlation;
Pearson’s r.
*JAMES MCKEEN CATTELL- The first person who used the
term mental test.
• Made a dissertation on reaction time based upon Galton’s
work.
• Tried to link various measures of simple discriminative,
perceptive, and associative power to independent
estimates of intellectual level, such as school grades.
=EARLY EXPERIMENTAL PSYCHOLOGISTS=
• Early 19th century, scientists were generally interested in
identifying common aspects, rather than individual
differences.
• Differences between individuals were considered as
source of error, which rendered human measurement
inexact.
*JOHAN FRIEDRICH HERBART- Proposed the Mathematical
models of the mind.
• The founder of Pedagogy as an academic discipline.
*ERNST HEINRICH WEBER- Proposed the concepts of
sensory thresholds and Just Noticeable Differences (JND).
*GUSTAV THEODOR FECHNER- Involved in the Mathematic
sensory thresholds of experience.
• Founder of Psychophysics, and one of the founders of
Experimental Psychology.
• Weber-Fechner-Law was the first to relate to situation
and stimulus. It states that the strength of a sensation
grows as the logarithm of the stimulus intensify.
• Was considered by some as the founder of
Psychometrics.
*GUY MONTROSE WHIPPLE- Was influenced by Fechner
and was a student of Titchner.
• Pioneered the human ability testing.
• Conducted seminar that changed the field of
psychological testing (Carenegie Institute in 1918).
• Because of his criticisms, APA issued its first standards for
professional psychological testing.
• Construction of Cernegie Interest Inventory – Strong
Vocational Interest Blank.
*LOUIS LEON THURSTONE- Was a large contributor to
factor analysis and attended Whipple’s seminars.
• His approach to measurement was called the Law of
Comparative Judgment.
=INTEREST IN MENTAL DEFICIENCY=
*JEAN ETIENNE ESQUIROL- A French physician and was the
favorite student of Phillippe Pinel– the founder of
Psychiatry.
• Was responsible for the manuscript on mental
retardation which differentiated between insanity and
mental retardation.
*EDOUARD SEGUIN- A French physician who pioneered in
training mentallyretarded persons.
• Rejected the notion of incurable mental retardation (MR).
• 1837, he opened the first school devoted to teaching
children with MR.
• 1866, he conducted experiments with physiological
training of MR which involved sense/muscle training used
until today and leads to nonverbal tests of intelligence
(Seguin Form Board Test).
*EMIL KRAPELIN- Devised a series of examinations for
evaluating emotionally impaired individuals.
=INTELLIGENCE TESTING=
*ALFRED BINET- Appointed by the French government to
develop a test that will place Paris schoolchildren to special
classes who failed to respond to normal schooling.
• Devised the first intelligence test: the Binet-Simon scale
of 1905.
• The scale has standardized administration and used a
standardization sample.
*LEWIS MADISON TERMAN- Translated the Binet-Simon
Scales in English to be used in the US and in 1916, it was
published as the StanfordBinet Intelligence Scale.
• SB scale became more psychometrically sound and the
term IQ was introduced.
• IQ = Mental Age / Chronological Age X 100
6
*ROBERT YERKES- President of the APA who was
commissioned by the US Army to develop structured tests
of human abilities.
• WW1 arose the need for large-scale group administered
ability tests by the army.
• Army Alpha – verbal; administered to literate soldiers.
• Army Beta – nonverbal; administered to illiterate soldiers.
• Started with great suspicion; first serious study made in
1932.
• Symmetric colored and black & white inkblots.
• Introduced to the US by David Levy.
*DAVID WECHSLER- Subscales on his tests were adopted
from the army scales.
• Produces several scores of intellectual ability rather than
Binet’s single scores.
• Evolved to the Wechsler Series of intelligence tests (WAIS,
WISC, etc.)
*THEMATIC APPERCEPTION TEST (TAT)- Was developed in
1935 and composed of ambiguous pictures that were
considerably more structured than the Rorschach.
• Subjects are shown pictures and asked to tell a story
including:
o What has led up to the event shown;
o What is happening at the moment;
o What the characters are feeling and thinking; and o What
the outcome of the story was.
=PERSONALITY TESTING-
=PERSONALITY TESTING: Second Structured Test=
These tests were intended to measure personality traits.
*MINNESOTA MULTIPHASIC PERSONALITY INVENTORY
(MMPI 1943)- Tests like Woodworth made too many
assumptions.
• The meaning of the test response could only be
determined by empirical research.
• MMPI-2 and MMPI-A are most widely used.
*TRAITS- are relatively enduring dispositions (tendencies to
act, think, or feel in a certain manner in any given
circumstance).
*1920s- The rise of personality testing
*1930s- The fall of personality testing
*1940s- The slow rise of personality testing
=PERSONALITY TESTING: First Structured Test=
*WOODWORTH PERSONAL DATA SHEET- The first
objective personality test meant to assist in psychiatric
interviews.
• Was developed during the WW1.
• Designed to screen out soldiers unfit for duty.
• Mistakenly assume that subject’s response could be taken
at face value.
*RAYMOND B. CATTELL: The 16 PF- The test was based on
factor analysis – a method for finding the minimum number
of dimensions or factors for explaining the largest number
of variables.
• J. R. Guilford, the first to apply factor analytic approach to
test construction.
=THE RISE OF MODERN PSYCHOLOGICAL TESTING=
*1900s- Everything necessary for the rise of the first truly
modern and successful psychological test was in place.
*1904- Alfred Binet was appointed to devise a method of
evaluating children who could not profit from regular
classes and would require special education.
*1905- Binet and Theodore Simon published the first useful
instrument to measure general cognitive abilities or global
intelligence.
*1908- Binet revised, expanded, and refined his first scale.
*1911- The birth of the IQ. William Stern (1911) proposed
the computation for IQ based on Binet-Simon scale (IQ =
Mental Age / Chronological Age X 100).
=PERSONALITY TESTING: Slow Rise – Projective
Techniques=
*HERMAN RORSCHACH: The Inkblot Test- Pioneered the
projective assessment using his inkblot test in 1921.
*1916- Lewis Terman translated the Binet-Simon scales to
English and published it as the Stanford-Binet Intelligence
Scale.
7
*1905- Binet and Theodore Simon published the first useful
instrument to measure general cognitive abilities or global
intelligence.
*1908 and 1911- Binet revised, expanded, and refined his
first scale.
*1917 World War I- Robert Yerkes, APA President
developed a group test of intelligence for US Army.
Pioneered the first group testing; Army Alpha and Army
Beta.
*1918- Arthur Otis devised a multiple choice items that
could be scored objectively and rapidly. Published Group
Intelligence Scale that had served as model for Army Alpha.
*1919- E.L Throndike produced an intelligence test for high
school graduates
=PROMINENT FIGURES IN THE MODERN PSYCHOLOGICAL
TESTING=
•Alfred Binet
•Theodore Simon •Lewis Terman
•Robert Yerkes • Arthur Otis
•Lewis Terman
TOPIC 2: STATISTICS REVIEW
*MEASUREMENT- The act of assigning number or symbols
to characteristics of things (people, events, etc.) according
to rules.
*SCALE- A set of numbers (or other symbols) whose
properties model empirical properties of the objects to
which the numbers are assigned.
=CATEGORIES OF SCALES=
*DISCRETE- Values that are distinct and separate; they can
be counted. • For example, if subjects were to be
categorized as either female or male, the categorization
scale would said to be discrete because it would not be
meaningful to categorize a subject as anything other than
female or male. • Examples: Gender, Types of House, Color
*CONTINUOUS- Exists when it is theoretically possible to
divide any of the values of the scale. • The values may take
of any value within a finite or infinite interval. • Examples:
Temperature, Height, Weight
*E R R O R- Refers to the collective influence of all the
factors on a test score or measurement beyond those
specifically measured by the test or measurement.
• It is very much an element of all measurement, and it is
an element for which any theory of measurement must
surely account.
• Always present of measurement that follows a
continuous scale.
=SCALES OF MEASUREMENT=
*N O M I N A L S C A L E S- Known as the simplest form of
measurement.
• Involve classification or categorization based on one or
more distinguishing characteristics, where all things
measured must be placed into mutually exclusive and
exhaustive categories.
• Example: DSM5, Gender of the patients, colors
*O R D I N A L S C A L E S- It also permit classification and in
addition, rank ordering on some characteristics is also
permissible.
• It imply nothing about how much greater one ranking is
than another; and the numbers do not indicate units of
measurement.
• No absolute zero point.
• Examples: fastest reader, size of waistline, job positions
*I N T E R V A L S C A L E S- Permit both categorization and
rank, in addition, it contain intervals between numbers,
thus, each unit on the scale is exactly equal to any other
unit on the scale.
• No absolute zero point however, it is possible to average
a set of measurements and obtain a meaningful result.
• For example, IQs of 80 and 100 is thought to be similar to
that existing between IQs of 100 and 120. If an individual
achieved IQ of 0, it would not be an indication of zero
intelligence or total absence of it.
• Examples: temperature, time, IQ scales, psychological
scales
*R A T I O S C A L E S- Contains all the properties of
nominal, ordinal, and interval scales, and it has a true zero
point; negative values are not possible.
• A score of zero means the complete absence of the
attribute being measured.
• Examples: exam score, neurological exam (i.e. hand grip),
heart rate
*DESCRIPTIVE STATISTICS- Is used to say something about
a set of information that has been collected only.
8
=D E S C R I B I N G D A T A=
*DISTRIBUTION- A set of test scores arrayed for recording
or study.
*RAW SCORE- Is a straightforward, unmodified accounting
of performance that is usually numerical. • May reflect a
simple tally, such as the number of items responded to
correctly on an achievement test.
*AVERAGE DEVIATION- Another tool that could be used to
describe the amount of variability in a distribution.
• Rarely used perhaps due to the deletion of algebraic signs
renders it is a useless measure for purpose of any further
operations.
*FREQUENCY DISTRIBUTION- All scores are listed alongside
the number of times each score occurred. • Scores might
be listed in a tabular or graphical forms.
*STANDARD DEVIATION (SD)- A measure of variability that
is equal to the square root of the average squared
deviations about the mean. • The square root of the
variance. • A low SD indicates that the values are close to
the mean, while a high SD indicates that the values ae
dispersed over a wider range.
*MEASURES OF CENTRAL TENDENCY- It is a statistic that
indicates the average of midmost scores between the
extreme scores in a distribution.
*S K E W N E S S- Refers to the absence of symmetry.
• It is an indication of how a measurements in a distribution
are distributed.
*MEAN- The most common measure of central
tendency. It takes into account the numerical value of
every score.
*MEDIAN- The middle most score in the distribution.
Determined by arranging the scores in either ascending or
descending order.
*MODE- The most frequently occurring score in a
distribution of scores
=MEASURES OF VARIABILITY=
*Variability- is an indication of how scores in a distribution
are scattered or dispersed.
*K U R T O S I S- The steepness of the distribution in its
center.• Describes how heavy or light the tails are.
• PLATYKURTIC- relatively flat, gently curved
• MESOKURTIC- moderately curved, somewhere in the
middle
• LEPTOKURTIC- relatively peaked
*RANGE- he simplest measure of variability. • It is the
difference between the highest and the lowest score.
*Interquartile Range- A measure of variability equal to
the difference between Q3 and Q1.
*Semi-Interquartile Range - Equal to the interquartile
range divided by two
9
*T H E N O R M A L C U R V E- Is a bell-shaped, smooth,
mathematically defined curve that is highest at its center.
• It is perfectly symmetrical with no skewness.
• Majority of the test takers are bulked at the middle of the
distribution; very few test takers are at the extremes.
• Mean = Median = Mode
• Q1 and Q3 have equal distances to the Q2 (median).
• Approximately 95% of all scores occur between the mean
and +/- 2 SD.
*T H E S T A N D A R D S C O R E S-These are raw scores
that have been converted from one scale to another scale,
where the latter scale has some arbitrarily set mean and
standard deviation.
• It also provide a context of comparing scores on two
different tests by converting scores from the two tests into
z-score.
=TYPES OF STANDARD SCORES=
*z Scores- Known as the golden scores.
• Results from the conversion of a raw score into a number
indicating how many SD units the raw score is below or
above the mean of the distribution.
• Mean = 0; SD = 1
• Zero plus or minus one scale (0 +/- 1)
• Scores can be positive or negative.
=AREAS UNDER THE NORMAL CURVE=
*t Scores- Fifty plus or minus ten scale (50 +/- 10)
• Mean = 50; SD = 10
• Devised by W.A McCall (1922, 1939) and named a T score
in honor of his professor E.L. Thorndike.
• Composed of a scale that ranges from 5 SD below the
mean to 5 SD above the mean.
• None of the scores are negative.
*Stanine- Takes the whole numbers from 1 to 9 without
decimals, which represent a range of performance that is
half of SD in width.
• Mean = 5; SD = 2
• Used by US Airforce Assessment
• 50% of the scores occur above the mean and 50% of the
scores occur below the mean.
• Approximately 34% of all scores occur between the mean
and 1 SD above the mean.
• Approximately 34% of all scores occur between the mean
and 1 SD below the mean.
• Approximately 68% of all scores occur between the mean
and +/- 1 SD.
*Deviation IQ- Used for interpreting IQ scores • Mean =
100; SD = 15
*STEN- Standard ten • Mean = 5.5; SD = 2
*Graduate Record Exam (GRE) or Scholastic Aptitude Test
(SAT)- Used from admission to graduate school and college.
• Mean = 500; SD = 100
10
= RELATIONSHIP BETWEEN STANDARD SCORES=
= RELATIONSHIP BETWEEN STANDARD SCORES=
* CORRELATION AND INFERENCE- Correlation coefficient is
a number that provides an index of the relationship
between two things.
*C O R R E L A T I O N A L S T A T I S T I C S- Are statistical
tools for testing the relationships or associations between
variables.
• A statistical tool of choice when the relationship between
variables is linear and when the variables being correlated
are continuous.
• COVARIANCE- how much two scores vary together.
• CORRELATION COEFFICIENT- a mathematical index
that describes the direction and magnitude of a
relationship; always ranges from -1.00 to +1.00 only.
*PEARSON PRODUCT MOMENT CORRELATIONDetermines the degree of variation in one variable that can
be estimated from knowledge about variation in other
variable.
• Correlated two variables in interval or ratio scale format.
• Devised by Karl Pearson
=THREE TYPES OF CORRELATIONS=
*SPEARMAN RHO CORRELATION- A method of correlation
for finding the association between two sets of ranks thus,
two variables must be in ordinal scale.
• Frequently used when the sample size is small (fewer
than 30 pairs of measurements).
• Also called rank-order correlation coefficient or rankdifference correlation.
• Devised by Charles Spearman
*BISERIAL CORRELATION- Expresses the relationship
between a continuous variable and an artificial
dichotomous variable.
• For example, the relationship between passing or failing
the bar exam (artificial dichotomous variable) and general
weighted average (GPA) in law school (continuous variable)
*POINT BISERIAL CORRELATION- Correlates one
continuous and one true dichotomous data.
• For example, score in the test (continuous or interval) and
correctness in an item within the test (true dichotomous).
*TRUE DICHOTOMY- There are only two possible
categories that are formed naturally.
• For example: Gender (M/F)
*ARTIFICIAL DICHOTOMY- Reflect an underlying
continuous scale forced into a dichotomy; there are other
possibilities in a certain category.
• For example: Exam score (Pass or Fail)
*PHI COEFFICIENT- Correlates two dichotomous data; at
least one should be true dichotomy.
• For example, gender population who passed or fail the
2018 Physician Licensure Exam.
*TETRACHORIC COEFFICIENT- Correlated two dichotomous
data; both are artificial dichotomy.
• For example, passing or failing a test and being highly
anxious or not.
=ISSUES IN THE USE OF CORRELATION=
*RESIDUAL- difference between the predicted and the
observed values.
11
*STANDARD ERROR OF ESTIMATE- standard deviation of
the residual; measure of accuracy and prediction.
*SHRINKAGE- the amount of decrease observed when a
regression equation is created for one population and then
applied to another.
*COEFFICIENT OF DETERMINATION (r2)- tells the
proportion of the total variation in scores on Y that we
know as a function of information about X. It also suggests
the percentage shared by two variables; the effect of one
variable to the other.
*COEFFICIENT OF ALIENATION- measures the nonassociation between two variables.
*RESTRICTED RANGE- significant relationships are difficult
to find if the variability is restricted.
=Essential Facts About Correlation=
• The degree of relationship between two variables is
indicated by the number in the coefficient, whereas the
direction of the relationship is indicated by the sign.
• Correlation, even if high, does not imply causation.
• High correlations allow us to make predictions
*R E G R E S S I O N- Defined broadly as the analysis or
relationships among variables for the purpose of
understanding how one variable may predict the other
through the use of linear regression.
• Predictor (X) –serves as the IV; causes changes to the
other variable.
• Predicted (Y) –serves as the DV; result of the change as
the value of predictor changes.
• Represented by the formula: Y = a + bX
• INTERCEPT (a)- the point at which the regression line
crosses the Y axis
• REGRESSION COEFFICIENT (b)- the slope of the regression
line
• REGRESSION LINE- best fitting straight line through a set
of points in a scatter plot
• STANDARD ERROR OF ESTIMATE- measures the accuracy
of predi
*MULTIPLE REGRESSION ANALYSIS- A type of multivariate
(three or more variables) analysis which finds the linear
combination of the three variables that provides the best
prediction.
• Statistical technique in predicting one variable from a
series of predictors.
• Intercorrelations among all the variables involved.
• Applicable only for all data that are continuous.
*STANDARDIZED REGRESSION COEFFICIENTS- Also known
as beta weights • Tells how much a variable from a given
list of variables predict a single variable.
*FACTOR ANALYSIS- Is used to study the interrelationships
among a set of variables without reference to a criterion.
• Factors–these are the variables; also called principal
components.
• Factor Loading –the correlation between original items
and the factors.
*META-ANALYSIS- The family of techniques used to
statistically combine information across studies to produce
single estimates of the data under study.
ADVANTAGES:
• Can be replicated
• Conclusions tend to be more reliable and precise than
conclusions from single studies
• More focus on effect size than statistical significance
alone
• Promotes evidenced-based practice –a professional
practice that is based on clinical research findings.
•Effect Size – the estimate of strength of relationship or
size of differences. Typically expressed as a correlation
coefficient.
=PARAMETRIC VS NON PARAMETRIC TESTS=
*PARAMETRIC- Assumptions are made for the population
• Homogenous data; normally distributed samples
• Mean and SD
• Randomly selected samples
*NON-PARAMETRIC- Assumptions are made for the
samples only
• Heterogeneous data; skewed distribution.
• Ordinals and categories
• Highly purposive sampling
=NON PARAMETRIC TESTS=
*MANN-WHITNEY U TEST
o Counterpart of t-test for independent samples
o Ordinal data
o Assumption of heterogeneous group
*WILCOXON SIGNED RANK TEST
o Counterpart of t-test for dependent samples
o Ordinal data o Assumption of heterogeneous data
12
*KRUSKAL WALLIS H TEST
o Counterpart for One-Way ANOVA
o Ordinal data
o Assumption of heterogeneous group
*FRIEDMAN TEST
o Counterpart of t-test for dependent samples
o Ordinal data
o Assumption of heterogeneous data
TOPIC 3: ESSENTIALS OF TEST SCORE
INTERPRETATION (Of Test and Testing)
=ASSUMPTIONS ABOUT PSYCHOLOGICAL TESTING AND
MEASUREMENT=
*Assumption 1- Psychological Traits and States Exist
*TRAIT -Any distinguishable, relatively enduring way in
which one individual varies from one another
•Psychological traits exist only as a construct–an informed,
scientific concept developed or constructed
to
describe or explain behavior.
*STATES -Distinguish one person from the another but are
relatively less enduring
*Assumption 2- Psychological Traits and States
Can Be Quantified and Measured
• Traits and states shall be clearly defined to be measured
accurately.
• Test developers and researchers, like other people in
general, have many different ways of looking at and
defining the same phenomenon.
• Once defined, test developer considers the types of item
content that would provide insight into it.
*Cummulative scoring- assumption that higher the tesk
taker’ score is, there is the presumption to be on the
targeted ability or trait
whom, and how the test results should be interpreted.
• Competent test users understand and appreciate the
limitations of the tests they use
*Assumption 5- Various Sources of Error Are Part of the
Assessment Process
*Error- Is trasitionally refereed to as a something that is
more than expected and is a component of the
measurment process.
•Is a long-standing about the assumptions that factors
other than what a test attempts tos measure will influence
perfromance on the test
*Error Variance- The component of a test score
atrributable to sources than the trait or ability being
measured
*Classical Test Theory- assumption is made of that each
test taker has a score on a test that would be obtained but
for the action of measurement error
* Assumption 6- Testing and Assessment Can Be Conducted
in a fair asn Unbiased Manner
•Sources of fairness-related problems is the best test who
attempts to use a particular test with people whose
background and experience of people for whom the test
was intended.
•It is more political than psychometric
*Assumption 7- Testing and Assessment Benefit Society
Without Test,there will be….
•Subjective personnel hiring process
•Children with special needs might be assigned to certain
classes by gut feel of the teachers and school
administrators
•Great needs to diagnose adecational difficulties
*Assumption 3- Test-Related Behavior Predicts Non-TestRelated Behavior
•No instruments todiagnose neuropsychological
impairments
• The tasks in some tests mimic the actual behaviors that
the test user is attempting to understand.
• Obtained sample of behavior is typically used to make
predictions about future behavior.
•No practical way for military to screen thousands of
recruits
*Assumption 4- Tests and Other Measurement Techniques
Have Strengths and Weakness
• Competent test users understand how a test was
developed, the circumstance under which it is appropriate
to administer the test, how to administer the test and to
=WHAT IS A GOOD TEST?=
Psychometric Soundness
*Reliability- consistency in measurement •The
precision with which the test measures and the extent to
which error is present in measurement •Perfectly reliable
measuring tool consistenly measures in the same way
13
*Validity- when a test measures what it purports to
measure •An intelligence test is valid test because it
measures intelligence; the same way with personality tests;
and with other psychological tests •Questions on test’s
valifity may focus on the items that is collectively make up
the test.
*Norms- These are the test performace data of a particular
group of test takers that are designed for use as a reference
when evaluating or interpreting individual test scores
•Obtained by adminis tering the test to a sample of people
and obtaining the distribution of scores for that group
*Normative Sample- Group of people whose performancr
on a particular test in analyzed for reference in evaluating
the persormance of an individual test taker
*Norming- The perocess of deriving norms
•Maybe modified to describe a particular type of norm
derivation
=SAMPLING TO DEVELOP NORMS=
*Sample- The representative of the whole population •It
could be a small as one person, though samples that
approach the size of the population reduce thepossible
sources of error due to insufficient sample size
*Sampling- The process of selecting theportion of the
universe deemed to be representative of the whole
population
=Developing Norms for a Standardized Test=
•The test developer administers the test according to the
standard set of instructions that will be used with the test
•The test developer describes the recommended setting for
giving the test
*Standardization- the process of adminitering a test to a
representative sample of test takers for the purpose of
establishing norms
•A test is said to be standardized when it has clearly
specified procedures for administration and scoring.
Typically including normative data
•The test developer will summarize the data using
descriptive statistics, including measures of central
tendency and variability
=TYPES OF STANDARD ERROR=
Exact figure exact # of sample, specify the demograpic
*Standard Error of Measurement (SEM)- a statistic to
estimate the extent to which an observed score deviates
from a true score
= EVALUATION OF NORMS=
*Standard Error of Estimate (SEE)- In regression, it is an
estimate of the degree of error involved in predicting the
value of one variable from another
*Standard Error of the Mean (SEM)- a measure of sampling
error
*Standard Error of the Difference (SED)- a statistic used to
estimate how large difference between two scores should
be before the difference is considered statistically
significant
•The test developer provides a precise description od the
standardization saple itself
*Norm-referenced Test- a score is interpreted by
comparing it with the scores obtained by others on the
same test •Methoed of evaluation and a way of deriving
meaning from test scores by evaluationg an individual’s
score with reference to a set of standard
*Criterion-referenced Test- is uses as specified content
domain rather than a specified poppukation of people. The
score is interpreted based on the performance of a
standardized group •Also known as content-referenced or
domain-reference
Criterion: how the test developer pre determined the cut
score
14
=TYPES OF NORMS REFERENCED TESTING=
*Development Norms- Norm developed on the basis of any
trait, ability, skill or other characteristics that is presumed
to develop, deteriorate, or otherwise be affected by
chronoligical age, school grade, or stage of life
*Age Norms- average performance of different
samples of test takers who were at various ages at the time
the test was administered
*Grade Norms- designed to indicate the average
test performance of test takers in a given school grade
*Ordinal Scale- are digned to identify the stage
reached by the child in the development of specific
behavior functions
*Within Group Norms-the individual’s performance
evaluated in terms of the performance of the most nearly
comparable standardization group
*Percentiles- an expression of the percentage of
people whose score on a test or measure falls below a
particular score •it indicates the individual’s relative
position in the standardization sample
(Percentile rank: your position in the entire rank
Example: Kyla placed in 95th percentile
Interpretation: Kyla is in the 95th percentile rank
which means that she outsmarted 95% of the population
who also took the test.)
*Standard scores- are derived scores which uses as
its unit the SD of the population upon which the test was
standardized
*Deviation IQ- a standard score on an intelligence
test with a mean of 100 and an SD that approximates the
SD of the SB IQ distribution
*National Norms- Norms on a large scale sample •National
representatives
*Subgroup Norms- segmented by any of the
criteria initially used in selecting subjects for the sample
*Local Norms- provide normative information with
respect to the normative population performance on some
test
*TRACKING – The tendency to stay at about the same level
relative to one’s peer.
• Staying at the same level of characteristics as compared
to the norms.
• This process is applied to children when parents want to
know if the child is growing normally.
TOPIC 4: RELIABILITY
*Reliability- refers to the dependability or consistency in
measurement •The proportion of the total variancr
attibuted to true variance
*Reliability Coefficient- an index of reliblity. A proportion
that indicates the ratio between the true score variance on
a test and the total variance
If we use X to represent an observed score, T to represent a
true score, and E to represent error, then the fact that an
observed score equals the true score plus error may be
expressed as follows: X = T + E
=Concepts in Reliability=
*Variance (σ2) - useful in describing sources of test score
variability. The standard deviation squared. This statistic is
useful because it can be broken into components.
*True Variance- variance from true differences
*Error variance- variance from irrelevant, random sources
*Measurement error- refers to, collectively, all of the
factors associated with the process of measuring some
variable, other than the variable being measured
•If σ2 represents the total variance, the true variance, and
the error variance, then the relationship of the variances
can be expressed as
σ2 = σ2 th + σ2e
•In this equation, the total variance in an observed
distribution of test scores (σ2) equals the sum of the true
variance (σ2 th) plus the error variance (σ2 e).
Types of Measurement Error
*Random error- is a source of error in measuring a
targeted variable caused by unpredictable fluctuations and
inconsistencies of other variables in the measurement
process.
*Systematic error- refers to a source of error in
measuring a variable that is typically constant or
proportionate to what is presumed to be the true value of
the variable being measured.
15
=Theory of Reliability=
Puts sampling error and correlation together in the context
of measurement
*Test Score Theory: Classical Test Theory (CTT)- known as
the true score model of measurement
•X = T + E where X is the observed score, T is the true score,
and E is the error
•Standard Error of Measurement- the standard deviation
of errors because of the assumption that the distribution of
random error will be the same for all people
•Widely used in assessment of reliability and has been
around for over 100 years
•Disadvantage: requires that exactly the same test items be
administered to each person
•Errors of measurement are random
•Each person has a true score obtained if there were no
error inn measurement
•The true score for an individual will not change with
repeated applications of the same test
*Item Response Theory (IRT)- provides a way to model the
probability that a person with X ability will be able to
perform at a level of Y
•Latent-Trait Theory- synonym of IRT I academic literature
because often times, a psychological or educational
construct being measured is physically unobservable and
because the construct being measured may be trait
•The computer is used to focus on the range of item
difficulty that helps assess on individual’s ability level
•For example, if the test taker gets several easy items
correct, the computer might quickly move to difficult items.
If the person gets several difficult items wrong, the
computer moves back to the area of item difficulty where
the person gets some items right and some wrong
•Requires a bank of items that have systematically
evaluated for level of difficulty
•The more questions that the test takers have answered
the higher the chance that the question is easy
•The more questions that the test takers have not
answered the higher possibility the chance that the
question is difficult
*Domain Sampling Model- The greater the numbers of
items, the higher the reliability is
•It considers the problems created by using a limited
number of items to represents a large and more
complicated construct
•Conceptualizes reliability as the ratio of the variance of
the observed score on the test and the variance of the longrun true score
=Sources of Error Variance=
*Test Construction- the item sampling or content sampling
is one of the sources of variance during test construction
due to the variation among items within a test as well as
variation among items between tests
*Test Administration- Sources of error variance that occur
during test administration may influence the test taker’s
attention or motivation.
•test taker’s reactions to those influences are the source of
one kind of error variance, test environment for instances
•Test taker variables such as emotional problems, physical
discomfort, lack of sleep, effects of drugs or medication,
etc.
•Examiner- related variables- such as physical appearance
and demeanor, or the mere presence or absence of an
examiner
*Test Scoring and Interpretation- Not all tests can be
scored using a computer such as the tests administered
individually
• a test may employ objective-type items amendable to
computer scoring yet technical glitch may contaminate the
data
•if subjectively is involved in scoring, the scorer or rater can
be the source of error variance
16
=Models/ Estimates of Reliability=
1. Test-retest Reliability- is used to evaluate error
associated with administering a test at two different times;
correlating the scores of two test administered
•Should be used when measuring traits or characteristics
that do not change over time- static attributes
•Also known as coefficient of stability
*Alternate-Forms Reliability- Different versions of
a test that have been constructed to be parallel •Designed
to be equivalent with respect to variables such as content
and level of difficulty
•refers
to the extent to which these different forms of the same
test have been affected item sampling error, or other error.
It is same with parallel but different version
*Conveyer Effect- occurs when the first testing
session influences scores from the second session
3. Internal Consistency- estimates how the test itmes are
consistent with one another
•the degree of correlation among all the items on a scale
*Practice Effect- type of carryover wherein skills
are improved with practiced
=Types to measure internal consistency=
•Scores on the second administration are usually higher
than they were on the first; thus, changes are not constant
across the group.
•If a test manual has test-retest correlation, always pay
attention to the interval between two testing sessions
•poor test-retest correlations do not always mean that the
test is unreliable; it could mean that the characteristics
under the study has changed over time
2. Parallel-And Alternate-Forms Method- coefficient of
equivalence
•responsible for item-sampling error
•Also uses correlation
•Disadvantages:
•Burdensome to develop two forms of the same
test
•Practical constraints include retesting of the same
group of individuals
•creating a new secondary form that has the same question
and same difficulty but different presentation (for board
exam)
*Parallel-Forms Reliability- compares two
equivalent forms of a test that measure that same attitude
•the two forms use different items; however, the rules
used to select items of a particular difficulty level are the
same
•theoretically, scores obtained on parallel forms correlate
equally with the true score or with other measures
•An estimate of the extent to which item sampling and the
errors have affected scores on versions of the same test
when, for each form of the test, the means and variances
observed test scores are equal
*Split-Half Reliability- a tets is given and is divided into
halves that are scores separately, the results of one half of
the test are then compared with the results of the other;
one administration required
•Equalizes the difficulty of the test
•a useful of reliability when it is impractical and undesirable
to use two tests or to administer a test twice-odd-even
system
•Three
steps; Divide the test into equivalent halves •Calculate the
pearson’s r between scores on the two halves of the test
•adjust the half-test reliability using the spearman-brown
formula; corrected r= 2r + r
*Spearman-Brown Formula- estimates what the
correlation would have been if each half had been the
length of a whole test •It increases the estimate of
reliability •can be used to estimate the effect of the
shortening at reduction of test items on the test’s reliability
•can also be used to determine the number of items
needed to attain a desired level of reliability; hence the
new items must be equivalent in content and difficulty so
that the longer test still measure what the original test
measured
(Magkakaroon ng correlation kung yung length ng half test
is the same of the length of the whole test
Ilang test items ang dapat mong alisin para bumaba ang
reliability coefficient to an acceptable range for a newly
develop test
Sinagot niya yung weakness ni split half)
*Kuder-Richardson Formula:
*KR-20- used for a test in which items are
dichotomous or can be scored right or wrong (merong
tamang sagot) ; homogeneous, and items with unequal
difficulty •the 20th formula that was developed in the series
due to G. Frederic Kuder and M.W. Richardson
dissatisfaction with the spilt-half methods
17
*KR-21- for items that have equal difficulty or that
average difficulty level is 50%
•Can’t apply in personality test
*Coefficient Alpha (a)- Can be thought of as a mean of all
possible split-half correlations, corrected by the SpearmanBrown formula; thus, providing the lowest estimate of
reliability that one can expect.
• Can be used when the two halves of a test hove unequal
variances.
• Appropriate for tests that have nondichotomous (no
correct and incorrect) items; personality tests.
• The most preferred statistic for obtaining an estimate of
internal consistency reliability.
•Best use in in personality test
• Widely used measure of reliability, in pod because it
requires only one administration of the test.
• Values typically range from 0 to +1 only.
*COEFFICIENT OMEGA- The extent to which all
items measure the some underlying trait
•Overcomes the weakness of coefficient alpha as its
coefficient can be greater than zero even though the items
are not assessing the same trait.
*Average Proportional Distance (APD)• Evaluates the internal consistency of a test that focuses
on the degree of difference that exist between items scores.
• General rule: if the obtained value is .2 or lower, it
indicates on excellent internal consistency. The value of .25
to .2 is in the acceptable range.
• A relatively new measure developed by Sturman et al..
4. INTER-SCORER (RATER) RELIABILITY- The degree of
agreement or consistency between Iwo or more scorers
(raters or judges) with regord to a particular measure.
• Often used to code nonverbal behavior.
• Coefficient of Inter-Scorer Reliability- the degree of
consistency among scorers.
*Kappa Statistics- the best method to ossess the level of
agreement omong several observers.
*Cohen’s kappa- small no. Of raters and scorers (2
raters)
*Fleiss kappa- measure the ahreement between
the fixied no. of raters (more than 2)
It is the reliability of instrument (performance and score of
a group) that we are measuring and not the test itself
How the raters in agreement with the performance etc.
*Kappa- indicates the actual agreement os a proportion of
the potential agreement following a correction for a chance
agreement.
POINTS TO REMEMBER:
• Indices of reliability provide an index that is the
characteristic of a particular group of test score, not the
test itself (Caruso. 2000: Yin & Fan. 2000). • Measures of
reliability are estimates, and estimates are subject to error.
• The precise amount of error inherent in a reliability
estimate will vary with various factors such as the sample of
test takers from which the data were drawn.
• Reliability index is published in a test manual and might
be impressive. Remember that the report reliability was
achieved with a particular group of test takers.
HOW RELIABLE IS RELIABLE?
• Usually depends on the use of the test.
• Reliability estimates in the range of .70 to .80 are good
enough for most purposes of basic research.
• .95 are not very useful because it may suggest that all of
the items ore essentially the something and the measure
could easily be shortened.
=SOLVING LOW RELIABILITY=
*Increase the Number of Test Items
• the larger the sample, the more likely that the test will
represent the true characteristic.
• In domain-sampling model, the reliability of a test
increases, as the number of items increases.
• Can use the general Spearman-Brown formula to achieve
a desired level of reliability
*Factor Analysis
• Unidimensional makes the test more reliable: thus, one
factor should account for considerable more of the variance
than any other factor.
• Remove items that do not load on the one factor being
measured.
*Item Analysis
• Correlation between each item and the total score for the
test often coiled as discriminability analysis.
• When the correlation between performance on a single
item and the total test score is low, the item is probably
measuring something different from other items on the
test.
*Correction of Attenuation
• Used to determine the exact correlation between
variables in the test is deemed affected by error.
18
• The estimation of what the correlation between tests
would have been is there had been no error in
measurement.
=NATURE OF THE TEST=
*Homogenous or Heterogenous Test Items
• HOMOGENOUS: items that ore functionally uniform
throughout. The test is designed to measure one factor
such as one ability or one trait which yields a high degree of
internal consistency.
• HETEROGENOUS: more than one factor is being
measured; thus, internal consistency might be low relative
to a more appropriate estimate of test-retest reliability.
*Dynamic or Static Characteristic, Ability, Trait
• DYNAMIC- a trait, state, or ability is presumed to be everchanging as a function of situational and cognitive
experiences.
• STATIC- a characteristic, ability, or trait that is stable;
hence, test-retest and alternate forms methods would be
appropriate
*Range of Test Scores is or is not Restricted
• Important to be used when interpreting a coefficient of
reliability.
• If the variance of either variable in o correlational analysis
Is restricted by the sampling procedure used, then the
resulting correlation coefficient tends to be lower.
• If the variance is inflated by the sampling procedure, then
the resulting correlation coefficient is higher.
*Speed Test or Power Test
• SPEED TESTS- generally contain items of uniform level of
difficulty (typically low) so that when given generous time
limits, test taker should be able to complete all the items
correctly (with time limit)
• POWER TESTS- when the time Emit is long enough to
attempt all items, and if some test items are so difficult that
no lest taker is able to attain o perfect score (no time limit)
*Test is or is not Criterion-References
• Designed to provide on indication of where 0 test taker
stands with respect to some variable criterion
• Contains materials that have been mastered in a
hierarchical fashion. For example, tracing a letter pattern
before attempting to master writing
• Scores on this measure tends to be interpreted in pussfoil terms, and any scrutiny of performance on individual
items tends to be for diagnostic and remedial purposes.
=OTHER CONCEPTS TO REMEMBER=
*STANDARD ERROR OF MEASUREMENT (SEM)
• A tool used to estimate or infer the extent to which a test
score deviates from o true score.
• Provides on estimate of the amount of error inherent in
on observed score of measurement.
• An Index of the extent to which one individual score vary
over tests presumed to be parallel.
• Confidence Interval- the range or bond of test scores that
is likely to contain the hue score.
*STANDARD ERROR OF DIFFERENCE (SED)
• Determining how lorge the difference should be before it
is considered statistically significant.
• Larger than SEM for either score alone because the
former is affected by the measurement in both sides.
• If two scores each contain error such that in each case
that true score could be higher or lower, then we would
wont the two scores to be further apart before we
conclude that there is a significant difference between
them.
*Don't forget:
• The relationship between the SEM and the ratability of a
test that is inverse; the higher the reliability of o test (or
individual subtest within a test), the lower the SEM.
• In accordance with the True Score Model, an obtained
test score represented one point in the theoretical
distribution of scores the test faker could have obtained
19
TOPIC 5: VALIDITY
*VALIDITY- A judgment of estimate of how well a test
measures what it purports to measure in a particular context.
*Validation: the process of gathering and evaluating evidence
about validity.
*LOCAL VALIDATION STUDIES- necessary when the test user
plans to alter II some way the format, instruction, language,
or content of the test.
•Also necessary it a test use sought to test with o population
al lest token Mot differed in some significant way from the
population on which the test was standardized.
(Knows the context of introversion
*Local Validation- no one has the right to translate a
psychological test and administer it to a certain group of
people whom it was translated to the language that they
understood without undergoing local validation)
behavior representative of the universe of behavior that the
test was designed to sample.
• Requites subject matter words to ensure all items are valid.
(How sufficiently represented the domain that should be
measure. All discussion is covered in the test)
*CONSTRUCT UNDERREPRESENTATION- Describes
the failure to capture important components of a construct. •
For example, an English Proficiency exam that did not cover
the Ports of Speech knowledge of the test takers.
*CONSTRUCT-IRRELEVANT VARIANCE- Occurs when
scores ore influenced by factors irrelevant to the construct.
• For example, test anxiety, physical condition, etc.
*Quantification of Content Validity (Lawshe, 1975)
• Essential (Accepted)
• Useful but not essential (Revise)
• Not necessary (Rejected)
=ASPECTS/ TRINITARIAN MODELS OF VALIDITY=
* Face Validity- A judgment concerning how relevant test
items appear to be.
• Makes the test taker to seriously take the test.
• Even if a test lacks face validity, it may still be relevant and
useful.
(It is mathematics test but there are subject verb agreement)
* CONTENT-RELATED VALIDITY- Adequacy of representation
of the conceptual domain the test is designed to cover.
• It describes a judgment of how adequately a test samples
*CRITERION-RELATED VALIDITY- A judgment of how
adequately a test score can be used to infer an individual's
most probable standing on some measure on interest - the
measure of interest being the criterion.
20
*Criterion- the standard against which the test or test
score is compared to or evaluated. It can be a test score,
behavior, amount of time. rating, psychiatric diagnosis,
training cost, index of absenteeism, etc.
• For
example, a test might be used to predict which engaged
couples will have successful montages and which ones will get
annulled. Marital success is the criterion but it cannot be
known at the time the couple take premarital test.
Characteristics of a Criterion
=Considerations in Predictive Validity=
*BASE RATE- the extent to which a particular trait, behavior,
characteristic, or attribute exists in the population (existence
of trait)
*HIT RATE- the proportion of people a test accurately
identifies as possessing or exhibiting o particular trait,
behavior, characteristic, or attribute (presence of the threat,
behavior)
*RELEVANT-must be pertinent or applicable to the motler at
hand. (The criteria you have is the criterion needed)
*MISS RATE- the proportion of people the test foils to identify
as having, or not having a particular trait, characteristic or
attribute; inaccurate prediction.
*VALID- should be adequate for the purpose for which it is
being used. If one test is being used as the criterion to
validate the second test, then evidence should exist that the
first test is valid.
*False Positive- a miss when a test taker was
predicted to have the attributes being measured when in fact
did not: akala meron pero wala.
*UNCONTAMINATED- the criterion is not affected by other
criterion measures; otherwise, criterion contamination
occurs.
=Types of Criterion Validity=
*CONCURRENT VALIDITY- Comes from assessments of
simultaneous relationship between the test and the criterion.
Indicates the extent to which test scores may be used to
estimate an individual's present standing on a criterion.
•For example, work samples, scores on Beck Depression
Inventory IBDII and BDI-II, etc.
*PREDICTIVE VALIDITY- How accurately scores on the test
predict some criterion measure.
•For example, scores on College Admission Test (CAT] and
GPA of freshmen provide evidence of the predictive validity of
such admission test.
•NMAT predicts MSAT predicts GPA in Med School •PhiLSAT
predicts BAT predicts GPA in Law School
*False Negative- a miss when a test taker was not
predicted to have the attributes being measured when in fact
he or she did have; akalo wok" pero meron.
*Incremental Validity- value add of the test or of the
criterion.
• The degree to which on additional predictor explains
something about the criterion measure that is not explained
by predictors already in use.
• The
predictive power of the test to see or discover something else
other that what it is intended to
(What more, extra mile that the test can offer more than what
the test is offering)
* Validity Coefficient- The relationship between a test score
and a score on the criterion measure. It is the computed
correlation coefficient.
• No hard-and-fast rules about how large the validity
coefficient must be.
• .60 or larger is rare; .30 to .40 are commonly considered
high.
(There is validity coefficient in criterion-related validity)
21
=EVALUATING VALIDITY COEFFICIENT=
• Look for Changes in the Cause of Relationships
• Identify if the Criterion is Valid and Reliable
o Criterion validity studies would mean nothing if the criterion
is not valid or reliable
• Review the Subject Population in the Validity Study
o The validity study might have been done on a population
that does not represent the group to which inferences will be
made.
• Be Sure the Sample Size was Adequate
o A good validity study will present evidence for crossvalidation; hence, the sample size should be enough.
o
Cross validation study assesses how well the test actually
forecasts performance for on independent group of subjects.
• Never Confuse the Criterion with the Predictor
Criterion is the standard being measured or the desired
outcome while predictor is a variable that affects the
criterion.
o
• Check for Restricted Range on Both Predictor and Criterion
o Restricted range happens if all scores foil very close
together.
• Review Evidence of Validity Generalization
o The findings obtained in one situation may be applied to
other situations.
•Consider Differential Prediction
o Predictive relationships may not be the same for oil
demographic groups.
=Other Concepts=
*EXPECTANCY DATA- provide information that can be used in
evaluating the criterion-related validity of a test.
*EXPECTANCY TABLE- shows the percentage of people within
specified test-scores intervals who subsequently were placed
in various categories of the criterion (passed or failed).
• Established through a series at activities in which a
researcher simultaneously defines some construct and
develops the instrumentation to measure it.
• Judgment about the appropriateness of inferences drawn
from test scores regarding individual standing on a variable. •
Viewed as the unifying concept for all validity evidence
*CONSTRUCT- a scientific idea developed or hypothesized to
describe a explain behavior: something built by mental
synthesis. Examples are intelligence, motivation, job
satisfaction, self-esteem, etc.
*CONSTRUCT VALIDATION- assembling evidence about what
the test means and is done by showing the relationship
between a test and other tests and measures.
=Main Types of Construct Validity Evidence=
*CONVERGENT VALIDITY- When a measure correlates well
with other tests (standardized. published. etc.) that are
designed to measure similar construct.
• Yields a moderate 10 high correlation coefficient wherein
the scores on the constructed test will be correlated with the
scores on an established test that measures the same
construct.
• Can be obtained by:
• Showing that a test measures the same thing as the other
tests used for the same purpose: and
• Demonstrating specific relationships that can be expected if
the test's really doing its job.
*DISCRIMINANT (Divergent) VALIDITY- A proof that the test
measures something unique.
•Indicates that the measure does not represent a construct
other than the one for which it was devised.
•Correlating a test to another measure that has a little to no
relationship at all; hence, the correlation coefficient should be
low to zero.
=Other Evidences of Construct Validity =
*Evidence of Homogeneity- how uniform a test is in
measuring a single concept. (Homogeneity- Normative sample
have the same characteristics)
*Evidence of Changes with Age- some Constructs Ore
expected to change over time.
*Evidence of Pretest-Posttest Changes- result of intervening
experiences.
*Evidence from Distinct Groups- or contrasted groups - score
on a test vary in a predictable way as a function of
membership in some group.
22
*FACTOR ANALYSIS- A method of finding the minimum
number of dimensions or factors for explaining the largest
number of variables.
*EXPLORATORY FACTOR ANALYSIS: entails
estimating, or extracting factors; deciding how many factors
to retain: and rotating factors to an interpretable orientation
TOPIC 6: TEST DEVELOPMENT
*Test development- is an umbrella term for all that goes
into the process of creating a test.
The process of developing a test occurs in five stages:
1. test conceptualization;
*CONFIRMATORY FACTOR ANALYSIS: the degree to
which a hypothetical model - which includes factors fits the
actual data.
2. test construction;
= Other Issues in Validity=
4. item analysis;
*RATING ERROR- judgment resulting for the intentional or
unintentional misuse of a rating scale.
5. test revision.
*Leniency of Generosity Error- rater’s tendency to be
lenient in scoring, marking, or grading (mataas yung binigay
na score)
*Central Tendency Error- rater's reluctance to give
scores at either the positive or negative extremes (the scorer
are having difficulty in giving scores)
*Severity Error- rater gives low scores regardless of
the performance. (if the rater is very reliant in the rubric given
to the rater)
*Halo Effect- rater gives high scores due to his or her
failure to discriminate among conceptually and independent
aspects of behavior (nadadala ka ng positive attributes. Hindi
maganda yung gawa pero dahil kilala mo siya or mabait siya)
3. test tryout;
Once the idea for a test is conceived (test
conceptualization), test construction begins.
*Test construction- is a stage in the process of test
development that entails writing test items (or re-writing or
revising existing items), as well as formatting items, setting
scoring rules, and otherwise designing and building a test
Once a preliminary form of the test has been developed, it is
administered to a representative sample of test takers
under conditions that simulate the conditions that the final
version of the test will be administered under (test tryout)
*Item analysis- are employed to assist in making judgments
about which items are good as they are, which items need
to be revised, and which items should be discarded.
=Relationship Between Reliability and Validity=
*Test revision- refers to action taken to modify a test’s
content or format for the purpose of improving the test’s
effectiveness as a tool of measurement.
A test can be reliable even if Nis not valid; but a test cannot
be valid If it is not reliable.
=Test Conceptualization=
*Asexuality- may be defined as a sexual orientation
characterized by a long-term lack of interest in a sexual
relationship with anyone or anything
23
=Scaling methods=
1. Rating scale- grouping of words, statements, or symbols on
which judgments of the strength of a particular trait, attitude,
or emotion are indicated by the testtaker. Rating scales can
be used to record judgments of oneself, others, experiences,
or objects, and they can take several forms
*Summative scale-final test score is obtained by
summing the ratings across all the items
Norm-referenced tests compare individual performance with
the performance of a group.
Criterion-referenced assessments measure how well a
student has mastered a specific learning goal (or objective)
*Pilot work, pilot study, and pilot research refer- in general,
to the preliminary research surrounding the creation of a
prototype of the test. Test items may be pilot studied (or
piloted) to evaluate whether they should be included in the
final form of the instrument
=Test construction=
*Scaling- may be defined as the process of setting rules for
assigning numbers in measurement. Stated another way,
scaling is the process by which a measuring device is designed
and calibrated and by which numbers (or other indices)—
scale values—are assigned to different amounts of the trait,
attribute, or characteristic being measured.
*Age-based scale- If the testtaker’s test performance
as a function of age is of critical interest,
*Grade-based scale- If the testtaker’s test
performance as a function of grade is of critical interest,
*stanine scale- If all raw scores on the test are to be
transformed into scores that can range from 1 to 9
* A scale might be described in still other ways. For
example, it may be categorized as unidimensional as opposed
to multidimensional. It may be categorized as comparative
as opposed to categorical
* Likert scale- one type of summative rating scale. It is
used extensively in psychology, usually to scale attitudes.
Likert scales are relatively easy to construct. Each item
presents the testtaker with five alternative responses
(sometimes seven), usually on an agree–disagree or approve–
disapprove continuum
Rating scales differ in the number of dimensions underlying
the ratings being made
•Unidimensional- meaning that only one dimension is
presumed to underlie the ratings.
•Multidimensional- meaning that more than one
dimension is thought to guide the testtaker’s responses
*Method of paired comparisons- another scaling method
that produces ordinal data. Testtakers are presented with
pairs of stimuli (two photographs, two objects, two
statements), which they are asked to compare.
2. Sorting Scale- stimuli such as printed cards, drawings,
photographs, or other objects are typically presented to
testtakers for evaluation
*Comparative scaling- One method of sorting, entails
judgments of a stimulus in comparison with every other
stimulus on the scale. Testtakers would be asked to sort the
cards from most justifiable to least justifiable. Comparative
scaling could also be accomplished by providing testtakers
with a list of 30 items on a sheet of paper and asking them to
rank the justifiability of the items from 1 to 30.
*Categorical scaling- Stimuli are placed into one of
two or more alternative categories that differ quantitatively
with respect to some continuum. Testtakers might be given
30 index cards, on each of which is printed one of the 30
items. Testtakers would be asked to sort the cards into three
piles: those behaviors that are never justified, those that are
sometimes justified, and those that are always justified.
24
3. Guttman scale- Items on it range sequentially from weaker
to stronger expressions of the attitude, belief, or feeling being
measured. A feature of Guttman scales is that all respondents
who agree with the stronger statements of the attitude will
also agree with milder statements.
•If this were a perfect Guttman scale, then all respondents
who agree with “a” (the most extreme position) should also
agree with “b,” “c,” and “d.” All respondents who disagree
with “a” but agree with “b” should also agree with “c” and
“d,” and so forth
* scalogram analysis- an item-analysis procedure and
approach to test development that involves a graphic
mapping of a testtaker’s responses
* Thurstone’s equal appearing intervals method- is
one scaling method used to obtain data that are presumed to
be interval in nature.
•It is an example of a scaling method of the direct estimation
variety.
*In contrast to other methods that involve indirect
estimation-there is no need to transform the testtaker’s
responses into some other scale.
=Writing Items=
*Item pool- is the reservoir or well from which items will or
will not be drawn for the final version of the test
Multiply the number of items required in the pool for one
form of the test by the number of forms planned, and you
have the total number of items needed for the initial item
pool.
* Item format- Variables such as the form, plan, structure,
arrangement, and layout of individual test items
*selected-response format- require testtakers to
select a response from a set of alternative responses
*constructed-response format- require testtakers to
supply or to create the correct answer, not merely to select it
=Types of selected-response item formats=
*Multiple-choice format- has three elements: (1) a stem, (2) a
correct alternative or option, and (3) several incorrect
alternatives or options variously referred to as distractors or
foils.
*Matching item- the testtaker is presented with two columns:
premises on the left and responses on the right
*Binary-choice item- A multiple-choice item that contains
only two possible responses. Perhaps the most familiar
binary-choice item is the true–false item
=Three types of constructed-response items=
*Completion item requires- the examinee to provide a word
or phrase that completes a sentence, as in the following
example:
The standard deviation is generally considered the most
useful measure of __________.
(Completion item also referred as short-answer item)
*Short-answer item- It is desirable for completion or
short-answer items to be written clearly enough that the
testtaker can respond succinctly—that is, with a short answer.
*Essay item- as a test item that requires the testtaker to
respond to a question by writing a composition, typically one
that demonstrates recall of facts, understanding, analysis,
and/or interpretation.
An essay item is useful when the test developer wants the
examinee to demonstrate a depth of knowledge about a
single topic
=Writing items for computer administration=
*Item bank- is a relatively large and easily accessible
collection of test questions. Instructors who regularly teach a
particular course sometimes create their own item bank of
questions that they have found to be useful on examinations
* computerized adaptive testing (CAT)- refers to an
interactive, computer administered test-taking process
wherein items presented to the testtaker are based in part on
the testtaker’s performance on previous items. CAT tends to
reduce floor effects and ceiling effects.
*Floor effect- refers to the diminished utility of an
assessment tool for distinguishing testtakers at the low end of
the ability, trait, or other attribute being measured.
*Ceiling effect- refers to the diminished utility of an
assessment tool for distinguishing testtakers at the high end
of the ability, trait, or other attribute being measured.
*Item branching- The ability of the computer to tailor
the content and order of presentation of test items on the
basis of responses to previous items
=Scoring Items=
25
*Class scoring or (also referred to as category scoring)testtaker responses earn credit toward placement in a
particular class or category with other testtakers whose
pattern of responses is presumably similar in some way.
This approach is used by some diagnostic systems wherein
individuals must exhibit a certain number of symptoms to
qualify for a specific diagnosis.
*Ipsative scoring- is comparing a testtaker’s score on one
scale within a test to another scale within that same test.
Test Tryout
•The test should be tried out on people who are similar in
critical respects to the people for whom the test was
designed.
•Example, if a test is designed to aid in decisions regarding
the selection of corporate employees with management
potential at a certain level, it would be appropriate to try out
the test on corporate employees at the targeted level
•An informal rule of thumb is that there should be no fewer
than 5 subjects and preferably as many as 10 for each item on
the test. In general, the more subjects in the tryout the
better.
What Is a Good Item?
* Pseudobulbar affect (PBA)- is a neurological disorder
characterized by frequent and involuntary outbursts of
laughing or crying that may or may not be appropriate to the
situation
=Item Analysis=
1. The Item-Difficulty Index- calculating the proportion of the
total number of testtakers who answered the item correctly.
•A lowercase italic “p” (p) is used to denote item difficulty,
and a subscript refers to the item number (so p1 is read
“item-difficulty index for item 1”).
•Note that the larger the item-difficulty index, the easier the
item. Because p refers to the percent of people passing an
item, the higher the p for an item, the easier the item.
•The statistic referred to as an item-difficulty index in the
context of achievement testing may be an item-endorsement
index in other contexts, such as personality testing
2. Item-reliability index- provides an indication of the internal
consistency of a test
•The higher this index, the greater the test’s internal
consistency
•This index is equal to the product of the item-score standard
deviation (s) and the correlation (r) between the item score
and the total test score.
*Factor analysis and inter-item consistency A
statistical tool useful in determining whether items on a test
appear to be measuring the same thing(s)
3. item-validity index- is a statistic designed to provide an
indication of the degree to which a test is measuring what it
purports to measure.
•The higher the item-validity index, the greater the test’s
criterion-related validity.
•The item-validity index can be calculated once the following
two statistics are known:
■ the item-score standard deviation
■ the correlation between the item score and the criterion
score The item-score standard deviation of item 1 (denoted
by the symbol s1) can be calculated using the index of the
item’s difficulty (p1) in the following formula:
s1 = √p1(1 − p1)
4. item-discrimination index- is a measure of item
discrimination, symbolized by a lowercase italic “d” (d). This
estimate of item discrimination, in essence, compares
performance on a particular item with performance in the
upper and lower regions of a distribution of continuous test
scores.
•Item-discrimination index is a measure of the difference
between the proportion of high scorers answering an item
correctly and the proportion of low scorers answering the
item correctly
•The higher the value of d, the greater the number of high
scorers answering the item correctly.
*Item-characteristic curve- is a graphic representation of
item difficulty and discrimination.
26
=Other Considerations in Item Analysis=
*Guessing- one that has eluded any universally acceptable
solution
items occurring toward the end of the test appear to be
easier than they are.
=Qualitative Item Analysis=
1. A correction for guessing must recognize that,
when a respondent guesses at an answer on an achievement
test, the guess is not typically made on a totally random basis.
It is more reasonable to assume that the testtaker’s guess is
based on some knowledge of the subject matter and the
ability to rule out one or more of the distractor alternatives.
*Qualitative methods- are techniques of data generation and
analysis that rely primarily on verbal rather than
mathematical or statistical procedures. Encouraging
testtakers—on a group or individual basis—to discuss aspects
of their test-taking experience is, in essence, eliciting or
generating “data” (words).
2. A correction for guessing must also deal with the
problem of omitted items. Sometimes, instead of guessing,
the testtaker will simply omit a response to an item.
*Qualitative item analysis- is a general term for various
nonstatistical procedures designed to explore how individual
test items work. The analysis compares individual test items
to each other and to the test as a whole
3. Just as some people may be luckier than others in
front of a Las Vegas slot machine, so some testtakers may be
luckier than others in guessing the choices that are keyed
correct. Any correction for guessing may seriously
underestimate or overestimate the effects of guessing for
lucky and unlucky testtakers
*Item fairness- refers to the degree, if any, a test item is
biased.
*Biased test item- is an item that favors one particular group
of examinees in relation to another when differences in
group ability are controlled
*Speed Test- The closer an item is to the end of the test, the
more difficult it may appear to be. This is because testtakers
simply may not get to items near the end of the test before
time runs out.
•How can items on a speed test be analyzed? Perhaps the
most obvious solution is to restrict the item analysis of items
on a speed test only to the items completed by the testtaker.
However, this solution is not recommended, for at least three
reasons:
(1) Item analyses of the later items would be based on
a progressively smaller number of testtakers, yielding
progressively less reliable results;
(2) if the more knowledgeable examinees reach the
later items, then part of the analysis is based on all testtakers
and part is based on a selected sample; and
(3) because the more knowledgeable testtakers are
more likely to score correctly, their performance will make
*Qualitative methods- involve exploration of the issues
through verbal means such as interviews and group
discussions conducted with testtakers and other relevant
parties
*“think aloud” test administration- as a qualitative research
tool designed to shed light on the testtaker’s thought
processes during the administration of a test. On a one-toone basis with an examiner, examinees are asked to take a
test, thinking aloud as they respond to each item.
*Expert panels- may also provide qualitative analyses of test
items.
*Sensitivity review- is a study of test items, typically
conducted during the test development process, in which
items are examined for fairness to all prospective testtakers
and for the presence of offensive language, stereotypes, or
situations
*Cross-validation- refers to the revalidation of a test on a
sample of testtakers other than those on whom test
performance was originally found to be a valid predictor of
some criterion. A key step in the development of all tests—
brand-new or revised editions.
*Validity shrinkage -The decrease in item validities that
inevitably occurs after cross-validation of findings
*Co-validation- may be defined as a test validation process
conducted on two or more tests using the same sample of
testtakers
27
*Co-norming- When used in conjunction with the creation of
norms or the revision of existing norms
model for scoring and a mechanism for resolving scoring
discrepancies.
* Another mechanism for ensuring consistency in scoring is
the anchor protocol. An anchor protocol- is a test protocol
scored by a highly authoritative scorer that is designed as a
* scoring drift -A discrepancy between scoring in an anchor
protocol and the scoring of another protocol
=The Use of IRT in Building and Revising Tests=
Three of the many possible applications of IRT in building and
revising tests include
(1) evaluating existing tests for the purpose of mapping test
revisions,
(2) determining measurement equivalence across testtaker
populations, and
(3) developing item banks
*item-characteristic curves (ICCs)- provide information about
the relationship between the performance of individual items
and the presumed underlying ability (or trait) level in the
testtaker
•Using IRT, test developers evaluate individual item
performance with reference to item-characteristic curves
(ICCs)
28
*Differential item functioning (DIF)- a phenomenon, wherein
an item functions differently in one group of testtakers as
compared to another group of testtakers known to have the
same (or similar) level of the underlying trait
*DIF analysis- test developers scrutinize group-bygroup item response curves, looking for what are termed DIF
items.
•It has even been used to explore differential item
functioning as a function of different patterns of guessing on
the part of members of different groups
*DIF items- are those items that respondents from
different groups at the same level of the underlying trait have
different probabilities of endorsing as a function of their
group membership.
•It has been used to evaluate measurement equivalence in
item content across groups that vary by culture, gender, and
age.
=Developing item banks=
•The final item bank will consist of a large set of items all
measuring a single domain (or, a single trait or ability). A test
developer may then use the banked items to create one or
more tests with a fixed number of items. For example, a
teacher may create two different versions of a math test in
order to minimize efforts by testtakers to cheat.
•When used within a CAT environment, a testtaker’s
response to an item may automatically trigger which item is
presented to the testtaker next. The software has been
programmed to present the item next that will be most
informative with regard to the testtaker’s standing on the
construct being measured. This programming is actually
based on near-instantaneous construction and analysis of
IRT information curves. The process continues until the
testing is terminated.
29
WRITING AND EVALUATING TEST ITEMS
Item Writing
1. Define clearly what you want to measure. To do this, use
substantive theory as a guide and try to make items as
specific as possible.
2. Generate an item pool. Theoretically, all items are
randomly chosen from a universe of item content. In
practice, however, care in selecting and developing items is
valuable. Avoid redundant items. In the initial phases, you
may want to write three or four items for each one that will
eventually be used on the test or scale.
3. Avoid exceptionally long items. Long items are often
confusing or misleading.
4. Keep the level of reading difficulty appropriate for those
who will complete the scale.
5. Avoid “double-barreled” items that convey two or more
ideas at the same time. For example, consider an item that
asks the respondent to agree or disagree with the statement,
“I vote Democratic because I support social programs.” There
are two different statements with which the person could
agree: “I vote Democratic” and “I support social programs.”
*Likert format- because it was used as part of Likert’s (1932)
method of attitude scale construction. A scale using the
Likert format consists of items such as “I am afraid of
heights”. Instead of asking for a yes-no reply, five alternatives
are offered: strongly disagree, disagree, neutral, agree, and
strongly agree. In some applications, six options are used to
avoid allowing the respondent to be neutral. The six
responses might be strongly disagree, moderately disagree,
mildly disagree, mildly agree, moderately agree, and strongly
agree.
6. Consider mixing positively and negatively worded items.
Sometimes, respondents develop the “acquiescence
response set.” This means that the respondents will tend to
agree with most items. To avoid this bias, you can include
items that are worded in the opposite direction. For
example, in asking about depression, the CES-D (see Chapter
2) uses mostly negatively worded items (such as “I felt
depressed”). However, the CES-D also includes items worded
in the opposite direction (“I felt hopeful about the future”).
=Item Formats=
*Dichotomous format- offers two alternatives for each item.
Usually a point is given for the selection of one of the
alternatives.
The most common example of this format is the true-false
examination
*Polytomous format- (sometimes called polychotomous)
resembles the dichotomous format except that each item
has more than two alternatives. Multiple choice format.
Distractors- Incorrect choices
*Category format- a technique that is similar to the Likert
format but that uses an even greater number of choices. On
a scale from 1 to 10, with 1 as the lowest and 10 as the
highest. Visual analogue scale- . Popular for measuring selfrated health. Using this method, the respondent is given a
100-millimeter line and asked to place a mark between two
well-defined endpoints.
30
*Adjective checklist-One format common in personality
measurement. The subject receives a long list of adjectives
and indicates whether each one is characteristic of himself or
herself. It can be used for describing either oneself or
someone else
•The adjective checklist requires subjects either to endorse
such adjectives or not, thus allowing only two choices for
each item. A similar technique known as the Q-sort increases
the number of categories.
•Discrimination index - difference between the proportions
of people in each group who got the item correct
•For each item, subtract the proportion of correct responses
for the low group from the proportion of correct responses
for the high group. This gives the item discrimination index
(di).
2. Point Biserial Method- finding the correlation between
performance on the item and performance on the total test.
•The correlation between a dichotomous (two-category)
variable and a continuous variable
•Q-sort- can be used to describe oneself or to provide ratings
of others. A subject is given statements and asked to sort
them into nine piles. If a statement really hit home, you
would place it in pile 9. Those that were not at all descriptive
would be placed in pile 1
•The point biserial correlation (rpbis) between an item and
the total test score is evaluated in much the same way as the
extreme group discriminability index.
•If this value is negative or low, then the item should be
eliminated from the test
=ITEM ANALYSIS=
*Item analysis- a general term for a set of methods used to
evaluate test items, is one of the most important aspects of
test construction.
*Item difficulty- the number of people who get a particular
item correct (For example, if 84% of the people taking a
particular test get item 24 correct, then the difficulty level for
that item is .84.)
*Item discriminability- determines whether the people who
have done well on particular items have also done well on
the whole test (examine the relationship between
performance on particular items and performance on the
whole test)
=WAYS TO EVALUATE DISCRIMINABILITY=
1. The Extreme Group Method- compares people who have
done well with those who have done poorly on a test. For
example, you might find the students with test scores in the
top third and those in the bottom third of the class. You
would find the proportions of people in each group who got
each item correct
•The closer the value of the index is to 1.0, the better the
item
•If 90% of test takers get an item correct, then there is too
little variability in performance for there to be a substantial
correlation with the total test score. Similarly, if items are so
hard that they are answered correctly by 10% or fewer of the
test takers, then there is too little room to show a
correlation between the items and the total test score
*Drawing the Item Characteristic Curve- Using fewer class
intervals allows the curve to take on a smoother appearance.
=Pictures of Item Characteristics=
*Item characteristic curve- valuable way to learn about items
is to graph their characteristics
•The total test score is plotted on the horizontal (X) axis and
the proportion of examinees who get the item correct is
plotted on the vertical (Y) axis
•The total test score is used as an estimate of the amount of
a “trait” possessed by individuals.
31
•The relationship between performance on the item and
performance on the test gives some information about how
well the item is tapping the information we want
TOPIC 7. TEST ADMINISTRATION
*Item Response Theory- each item on a test has its own item
characteristic curve that describes the probability of getting
each particular item right or wrong given the ability level of
each test taker.
•Both the behavior of the examiner and his or her
relationship to the test taker can affect test scores
•The computer can rapidly identify the specific items that are
required to assess a particular ability level
=The Race of the Tester=
•It builds on traditional models of item analysis and can
provide information on item functioning, the value of specific
items, and the reliability of a scale
=Items for Criterion-Referenced Tests=
•To evaluate the items in the criterion-referenced test, one
should give the test to two groups of students—one that has
been exposed to the learning unit and one that has not
=The Relationship Between Examiner and Test Taker=
•Subtle cues given by the test administered can affect the
level of performance expected by the examiner
•Examiner’s race did not affect the results of IQ tests
because the procedures for properly administering an IQ test
are so specific. Anyone who gives the test should do so
according to a strict procedure. In other words, well-trained
African American and white test administrators should act
almost identically
=Stereotype Threat=
For people who come from groups haunted by negative
stereotypes, there may be a second level of threat
Stereotype threat depletes working memory
Another explanation for the effects of stereotype threat is
“self-handicapping.” Test takers, when faced with the
expectation that they may not perform well, might reduce
their level of effort
This threat could be avoided by simply moving the questions
about age, race, and sex from the beginning of the test to the
end.
•The frequency polygon looks like a V. The scores on the left
side of the V are probably those from students who have not
experienced the unit. Scores on the right represent those
who have been exposed to the unit
=Subject Variables=
A final variable that may be a serious source of error is the
state of the subject.
-Motivation and anxiety can greatly affect test scores
•The bottom of the V is the antimode, or the least frequent
score. This point divides those who have been exposed to the
unit from those who have not been exposed and is usually
taken as the cutting score or point, or what marks the point
of decision
•When people get scores higher than the antimode, we
assume that they have met the objective of the test. When
they get lower scores, we assume they have not
=Computer-Assisted Test Administration=
Here are some of the advantages that computers offer:
•excellence of standardization,
•individually tailored sequential administration,
• precision of timing responses,
•release of human testers for other duties,
•patience (test taker not rushed), and
•control of bias
32
=Language of Test Taker=
TOPIC 8: INTERVIEWING TECHNIQUES
The amount of linguistic demand can put non-English
speakers at a -disadvantage. Even for tests that do not
require verbal responses, it is important to consider the
extent to which test instructions assume that the test taker
understands English
*Interview techniques- will reveal a plethora of sites offering
advice on how to answer job interview questions.
=Training of Test Administrators=
Different assessment procedures require different levels of
training
*Expectancy Effects- It is often called Rosenthal effects
•Standardized test administration procedures are necessary
for valid results. Extensive research in social psychology has
clearly demonstrated that situational factors can affect
scores on mental and behavioral tasks. These effects,
however, can be subtle and may not be observed in all
studies. For example, stereotype threat can have significantly
detrimental effects on test takers. Similarly, the examiner’s
rapport and expectancies may influence scores on some but
not all occasions. Direct reinforcement of specific responses
does have an acknowledged impact and therefore should not
be given in most testing situations. In response to these
problems, several remedies have been suggested. These
include standardization of the test instructions. The threat of
the testing situation might be reduced by maneuvers as
simple as asking demographic information at the end rather
than at the beginning of the test.
*Interview- It involves the interaction of two or more
people. It is the only or most important source of data. The
interview remains one of the most prevalent selection
devices for employment. The chief method of collecting data
in clinical psychiatry
•All interviews involve mutual interaction whereby the
participants are interdependent—that is, they influence each
other
•It is thematic; it does not jump from one unrelated topic to
another as it might if the interviewer asked a series of set
questions.
*structured interview- the interviewer asks a specific set of
questions.
*structured standardized interview- these questions are
printed
*unstructured interview- there are no specific questions or
guidelines for the interviewer to follow
*directive- The personnel officer directed, guided,
and controlled the course of the interview
*nondirective- The clinical psychologist let the client
determine the direction of the interview
•Interest has increased in computer-assisted test
administration because it may reduce examiner bias.
Computers can administer and score most tests with great
precision and with minimum bias. This mode of test
administration has become the norm for many types of tests.
*employment interview or selection interview- designed to
elicit information pertaining to client’s qualifications and
capabilities for particular employment duties
•The state of the subject also affects test scores. For
example, some students suffer from debilitating test
anxiety, which seriously interferes with performance.
•Method for gathering data
•Used to make predictions
•Evaluated in terms of reliability
•Evaluated in terms of validity
•Group or individual
Structured or unstructured
=Similarities Between an Interview and a Test=
*social facilitation- we tend to act like the models around us
(e.g when professional actors responded with anger to highly
trained, experienced interviewers, the interviewers became
angry themselves and showed anger toward the actors)
33
=Principles of Effective Interviewing=
*The Proper Attitudes- Good interviewing is actually more a
matter of attitude than skill. Attitudes related to good
interviewing skills include warmth, genuineness, acceptance,
understanding, openness, honesty, and fairness
•To appear effective and establish rapport, the interviewer
must display the proper attitudes.
*Avoid- Judgmental or evaluative statements, Probing
statements, Hostile responses and False reassurance
*Effective Responses- keeping the interaction flowing. The
interview is a two-way process; one person speaks first, then
the other, and so on. The interviewer listens with interest by
maintaining face-to-face contact.
*interpersonal attraction- the degree to which
people share a feeling of understanding, mutual respect,
similarity, and the like
*open-ended question- requires the interviewee to
produce something spontaneously. used in structured
interviews or for a particular purpose. (E.g a closed-ended
question, which can be. Examples of open-ended questions
include, “Tell me a little bit about yourself,” “Tell me about
what interests you,” and “What is it that brings you here to
see me?”)
*Responses to Avoid- If the goal is to elicit as much
information as possible or to receive a good rating from the
interviewee, then interviewers should avoid certain
responses, including judgmental or evaluative statements,
probing statements, hostility, and false reassurance.
*closed-ended question- requires the interviewee to
recall something. It brings the interview to a dead halt, thus
violating the principle of keeping the interaction flowing. (E.g
“Do you like sports?,” “Are you married?,” and “How old are
you?”)
*Judgmental or evaluative statements- evaluating
the thoughts, feelings, or actions of another
*Responses to Keep the Interaction Flowing- using a
transitional phrase such as ‘Yes,” “And,” or “I see.” These
phrases imply that the interviewee should continue on the
same topic
*interpersonal influence- the degree to which one
person can influence another and is related to
*probing statements- these demand more
information than the interviewee wishes to provide
voluntarily. The most common way to phrase a probing
statement is to ask a question that begins with “Why?” as it
tends to place others on the defensive.
•In probing we may induce the interviewee to reveal
something that he or she is not yet ready to reveal. The
interviewee will probably feel anxious and thus not well
disposed to revealing additional information
•BUT, probes are appropriate and necessary. With children
or individuals with mental retardation to get beyond a
superficial interchange, for instance, one often needs to ask
questions to elicit meaningful information. Avoiding “Why?”
statements and replacing them with “Tell me” or “How?”
statements
*verbatim playback- the interviewer simply repeats
the interviewee’s last response. (E.g in her interview with the
clinical psychologist, Maria stated, “I majored in history and
social studies.” A verbatim playback, “You majored in history
and social studies,”)
*Paraphrasing and restatement responses- Repeats
interviewee’s response using different words.
Interchangeable with the interviewee’s response
*Summarizing and clarification statements- Pulls
together the meaning of several responses. It go just beyond
the interviewee’s response.
*Clarification response- clarifies the interviewee’s response
*Empathy and understanding- communicates understanding
•One good way to accomplish this involves what we call
understanding statements. To establish a positive
atmosphere, interviewers begin with an open-ended
question followed by understanding statements that capture
the meaning and feeling of the interviewee’s communication
34
=Measuring Understanding=
=Sources of Error in the Interview=
*Level-one responses- bear little or no relationship to the
interviewee’s response. The two people are really talking
only to themselves.
*Interview Validity- extreme difficulty we have in making
accurate, logical observations and judgments
*Level-Two Responses- communicates a superficial
awareness of the meaning of a statement. The individual
who makes a level-two response never quite goes beyond his
or her own limited perspective. Level-two responses impede
the flow of communication.
*Level-Three Responses- interchangeable with the
interviewee’s statement. The minimum level of responding
that can help the interviewee. Paraphrasing, verbatim
playback, clarification statements, and restatements are all
examples of level-three responses.
*level-four response- the interviewer adds “noticeably” to
the interviewee’s response.
*level-five response- the interviewer adds “significantly” to it
•Level-four and level- five responses not only provide
accurate empathy but also go beyond the statement given.
*active listening- is the foundation of good interviewing skills
for many different types of interviews
*mental status examination- An important tool in psychiatric
and neurological examinations, it is used primarily to
diagnose psychosis, brain damage, and other major mental
health problems. Its purpose is to evaluate a person
suspected of having neurological or emotional problems in
terms of variables known to be related to these problems.
=Developing Interviewing Skills=
*first step- in doing so is to become familiar with research
and theory on the interview in order to understand the
principles and underlying variables in the interview.
*second step- in learning such skills is supervised practice.
Experience truly is the best teacher. No amount of book
learning can compare with having one’s taped interview
analyzed by an expert.
*third step- one must make a conscious effort to apply the
principles involved in good interviewing, such as guidelines
for keeping the interaction flowing.
•initial phase of learning any new skill seems to involve
attending to a hundred things at once—an impossible task
*halo effect- tendency to judge specific traits on the
basis of a general impression. Interviewers form an
impression of the interviewee within the first minute or so
and spend the rest of the interview trying to confirm that
impression
*general standoutishness- people tend to judge on
the basis of one outstanding characteristic. The tendency of
interviewers to make unwarranted inferences from personal
appearance
*Interview Reliability- the critical questions about reliability
have centered on inter-interviewer agreement- agreement
between two or more interviewers
•The research suggests that a highly structured interview in
which specific questions are asked in a specific order can
produce highly stable results
•The internal consistency reliability for scores on highly
structured interviews was .79 where the interviewer was
gathering information about the interviewee’s experience,
.90 where interviewees responded to hypothetical dilemmas
they may experience on the job, and .86 where the
interviewer was gathering information about the
interviewees’ past behavior
35
TOPIC 9: THEORIES OF INTELLIGENCE AND ITS
MEASUREMENTS (SB-5, WAIS-IV)
*Intelligence- as a multifaceted capacity that manifests itself
in different ways across the life span. In general, intelligence
includes the abilities to:
•acquire and apply knowledge
•reason logically
•plan effectively
•infer perceptively
•make sound judgments and solve problems
•grasp and visualize concepts
•pay attention
•be intuitive
•find the right words and thoughts with facility
•cope with, adjust to, and make the most of new situations
=Perspectives on Intelligence=
*Interactionism- refers to the complex concept by which
heredity and environment are presumed to interact and
influence the development of one’s intelligence.
*Louis L. Thurstone- conceived of intelligence as composed
of what he termed primary mental abilities (PMAs).
•Thurstone developed and published the Primary Mental
Abilities test, which consisted of separate tests, each
designed to measure one PMA: verbal meaning, perceptual
speed, reasoning, number facility, rote memory, word
fluency, and spatial relations.
*factor-analytic theories- the focus is squarely on identifying
the ability or groups of abilities deemed to constitute
intelligence.
*information-processing theories- the focus is on
identifying the specific mental processes that constitute
intelligence. Prior to reading about factor-analytic theories
of intelligence, some extended discussion of factor analysis
may be helpful
*Factor-analytic theories of intelligence- theorists have used
factor analysis to study correlations between tests
measuring varied abilities presumed to reflect the underlying
attribute of intelligence.
*two-factor theory of intelligence- spearman
formalized these observations into an influential theory of
general intelligence that postulated the existence of a
general intellectual ability factor (denoted by an italic
lowercase g) that is partially tapped by all other mental
abilities.
•The g representing the portion of the variance that all
intelligence tests have in common and the remaining
portions of the variance being accounted for either by
specific components (s), or by error components (e) of this
general factor
•Test that exhibited high positive correlations with other
intelligence tests were thought to be highly saturated with g,
whereas tests with low or moderate correlations with other
intelligence tests were viewed as possible measures of
specific factors (such as visual or motor ability).
•The greater the magnitude of g in a test of intelligence, the
better the test was thought to predict overall intelligence.
*group factors- existence of an intermediate class of factors
common to a group of activities but not to all. Neither as
general g or nor as specific as s. Examples of these broad
group factors include linguistic, mechanical, and arithmetical
abilities.
=Multiple-factor models of intelligence=
*Guilford- have sought to explain mental activities by
deemphasizing, if not eliminating, any reference to g
*Thurstone- initially conceived of intelligence as being
composed of seven “primary abilities.”
*Gardner- developed a theory of multiple (seven, actually)
intelligences:
(1) logicalmathematical,
(2) bodily-kinesthetic,
(3) linguistic,
(4) musical,
(5) spatial,
(6) interpersonal- the ability to understand other people:
what motivates them, how they work, how to work
cooperatively with them. Successful sales people, politicians,
teachers, clinicians, and religious leaders are all likely to be
individuals with high degrees of interpersonal intelligence
(7) intrapersonal- seventh kind of intelligence, is a correlative
ability, turned inward. It is a capacity to form an accurate,
veridical model of oneself and to be able to use that model
to operate effectively in life.
36
(Guilford’s perspective, there is no single underlying
intelligence for the different test items to reflect. This means
that there would be no basis for a large common factor.)
*Cattell- the existence of two major types of cognitive
abilities: crystallized intelligence and fluid intelligence
*crystallized intelligence (symbolized Gc)- include
acquired skills and knowledge that are dependent on
exposure to a particular culture as well as on formal and
informal education (vocabulary, for example). Retrieval of
information and application of general knowledge are
conceived of as elements of crystallized intelligence.
*fluid intelligence (symbolized Gf )- are nonverbal,
relatively culture-free, and independent of specific
instruction (such as memory for digits).
*Horn- proposed the addition of several factors:
visual processing (Gv),
auditory processing (Ga),
quantitative processing (Gq),
speed of processing (Gs),
facility with reading and writing (Grw),
short-term memory (Gsm), and
long-term storage and retrieval (Glr)
*vulnerable abilities- such as (Gv), in that they
decline with age and tend not to return to preinjury levels
following brain damage
*maintained abilities- such as (Gq), they tend not to
decline with age and may return to preinjury levels following
brain damage.
*three-stratum theory of cognitive abilities- Carroll’s model
is g, or general intelligence.
•The second stratum is composed of eight abilities and
processes:
fluid intelligence (Gf ),
crystallized intelligence (Gc),
general memory and learning (Y),
broad visual perception (V),
broad auditory perception (U),
broad retrieval capacity (R),
broad cognitive speediness (S), and
processing/decision speed (T)
*Hierarchical model- meaning that all of the abilities listed in
a stratum are subsumed by or incorporated in the strata
above. (three-stratum theory used this)
*McGrewFlanagan CHC model- features ten “broadstratum” abilities and over seventy “narrow-stratum”
abilities, with each broad-stratum ability subsuming two or
more narrow-stratum abilities.
•g was not employed in their CHC model because it lacked
utility in psychoeducational evaluations.
•The ten broad-stratum abilities, with their “code names” in
parentheses, are labeled as follows:
fluid intelligence (Gf),
crystallized intelligence (Gc),
quantitative knowledge (Gq),
reading/writing ability (Grw),
short-term memory (Gsm),
visual processing (Gv),
auditory processing (Ga),
long-term storage and retrieval (Glr),
processing speed (Gs), and
decision/reaction time or speed (Gt).
*cross-battery assessment- assessment that employs tests
from different test batteries and entails interpretation of
data from specified subtests to provide a comprehensive
assessment.
*Thorndike- intelligence can be conceived in terms of three
clusters of ability: social intelligence (dealing with people),
concrete intelligence (dealing with objects), and abstract
intelligence (dealing with verbal and mathematical symbols).
•Thorndike also incorporated a general mental ability factor
(g) into the theory, defining it as the total number of
modifiable neural connections or “bonds” available in the
brain.
(It look for one central factor reflecting g along with three
additional factors representing social, concrete, and abstract
intelligences; the testtakers’ responses to specific items
reflected in part a general intelligence but also different
types of intelligence: social, concrete, and abstract)
*Information-processing view- by Russian neuropsychologist
Aleksandr Luria. It focuses on the mechanisms by which
information is processed—how information is processed,
rather than what is processed.
•Two basic types of information-processing styles,
simultaneous and successive, have been distinguished
37
*simultaneous (or parallel) processing- information is
integrated all at one time (may be described as
“synthesized.” Information is integrated and synthesized at
once and as a whole.)
*successive (or sequential) processing- each bit of
information is individually processed in sequence (logical and
analytic in nature; piece by piece and one piece after the
other, information is arranged and rearranged so that it
makes sense.)
*PASS model of intellectual functioning- is an acronym for
planning, attention, simultaneous, and successive. In this
model,
*planning- refers to strategy development for
problem solving;
*attention (also referred to as arousal)- refers to
receptivity to information; and
*simultaneous and successive- refer to the type of
information processing employed.
=Measuring Intelligence=
*Some Tasks Used to Measure Intelligence- intellectual
assessment consists primarily of measuring sensorimotor
development. (E.g measurement of nonverbal motor
responses such as turning over, lifting the head, sitting up,
•The reliability of the SB5 Full Scale IQ with the norming
sample, an internal consistency reliability formula designed
for the sum of multiple tests was employed.
•The calculated coefficients for the SB5 Full Scale IQ were
consistently high (.97 to .98) across age groups, as was the
reliability for the Abbreviated Battery IQ (average of .91).
following a moving object with the eyes, imitating gestures,
and reaching for a group of objects)
*Some Tests Used to Measure Intelligence- reference
volumes such as Tests in Print, many different intelligence
tests exist.
*Stanford-Binet Intelligence Scales: Fifth Edition (SB5)- was
designed for administration to assessees as young as 2 and
as old as 85 (or older). The test yields a number of composite
scores, including a Full Scale IQ derived from the
administration of ten subtests.
•Subtest scores all have a mean of 10 and a standard
deviation of 3.
•All composite scores have a mean set at 100 and a standard
deviation of 15.
(based on the Cattell-Horn-Carroll (CHC) theory of
intellectual abilities.)
*alternate item- an item to be substituted for a
regular item under specified conditions (such as the situation
in which the examiner failed to properly administer the
regular item).
*test composite- formerly described as a deviation IQ score,
may be defined as a test score or index derived from the
combination of, and/or a mathematical transformation of,
one or more subtest scores.
•Test-retest reliability coefficients reported in the manual
were also high. The test-retest interval was only 5 to 8
days—shorter by some 20 to 25 days than the interval
employed on other, comparable tests.
•Inter-scorer reliability coefficients reported in the SB5
Technical Manual ranged from .74 to .97 with an overall
median of .90.
38
•Items showing especially poor inter-scorer agreement
had been deleted during the test development process.
*routing test- may be defined as a task used to direct or
route the examinee to a particular level of questions. A
purpose of the routing test, then, is to direct an examinee
to test items that have a high probability of being at an
optimal level of difficulty. It contain
*teaching items- which are designed to illustrate
the task required and assure the examiner that the
examinee understands
*floor- refers to the lowest level of the items on a subtest
*ceiling- highest-level item of the subtest
*basal level- which is used to describe a subtest with
reference to a specific testtaker’s performance
*ceiling level- is said to have been reached and testing is
discontinued if and when examinees fail a certain number
of items in a row
*adaptive testing- or testing individually tailored to the
testtaker, might entail beginning a subtest with a question
in the middle range of difficulty. If the testtaker responds
correctly to the item, an item of greater difficulty is posed
next. If the testtaker responds incorrectly to the item, an
item of lesser difficulty is posed. Computerized adaptive
testing is in essence designed “to mimic automatically what
a wise examiner would do”
(Other terms used to refer to adaptive testing include
tailored testing, sequential testing, branched testing, and
response-contingent testing)
• Three other advantages of beginning an intelligence test
or subtest at an optimal level of difficulty are that
(1) it allows the test user to collect the maximum amount
of information in the minimum amount of time,
(2) it facilitates rapport, and
(3) it minimizes the potential for examinee fatigue from
being administered too many items.
*cutoff boundaries with their corresponding nominal
categories:
*Wechsler tests- W-B measured something comparable to
what other intelligence tests measured. Still, the test
suffered from some problems:
(1) The standardization sample was rather restricted;
(2) some subtests lacked sufficient inter-item reliability;
(3) some of the subtests were made up of items that were
too easy; and
(4) the scoring criteria for certain items were too
ambiguous.
*Wechsler Adult Intelligence Scale (WAIS)- WAIS was
organized into Verbal and Performance scales. Scoring
yielded a Verbal IQ, a Performance IQ, and a Full Scale IQ.
*WAIS-IV- made up of subtests that are designated either
as core or supplemental.
*core subtest- is one that is administered to obtain
a composite score.
*supplemental subtest (sometimes referred to as
an optional subtest)- is used for purposes such as providing
additional clinical information or extending the number of
abilities or processes sampled
•supplemental subtest might be substituted for a core
subtest if:
■ the examiner incorrectly administered a core subtest
■ the assessee had been inappropriately exposed to the
subtest items prior to their administration
■ the assessee evidenced a physical limitation that affected
the assessee’s ability to effectively respond to the items of
a particular subtest
•changes in the WAIS-IV as compared to the previous
edition
■ enlargement of the images in the Picture Completion,
Symbol Search, and Coding subtests
■ the recommended nonadministration of certain
supplemental tests that tap short-term memory, hand-eye
coordination, and/or motor speed for testtakers above the
age of 69 (this to reduce testing time and to minimize
testtaker frustration)
■ an average reduction in overall test administration time
from 80 to 67 minutes (accomplished primarily by
shortening the number of items the testtaker must fail
before a subtest is discontinued)
39
=Group tests of intelligence or school ability test=
•Group intelligence tests in the schools are used in special
forms as early as the kindergarten level. The tests are
administered to groups of 10 to 15 children, each of whom
receives a test booklet that includes printed pictures and
diagrams.
*Short forms of intelligence tests- refers to a test that has
been abbreviated in length, typically to reduce the time
needed for test administration, scoring, and interpretation
*Watkins - concluded that short forms may be used for
screening purposes only, not to make placement or
educational decisions
*Silverstein- provided an incisive review of the history of
short forms, focusing on four issues:
(1) how to abbreviate the original test;
(2) how to select subjects;
(3) how to estimate scores on the original test; and
(4) the criteria to apply when comparing the short form
with the original
*Ryan and Ward- advised that anytime a short form is
used, the score should be reported on the official record
with the
abbreviation “Est” next to it, indicating that the reported
value is only an estimate.
*Wechsler Abbreviated Scale of Intelligence (WASI)- was
designed to answer the need for a short instrument to
screen intellectual ability in testtakers from 6 to 89 years of
age. The test comes in a two-subtest form (consisting of
Vocabulary and Block Design) that takes about 15 minutes
to administer and a four-subtest form that takes about 30
minutes to administer.
•The four subtests (Vocabulary, Block Design, Similarities,
and Matrix Reasoning) are WISC- and WAIStype subtests
that had high correlations with Full Scale IQ on those tests
and are thought to tap a wide range of cognitive abilities.
•yields measures of Verbal IQ, Performance IQ, and Full
Scale IQ. Consistent with many other intelligence tests, the
Full Scale IQ was set at 100 with a standard deviation of 15.
*WASI-2- revision of WASI
(California Test of Mental Maturity, the KuhlmannAnderson Intelligence Tests, the Henmon-Nelson Tests of
Mental Ability, and the Cognitive Abilities Test are group
intelligence tests available for use in school settings)
*Otis-Lennon School Ability Test, formerly the Otis Mental
Ability Test- the test is designed to measure abstract
thinking and reasoning ability and to assist in school
evaluation and placement decision-making.
•This nationally standardized test yields Verbal and
Nonverbal score indexes as well as an overall School Ability
Index (SAI)
*Army Alpha test- This test would be administered to
Army recruits who could read. It contained tasks such as
general information questions, analogies, and scrambled
sentences to reassemble
*Army Beta test- designed for administration to foreignborn recruits with poor knowledge of English or to illiterate
recruits (defined as “someone who could not read a
newspaper or write a letter home”). It contained tasks
such as mazes, coding, and picture completion (wherein the
examinee’s task was to draw in the missing element of the
picture)
*screening tool- as an instrument or procedure used to
identify a particular trait or constellation of traits at a gross
or imprecise level.
*Officer Qualifying Test- a 115-item multiple-choice test
used by the U.S. Navy as an admissions test to Officer
Candidate School
*Airman Qualifying Exam- a 200-item multiple-choice test
given to all U.S. Air Force volunteers
*Armed Services Vocational Aptitude Battery (ASVAB)The ASVAB is administered to prospective new recruits in
all the armed services. It is also made available to highschool students and other you (multiple aptitude test)
40
*Armed Forces Qualification Test (AFQT)- a form
of ASVAB, measure of general ability used in the selection
of recruits. In addition to the AFQT score, ten aptitude
areas are also tapped on the ASVAB, including general
technical, general mechanics, electrical, motor-mechanics,
science, combat operations, and skill-technical.
(A set of 100 selected items included in the subtests of
Arithmetic Reasoning, Numerical Operations, Word
Knowledge, and Paragraph Comprehension)
• group tests are useful screening tools when large
numbers of examinees must be evaluated either
simultaneously or within a limited time frame
=Other measures of intellectual abilities=
*cognitive style- is a psychological dimension that
characterizes the consistency with which one acquires and
processes information
Four terms common to many measures of creativity are
*Originality- refers to the ability to produce something that
is innovative or nonobvious. It may be something abstract
like an idea or something tangible and visible like artwork
or a poem
*Fluency- refers to the ease with which responses are
reproduced and is usually measured by the total number of
responses produced. For example, an item in a test of word
fluency might be In the next thirty seconds, name as many
words as you can that begin with the letter w.
*Flexibility- refers to the variety of ideas presented and the
ability to shift from one approach to another
*Elaboration- refers to the richness of detail in a verbal
explanation or pictorial display.
*Guilford- drew a distinction between the intellectual
processes of convergent and divergent thinking
*Convergent thinking- is a deductive reasoning
process that entails recall and consideration of facts as well
as a series of logical judgments to narrow down solutions
and eventually arrive at one solution. (thought process
required in most achievement test)
*Divergent thinking- is a reasoning process in
which thought is free to move in many different directions,
making several solutions possible. Divergent thinking
requires flexibility of thought, originality, and imagination
• described several tasks designed to measure creativity,
such as Consequences (“Imagine what would happen if . .
.”) and Unusual Uses (e.g., “Name as many uses as you can
think of for a rubber band”
*Structure-of-Intellect Abilities- are verbally oriented tasks
(such as Word Fluency) and nonverbally oriented tasks
(such as Sketches).
*Remote Associates Test (RAT)- presents the testtaker
with three words; the task is to find a fourth word
associated with the other three (Mednick)
*Torrance Tests of Creative Thinking- consist of wordbased, picture-based, and sound-based test materials.
*psychoeducational batteries- test package to test not only
intelligence but also related abilities in educational settings.
=Issues in the Assessment of Intelligence=
•Items on a test of intelligence tend to reflect the culture of
the society where the test is employed.
*culture-free intelligence test- was designed to separate
“natural intelligence from instruction” by “disregarding,
insofar as possible, the degree of instruction which the
subject possesses”
*Culture loading- may be defined as the extent to which a
test incorporates the vocabulary, concepts, traditions,
knowledge, and feelings associated with a particular culture
(the culture loading of a test tends to involve more of a
subjective, qualitative, nonnumerical judgment.)
*culture-fair intelligence test- as a test or assessment
process designed to minimize the influence of culture with
regard to various aspects of the evaluation procedures,
such as administration instructions, item content,
responses required of testtakers, and interpretations made
from the resulting data.
(Culture-fair tests have been found to lack the hallmark of
traditional tests of intelligence: predictive validity)
*culture-specific intelligence test- traditional intelligence
tests developed for members of a particular cultural group
or subculture, such tests were thought to be able to yield a
more valid measure of mental development
41
*Black Intelligence Test of Cultural Homogeneity- culturespecific intelligence test developed expressly for use with
African-Americans, a 100-item multiple-choice test
(the test was measuring a variable that could be
characterized as streetwiseness, “street smarts” or “street
efficacy”) (BITCH lacked predictive validity and provided
little useful, practical information)
*Flynn effect- is thus a shorthand reference to the
progressive rise in intelligence test scores that is expected
to occur on a normed test intelligence from the date when
the test was first normed. (a less obvious sources of
systematic bias in scores on intelligence tests.)
* intelligence inflation- measured intelligence seems to rise
on average, year by year, starting with the year for which
the test is normed. The rise in measured IQ is not
accompanied by any academic dividend and so is not
thought to be due to any actual rise in “true intelligence.”
INTELLIGENCE AND BINET SCALES
*Alfred Binet- intelligence as “the tendency to take and
maintain a definite direction; the capacity to make
adaptations for the purpose of attaining a desired end, and
the power of autocriticism”
*Spearman- intelligence as the ability to educe either
relations or correlates
*Freeman- intelligence is “adjustment or adaptation of the
individual to his total environment,” “the ability to learn,”
and “the ability to carry on abstract thinking”
*Das- defined intelligence as “the ability to plan and
structure one’s behavior with an end in view”
*H. Gardner (1983) defined intelligence in terms of the
ability “to resolve genuine problems or difficulties as they
are encountered”
*Sternberg- defined intelligence in terms of “mental
activities involved in purposive adaptation to, shaping of,
and selection of real-world environments relevant to one’s
life”
*Anderson- intelligence is two-dimensional and based on
individual differences in information-processing speed and
executive functioning influenced largely by inhibitory
processes
*T. R. Taylor- identified three independent research
traditions that have been employed to study the nature of
human intelligence: the psychometric, the information
processing, and the cognitive approaches
*psychometric approach- examines the elemental
structure of a test. We examine the properties of a test
through an evaluation of its correlates and underlying
dimensions (oldest approach)
*information-processing approach- we examine
the processes that underlie how we learn and solve
problems
*cognitive tradition- focuses on how humans adapt
to real-world demands
*Binet test- that examines one’s ability to define words
and identify numerical sequences certainly does not meet
the standards of all or even most definitions of intelligence.
•properly used intelligence tests provide an objective
standard of competence and potential (Greisinger, 2003).
Critics charge that intelligence tests are biased
(McDermott, Watkins, & Rhoad, 2014), not only against
certain racial and economic groups (Jones, 2003) but also
used by those in power to maintain the status quo (Gould,
1981)
=Binet’s Principles of Test Construction=
*Binet- defined intelligence as the capacity (1) to find and
maintain a definite direction or purpose, (2) to make
necessary adaptations—that is, strategy adjustments—to
achieve that purpose, and (3) to engage in self-criticism so
that necessary adjustments in strategy can be made.
*Principle 1: Age Differentiation- refers to the simple fact
that one can differentiate older children from younger
children by the former’s greater capabilities. For example,
whereas most 9-year-olds can tell that a quarter is worth
more than a dime, a dime is worth more than a nickel, and
so on, most 4-year-olds cannot
*mental age- equivalent age capability
•one could determine the equivalent age capabilities of a
child independent of his or her chronological age. If a 6year-old completed tasks that were appropriate for the
average 9-year-old, then the 6-year-old had demonstrated
that he or she had capabilities equivalent to those of the
average 9-year-old, or a mental age of 9
•A particular 5-year-old child might be able to complete
tasks that the average 8-year-old could complete. On the
other hand, another 5-year-old might not be capable of
completing even those tasks that the average 3-year-old
could complete.
42
*Principle 2: General Mental Ability- measure only the
total product of the various separate and distinct elements
of intelligence
•Identifying each element or independent aspect of
intelligence. He also was freed from finding the relation of
each element to the whole
•He could judge the value of any particular task in terms of
its correlation with the combined result (total score) of all
other tasks.
•Tasks with low correlations could be eliminated, and tasks
with high correlations retained.
=Spearman’s Model of General Mental Ability=
*Spearman’s theory- intelligence consists of one general
factor (g) plus a large number of specific factors
•half of the variance in a set of diverse mental ability tests
is represented in the g factor
=Implications of General Mental Intelligence (g)=
•implies that a person’s intelligence can best be
represented by a single score, g, that presumably reflects
the shared variance underlying performance on a diverse
set of tests.
•True performance on any given individual task can be
attributed to g as well as to some specific or unique
variance
•However, if the set of tasks is large and broad enough, the
role of any given task can be reduced to a minimum
*The gf-gc Theory of Intelligence- human intelligence can
best be conceptualized in terms of multiple intelligences
rather than a single score
there are two basic types of intelligence:
*Fluid intelligence (f) - can best be thought of as
those abilities that allow us to reason, think, and acquire
new knowledge
*Spearman’s model of intelligence- According to the
model, intelligence can be viewed in terms of one general
underlying factor (g) and a large number of specific factors
(S1, S2, …, Sn). Thus, intelligence can be viewed in terms of
g (general mental ability) and S (specific factors).
Spearman’s theory was consistent with Binet’s approach to
constructing the first intelligence test
•The general mental ability, which he referred to as
psychometric g (or simply g)
*positive manifold- when a set of diverse ability tests are
administered to large unbiased samples of the population,
almost all of the correlations are positive. All tests, no
matter how diverse, are influenced by g.
*factor analysis- a statistical technique, is a method for
reducing a set of variables or scores to a smaller number of
hypothetical variables called factors
•one can determine how much variance a set of tests or
scores has in common
•The g in a factor analysis of any set of mental ability tasks
can be represented in the first unrotated factor in a
principal components analysis
*Crystallized intelligence (c) - by contrast,
represents the knowledge and understanding that we have
acquired
•the abilities that allow us to learn and acquire information
(fluid) and the actual learning that has occurred
(crystallized)
=The Early Binet Scales=
*1905 Binet-Simon scale- the first major measure of
human intelligence. It was an individual intelligence test
consisting of 30 items presented in an increasing order of
difficulty.
•The collection of 30 tasks of increasing difficulty in the
Binet-Simon scale provided the first major measure of
human intelligence
43
three levels of intellectual deficiency
=Terman’s Stanford-Binet Intelligence Scale=
*Idiot- described the most severe form of intellectual
impairment
*The 1916 Stanford-Binet Intelligence Scale- Terman relied
heavily on Binet’s earlier work. The principles of age
differentiation, general mental ability, and the age scale
were retained. The mental age concept also was retained
*imbecile- moderate levels of impairment
*moron- the mildest level of impairment.
•Binet believed that the ability to follow simple directions
and imitate simple gestures (item 6 on the 1905 scale) was
the upper limit of adult idiots.
•The ability to identify parts of the body or simple objects
(item 8) would rule out the most severe intellectual
impairment in an adult
•The upper limit for adult imbeciles was item 16, which
required the subject to state the differences between two
common objects such as wood and glass.
(The 1905 Binet-Simon scale lacked an adequate measuring
unit to express results; it also lacked adequate normative
data and evidence to support its validity)
*The 1908 Scale- Binet and Simon retained the principle of
age differentiation
*intelligence quotient (IQ)- used a subject’s mental
age in conjunction with his or her chronological age to
obtain a ratio score. This ratio score presumably reflected
the subject’s rate of mental development.
•The first step- is to determine the subject’s actual or
chronological age. To obtain this, we need only know his or
her birthday.
•The second step- the subject’s mental age is determined
by his or her score on the scale. Finally, to obtain the IQ,
the chronological age (CA) is divided into the mental age
(MA) and the result multiplied by 100 to eliminate
fractions: IQ 5 MA/CA 3 100.
•Introduced two major concepts: the age scale format and
the concept of mental age.
*age scale- which means items were grouped according to
age level rather than simply one set of items of increasing
difficulty, as in the 1905 scale
•The age scale provided a model for innumerable tests still
used in educational and clinical settings
•Binet attempted to solve the problem of expressing the
results in adequate units. A subject’s mental age was based
on his or her performance compared with the average
performance of individuals in a specific chronological age
group.
(The scale produced only one score, almost exclusively
related to verbal, language, and reading ability)
•When MA is less than CA, the IQ is below 100, the subject
was said to have slower-than-average mental development
•When MA exceeded CA, the subject was said to have
faster-than-average mental development.
44
(The 1916 scale had a maximum possible mental age of
19.5 years; that is, if every group of items was passed, this
score would result. Given this limitation, anyone older than
19.5 would have an IQ of less than 100 even if all items
were passed. Because back in 1916 people believed that
mental age ceased to improve after 16 years of age, 16 was
used as the maximum chronological age.)
*The 1937 Scale- extended the age range down to the 2year-old level. Adding new tasks, developers increased the
maximum possible mental age to 22 years, 10 months.
•instructions for scoring and test administration were
improved, and IQ tables were extended from age 16 to 18.
Perhaps most important, the problem of differential
variation in IQs was solved by the deviation IQ concept.
*deviation IQ- was simply a standard score with a
mean of 100 and a standard deviation of 16 (today the
standard deviation is set at 15)
=The Modern Binet Scale=
*Model for the Fourth and Fifth Editions of the Binet
Scale-These versions incorporate the gf-gc theory of
intelligence. They are based on a hierarchical model.
•At the top of the hierarchy is g (general intelligence),
which reflects the common variability of all tasks. At the
next level are three group factors.
*Crystallized abilities reflect learning—the
realization of original potential through experience.
*crystallized ability has two subcategories:
verbal reasoning and nonverbal reasoning
*Fluid-analytic abilities- represent original
potential, or the basic capabilities that a person uses to
acquire crystallized abilities
•several performance items, which required the subject to
do things such as copy designs, were added to decrease
the scale’s emphasis on verbal skills.
*Short-term memory- refers to one’s memory
during short -intervals—the amount of information one
can retain briefly after a single, short presentation
•the most important improvement in the 1937 version was
the inclusion of an alternate equivalent form. Forms L and
M were designed to be equivalent in terms of both
difficulty and content.
*Thurstone’s Multidimensional Model- intelligence could
best be conceptualized as comprising independent factors,
or “primary mental abilities.”
=Problems With the 1937 Scale=
•reliability coefficients were higher for older subjects than
for younger ones. Thus, results for the latter were not as
stable as those for the former.
*The 1960 Stanford-Binet Revision and Deviation IQ (SBLM)- tried to create a single instrument by selecting the
best from the two forms of the 1937 scale
•instead of viewing all specific abilities as being powered
by a g factor, some groups of abilities were seen as
independent.
=Characteristics of the 1986 Revision=
*Modern 2003 fifth edition- provided a standardized
hierarchical model with five factors
*The age range touted by the fifth edition spans from 2 to
851 years of age.
45
(The purpose of the routing tests is to estimate the
examinee’s level of ability in order to guide the
examination process by estimating the level of ability at
which to begin testing for any given subject)
*start point- estimated level of ability
*basal- The level at which a minimum criterion number of
correct responses is obtained
*ceiling- which is a certain number of incorrect responses
that indicate the items are too difficult.
*Scaled scores- have a mean of 10 and a standard
deviation of 3.
*standard score- with a mean of 100 and a standard
deviation of 15 is computed for nonverbal IQ, verbal IQ,
full-scale IQ, and each of the five factors: fluid reasoning,
knowledge, quantitative reasoning, visualspatial processing,
and working memory
(Nonverbal and verbal IQ scores are based on summing the
five nonverbal and five verbal subtests.
The full-scale IQ is based on all 10.
The standard scores for each of the five factors are based
on summing the nonverbal and corresponding verbal
subtest for each respective factor)
The Wechsler Intelligence Scales
(1) Wechsler’s use of the point scale concept rather than
an age scale used in the early Binet Tests and
(2) Wechsler’s inclusion of a nonverbal performance scale.
•each of the specific 15 tests were grouped into one of four
content areas or factors
•the nonverbal and verbal scales are equally weighted. The
test examination process begins with one of two “routing
measures” (subtests): one nonverbal, one verbal
*Routing tests- are organized in a point scale, which means
that each contains items of similar content and of
increasing difficulty. For example, the verbal routing test
consists of a set of vocabulary items of increasing difficulty.
• the examiner then goes to an age scale-based subtest at
the appropriate level for the examinee. In that way, items
that are too easy are skipped to save time and provide for
a more efficient examination.
*point scale- credits or points are assigned to each item. An
individual receives a specific amount of credit for each item
passed.
•allowed Wechsler to devise a test that permitted an
analysis of the individual’s ability in a variety of content
areas (e.g., judgment, vocabulary, and range of general
knowledge)
=The Performance Scale Concept=
•Wechsler included an entire scale that provided a measure
of nonverbal intelligence: a performance scale
*performance scale- consisted of tasks that require a
subject to do something (e.g., copy symbols or point to a
missing detail) rather than merely answer questions
(measure of nonverbal intelligence)
46
*verbal scale- provided a measure of verbal intelligence
*Similarities Subtest- consists of paired items of increasing
difficulty. The subject must identify the similarity between
the items in each pair
*Arithmetic Subtest- contains approximately 15 relatively
simple problems in increasing order of difficulty. The ninth
most difficult item is as easy as this: “A person with $28.00
spends $.50. How much does he have left?”
*Digit Span Subtest- requires the subject to repeat digits,
given at the rate of one per second, forward and backward
= Scales, Subtests, and Indexes=
* Wechsler- defined intelligence as the capacity to act
purposefully and to adapt to the environment. In his
words, intelligence is “the aggregate or global capacity of
the individual to act purposefully, to think rationally and to
deal effectively with his environment”
•Wechsler believed that intelligence comprised specific
elements that one could individually define and measure;
however, these elements were interrelated—that is, not
entirely independent
•implies that intelligence comprises several specific
interrelated functions or elements and that general
intelligence results from the interplay of these elements
*index- is created where two or more subtests are related
to a basic underlying skill
•On the WAIS-IV, the subtests are sorted into four
indexes: (1) verbal comprehension, (2) perceptual
reasoning, (3) working memory, and (4) processing speed.
*Information Subtest- items appear in order of increasing
difficulty. Item 6 asks something like, “Name two people
who have been generals in the U.S. Army” or “How many
members are there in the U.S. Congress?” Like all Wechsler
subtests, the information subtest involves both intellective
and nonintellective components, including the abilities to
comprehend instructions, follow directions, and provide a
response.
*Comprehension Subtest- measures judgment in everyday
practical situations, or common sense. It has three types of
questions.
•The first asks the subject what should be done in a given
situation, as in, “What should you do if you find an injured
person lying in the street?”
•The second type of question asks the subject to provide a
logical explanation for some rule or phenomenon, as in,
“Why do we bury the dead?” The third type asks the
subject to define proverbs such as, “A journey of 1000
miles begins with the first step.”
*Letter–Number Sequencing Subtest- used as a
supplement for additional information about the person’s
intellectual functioning. It is made up of items in which the
individual is asked to reorder lists of numbers and letters.
For example, Z, 3, B, 1, 2, A, would be reordered as 1, 2, 3,
A, B, Z. This subtest is related to working memory and
attention
*Digit Symbol–Coding Subtest- requires the subject to copy
symbols. It measures such factors as ability to learn an
unfamiliar task, visual-motor dexterity, degree of
persistence, and speed of performance
*Vocabulary Subtest- ability to define words is not only
one of the best single measures of intelligence but also the
most stable. Vocabulary tests appear on nearly every
individual test that involves verbal intelligence.
*Block Design Subtest- provides an excellent measure of
nonverbal concept formation, abstract thinking, and
neurocognitive impairment. The subject must arrange the
blocks to reproduce increasingly difficult designs.
•This subtest requires the subject to reason, analyze
spatial relationships, and integrate visual and motor
47
functions. The input information (i.e., pictures of designs) is
visual, but the response (output) is motor.
*Matrix Reasoning Subtest- a good measure of
information-- processing and abstract-reasoning skills. A
core subtest in the perceptual reasoning index scale in an
effort to enhance the -assessment of fluid intelligence,
which involves our ability to reason.
*WAIS-IV- follows a hierarchical model with general
intelligence (FSIQ) at the top. The index scores form the
next level, with the subtests providing the base
•In the matrix reasoning subtest, the subject is presented
with nonverbal, figural stimuli. The task is to identify a
pattern or relationship between the stimuli
*Symbol Search Subtest- It was added in recognition of the
role of speed of information processing in intelligence.
•the subject is shown two target geometric figures. The
task is then to search from among a set of five additional
search figures and determine whether the target appears
in the search group.
•the faster a subject performs this task, the faster his or her
information-processing speed will be.
=From Raw Scores to Scaled and Index Scale Scores=
•Each subtest produces a raw score—that is, a total
number of points—and has a different maximum total
•to compare scores on individual subtests, raw scores can
be converted to standard or scaled scores with a mean of
10 and a standard deviation of 3.
•Each of the four index scores was normalized to have a
mean of 100 and a standard deviation of 15
•The four composite index scales are then derived by
summing the core subtest scores
*inferential norming- used in deriving a subtest scaled
score for the WAIS-IV. A variety of statistical indexes or
“moments,” such as means and standard deviations, were
calculated for each of the 13 age groups of the stratified
normative sample
*Working memory- refers to the information that we
actively hold in our minds, in contrast to our stored
knowledge, or long-term memory (one of the most
important innovations on the modern WAIS)
*FSIQ- represents a measure of general intelligence.
follows the same principles of index scores.
•It is obtained by summing the age-corrected scaled scores
of all four index composites. Again, a deviation IQ with a
mean of 100 and a standard deviation of 15 is obtained.
=Psychometric Properties of the Wechsler Adult Scale=
*Standardization- sample consisted of a stratified sample
of 2200 adults divided into 13 age groups from 16:00
through 90:11 as well as 13 specialty groups
*Reliability- When the split-half method is used for all
subtests except speeded tests (digit symbol–coding and
symbol search), the typical average coefficients across age
levels are .98 for the FSIQ, .96 for the verbal
comprehension index VIQ, .95 for the perceptual reasoning
index, .94 for the working memory index, and .90 for the
processing speed index
*Validity- The validity of the WAIS-IV rests heavily on its
correlation with earlier versions of the test. However, the
Wechsler tests are considered among the most valid in the
world today for measuring IQ.
*Wechsler adult scale- is extensively used as a measure of
adult intelligence. This scale is well constructed and its
primary measures—the four index components and fullscale IQ (FSIQ)—are highly reliable.
48
=Downward Extensions of the WAIS-IV: The WISC-V and
the WPPSI-IV=
*WISC-V- measures intelligence from ages 6 through 16
years, 11 months. 21st-century test.
•can be administered and scored by two coordinated iPads,
one for the examiner and one for the subject being tested.
According to the test authors, administration is faster and
more efficient. The scores can then be forwarded for
interpretation and even report generation to a Web-based
platform called “Q-global scoring and reporting.”
•The test is heavily based on speed of a response based on
the findings that faster responding is associated with higher
ability for most tasks.
•At the top of the hierarchy is FSIQ. Next are the five
indexes or primary scores.
*The five indexes- are called (1) verbal comprehension,
(2)visual spatial, (3) fluid reasoning(ability to abstract) , (4)
working memory, and (5) processing speed.
•Each index is associated with at least two subtest scores.
To enhance assessment, there are five ancillary scales,
each based on two or more subtests. These are called
quantitative reasoning, auditory working memory,
nonverbal, general ability, and cognitive processing.
•Finally, there are three “complementary” scales, called
naming speed, symbol translation, and storage and
retrieval.
(provides insight for academic performance, various groups
were targeted and carefully studied. These included
various specific learning disabilities, attentiondeficit/hyperactivity disorder (ADHD), traumatic brain
injury, and autism spectrum disorders)
*WPPSI- a scale for children 4 to 6 years of age. It is based
on the familiar hierarchical model. General mental ability,
or g, is at the top and reflected in the full-scale IQ.
•Then, there are three group factors, represented by index
or primary scores: (1) verbal comprehension, (2) visual
spatial, and (3) working memory.
•Finally, each of the indexes is composed of two or more
subtest scores.
*WPPSI-IV- Measures intelligence in children from 2.5 to 7
years, 7 months. It is more flexible than its predecessors
and gives the test user the option of using more or less
subtests depending on how complete an evaluation is
needed and how young the child is
(compatible with measures of adaptive functioning and
achievement)
Download