Psychological Testing

Psychological Testing
Fields of Study: Algebra, Data Analysis and Probability, Representations
Testing is used for many different purposes within psychology—among them to evaluate
intelligence, diagnose psychiatric illness, and identify aptitudes and interests. Although the
results of testing are rarely used as the sole criterion to make a diagnosis or other decision about
an individual they are often used in conjunction with information gained from other sources such
as interviews and observations of behavior. There are many types of psychological tests but most
share the goal of expressing an essentially unobservable quality such as intelligence or anxiety in
terms of numbers. The numbers themselves are not meant to be taken literally: no one seriously
believes that a person’s intelligence is equivalent to their IQ score, for instance. Instead the
numbers are useful tools which to help evaluate a person’s situation: for instance, how does the
intellectual development of one particular child relate to that of other children of his age? Of
course, the results of psychological testing should be evaluated with the social context of the
individual in mind and with full respect for human diversity.
Psychometrics is a field of study which applies mathematical and statistical principles to devise
new psychological tests and evaluate the properties of current tests. The two most common
approaches to psychometrics today are classical test theory and item response theory (IRT).
Classical test theory is the older of the two approaches and the calculations required can be
performed with a pencil and paper, although today computer software is often used. Classical test
theory assumes that all measurements are imperfect and thus contain error: the goal is to evaluate
the amount of error in a measurement and develop ways to minimize it. Any observed
measurement (for instance a child’s score on an intelligence test) is made up of two components:
true score and error. This may be written as an equation:
where X is the observed score, T is the true score (the score representing the child’s true
intelligence) and E is the error component (resulting from imperfect testing). Classical test theory
assumes that that error is random and thus will sometimes be positive (resulting in a higher
observed score than true score) and sometimes negative (resulting in a lower observed score than
true score) so that over an infinite number of testing occasions the mean of the observed scores
will equal the true score. Although normally a test is administered only once to a given
individual this is a useful model which facilitates evaluation of the reliability and validity of
different tests.
Item-response theory (IRT) is a different approach to psychological testing which assumes that
observed performance on any given test item can be explained by a latent (unobservable) trait or
ability so that individuals may be evaluated in terms of the amount of that trait they contain and
items may be evaluated in terms of the amount of the trait required to answer them positively.
For an item on an intelligence test (intelligence being the latent trait), persons with higher
intelligence should be more likely to answer the question correctly. The same principle applies to
IRT-based tests evaluating other psychological characteristics: for instance if an item in a
psychological screening test is meant to diagnose depression a person with more depressive
symptoms should be more likely to answer it positively. IRT is a mathematically complex
method of analysis which depends on the use of specialized computer software but has become a
popular means to evaluate psychological tests as computers have become more affordable.
Although the mathematical models of IRT differ from that of classical test theory, the goals are
the same: to devise tests which measure characteristics of individuals with a minimum of error.
Reliability and Validity
Reliability refers to the consistency of a test score: if a test is reliable it will yield consistent
results over time and without regard to the irrelevant conditions such as the person administering
the test. Internal consistency is considered an aspect of reliability: it means that all the items in a
test measure the same thing. Temporal reliability is also called test-retest reliability because it is
typically evaluated by having groups of individuals take the same test on several occasions and
seeing how their scores compare: some differences are expected due to random nature of the
error component but there should be a strong relationship between the observed scores of
individuals on multiple occasions.
Inter-rater reliability refers to the consistency of a test or scale regardless of who administers it.
For instance, psychiatric conditions are often evaluated by having an observer rate an
individual’s behavior using a scale, and the results for different observers evaluating the same
individual at the same time should be similar. For instance, three psychologists using a scale to
evaluate the same child for hyperactivity should reach similar conclusions. Both types of
reliability are typically evaluated by correlating test results on different occasions (temporal) or
the scores returned by different raters (inter-rater).
Internal consistency can be measured in several ways. The split-half method involves having a
group of individuals take a test then splitting the items into two groups (for instance, odd
numbered items in one group and even in the other) and calculating the correlation between the
total scores of the two groups. Cronbach’s alpha (coefficient alpha) is a refinement of the splithalf method: it is the mean of all possible split-half coefficients.
Validity refers to whether a test measures what it claims to be measuring. Three types of validity
are typically discussed: content, predictive and construct. Content validity refers to whether the
test includes a reasonable sample of the subject or quality (for instance, mathematical aptitude or
quality of life) it is intended to measure and is usually established by having a panel of experts
evaluate the test in relation to its purpose. Predictive validity means that test scores correlate
highly with measures of similar outcomes in the future: for instance a test of mechanical aptitude
should correlate with a new hire’s success working as an auto repairman. Construct validity
refers to a pattern of correlations predicted by the theory behind the quantity being measured:
the scores on a test should correlate highly with scores on other tests which measure similar
qualities and less highly with those which measure different qualities.
SEE ALSO: Diagnostic Testing; Educational Testing; Intelligence Quotients.
Embretson, Susan E. and Steven P. Reise. Item Response Theory for Psychologists. Mahwah, NJ:
Erlbaum, 2000.
Furr, R. Michael and Verne R. Bacharach. Psychometrics: An Introduction. Thousand Oaks, CA:
Sage Publications, 2007.
Gopaul McNicol, Sharon-ann and Eleanor Armour-Thomas. Assessment and Culture:
Psychological Tests with Minority Populations. Burlington, MA: Elsevier, 2001.
Kline, Paul. The Handbook of Psychological Testing. New York: Routledge, 2000.
Wood, James M, Howard N. Garb and M. Teresa Neszworski. “Psychometrics: Better
Measurement Makes Better Clinicians,” in The Great Ideas of Clinical Science: 17
Principles That Every Mental Health Professional Should Understand, eds. Scott O.
Lilienfeld and William T. O’Donohue. New York: Routledge, 2007.
Sarah Boslaugh, Ph.D.
Washington University School of Medicine