Dr. Dave Flora's Talk on Psychometric Theory

advertisement
Crash Course in
Psychometric Theory
David B. Flora
SP Area Brownbag
February 8, 2010

Research in social and personality psychology is
about abstract concepts of theoretical importance,
called “constructs.”

Examples include “prejudice,” “self-esteem,”
“introversion,” “forgiveness,” and on and on…

The success of a research study depends on how
well constructs of interest are measured.

The field of “Test Theory” or “Psychometrics” is
concerned with the theory and accompanying
research methods for the measurement of
psychological constructs.

Psychometric theory evolved from the
tradition of intelligence, or “mental ability”,
testing.

Spearman (1904) invented factor analysis to
aid in the measurement of intelligence.

The psychophysics tradition is also
foundational to psychometric theory, as per
Thurstone’s (1928) law of comparative
judgment for scaling of social stimuli.

A test question is a stimulus; the answer to
the question is a behavioural response to the
stimulus.
Classical True Score Model
x i = ti + e i
xi is the observed value for person i from an
operationalization of a construct (e.g., a test
score).
ti is that person’s true score on the construct.
ei is measurement error.
The variable t is a latent variable:
An unobservable variable that is measured by
the observable variable x.

Lord & Novick’s (1968) preferred definition of the
true score (paraphrased):
For a given person, there is a “propensity”
distribution of possible outcomes of a
measurement that reflects the operation of
processes such momentary fluctuations in
memory and attention or in strength of an
attitude. The person’s true score is the mean of
this propensity distribution.
Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores.
Validity
xi = ti + ei
or
ti = xi  ei

Validity denotes the scientific utility of the scores, x,
obtained with a measuring instrument (i.e., a test).

But there is more to it than just the size of ei.

Validity is mostly concerned with whether x
measures the t that we want it to…

Note that validity is a property of the scores
obtained from a test, not the test itself.
Nunnally & Bernstein (1994), Psychometric Theory
(3rd ed.), p. 84:
“Validation always requires empirical
investigations, with the nature of the measure
and form of validity dictating the needed form of
[empirical] evidence.”
“Validation usually is a matter of degree rather
than an all-or-none property, and validation is an
unending process.”
“Strictly speaking, one validates the use to which a
measuring instrument is put rather than the
instrument itself. Tests are often valid for one
purpose but not another.”
You may have heard of














Internal validity
External validity
Face validity
Content validity
Construct validity
Criterion validity
Predictive validity
Postdictive validity
Concurrent validity
Factorial validity
Convergent validity
Discriminant validity
Incremental validity
Ecological validity
Standards

Standards for Educational and Psychological
Testing (1966; 1974; 1985; 1999) is developed
jointly by AERA, APA, and NCME.

The Standards view validity as a unitary
concept.

Rather than there being separate types of
validity, there are three main types of validity
evidence.
1. Content-related evidence
2. Construct-related evidence
3. Criterion-related evidence
Content-related validity evidence

Content validity refers to the extent to which a
set of items (or stimuli) adequately reflects a
content domain.

E.g., selection of vocabulary words for Grade 6
vocabulary test from the domain of all words
taught to 6th graders.

Evidence is based on theoretical judgment.

Same as face validity?
- self-report judgment of overall health
Construct-related validity evidence

Cronbach, L.J., & Meehl, P.E. (1955). Construct validity
in psychological tests.

Mainly concerned with associations between test scores
and other variables that are dictated by theory.

Multi-trait multi-method correlation matrix (Campbell &
Fiske, 1959):
Is the test strongly correlated with other measures of the
same construct? (convergent validity)
Is the test less strongly correlated with measures of
different constructs than with measures of the same
construct? (discriminant validity)
Floyd & Widaman (1995), p. 287:

“Construct validity is supported if the factor
structure of the [instrument] is consistent with
the constructs the instrument purports to
measure.”

“If the factor analysis fails to detect underlying
constructs [i.e., factors] that explain sufficient
variance in the [items] or if the constructs
detected are inconsistent with expectations, the
construct validity of the scale is compromised.”
Floyd, F. J., & Widaman, K. F. (1995). Factor analysis in the development and
refinement of clinical assessment instruments. Psychological Assessment,
7, 286-299.
Criterion-related validity evidence

Evidence is based on empirical association with
some important “gold standard” criterion.

Encompasses predictive and concurrent validity.

Difficult to distinguish from construct validity
- Theoretical reason for association is critical for
construct validity, less important for criterion
validity.

E.g., relationship between a stress measure and
physical health?
Do we really need your new scale?
Does it have incremental validity?
“Incremental validity is defined as the degree to
which a measure explains or predicts a
phenomenon of interest, relative to other
measures. Incremental validity can be evaluated
on several dimensions, such as sensitivity to
change, diagnostic efficacy, content validity,
treatment design and outcome, and convergent
validity.”
Haynes, S. N., & Lench, H. (2003). Incremental validity of new clinical
assessment measures. Psychological Assessment, 15, 456-466.
Reliability

Reliability is necessary, but not sufficient, for
construct validity.

Lack of reliability (i.e., measurement error)
introduces bias in analyses and reduces
statistical power.

What exactly is reliability?
x i = ti + e i
Reliability = Var(ti) / Var(xi)
Reliability is the proportion of true score
variance to total observed variance.

Since we can’t directly observe Var(ti) , we must
turn to other methods for estimating reliability…

Parallel-forms reliability
Split-half reliability
Internal consistency reliability (coefficient alpha)
Test-retest reliability
Inter-rater reliability




Each is an estimate of the proportion of true score
variability to total variability.
Coefficient alpha ()

Original formula actually given by Guttman
(1945), not Cronbach (1951)!

An average of all inter-item correlations,
weighted by the number of items, k:
kr

1  (k  1)r

The expected correlation of one test with an
alternate form containing the same number of
items.
Coefficient alpha ()

The more items, the larger .

A high  does NOT imply unidimensionality (i.e.,
that items all measure a single factor).

 is a lower-bound estimate of true reliability…
How does factor analysis fit in?
“Common factor model” for a “congeneric” set of
items measuring a single construct:
xij = j fi + uij
xij is the jth item on a multi-item test
fi is the common factor score on the factor, or
latent variable for person i.
j is the factor loading of test item j.
uij is the factor score unique factor j for person i.
It represents a mixture of systematic influence
on random error influence on item x:
uij = (sij + eij )

If we define tij = j fi and assume that the systematic
unique influence is negligible, so that uij ≈ (0 + eij )…

…then the common factor model gives the Classical
True Score model for scores on item j:
xij = j fi + uij
xij = ti + eij

Coefficient  will be underestimated to the extent
that the factor loadings, j , vary across items.

More accurate reliability estimates can be calculated
using the factor loadings.
-Perspective shifts from internal consistency to
latent variable relationship
Tangential things you should know…

Principal components analysis (PCA) is NOT factor
analysis. When you run a PCA, you are NOT estimating
the common factor model.

Situations where PCA is appropriate are quite rare in
social and personality psychology.

The Pearson product-moment correlation is often NOT
adequate for describing the relationships among itemlevel categorical variables!

When factor analyzing items, we should usually use
something other than product-moment correlations.

One approach is to analyze polychoric correlations.
Modern Psychometric Theory

Another approach that properly models itemlevel variables as categorical is Item Response
Theory (IRT).

IRT represents a collection of models for relating
individual items within a test or scale to the
latent variable(s) they measure.

IRT leads to test scores with smaller
measurement error than traditional item sums or
means.
IRT

The properties of each item are summarized
with an item characteristic curve (ICC).

The slope of the curve indicates item
discrimination, i.e., the strength of relationship
between the item and the latent construct.

The horizontal location of the curve indicates
item difficulty or severity.
Item characteristic
curves (ICCs) for
four binary items
with equal
discrimination but
varying “difficulty.”


X-axis, “theta,” represents latent trait or construct.
Y-axis represents probability of a positive item response.
1
2
3
4


Item characteristic
curves (ICCs) for
four binary items
with varying
discrimination and
varying difficulty.
Items 1 and 2 have stronger discrimination than 3 and 4.
Item 1 has the lowest difficulty, item 4 the highest.
A “test information function”
 Shows precision of measurement as a function
of latent trait level

IRT scores

Scale scores constructed using IRT
- take into account item discrimination, whereas
simple sum (or mean) scores assume all items
measure the construct equally well
- have a proper interval scale of measurement,
whereas simple sum scores are typically ordinal,
strictly speaking
- have measurement error that varies across the
range of the construct, whereas simple sum
scores assume a single reliability value for the
whole range
The big picture

IRT was often presented as an alternative approach to
test theory at odds with classical test theory (CTT).

Current perspective is that CTT and IRT complement
and enhance each other.
-For example, the mathematical link between IRT and
factor analysis is now well understood.

A well validated test will still produce scores with
measurement error.

Ideas from CTT, IRT, and structural equation modeling
can be implemented to produce powerful results that
account for measurement error, thus modeling
relationships among the constructs themselves rather
than the operational variables.
Download