Designing a New Scale/Questionnaire:
Optimal Psychometric Practice
Alex M. Wood, PhD
Senior Lecturer in Psychology
School of Psychological Sciences, University of Manchester, England.
Aims:
•
To describe the steps used in the construction of a new questionnaire/scale
•
Psychometrics is the branch of psychology concerned with the measurement of
individual differences. Used in many other fields (e.g., economics, medicine, etc).
•
Accurate measurement of individual differences is vital for the scientific credibility
of the discipline and the research.
Essential reading (all essential, but perhaps start with Worthington):
Clark, L. A., & Watson, D. (1995). Constructing validity: Basic issues in objective scale
development. Psychological Assessment, 7, 309-319.
Hunsley, J., & Meyer, G. J. (2003). The incremental validity of psychological testing and
assessment: Conceptual, methodological, and statistical issues. Psychological Assessment, 15,
446-455.
Smith, G. T., Fischer, S., & Fister, S. M. (2003). Incremental validity principles in test
construction. Psychological Assessment, 15, 467-477.
Smith, G. T., & McCarthy, D. M. (1995). Methodological considerations in the refinement of
clinical assessment instruments. Psychological Assessment, 7, 300-308.
Worthington, R. L., & Whittaker, T. A. (2006). Scale development research - a content analysis
and recommendations for best practices. Counseling Psychologist, 34, 806-838.
Book: Coaley, K. (2010). An introduction to psychological assessment and
psychometrics. London: Sage Publishing.
Psychometric development involves:
1. Developing a clear rationale for the need to develop the scale
2. Developing a clear definition of the construct and a representative item
pool
3. Identifying the scale’s structure and selecting items based on factor
analysis
4. Testing the structure with confirmatory factor analysis
5. Testing internal consistency (reliability)
6. Testing temporal stability (reliability)
7. Showing face validity (validity)
8. Testing criterion validity (validity)
9. Testing predictive validity (validity)
10. Testing discriminant validity (validity)
11. Testing incremental validity (validity)
Step 1: Develop a strong rationale for the need for the scale
•50 years of personality psychology has developed scales for pretty
much everything, your scale is probably not needed!
•Increasingly a view in the field that there are enough scales, people
should use them rather than develop more! (Gratitude example)
•It may not be immediately clear there is a scale to suit a need, but a
full search will show that it probably exist, but perhaps has a weird
name, comes from a strange theoretical position, or does not cite/is
not cited in other papers
•People who develop new scales normally pretend that it measures a
new trait or conceptions, hence most of psychology is in a mess, with
a huge number of almost certainly synonymous traits and no theory to
integrate them.
•Represent the entire continuum of the construct (not just the positive
or negative aspects).
•Question your motives for making a new scale. The best work in
psychology integrates existing perspectives and research – this is
where real progress is made.
Step 2: Develop a representative item pool (~100 items, at 10 30 per expected factor)
•The operational definition of the construct, essential to
represent the universe of the construct
•Could be developed through qualitative research (sleep,
genetic counselling examples)
•Could use pre-existing exhaustive lists of potential items
(Lexical Hypothesis example)
•Could include items from other scales (dodgy – what item
pools did they use)
•Could be designed to map onto a pre-existing theoretical
conception (dodgy – would bias results, both by overdefining
and missing out parts of the construct)
•Could simply choose items that make sense to the
researchers (unacceptable)
Step 3: Perform factor analysis on the item pool, and select items
•Used both to find the number of underlying factors
•Used to select the most representative items
•All decisions are critical here, once you’ve chosen your items there is no turning
back!
•Participants should be representative of who is going to use the scale in the
future. Ideally, have multiple groups (e.g.,. Community, clinical), conduct all
analysis below separately for each group, and base decisions on a balance of the
findings between the samples (which should be largely consistent anyway).
•Should be exploratory factor analysis (maximum likelihood with oblique [oblim]
rotation)
•Should use parallel analysis to determine the number of factors
•Extract the correct number of factors. Be careful of factors defined by all positive
or negative items.
•Choose the highest loading items of each factor
•Decide on how many items to have per factor. Difficult decision, shorter scales
are more widely used but perhaps don’t fully represent the construct. Could base
on Cronbach’s alpha, but issues of “bloated specifics”
Step 4: Perform confirmatory factor analysis
•Differs from exploratory factor analysis as tests the plausibility of a particular
factor structure
•In many senses weaker – the fit of other factor structures may be equally
valid
•You should (a) test the expected factor structure, (b) compare it to other
factor structures, (c) perform multi-group comparisons
The next steps test reliability and validity
• Reliability refers to how consistent are the scores on the test
over time or across equivalent versions of the test
– Reliability refers to how well a test measures true and
systematic variation in a subject rather than error, bias or
random variation.
• Validity refers to how well the test measures what it is
supposed to. This requires independent criteria on which to
base the validity of the test score
Step 5: Test internal consistency
Asks whether each of the items are highly inter-correlated
•Important as the relationship between the scale and other variables will be
attenuated as a factor of the unreliability of the scale due to low correlating
items
Historically used split-half reliability
•Split the test to get equivalent halves
– Odd and even items
– Resultant reliability is for only half the test
– The longer the test, the more reliable it will be
•Now use Cronbach’s alpha (due to better computing)
- Effectively the average of each permutation of the off and even items
(with some adjustment). Gives a value between 0 and 1. Less than .60
very poor, .60 to .70 poor, .70 to .80 good, more than .80 excellent
•Correlations between the scale and external variables can be corrected for
alpha (can be theoretically important (gratitude and appreciation example), but
can be dodgy as excuses poor scale)
Step 6: Test for temporal stability (test-retest reliability)
•If the construct has trait like properties then it should be stable over time,
BUT tests should also be sensitive to real change
•Show that the test is stable over time for most people (use all
methods).
•Method 1. Give the same test to the same people at two time points.
These should be small enough to preclude genuine change, but long
enough for participants not to simply remember their answers.
•Ideally, use two different time intervals with different groups (e.g., 2
weeks and 4 weeks, or 4 weeks and 3 months)
•Method 2. Show that the means of the group does not significantly
change over time (e.g., everyone’s score goes up on reflection). Be
careful of issues of power.
•Method 3. Show that the scale DOES change when expected to. Can
be demonstrated longitudinally (e.g., therapy), or experimentally (e.g.,
social desirability scales).
Step 7: Show face validity
•Refers to what the test appears to measure
•Affects acceptability of the test and for the test to be effective in practice.
•Should have already been shown through the item pool creation.
•Might perform further qualitative work with the final items (genetic
counselling example)
•Can’t be substituted for objective validity
Step 8: Show criterion validity
•Refers to the scale correlating with what it is meant to
•If there are existing scales, there should be high correlations (but why are
you making the new scale? Shorter? Better? – in which case correlations
may be lower)
•What is a high correlation? Conventionally, scales measuring the
same construct should correlated higher than r = .80. Cohen (1988,
1992) defines effect sizes as .10 small, .50 medium, .80 large). Rules
of thumb, can’t apply to every situation. Unique, causal, multiply
determined, or objective relationships always smaller.
•Should correlate with theoretically related constructs.
•Peer-correlations. Self-report should correlate with peer-ratings. But what
is high correlation? Issues of visibility, huge literature on judgements of
others (David Funder and others), halo bias.
Step 9: Show predictive validity
•Similar to criterion validity, but differs in predictive validity predicts a
future outcome
• Could be a behaviour. For example, extroversion should predict how
much a person talks in a small group task, and sensation seeking
behaviour. Again issues of what is a high correlation (Epstein's work on
behavioural prediction).
•Could be a longitudinal change in functioning (Emmons work on goal
striving)
Step 10: Show discriminant [sic] validity
•Show that the scale is NOT correlated with what it is not theoretically
intended to.
•Issues of power.
•Issues of whether the other measures are actually any good.
•Examples
•Social desirability, but what do the scales actually measure?
•Mood inductions. But should it be effected? Is the induction any good
(show that it changes other measures, but not yours)? Is this actually
evidence of lack of sensitivity to change (must be used in combination
with convergent validity?
•Often determined theoretically (e.g., PANAS).
Step 11: Show incremental validity
•You must show that your scale can predict some variables after
controlling for other existing scales (re-demonstrate criterion or predictive
validity after controlling for other scales)
•Very important, given the duplication of scales in psychology research
•Critical that you choose the right scales to control (huge potential for
bias)
•Commonly use Big Five (but crap, due to hierarchical organization).
•Best to conduct a tough test where you select the most similar other
scales, but where theory suggests that what your measuring may provide
additional prediction (and watch your results disappear…)
Summary
Psychometric development involves:
1. Developing a clear rationale for the need to develop the scale
2. Developing a clear definition of the construct and a representative item
pool
3. Identifying the scale’s structure and selecting items based on factor
analysis
4. Testing the structure with confirmatory factor analysis
5. Testing internal consistency (reliability)
6. Testing temporal stability (reliability)
7. Showing face validity (validity)
8. Testing criterion validity (validity)
9. Testing predictive validity (validity)
10. Testing discriminant validity (validity)
11. Testing incremental validity (validity)