Senior Lecturer in Psychology
School of Psychological Sciences, University of Manchester, England.
Aims:
• To describe the steps used in the construction of a new questionnaire/scale
•
Psychometrics is the branch of psychology concerned with the measurement of individual differences. Used in many other fields (e.g., economics, medicine, etc).
• Accurate measurement of individual differences is vital for the scientific credibility of the discipline and the research.
Essential reading (all essential, but perhaps start with Worthington):
Clark, L. A., & Watson, D. (1995). Constructing validity: Basic issues in objective scale development. Psychological Assessment, 7 , 309-319.
Hunsley, J., & Meyer, G. J. (2003). The incremental validity of psychological testing and assessment: Conceptual, methodological, and statistical issues. Psychological Assessment, 15 ,
446-455.
Smith, G. T., Fischer, S., & Fister, S. M. (2003). Incremental validity principles in test construction. Psychological Assessment, 15 , 467-477.
Smith, G. T., & McCarthy, D. M. (1995). Methodological considerations in the refinement of clinical assessment instruments. Psychological Assessment, 7 , 300-308.
Worthington, R. L., & Whittaker, T. A. (2006). Scale development research - a content analysis and recommendations for best practices. Counseling Psychologist, 34 , 806-838.
Book : Coaley, K. (2010). An introduction to psychological assessment and psychometrics . London: Sage Publishing.
Psychometric development involves:
1. Developing a clear rationale for the need to develop the scale
2. Developing a clear definition of the construct and a representative item pool
3.
Identifying the scale’s structure and selecting items based on factor analysis
4. Testing the structure with confirmatory factor analysis
5. Testing internal consistency (reliability)
6. Testing temporal stability (reliability)
7. Showing face validity (validity)
8. Testing criterion validity (validity)
9. Testing predictive validity (validity)
10. Testing discriminant validity (validity)
11. Testing incremental validity (validity)
Step 1: Develop a strong rationale for the need for the scale
•50 years of personality psychology has developed scales for pretty much everything, your scale is probably not needed!
•Increasingly a view in the field that there are enough scales, people should use them rather than develop more! (Gratitude example)
•It may not be immediately clear there is a scale to suit a need, but a full search will show that it probably exist, but perhaps has a weird name, comes from a strange theoretical position, or does not cite/is not cited in other papers
•People who develop new scales normally pretend that it measures a new trait or conceptions, hence most of psychology is in a mess, with a huge number of almost certainly synonymous traits and no theory to integrate them.
•Represent the entire continuum of the construct (not just the positive or negative aspects).
•Question your motives for making a new scale. The best work in psychology integrates existing perspectives and research – this is where real progress is made.
Step 2: Develop a representative item pool (~100 items, at 10 -
30 per expected factor)
•The operational definition of the construct, essential to represent the universe of the construct
•Could be developed through qualitative research (sleep, genetic counselling examples)
•Could use pre-existing exhaustive lists of potential items
(Lexical Hypothesis example)
•Could include items from other scales (dodgy – what item pools did they use)
•Could be designed to map onto a pre-existing theoretical conception (dodgy – would bias results, both by overdefining and missing out parts of the construct)
•Could simply choose items that make sense to the researchers (unacceptable)
Step 3: Perform factor analysis on the item pool, and select items
•Used both to find the number of underlying factors
•Used to select the most representative items
•All decisions are critical here, once you’ve chosen your items there is no turning back!
•Participants should be representative of who is going to use the scale in the future. Ideally, have multiple groups (e.g.,. Community, clinical), conduct all analysis below separately for each group, and base decisions on a balance of the findings between the samples (which should be largely consistent anyway).
•Should be exploratory factor analysis (maximum likelihood with oblique [oblim] rotation)
•Should use parallel analysis to determine the number of factors
•Extract the correct number of factors. Be careful of factors defined by all positive or negative items.
•Choose the highest loading items of each factor
•Decide on how many items to have per factor. Difficult decision, shorter scales are more widely used but perhaps don’t fully represent the construct. Could base on Cronbach’s alpha, but issues of “bloated specifics”
Step 4: Perform confirmatory factor analysis
•Differs from exploratory factor analysis as tests the plausibility of a particular factor structure
•In many senses weaker – the fit of other factor structures may be equally valid
•You should (a) test the expected factor structure, (b) compare it to other factor structures, (c) perform multi-group comparisons
• Reliability refers to how consistent are the scores on the test over time or across equivalent versions of the test
– Reliability refers to how well a test measures true and systematic variation in a subject rather than error, bias or random variation.
• Validity refers to how well the test measures what it is supposed to. This requires independent criteria on which to base the validity of the test score
Step 5: Test internal consistency
Asks whether each of the items are highly inter-correlated
•Important as the relationship between the scale and other variables will be attenuated as a factor of the unreliability of the scale due to low correlating items
Historically used split-half reliability
•Split the test to get equivalent halves
– Odd and even items
– Resultant reliability is for only half the test
– The longer the test, the more reliable it will be
•Now use Cronbach’s alpha (due to better computing)
- Effectively the average of each permutation of the off and even items
(with some adjustment). Gives a value between 0 and 1. Less than .60 very poor, .60 to .70 poor, .70 to .80 good, more than .80 excellent
•Correlations between the scale and external variables can be corrected for alpha (can be theoretically important (gratitude and appreciation example), but can be dodgy as excuses poor scale)
Step 6: Test for temporal stability (test-retest reliability)
•If the construct has trait like properties then it should be stable over time,
BUT tests should also be sensitive to real change
•Show that the test is stable over time for most people (use all methods).
•Method 1. Give the same test to the same people at two time points.
These should be small enough to preclude genuine change, but long enough for participants not to simply remember their answers.
•Ideally, use two different time intervals with different groups (e.g., 2 weeks and 4 weeks, or 4 weeks and 3 months)
•Method 2. Show that the means of the group does not significantly change over time (e.g., everyone’s score goes up on reflection). Be careful of issues of power.
•Method 3. Show that the scale DOES change when expected to. Can be demonstrated longitudinally (e.g., therapy), or experimentally (e.g., social desirability scales).
Step 7: Show face validity
•Refers to what the test appears to measure
•Affects acceptability of the test and for the test to be effective in practice.
•Should have already been shown through the item pool creation.
•Might perform further qualitative work with the final items (genetic counselling example)
•Can’t be substituted for objective validity
Step 8: Show criterion validity
•Refers to the scale correlating with what it is meant to
•If there are existing scales, there should be high correlations (but why are you making the new scale? Shorter? Better? – in which case correlations may be lower)
•What is a high correlation? Conventionally, scales measuring the same construct should correlated higher than r = .80. Cohen (1988,
1992) defines effect sizes as .10 small, .50 medium, .80 large). Rules of thumb, can’t apply to every situation. Unique, causal, multiply determined, or objective relationships always smaller.
•Should correlate with theoretically related constructs.
•Peer-correlations. Self-report should correlate with peer-ratings. But what is high correlation? Issues of visibility, huge literature on judgements of others (David Funder and others), halo bias.
Step 9: Show predictive validity
•Similar to criterion validity, but differs in predictive validity predicts a future outcome
• Could be a behaviour. For example, extroversion should predict how much a person talks in a small group task, and sensation seeking behaviour. Again issues of what is a high correlation (Epstein's work on behavioural prediction).
•Could be a longitudinal change in functioning (Emmons work on goal striving)
Step 10: Show discriminant [sic] validity
•Show that the scale is NOT correlated with what it is not theoretically intended to.
•Issues of power.
•Issues of whether the other measures are actually any good.
•Examples
•Social desirability, but what do the scales actually measure?
•Mood inductions. But should it be effected? Is the induction any good
(show that it changes other measures, but not yours)? Is this actually evidence of lack of sensitivity to change (must be used in combination with convergent validity?
•Often determined theoretically (e.g., PANAS).
Step 11: Show incremental validity
•You must show that your scale can predict some variables after controlling for other existing scales (re-demonstrate criterion or predictive validity after controlling for other scales)
•Very important, given the duplication of scales in psychology research
•Critical that you choose the right scales to control (huge potential for bias)
•Commonly use Big Five (but crap, due to hierarchical organization).
•Best to conduct a tough test where you select the most similar other scales, but where theory suggests that what your measuring may provide additional prediction (and watch your results disappear…)
Summary
Psychometric development involves:
1. Developing a clear rationale for the need to develop the scale
2. Developing a clear definition of the construct and a representative item pool
3.
Identifying the scale’s structure and selecting items based on factor analysis
4. Testing the structure with confirmatory factor analysis
5. Testing internal consistency (reliability)
6. Testing temporal stability (reliability)
7. Showing face validity (validity)
8. Testing criterion validity (validity)
9. Testing predictive validity (validity)
10. Testing discriminant validity (validity)
11. Testing incremental validity (validity)