PSY 5130 Lecture 3 - Scale construction

advertisement
Psychology 5130 Lecture 3
Summated Scale Construction
Psychological Scales
Scale: An instrument, usually a questionnaire, designed to measure a person's position on some
dimension.
Can be responded to by oneself or by others – yielding what is called self-report or other-report.
Can assess ability, or attitudes, or personality
Just a very few of the dimensions which are important are
The Big 5 dimensions of Extroversion, Agreeableness, Conscientiousness, Stability, and Openness,
Narcissism, Cognitive ability, Job Satisfaction, Organizational commitment, Positive Affectivity,
Negative affectivity, Intent to leave, Job Embededness, Depression, Integrity, Self-esteem, . . . . . . . . . . .
Do we have enough psychological scales?
Here’s what one author has said . . .
Focus here will be on Summated scales to measure personality related constructs.
Historical Perspective
Three types
Guttman
Thurstone
Likert (pronounced Likkert)
Now, most scales are Likert.
P513 Lecture 3: Scale Construction - 1
3/16/2016
Likert Scale / Summated Scale
A set of statements regarding the construct being measured is presented to respondents
Respondents indicate the extent of agreement with each statement, on a response scale consisting of
from 2 alternatives to 11 alternatives.
Each response is assigned a numeric value.
Respondent's position on the dimension being measured, that is, the respondents, score, is the sum or
mean of the values of responses to the statements related to that dimension.
An example questionnaire from which 5 scale scores are extracted follows on the next two pages.
It’s the Sample 50-item Big Five questionnaire taken from the web site of the International Personality
Item Pool (IPIP) (http://ipip.ori.org/ipip/). The items on the web site have been modified so that each is
a complete sentence.
For example, item 1 on the web site is “Am the life of the party.” Here it is “I am the life of the party.”
Even numbered items have been shaded. I have no evidence that such shading is beneficial. The IPIP
web site recommends a 5-point response scale. I prefer a 7-point response scale. If you need a 50-item
Big Five questionnaire, you may copy and use what follows.
Items are:
E: 1,6,11,16,21,26,31,36,41,46
A: 2,7,12,17,22,27,32,37,42,47
C: 3,8,13,18,23,28,33,38,43,48
S: 4,9,14,19,24,29,34,39,44,49
O: 5,10,15,20,25,30,35,40,45,50
P513 Lecture 3: Scale Construction - 2
3/16/2016
Questionnaire
ID__________________________
Circle the number that represents how accurately the statement describes you.
7 = Completely Accurate
6 = Very Accurate
5 = Probably Accurate
4 = Sometimes Accurate, Sometimes Inaccurate
3 = Probably Inaccurate
2 = Very Inaccurate
1 = Completely Inaccurate
1.
I am the life of the party.
1
2
3
4
5
6
7
2.
I feel little concern for others.
1
2
3
4
5
6
7
3.
I am always prepared.
1
2
3
4
5
6
7
4.
I get stressed out easily.
1
2
3
4
5
6
7
5.
I have a rich vocabulary.
1 2
3
4
5
6
7
6.
I don't talk a lot.
1
2
3
4
5
6
7
7.
I am interested in people.
1
2
3
4
5
6
7
8.
I leave my belongings around.
1
2
3
4
5
6
7
9.
I am relaxed most of the time.
1
2
3
4
5
6
7
10. I have difficulty understanding abstract ideas.
1
2
3
4
5
6
7
11. I feel comfortable around people.
1
2
3
4
5
6
7
12. I insult people.
1
2
3
4
5
6
7
13. I pay attention to details.
1
2
3
4
5
6
7
14. I worry about things.
1
2
3
4
5
6
7
15. I have a vivid imagination.
1
2
3
4
5
6
7
16. I keep in the background.
1
2
3
4
5
6
7
17. I sympathize with others' feelings.
1
2
3
4
5
6
7
18. I make a mess of things.
1
2
3
4
5
6
7
19. I seldom feel blue.
1
2
3
4
5
6
7
20. I am not interested in abstract ideas.
1 2
3
4
5
6
7
21. I start conversations.
1
2
3
4
5
6
7
22. I am not interested in other people's problems.
1
2
3
4
5
6
7
23. I get chores done right away.
1
2
3
4
5
6
7
24. I am easily disturbed.
1
2
3
4
5
6
7
25. I have excellent ideas.
1
2
3
4
5
6
7
P513 Lecture 3: Scale Construction - 3
3/16/2016
Circle the number that represents how accurately the statement describes you.
7 = Completely Accurate
6 = Very Accurate
5 = Probably Accurate
4 = Sometimes Accurate, Sometime Inaccurate
3 = Probably Inaccurate
2 = Very Inaccurate
1 = Completely Inaccurate
26. I have little to say.
1
2
3
4
5
6
7
27. I have a soft heart.
1
2
3
4
5
6
7
28. I often forget to put things back in their proper place.
1
2
3
4
5
6
7
29. I get upset easily.
1
2
3
4
5
6
7
30. I do not have a good imagination.
1
2
3
4
5
6
7
31. I talk to a lot of different people at parties.
1
2
3
4
5
6
7
32. I am not really interested in others.
1
2
3
4
5
6
7
33. I like order.
1
2
3
4
5
6
7
34. I change my mood a lot.
1
2
3
4
5
6
7
35. I am quick to understand things.
1
2
3
4
5
6
7
36. I don’t like to draw attention to myself.
1
2
3
4
5
6
7
37. I take time out for others.
1
2
3
4
5
6
7
38. I shirk my duties.
1
2
3
4
5
6
7
39. I have frequent mood swings.
1
2
3
4
5
6
7
40. I use difficult words.
1
2
3
4
5
6
7
41. I don’t mind being the center of attention.
1
2
3
4
5
6
7
42. I feel others’ emotions.
1
2
3
4
5
6
7
43. I follow a schedule.
1
2
3
4
5
6
7
44. I get irritated easily.
1
2
3
4
5
6
7
45. I spend time reflecting on things.
1
2
3
4
5
6
7
46. I am quiet around strangers.
1
2
3
4
5
6
7
47. I make people feel at ease.
1
2
3
4
5
6
7
48. I am exacting in my work.
1
2
3
4
5
6
7
49. I often feel blue.
1
2
3
4
5
6
7
50. I am full of ideas.
1
2
3
4
5
6
7
P513 Lecture 3: Scale Construction - 4
3/16/2016
Example - Overall Job Satisfaction Scale from Michelle Hinton Watson's Thesis
The items in this scale are presented as questions. In other instances, they are presented as statements.
If presented as statements, the responses would represent amount of agreement.
If you need an overall job satisfaction scale, you may use this.
For each statement please put a check ( ) in the space showing how you feel about the following aspects
of your job. This time, indicate how satisfied you are with the following things about your job.
(1)
Very
Dissatisfied
(2)
Moderately
Dissatisfied
(3)
Slightly
Dissatisfied
(4)
Neither
Satisfied nor
Dissatisfied
(5)
Slightly
Satisfied
VD MD
1
2
SD
3
N
4
SS
5
(6)
Moderately
Satisfied
MS
6
(7)
Very
Satisfied
VS
7
Overall Satisfaction
27.
How satisfied do you feel
with your chances for getting
ahead in this organization?
( )
( )
( )
( )
( )
( )
( )
32.
All in all, how satisfied
are you with the persons in
your work group?
( )
( )
( )
( )
( )
( )
( )
35.
All in all, how satisfied
are you with your supervisor?
( )
( )
( )
( )
( )
( )
( )
37.
All in all, how satisfied
are you with this organization,
compared to most others?
( )
( )
( )
( )
( )
( )
( )
43.
How satisfied do you feel
with the progress you have
made in this organization up
to now?
( )
( )
( )
( )
( )
( )
( )
45.
Considering your skills and
the effort you put into the
work, how satisfied are you
with your pay?
( )
( )
( )
( )
( )
( )
( )
50.
All in all, how satisfied
are you with your job?
( )
( )
( )
( )
( )
( )
( )
P513 Lecture 3: Scale Construction - 5
3/16/2016
Why do we have multi-item scales.
1. Precision.
Consider the first item in the above scale. Suppose the true amounts of satisfaction of several
respondents is represented by the positions of the top arrows:
True Positions
27.
All in all, how satisfied
are you with your job?
VD MD
1
2
SD
3
N
4
SS
5
( )
( )
( )
( )
( )
MS
6
( )
VS
7
( )
The response labels represent points on a continuum. They're like 1-foot marks on a scale of height.
So, each respondent above except the rightmost one would respond, MS, which would be assigned the
value 6. But the rightmost respondent, whose actual position on the dimension is close to his/her nearest
neighbor, would pick VS, creating a considerable “response” distance from that neighbor.
Since each respondent can pick only one of the response CATEGORIES, any response made may miss
the respondent’s true amount of satisfaction by about 7 percent on a 7-point scale, by about 10 percent
on a 5-point scale.
Note the wide range of actual feelings which would be represented by a 6 above.
Consider that two persons very close in their actual feelings about the job could get scores which were
7% apart. E.g., a person whose actual feeling is 6.55 would check 7. But a person whose actual feeling
was 6.45 would check 6. The difference of 1 would be much greater than the actual difference of .1 in
actual feeling. See red arrows above.
6.45 6.55
6
7
This situation is analogous to one that most students have strong feelings about – the use of 5 grades to
represent performance in a course. We all remember those instances in which we missed the next higher
grade mark by a 10th of a point. The use of a single item with just a few response categories is
analogous.
Solution: Use multiple items. While each one may miss its mark considerably, some of the misses will
be positive and some will be negative, cancelling out the positives, so the average of the responses will
be very close to the respondent’s true position on the continuum.
Conclusion: Having multiple items and computing scale scores by summing responses to the
multiple items increases accuracy of identification of the respondent’s true position on a
dimension.
P513 Lecture 3: Scale Construction - 6
3/16/2016
2. Reliability.
Since a single categorical item response involves only a gross approximation to the actual (True) feeling,
on repeated measurement, a person giving only one response might get a very different score (6 vs. 7,
for example) on a single item. This reduces reliability. Reducing reliability reduces estimated validity.
Reducing estimated validity reduces your chances of getting published.
Conclusion: Summing or averaging responses to multiple items results in a measure that is
inherently more reliable.
3. Ability to use internal consistency to assess Reliability.
It is possible to assess the reliability of multiple items scales in a single administration of the scale to a
group by computing coefficient alpha. That is not possible with a single-item scale.
Conclusion: Using multiple items and basing the scale score on the sum or mean greatly facilitates
our ability to estimate the reliability of the scale score.
4. Insulation from the effects of idiosyncratic items.
Sometimes, a respondent will have a unique reaction to the wording of a single item. This reaction may
be based on the respondent’s history or understanding of that item. If that item is the only item in the
scale, then the respondent’s position on the dimension will be greatly distorted by that reaction.
Conclusion: Including multiple items and using the sum or mean of responses to them diminishes
the influence of any one idiosyncratic item.
Come on!!! What’s not to like about using multiple items???
1. Test length.
2. Overestimation of reliability by alpha from using too-similar items.
P513 Lecture 3: Scale Construction - 7
3/16/2016
Issues in development of Likert/Summated scales
1. Do you have to ask for agreement?
The original idea was to assess agreement. But now, other ordered characteristics are used. E.g., level
of satisfaction, strength of feeling, accuracy with which a statement describes you, etc.
2. How many response categories should be employed?
2-11. Seven or more is preferable. Spector on p. 21 recommends 5-9. There are 3 reasons to use 7 or
more.
a. If your study will involve level shifts between conditions, you should allow plenty of room to shift,
which means using 7- or 9-point scales.
b. If you plan to use confirmatory factor analysis of structural equation models on your data, seven or
more response options per item is preferred for reasons associated with factor analysis.
c. We’ve obtained results suggesting that inconsistency of responding is relatively independent of
response level when 7 point scales are used.
Below are correlations of inconsistency (vertical) vs. level (horizontal) for 5-point scales on the left and
7-point on the right. X-axis is individual person scores. Y-axis is standard deviations of items making
up the score.
Nhung Honest 5 point response scale; r=-.35
Incentive Honest – 7 point response scale; r = -.03
Nhung faking 5 point response scale; r=-.43
Incentive Faking 7 point response scale; r = +.08
Vikus 5 point response scale; r = -.49
FOR Study Gen 7 point response scale; r = -.04
Bias Study IPIP – 5 point response scale; r = -.32
FOR Study FOR 7 point response scale; r = +.08
Bias Study NEO-FFI 5 point response scale; r = -.34
Worthy Thesis 7 point response scale; r=-.18
P513 Lecture 3: Scale Construction - 8
3/16/2016
3. Should there be a neutral category?
I am not familiar with a clear-cut, strong argument either way. I prefer one.
If you analyze the data using Confirmatory Factor Analysis or Structural Equation Modeling, it doesn’t
matter.
My guess (and it’s just a guess) is that you’ll get a few more failures to respond without one, from
people who just can’t make up their minds.
And variability of responses might be slightly smaller with one, from those same people responding in
the middle.
But, I’m not aware of a meta-analysis on this issue.
4. What numeric values should be assigned to each response possibility for analyses based on sums or
means?
Although at one time there were arguments for scaling the various response alternatives, now almost
everyone who analyzes the data traditionally uses successive equally spaced integers. They need not
be, but everyone uses successive, as opposed to every other, for example, integers.
For example
Strongly
Disagree
1
Disagree
2
Neutral
3
Strongly
Agree
5
Agree
4
Or
Strongly
Disagree
1
Moderately
Disagree
2
Disagree
3
Neutral
4
Agree
5
Moderately
Agree
6
Strongly
Agree
7
Newer Confirmatory Factor Analysis and Structural Equation Modeling based analyses assuming the
data are “Ordered Categorical” require simply that the responses categories be ordered. No numeric
assignment is required.
5. If the analyses are based on sums or means, which integers should be used?
Answer: Any set will do. They should be successive integers.
1 to 5 or 1 to 7
0 to 4 or 0 to 6
-2 to +2 or -3 to +3.
6. Does agreement have to be high or low numbers?
Yes, the God of statistics will strike you down if you make small numbers indicate more of ANY
construct. Being a golfer will not save you. In fact, after the Tiger Woods debacle, it will make the
strike from above even worse.
I strongly prefer assigning numbers so that a bigger response value represents more of the construct as
it is named. I’m sure it’s what the God of Statistics intended.
P513 Lecture 3: Scale Construction - 9
3/16/2016
7. What about including negatively worded items, perhaps better labeled as “opposite idea” items.
Negatively worded items may be included, although there is no guarantee that responses to negatively
worded items will be the actual negation of what the responses to a positively worded counterpart would
have been.
I like my supervisor vs.
I dislike my supervisor.
Responses to these two items should be perfectly negatively correlated, but often they are not.
Some studies have found that items with negative wording are responded to similarly to other
negatively worded items, regardless of content or dimension, presumably due to the negativity of the
item, regardless of the main content of the items. We have found this in seven datasets.
Data on existence of “Wording” biases – From Biderman, Nguyen, Cunningham, & Ghorbani, 2011.
We administered the IPIP and NEO-FFI Big Five Questionnaires to 200+ students and correlated
corresponding factors from the two questionnaires.
Data showing convergent validity of IPIP and NEO Big Five domain factor scores .
Correlation of “purified” IPIP Extraversion with “purified” NEO Extraversion
.624
Correlation of “purified” IPIP Agreeableness with “purified” NEO Agreeableness.
.492
Correlation of “purified” IPIP Conscientiousness with “purified” NEO Conscientiousness .688
Correlation of “purified” IPIP Stability with “purified” NEO Stability
.664
Correlation of “purified” IPIP Openness with “purified” NEO Openness
.525
Data showing how much response tendencies correlate with each other
Correlation of IPIP Positive Wording Bias with NEO Positive Wording Bias
Correlation of IPIP Negative Wording Bias with NEO Negative Wording Bias
.739
.752
These data, along with other data, suggest that
1) There are consistent individual differences in tendencies to respond to items based on whether the
item is positively-worded or negatively-worded.
2) The individual differences in tendency to respond in a specific fashion to positively-worded items is
separate and different from the tendency to response in a specific fashion to negatively-worded items.
Moreover, the above data suggest that the bias tendencies are as reliable from one questionnaire to the
other as are the Big Five characteristics.
Recommendation:
Best: Design and analyze your questionnaire so that it permits estimation of the bias
tendencies. Nobody does this now, because the discovery of such wording-related response
tendencies has just been made.
Expedient: Have an equal number of positively-worded and negatively-worded items and
average across wordings to cancel out differences in response tendencies associated with
wording.
P513 Lecture 3: Scale Construction - 10
3/16/2016
8. If negatively worded items are included, how should they be scored?
Typically, negatively worded items are reverse-scored and then they’re treated as if they had been
positively worded.
Example for items with 5 categories and values 1,2,3,4, and 5.
Original
1
2
3
4
5
Reversed
5
4
3
2
1
Suppose Q1 = I like my job
Suppose Q7 = I don’t like to come to work in the morning, a negatively-worded item for job satisfaction.
Data matrix:
Person
1
2
3
4
Q1
5
4
1
2
Q7
2
1
5
4
Q7R
4
5
1
2
Scale as sum
9=5+4
9=4+5
2=1+1
4=2+2
Scale as mean
4.5
4.5
1
2
9. Should the scale score be the sum or the mean of items?
If there are no missing values, the sum and the mean will be perfectly correlated – they’re
mathematically equivalent, so you can use either.
The mean is more easily related to the questionnaire items if they all have the same response format.
SPSS’s RELIABILITY procedure computes only the sum.
If there are missing values, use the mean of available items or use imputation techniques to be
described next to impute missing values, after which it won’t matter whether you use the mean or sum.
P513 Lecture 3: Scale Construction - 11
3/16/2016
10. What about missing responses?
There are several possibilities
a. Listwise deletion. Generally not preferred but if only a couple out of 200 are missing, use it.
1
2
3
4
Q1
5
2
3
1
Q2
4
_
2
2
Q3
5
3
4
1
Substituted:
1
3
4
Q1
5
3
1
Q2
4
2
2
Q3
5
4
1
Problem: Can decimate the dataset. You may be left with highly conscientiousness, agreeable
participants because that kind of participant is the only that that will respond to all the items.
b. Item mean substitution. Substitute mean of other persons' responses to the same item for missing
item. Item mean substitution. Not recommended.
Q1
5
2
3
1
Q2
4
_
2
2
Q3
5
3
4
1
Substituted:
Q1
5
2
3
1
Q2
4
2.7
2
2
Q3
5
3
4
1
or
Q1
5
2
3
1
Q2
4
3
2
2
Q3
5
3
4
1
|
Mean of 4, 2, & 2 substituted.
c. Person scale mean substitution. Substitute mean of other items from same scale that person
responded to for the missing item. Assume Q1, Q2, and Q3 are three items forming a scale. Not
recommended.
Q1
Q2
Q3
Substituted: Q1
Q2
Q3
5
4
5
5
4
5
2
_
3
2
2.5
3
Mean of 2 & 3 substituted.
3
2
4
3
2
4
1
2
1
1
2
1
d. Use a more sophisticated imputation technique. Several are available in SPSS.
I have often used SPSS’s imputation techniques.
---------------------------------
e. The convention wisdom is changing on issues of missing values. Many modern statistical techniques
are designed to work with all available data. These techniques do not include REGRESSION and GLM.
11. Writing the items. Spector, p. 23...
a. Each item should involve only one idea.
The death penalty should be abolished because it’s against religious law.
b. Avoid colloquialisms, jargon.
I am the life of the party. I shirk my duties.
e. Avoid items that might trigger
emotional responses in certain samples.
c. Consider the reading level of the respondent.
d. Avoid using “not” to create negatively worded items.
Good: Communication in my organization is poor.
Bad: Communication is my organization is not good.
P513 Lecture 3: Scale Construction - 12
3/16/2016
Self presentation biases when the same method is used to assess two
or more constructs
Suppose two independent constructs are being measured using summated rating scales.
Suppose each construct was measured with a two-item scale using a 6-valued response format consisting
of the values 1 through 6.
Suppose 16 persons participated, giving the following matrix of responses.
Construct
1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Q1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
Q2
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
Construct
2
Q3
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
Q4
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
For these hypothetical data, Q1 and Q2 are perfectly correlated, as are Q3 and Q4. Obviously, items
within the same scale are not perfectly correlated in real life.
But Q1+Q2 are uncorrelated with Q3+Q4. The constructs are independent.
compute C1=mean(Q1,Q2).
compute C2=mean(Q3,Q4).
correlate c1 with c2.
Syntax to create construct scale scores
True Correlation between the constructs, C1 and C2.
Correlations
C1
Pearson Correlation
Sig. (2-tailed)
N
C2
.000
1.000
16
P513 Lecture 3: Scale Construction - 13
3/16/2016
Now, suppose that the odd-numbered participants were people who preferred the low end of
whatever response scale they were filling out, while the even numbered participants were people who
preferred the high end of whatever scale they were filling out. Obviously, our participants don’t
separate into odd-even groups like this, but they do separate. There ARE people who prefer the high
end of the response scale and there ARE people who prefer the low end.
For example, many personality items have valence – agreeing indicates you’re “good”; disagreeing
indicates you’re “not good”. We think that those who are feeling good about themselves will be tend to
choose the agreement end of the response scale and those who are not feeling so good about themselves
will tend to choose the disagreement end.
Assume that the response tendency results in a bias of 1 response value down in the case of those who
tend to disagree or up in the case of those who tend to agree.
True Values
The resulting data matrix would look like the following . . .
Construct 1
Person
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Biased
Q1
1
3
1
3
2
4
2
4
3
5
3
5
4
6
4
6
Construct 2
Biased
Q2
1
3
1
3
2
4
2
4
3
5
3
5
4
6
4
6
Biased
Q3
1
4
3
6
1
4
3
6
1
4
3
6
1
4
3
6
Biased
Q4
1
4
3
6
1
4
3
6
1
4
3
6
1
4
3
6
Construct 1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
True
Q1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
Construct 2
True
Q2
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
True
Q3
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
True
Q4
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
Now the correlation between Q1+Q2 with Q3+Q4 is .555, a value that is statistically significant.
The point of this is that differences in participants' response tendencies (e.g., the tendency of some to
use only the upper part of a response scale while others use the lower part of the scale) can result in
positive correlations between constructs that are in fact, uncorrelated.
This is the method bias problem that plagues the use of summated rating scales.
Correlations
compute biasedC1=mean(biasedq1,biasedq2).
compute biasedC2=mean(biasedq3,biasedq4). X = T + B + eij
correlate biasedc1 with biasedc2.
P513 Lecture 3: Scale Construction - 14
biasedC1
Pearson Correlation
Sig. (2-tailed)
N
3/16/2016
biasedC2
.555
.026
16
The process of creating a summated scale.
1. Define/conceptualize the construct to be measured.
2. See if someone else has already created a scale measuring that construct. If so, and if it appears OK,
don’t re-invent the wheel. Faculty.
Buros Institute.
IPIP web site.
Google.
Remember . . .
3. If you must develop your own, begin by generating items.
4. Have a sample of SMEs rate the extent to which each item represents the construct. Keep only the
best.
5. Administer items to a pilot sample from the population of interest.
6. Perform item analysis of the responses of the pilot sample.
a. Assess reliability.
b. Identify bad items, those that reduce reliability, and eliminate them.
c. Assess dimensionality using exploratory factor analysis.
7. Perform a validation study assessing convergent and discriminant validity using the population of
interest, perhaps using the pilot sample.
a. Administer other similar scales.
b. Administer other discriminating scales.
8. Administer to a sample from the population of interest along with the other scales that are part of
your research project.
Assess the theoretical relationships of interest to you.
9. Publish the scale and get rich.
P513 Lecture 3: Scale Construction - 15
3/16/2016
Example of processing items of a scale in SPSS
This example is taken from an independent study project conducted by Lyndsay Wrensen examining
factors related to faking of the Big 5 personality inventory. She administered the IPIP Big 5 inventory
twice – once under instructions to respond honestly and again (counterbalanced) under instructions to
respond as if seeking a customer service job.
The data here are the honest condition responses to the Extroversion scale. Participants read each item
and indicate how accurately it described them using 1=Very inaccurate to 5=Very accurate. Some of the
items were negatively worded. We now would use a 7-point response scale. This project was done
almost 10 years ago.
Extroversion item responses before reverse-scoring the negatively worded items.
of
P513 Lecture 3: Scale Construction - 16
3/16/2016
1. Reverse-score the negatively-worded items.
SPSS Syntax to reverse-score negatively worded items.
recode he2 he4 he6 he8 he10 (1=5)(2=4)(3=3)(4=2)(5=1) into he2r he4r he6r he8r he10r.
execute.
Or you can do the reverse scoring manually or using pull-down menus.
However, you do it, put the reverse-scored values in columns that are different from the originals.
Extroversion item responses after reverse-scoring the negatively worded items.
P513 Lecture 3: Scale Construction - 17
3/16/2016
2. Deal with missing data. (Not illustrated here.)
For example, use SPSS’s imputation features. Set up a time with me, and I’ll walk you through the
process.
3. Perform reliability analyses.
Analyze -> Scale -> Reliabilities
Note that the reverse-scored
items are the ones that are
included in the
RELIABILITY analysis.
Reliability
Wa rnings
Th e covarian ce m atrix is cal culat ed an d use d in the a nalysis.
Ca se Pr oces sing Summ ary
N
Ca ses
Va lid
Exclude d a
To tal
%
179
90. 4
19
9.6
198
100 .0
a. Listwise delet ion b ased on al l vari ables in th e pro cedure.
P513 Lecture 3: Scale Construction - 18
3/16/2016
Re liabil ity Statisti cs
Cro nbach's
Alp ha B ased on
Sta ndardized
Cro nbach's A lpha
Ite ms
.85 9
.86 0
N o f Item s
10
Ite m Sta tistic s
Me an
Std . Deviatio n
3.1 3
1.1 22
he 1
N
17 9
he 3
3.9 7
.90 8
17 9
he 5
3.7 2
1.0 93
17 9
he 7
3.3 4
1.2 77
17 9
he 9
3.4 1
1.2 16
17 9
he 2r
3.5 6
1.2 54
17 9
he 4r
3.2 7
1.1 36
17 9
he 6r
3.7 9
1.1 10
17 9
he 8r
2.7 4
1.2 24
17 9
he 10r
2.7 0
1.2 85
17 9
Inter-Ite m Correla tion M atrix
he 1
he 1
1.0 00
he 3
.33 4
he 5
.24 5
he 7
he 3
he 5
.24 5
.51 8
.54 2
he 2r
.36 7
he 4r
.43 5
he 6r
.21 1
he 8r
.33 6
he 10r
.29 6
1.0 00
.47 3
.37 2
.33 1
.35 4
.36 7
.36 8
.17 0
.38 3
.47 3
1.0 00
.55 3
.32 9
.42 1
.40 7
.48 8
.24 2
.51 9
.51 8
.37 2
.55 3
1.0 00
.42 7
.39 1
.40 4
.44 6
.19 8
.37 5
he 9
.54 2
.33 1
.32 9
.42 7
1.0 00
.38 2
.49 6
.25 8
.46 1
.29 5
he 2r
.36 7
.35 4
.42 1
.39 1
.38 2
1.0 00
.55 0
.57 2
.33 1
.44 8
he 4r
.43 5
.36 7
.40 7
.40 4
.49 6
.55 0
1.0 00
.37 5
.37 1
.45 0
he 6r
.21 1
.36 8
.48 8
.44 6
.25 8
.57 2
.37 5
1.0 00
.24 1
.41 7
he 8r
.33 6
.17 0
.24 2
.19 8
.46 1
.33 1
.37 1
.24 1
1.0 00
.21 0
he 10r
.29 6
.38 3
.51 9
.37 5
.29 5
.44 8
.45 0
.41 7
.21 0
1.0 00
.33 4
he 7
he 9
Th e covarian ce m atrix is cal culate d an d use d in the a nalysis.
Summa ry Ite m Sta tistic s
Item Me ans
Item Va riance s
Inte r-Item Co rrelati ons
Me an
3.3 63
Min imum
2.6 98
Ma ximu m
3.9 72
Ra nge
1.2 74
Ma ximu m /
Min imum
1.4 72
Va riance
.18 0
N o f Item s
10
1.3 63
.82 5
1.6 50
.82 5
2.0 00
.06 5
10
.38 1
.17 0
.57 2
.40 2
3.3 63
.01 0
10
Th e covarian ce ma trix i s calculate d and use d in th e an alysis.
Ite m-Total Statisti cs
he 1
Scale M ean if Scale V arian ce
Ite m De leted if I tem Delete d
30 .50
50 .162
Co rrecte d
Ite m-To tal
Co rrelat ion
.54 7
Sq uared Mul tiple
Co rrelat ion
.45 6
Cro nbach's A lpha
if I tem Delete d
.84 8
he 3
29 .66
52 .496
.51 6
.31 2
.85 1
he 5
29 .92
49 .504
.61 2
.50 4
.84 2
he 7
30 .29
47 .724
.61 0
.50 1
.84 2
he 9
30 .22
48 .691
.58 6
.44 9
.84 4
he 2r
30 .07
47 .501
.63 8
.48 9
.83 9
he 4r
30 .36
48 .546
.64 9
.45 5
.83 9
he 6r
29 .84
50 .080
.56 0
.44 3
.84 7
he 8r
30 .89
51 .309
.41 7
.27 3
.85 9
he 10r
30 .93
48 .501
.55 7
.37 5
.84 7
P513 Lecture 3: Scale Construction - 19
3/16/2016
New Directions in Measurement of Psychological Constructs
1) Measurement Using Factor Scores from factor analyses
Factor scores are measures obtained from factor analyses of items.
Factors scores are computed by differentially weighting each item according to its contribution to the
indication of the dimension. Items which are not highly correlated with the dimension are given little
weight. Those which are highly correlated with the dimension are given more weight.
Note that summated scale scores are computed by equally weighting each item that is thought to
be relevant.
The loadings of the items on the factor are used to determine the weights.
Advantages of factor scores
Factor score methods are thought by some to be more refined than the equal weighting that is
part of traditional summated scores because each item is weighted according to how much it represents
the factor.
They probably better capture the dimension of interest. They’re probably more highly correlated
with the dimension than the simple sum of items.
They can be computed taking into account other factors that might influence the items, thus may
be uncontaminated by the other factors.
Disadvantages of factor scores
They are harder to compute, requiring a factor analysis program.
The weights will differ from sample to sample so your weighting scheme based on your sample
will differ from my weighting scheme based on my sample.
P513 Lecture 3: Scale Construction - 20
3/16/2016
2)
Taking self-presentation tendencies into account
The examples given above in which the effects of method bias (self-presentation tendency) on
correlations were shown illustrate this.
I believe that in the future, scores on scales, such as those of the Big Five, will be computed as
factor scores, with the effects of self-presentation tendencies removed. Right now, that’s not happening,
except in our own research.
3) Using the whole buffalo: Measuring other aspects of responses to questionnaires, such as
inconsistency
Virtually all scales represent the mean of responses to items representing a dimension.
So, a Conscientiousness score is the mean of the responses of a person to the Conscientiousness items in
a questionnaire.
What about the variability of a person’s responses to the C items?
We’ve been exploring the relationships of Inconsistency of responding, as measured by the standard
deviation of persons’ responses to items from the same dimension.
Recent data (Reddock, Biderman, & Nguyen, 2011).
Overall UGPA was the criterion. Conscientiousness and Variability were predictors.
Results with Conscientiousness scale scores from the FOR condition
FORCon
.250
UGPA
R = .315
-.219
Inconsistency
Both standardized coefficients are significant at p < .01.
These data suggest that Inconsistency of Responding may be a valid predictor of certain types of
performance.
References
Reddock, C. M., Biderman, M. D., & Nguyen, N. T. (2011). The relationship of reliability and validity
of personality tests to frame-of-reference instructions and within-person inconsistency.
International Journal of Selection and Assessment, 19, 119-131.
For example: If you give a Big 5 questionnaire to a group of respondents, you can measure the
following 11 attributes
Extraversion
Agreeableness
Conscientiousness
Stability
Openness
General Affect
Positive Wording Bias
Negative Wording Bias
Inconsistency
Extreme Response Tendency
Acquiescent Response Tendency
P513 Lecture 3: Scale Construction - 21
3/16/2016
4) Techniques based on Item Response Theory (IRT).
There is a burgeoning literature on test construction based on item response theory techniques that
threatens to displace Likert/Summated response theory based techniques. Although at the present time,
the vast majority of personality tests are based on summated responses, this will change in the coming
years.
IRT based tests make use of items that have been scored, much as we now score people, so that each
item has a “scale” value, just like each person has a scale score.
Items are selected so that they represent a range of scale values, and a person’s score on a test
constructed from such items is based on which items the person endorsed or got correct, not simply on
how many the person endorsed or got correct.
P513 Lecture 3: Scale Construction - 22
3/16/2016
Download