Guidelines in Scale Development

advertisement
1
Guidelines in Scale Development

Step 1: Determine clearly what it is you want to measure
 You must think clearly about the construct being measured and not overlook the
importance of being well grounded in the substantive theories related to the
phenomenon being measured.
 Even if there is no available theory you must specify a tentative theoretical model
that will serve as a guide. This should at minimum include a well formulated
definition of the construct.
 Scales can be either very general or very specific and specificity can vary along a
number of dimensions including content domains, setting, and population. It is
vital that you determine the level of specificity before you begin to write items.
Furthermore, it is vital that the level of specificity is grounded in theory.
 This step should have been completed in your first assignment, although it is
possible that you are still thinking about these issues.

Step 2: Generate an Item Pool
 These items should be selected or created with the specific measurement goal in
mind. Theoretically a good set of items is chosen randomly from a universe of
items relating to the construct of interest.
 Redundancy is a good think at this stage IF redundancy pertains to the item, not
aspects of the item. For example, having an item such as “In my opinion, pet
lovers are kind” and “In my estimation, pet lovers are kind” is somewhat
ridiculous. However, if the second item were replaced by “I think that people
who like pets are good people” would do a good job at being redundant but not
trivially so.
 When developing a scale you want considerably more items in the initial pool
than you plan on including in the final scale. The more items you have in your
pool the pickier you can be when choosing items for your final scale.
 The following guidelines can help you in ensuring that your items are technically
sound.
1. Avoid exceptionally lengthy items which tend to increase complexity and
decrease clarity. However, do not decrease the length of an item if it
sacrifices the meaning of the item.
2. Write items at an appropriate reading level. A rule of thumb here is to
write items so that they are between a fifth and seventh sixth grade reading
level. Your text gives a description of how to assess the reading level of a
typical sentence.
3. Avoid items that convey more than one idea in which endorsing the item
might refer to either or both ideas.
2
4. Avoid ambiguities. Consider the items “Rapists should not be allowed to
seek the advice of lawyers because they are the scum of the earth” or
“How far did your mother go in school?”
5. Be wary of using both negatively and positively worded items. Although
it is common to include both of these types of items to avoid agreement
bias negatively worded items often do not function properly. This is
especially true if there is a double negative, either in the item itself or in
the item and the responses.

Step 3: Determine the format for measurement
 Numerous formats for questions exist and the format for the questions should be
determined simultaneous with the development of the items. Item format must be
decided upon simultaneously with item generation.
 In general, scales made up of items that are summed to form a scale score are
most widely used and most consistent with the theoretical framework we have
been developing. However the other formats, such as the following, are possible:
1. Thurstonian Scaling: In this type of scale agree/disagree items are
developed that are differentially responsive to specific levels of the
attribute which is typically an attitude of some sort. A simple example,
using items that tap one's fondness of chocolate ice cream would be the
following:
I love chocolate ice cream
Chocolate ice cream is the best flavor in the world.
I sometimes choose to eat chocolate ice cream.
Chocolate ice cream is definitely not my favorite flavor.
I detest chocolate ice cream.
Developing a true Thurstone scale is extremely difficult.
2. Guttman Scaling: In this type of scale agree/disagree items are developed
that tap progressively higher levels of an attribute. Theoretically at some
critical point a respondent should disagree with the remaining items
because the amount of attribute of the item exceeds the attribute possessed
by the respondent. A simple example measuring degree of alcoholic
consumption would be the following:
I never consume alcoholic beverages.
I occasionally consume alcoholic beverages.
I typically consume alcoholic beverages at least once a week.
I typically consume alcoholic beverages more than once a week.
I typically consume one alcoholic beverage each day.
I typically consume more than one alcoholic beverage each day.
I typically consume more than three alcoholic beverages each day.
3
I typically consume more than five alcoholic beverages each day.
These types of scales work well in some situations but tend to do poorly
when the phenomenon of interest is not objective or concrete.
3. Scales with Equally Weighted Items: These types of scales tend to work
best when items or more or less equally related to the underlying
construct.
 Specific Types of Response Options:

Likert Scale: This is one of the most common item formats and is
commonly used when measuring opinions, beliefs, and attitudes. In this
type of scale a declarative statement is presented followed by response
options that indicate varying degrees of agreement or endorsement of the
statement. The difference of agreement should be approximately
equivalent between any adjacent response options. A good Likert scale
should state the opinion, attitude, belief, or other construct under study in
clear terms. It is not necessary or appropriate that items span the range of
weak to strong assertions of the construct because the response options
provide levels of differentiation. Be careful of writing items that are too
mild for which most everyone would agree.

Semantic Differential: This type of item is used in reference to a stimuli.
Each item consists of adjective pairs that represent opposite ends of the
spectrum such as the following:
Statistics
Easy
____ ____ ____ ____ ____ ____ ____ Hard
Exciting
____ ____ ____ ____ ____ ____ ____ Boring
Important ____ ____ ____ ____ ____ ____ ____ Trivial
Useful
____ ____ ____ ____ ____ ____ ____ Useless
Typically seven or nine points are given. The respondent places a mark on
one of the lines to indicate the point on the continuum that characterizes
the stimulus. If sets of items are chosen that tap the same underlying
construct then, similar to Likert scales, such response formats are
theoretically compatible with the measurement models presented earlier.

Visual Analog: This type of item is similar to the semantic differential
however the respondents are provided with a continuous line for which
they must place a mark representing where they fall on whatever is being
measured. Although this type of response format is seemingly capable of
fine differentiation, it is important to consider that such fine gradations
may not be meaningful to respondents.

Binary Options: This type of scale gives subjects a choice between two
options (e.g. yes/no, true/false, agree/disagree). A major shortcoming of
these types of scales is that can only have minimal variability because
each item contributes such a small amount to the overall score.
4
 Issues Related to Response Options:


A desirable quality of a scale is variability. A measure cannot co-vary if it
cannot vary. One way to increase variability is to have a lot of items.
Another option is to increase the number of response options.

However, it is important to keep in mind respondents' ability to
discriminate between response options. When too many response options
are provided there will be more variability, but it will be “noise” rather
than true differences in the latent trait.

Sometimes the ability to discriminate between response options is
dependent on the specific wording (e.g. several, few, many) or the
physical arrangement of the response options. It is helpful to place
response options so that they represent an obvious continuum.

It is important to use terms that are easy to differentiate. Using both the
terms “somewhat” and the term “not very” together will make it difficult
for respondents to differentiate between these two categories so responses
will not mean the same thing for everyone.

Using a continuous type of scale should be thought through very carefully.
Consider the visual analog scale described previously. How precise will
measurement be in this case?

Using a neutral category should not be automatic. It depends on the type
of question, type of response options, and the purpose. An even number
of response options forces one to make a choice, while an odd number of
response options allows one to avoid making a choice.
Step 4: Have the initial item pool reviewed
 Consider having experts in the field (i.e. colleagues who have worked extensively
with the construct you are developing an instrument to measure) review your item
pool to determine if you have adequately defined the latent trait. You can
accomplish this by asking others to rate how relevant they think each item is to
what you are trying to measure.
 Reviewers should also evaluate the clarity and conciseness of items. Ambiguous
or unclear items can reflect factors extraneous to the latent trait.
 Reviewers can also comment on additional ways you might measure your
construct that you may have overlooked.
 Ultimately, it is your decision as to whether or not to incorporate the advice given
by reviewers.

Step 5: Consider the inclusion of validation items
 This refers to including items that attempt to ascertain whether respondents are
answering items for reasons other than your intents, such as social desirability.
 It might also be important to include items that help you to assess the construct
validity of your scale. More on this in future classes.
5

Step 6: Administer items to a development sample
 This will allow you to determine statistically the quality of your items.
 Of course the sample size needs to be large but the obvious question is “how
large?” There is no general consensus to this issue and the answer is somewhat
dependent on the number of items and the number of scales.
 The sample size should be large enough so that the item level statistics obtained
from piloting the measure are representative of the target population. Both
overall sample size and ratio of respondents to items must be considered. I have
read somewhere that having 5 respondents per item is sufficient. Your book
suggests having at least 300 respondents in your pilot sample.
 If too few respondents are used then the patterns of covariance among the items
will not be stable. In the extreme case, when the ratio of subjects to items is low
and the sample size is small the correlations among items can be primarily
influenced by chance.
 It is also important that the sample used for piloting your measure be
representative of the population for which the scale is intended, in terms of the
variability in the level of attribute present in your sample. If your sample is
quantitatively different from the target population then it consists of respondents
that represent a narrower range of the construct than would be expected in the
population. This can bias the resulting item-level statistics. Although inter-item
correlations may be representative of the population, item level variability will not
be.
 A pilot sample may also differ from the target population in qualitative ways,
such that the relationships among items, subscales, and/or constructs differ from
the target population. One way this can occur is when the meaning given to items
on your measure is atypical of the meaning that would be given to these items in
the larger population. This can severely influence the results obtained during the
development process.
Download