1 Guidelines in Scale Development Step 1: Determine clearly what it is you want to measure You must think clearly about the construct being measured and not overlook the importance of being well grounded in the substantive theories related to the phenomenon being measured. Even if there is no available theory you must specify a tentative theoretical model that will serve as a guide. This should at minimum include a well formulated definition of the construct. Scales can be either very general or very specific and specificity can vary along a number of dimensions including content domains, setting, and population. It is vital that you determine the level of specificity before you begin to write items. Furthermore, it is vital that the level of specificity is grounded in theory. This step should have been completed in your first assignment, although it is possible that you are still thinking about these issues. Step 2: Generate an Item Pool These items should be selected or created with the specific measurement goal in mind. Theoretically a good set of items is chosen randomly from a universe of items relating to the construct of interest. Redundancy is a good think at this stage IF redundancy pertains to the item, not aspects of the item. For example, having an item such as “In my opinion, pet lovers are kind” and “In my estimation, pet lovers are kind” is somewhat ridiculous. However, if the second item were replaced by “I think that people who like pets are good people” would do a good job at being redundant but not trivially so. When developing a scale you want considerably more items in the initial pool than you plan on including in the final scale. The more items you have in your pool the pickier you can be when choosing items for your final scale. The following guidelines can help you in ensuring that your items are technically sound. 1. Avoid exceptionally lengthy items which tend to increase complexity and decrease clarity. However, do not decrease the length of an item if it sacrifices the meaning of the item. 2. Write items at an appropriate reading level. A rule of thumb here is to write items so that they are between a fifth and seventh sixth grade reading level. Your text gives a description of how to assess the reading level of a typical sentence. 3. Avoid items that convey more than one idea in which endorsing the item might refer to either or both ideas. 2 4. Avoid ambiguities. Consider the items “Rapists should not be allowed to seek the advice of lawyers because they are the scum of the earth” or “How far did your mother go in school?” 5. Be wary of using both negatively and positively worded items. Although it is common to include both of these types of items to avoid agreement bias negatively worded items often do not function properly. This is especially true if there is a double negative, either in the item itself or in the item and the responses. Step 3: Determine the format for measurement Numerous formats for questions exist and the format for the questions should be determined simultaneous with the development of the items. Item format must be decided upon simultaneously with item generation. In general, scales made up of items that are summed to form a scale score are most widely used and most consistent with the theoretical framework we have been developing. However the other formats, such as the following, are possible: 1. Thurstonian Scaling: In this type of scale agree/disagree items are developed that are differentially responsive to specific levels of the attribute which is typically an attitude of some sort. A simple example, using items that tap one's fondness of chocolate ice cream would be the following: I love chocolate ice cream Chocolate ice cream is the best flavor in the world. I sometimes choose to eat chocolate ice cream. Chocolate ice cream is definitely not my favorite flavor. I detest chocolate ice cream. Developing a true Thurstone scale is extremely difficult. 2. Guttman Scaling: In this type of scale agree/disagree items are developed that tap progressively higher levels of an attribute. Theoretically at some critical point a respondent should disagree with the remaining items because the amount of attribute of the item exceeds the attribute possessed by the respondent. A simple example measuring degree of alcoholic consumption would be the following: I never consume alcoholic beverages. I occasionally consume alcoholic beverages. I typically consume alcoholic beverages at least once a week. I typically consume alcoholic beverages more than once a week. I typically consume one alcoholic beverage each day. I typically consume more than one alcoholic beverage each day. I typically consume more than three alcoholic beverages each day. 3 I typically consume more than five alcoholic beverages each day. These types of scales work well in some situations but tend to do poorly when the phenomenon of interest is not objective or concrete. 3. Scales with Equally Weighted Items: These types of scales tend to work best when items or more or less equally related to the underlying construct. Specific Types of Response Options: Likert Scale: This is one of the most common item formats and is commonly used when measuring opinions, beliefs, and attitudes. In this type of scale a declarative statement is presented followed by response options that indicate varying degrees of agreement or endorsement of the statement. The difference of agreement should be approximately equivalent between any adjacent response options. A good Likert scale should state the opinion, attitude, belief, or other construct under study in clear terms. It is not necessary or appropriate that items span the range of weak to strong assertions of the construct because the response options provide levels of differentiation. Be careful of writing items that are too mild for which most everyone would agree. Semantic Differential: This type of item is used in reference to a stimuli. Each item consists of adjective pairs that represent opposite ends of the spectrum such as the following: Statistics Easy ____ ____ ____ ____ ____ ____ ____ Hard Exciting ____ ____ ____ ____ ____ ____ ____ Boring Important ____ ____ ____ ____ ____ ____ ____ Trivial Useful ____ ____ ____ ____ ____ ____ ____ Useless Typically seven or nine points are given. The respondent places a mark on one of the lines to indicate the point on the continuum that characterizes the stimulus. If sets of items are chosen that tap the same underlying construct then, similar to Likert scales, such response formats are theoretically compatible with the measurement models presented earlier. Visual Analog: This type of item is similar to the semantic differential however the respondents are provided with a continuous line for which they must place a mark representing where they fall on whatever is being measured. Although this type of response format is seemingly capable of fine differentiation, it is important to consider that such fine gradations may not be meaningful to respondents. Binary Options: This type of scale gives subjects a choice between two options (e.g. yes/no, true/false, agree/disagree). A major shortcoming of these types of scales is that can only have minimal variability because each item contributes such a small amount to the overall score. 4 Issues Related to Response Options: A desirable quality of a scale is variability. A measure cannot co-vary if it cannot vary. One way to increase variability is to have a lot of items. Another option is to increase the number of response options. However, it is important to keep in mind respondents' ability to discriminate between response options. When too many response options are provided there will be more variability, but it will be “noise” rather than true differences in the latent trait. Sometimes the ability to discriminate between response options is dependent on the specific wording (e.g. several, few, many) or the physical arrangement of the response options. It is helpful to place response options so that they represent an obvious continuum. It is important to use terms that are easy to differentiate. Using both the terms “somewhat” and the term “not very” together will make it difficult for respondents to differentiate between these two categories so responses will not mean the same thing for everyone. Using a continuous type of scale should be thought through very carefully. Consider the visual analog scale described previously. How precise will measurement be in this case? Using a neutral category should not be automatic. It depends on the type of question, type of response options, and the purpose. An even number of response options forces one to make a choice, while an odd number of response options allows one to avoid making a choice. Step 4: Have the initial item pool reviewed Consider having experts in the field (i.e. colleagues who have worked extensively with the construct you are developing an instrument to measure) review your item pool to determine if you have adequately defined the latent trait. You can accomplish this by asking others to rate how relevant they think each item is to what you are trying to measure. Reviewers should also evaluate the clarity and conciseness of items. Ambiguous or unclear items can reflect factors extraneous to the latent trait. Reviewers can also comment on additional ways you might measure your construct that you may have overlooked. Ultimately, it is your decision as to whether or not to incorporate the advice given by reviewers. Step 5: Consider the inclusion of validation items This refers to including items that attempt to ascertain whether respondents are answering items for reasons other than your intents, such as social desirability. It might also be important to include items that help you to assess the construct validity of your scale. More on this in future classes. 5 Step 6: Administer items to a development sample This will allow you to determine statistically the quality of your items. Of course the sample size needs to be large but the obvious question is “how large?” There is no general consensus to this issue and the answer is somewhat dependent on the number of items and the number of scales. The sample size should be large enough so that the item level statistics obtained from piloting the measure are representative of the target population. Both overall sample size and ratio of respondents to items must be considered. I have read somewhere that having 5 respondents per item is sufficient. Your book suggests having at least 300 respondents in your pilot sample. If too few respondents are used then the patterns of covariance among the items will not be stable. In the extreme case, when the ratio of subjects to items is low and the sample size is small the correlations among items can be primarily influenced by chance. It is also important that the sample used for piloting your measure be representative of the population for which the scale is intended, in terms of the variability in the level of attribute present in your sample. If your sample is quantitatively different from the target population then it consists of respondents that represent a narrower range of the construct than would be expected in the population. This can bias the resulting item-level statistics. Although inter-item correlations may be representative of the population, item level variability will not be. A pilot sample may also differ from the target population in qualitative ways, such that the relationships among items, subscales, and/or constructs differ from the target population. One way this can occur is when the meaning given to items on your measure is atypical of the meaning that would be given to these items in the larger population. This can severely influence the results obtained during the development process.