Summated rating scale construction (Spector) – A summary Summary Spector Chapter 1: Introduction Characteristics of a summated rating scale: 1. It contains multiple items (multiple items are summed) 2. Each individual item measures something that has a underlying, quantitative measurement continuum (e.g. attitude) 3. Each item has no right answer 4. Each item is a statement, on which ratings have to be given by respondents Why use multiple item scales? Reliability: single items do not produce responses that are consistent over time. Multiple items improve reliability by allowing random errors of measurement to error out (minimizing impact of one item) Precision: single items are imprecise because they restrict measurement to only two levels. Scope: many measured characteristics are broad in scope and not easily assessed with only one question. A variety of questions enlarges the scope of what is measured. Reliability: does it measure something consistently? Test-retest reliability a scale yields consistent measurement over time Internal consistency reliability multiple items, designed to measure the same construct, will intercorrelate with one another Validity: does the scale measure its intended construct? Steps of scale construction: 1. Define construct: what is the construct of interest, clear and precise definition. What is the scale intended to measure? 2. Design scale: write initial item pool 3. Pilot test: small number of respondents are asked to critique the scale. Based on feedback, the scale is revised 4. Administration and item analysis: sample completes the scale. Coefficient Alpha is calculated, reliability is established initially. 5. Validate and norm the scale Chapter 2: Theory of summated rating scales Classical test theory (CTT) Observed score = True score + Random Error (O=T+E) The larger the error component, the lower the reliability. With multiple items combined into an estimate of the true score, errors tend to average out, leaving a more accurate and consistent (reliable) measurement from time to time. Extension of CTT: O = T+E+B, where B is bias. Example of bias: Social desirability Response sets: tendencies for people to respond to items systematically (e.g. acquiescence response set, tendency to agree with all items regardless content) Chapter 3: Defining the construct Necessary for developing good items and to derive hypotheses for validation purposes. A clear construct makes this easier. Inductive development: clearly define construct, and this construct definition guides the subsequent scale development. Validation takes a confirmatory approach. Preferred for test construction. 1 Summated rating scale construction (Spector) – A summary Deductive development: Items are administered to subject and analyses are used to uncover the construct within the items. Validation takes an exploratory approach. Interpreting results. How to define the construct Literature review, use already existing scales for scale development Homogeneity and dimensionality of constructs The content of complex constructs can only be adequately covered by a scale with multiple subscales. Summated rating scale: Multiple items are combined and analyzed together instead of separately. Chapter 4: Designing the scale Three parts in designing: 1. Number and nature of response choices. Nature agreement, evaluation or frequency. Number generally between 5 and 9. 2. Writing the item stems 3. Instructions to respondents. Giving directions and instruction of the construct or to explain response choices (what is ‘quite often’ and ‘seldom’) Unipolar scales: scale varies from zero to a high positive value (e.g. frequency of occurrence) Bipolar scales: both positive and negative values, with the zero point in the middle (attitudes) In case both positive and negative worded items are used: reverse the scores of the negatively worded items! R = (H+L) – I, where H is largest number, L is lowest number, I is response to item and R is the reversed item. Rules for writing good items: 1. Each item should express one and only one idea 2. Use both positively and negatively worded items. In this way, bias from response tendencies (e.g. acquiescence, tendency to agree or disagree with all items) can be reduced. Acquiescence can produce extreme high or low scores because these people will tend to agree or disagree with all items if they are stated all positive or negative. 3. Avoid colloquialisms, expressions and jargon. 4. Consider the reading level of the respondents (and thus also complexity of items) 5. Avoid the use of negatives to reverse the wording of an item. People can misread (miss) negatives and score the item on the wrong side of the scale. Chapter 5: Conducting the item analysis Goal: producing a tentative version of the scale, suitable for validation. Administer the scale to a sample of respondents (100-200) for the item analysis. Item analysis Purpose: to find items that form an internally consistent scale and to eliminate those items that do not. Internal consistency: measurable property of items that implies that they measure the same underlying construct. Do items intercorrelate? Item-remainder co-efficient: how well does the individual item relate to the other items? Correlation with each item to the sum of the other items. Also: reverse negatively worded items! The items with the highest item-remainder co-efficient are the ones that are retained. Selection of items can take place by adding up to x items (with largest coefficients) or demanding a minimum sized coefficient for the items. The more items, the lower the coefficients can be and still yield a internally consistent scale. 2 Summated rating scale construction (Spector) – A summary Coefficient Alpha (Cronbach): measures the internal consistency reliability of a scale. Function of the number of items and their magnitude of intercorrelation. Coefficient Alpha can be raised by increasing the number of items or by raising their intercorrelation. Even items with low intercorrelations can produce a relatively high coefficient alpha, if there are enough items. If all items represent an underlying construct, then the intercorrelations among them represent the opposite of error. An item that does not correlate with the others is expected to be comprised of error. Error can be averaged out if there are a lot of items. Nunnally: Cronbach’s alpha should be at least 0.70 for a scale to demonstrate internal consistency. Alpha is not a correlation! Alpha should always be positive if the right calculation is made. Coefficient Alpha is about comparing the variance of the total scale score (sum of items) with the variances of individual items. When items are uncorrelated, the variance of the total scale is equal to the sum of variances of all items. See formula! Use both coefficient alpha and item-remainder coefficients for choosing items for a scale (see table 5.1 page 33). When an item-remainder coefficient is negative, a possible score error was made. The previous steps should be re-examined. It might be poorly written, or was inappropriate for the respondents, or the conceptualization of the construct might have weaknesses. External criteria for item selection Selecting or deleting items based on their relations with external (to the scale) criteria. Bias: items are deleted when they relate to bias (the variable of interest here). The scale would be administered to a sample, while measuring the bias variable on the same people. Each item would be correlated with the bias variable. Only items with small or no relations to the bias variable will be selected. Social desirability: people tend to agree with favorable items about themselves and disagree with unfavorable items. Each item is correlated with the scores on SD. Items with a significant correlation are deleted from the scale to develop the scale without SD bias. After the item selection, a tentative version of the scale is ready. Item analysis needs to be done on a second sample, to further establish reliability and validity. When the scale needs more work When an acceptable level of internal consistency is not achieved, it might be caused by the items or the construct definition. See rules for writing good items. Also the construct might contain too many elements, or it was too broad or vague defined. Spearman Brown formula: formula for estimating the number of additional items that should be needed to achieve a given level of internal consistency (coefficient alpha). Based on the assumption that added or deleted items are of the same quality as the initial or retained items. Vice versa the formula can calculate what coefficient alpha will be reached when items are increased or decreased by a certain factor. It works in two ways. See formula! The formula can also be used for decreasing the number of items. Take care that the coefficient alpha does not become to low. Example (table 5.2 page 38): An initial item pool begins with 15 items and after the item analysis, 5 items are left with a coefficient alpha of 0.60. The table indicates that the number of items need to be doubled (10) to achieve 0.75 and tripled to achieve (15) 0.82. Assuming that the newly added items are of the same quality, and only a third of the items will be retained again after the analysis, you need another 15 (5 are left) or 30 (10 are left) to achieve a coefficient alpha of 0.75 or 0.82. 3 Summated rating scale construction (Spector) – A summary Multidimensional scales Many constructs are broad and contain multiple dimensions. Developing the scale is quite the same as developing a unidimensional scale, only now subscales are used. It is best when each item is part of only one subscale. During construct development it is important to determine where constructs overlap and where they are distinct. Where scales share item content, one should be careful in interpreting their intercorrelations. Chapter 6: Validation If a scale is internally consistent, it is reliable: it certainly measures something. But does it measure the intended construct? Hypotheses are developed about the causes, effects, and correlates of the construct. The scale is used to test these hypotheses. Empirical support for the hypotheses implies validity of the scale. When a sufficient amount of data supporting validity is collected, the scale is (tentatively) declared to be construct valid. Empirical validation evidence provides support for theoretical predictions about how the contruct of interest will relate to other constructs. It demonstrates the potential utility of the construct. Validation takes place after the item analysis has been conducted and the items are chosen. Techniques for studying validity Three approaches for establishing validity: 1. Criterion-related validity Involves the testing of hypotheses about how the scale will relate to other variables. 2. Discriminant and convergent validity investigating the comparative strengths or patterns of relations among several variables. Factor analysis is used to explore the dimensionality of a scale. Criterion-related validity Comparing scores on the scale of interest with scores on other variables: criteria. It begins with the generation of hypotheses about relations between the construct of interest and other constructs. The scale is validated against these hypotheses, either generated from existing theory or from own theoretical work. 1. Concurrent validity Simultaneous collection of all data: data from a sample of respondents on the scale of interest and on criteria, hypothesized to relate to the scale of interest. Hypotheses are that the scale of interest will correlate with one or more criteria. Findings that the scale of interest significantly relates with hypothesized variables, are taken as support for validity. Often in this type of validity, the scale of interest is embedded in a questionnaire that contains measures of several variables. 2. Predictive validity Same as concurrent, only data for the scale of interest is collected before the criterion variables. How well can the scale predict future variables? Unlike concurrent, predictive validity demonstrates how well a scale can predict a future variable. For example the prediction of respondents quitting from school, as predicted by their attitudes/personality. 3. Known-groups validity Based on hypotheses that certain groups of respondents will score higher on a scale than others. The criterion is categorical instead of continuous. Means on the scale of interest can be compared among respondents who are at each level of the categorical variable. Hypotheses specify which groups score higher on the scale. E.g. for a scale of job complexity, it is hypothesized that corporate executives will score higher than data-entry clerks. If not, then there is something wrong; the scale is not valid. 4 Summated rating scale construction (Spector) – A summary Two critical features for all criterion-related validity studies: 1. Underlying theory from which hypotheses are drawn must be solid. Also in concurrent validity study: if something goes wrong, you cannot easily discover if something was wrong with the theory or the scale. 2. There must be a good measurement of the criterion in order to conduct a good validity test of a scale. If you don’t find expected relations with criteria, this might as well be because of criterion invalidity rather than scale invalidity. Convergent and discriminant validity Convergent validity different measures of the same construct should relate strongly with each other. Indicated by comparing scores on a scale with an alternative measure of the same contruct. The two measures should correlate strongly. Discriminant validity measures of different constructs should relate only modestly with each other. Multitrait Multimethod matrix (MTMM) page 51 Simultaneously exploring convergent and discriminant validities. At least two constructs are measured, and each has been measured with at least two separate methods. The correlations between subscales measuring the same trait but across methods are the convergent validities. Convergent validity is indicated by the validity diagonal values, which should be statistically significant and relatively large in magnitude. For each subscale it should be larger than other values in its row or column. So, tow measures of the same construct should correlate more strongly with each other than they do with any other construct. Use of factor analysis for scale validation Unidimensional scales factor analysis can be used to explore possible subdimensions within the group of selected items Multidimensional scales factor analysis can be used to verify that the items empirically form the intended subscales Basic aspects of factor analysis Reducing a number of items to a smaller number of underlying groups of items, called factors. These factors can be indicators of separate constructs or of different aspects of a single rather heterogeneous construct. Factors are derived from analyzing the patterns of covariation (or correlation) among items. Groups that interrelate more strongly with each other than to other groups tend to form factors. Results are a function of the items entered. Subscales with more items tend to produce stronger factors, subscales with few items tend to produce weak factors. Items that intercorrelate relatively high are assumed to reflect the same construct (convergent validity). Items that intercorrelate relatively low are assumed to reflect different constructs. Exploratory factor analysis Useful for determining the number of separate components that might exist for a group of items. Goal is to explore the dimensionality of the scale itself. Two questions must be addressed with a factor analysis: 1. the number of factors that represent the items 2. the interpretation of the factors Steps in the factor analysis Principal components are derived, with one factor or component derived for each item analyzed. Each of these initial factors is associated with an eigenvalue, which is the relative proportion of variance accounted for by each factor. The sum of all eigenvalues 5 Summated rating scale construction (Spector) – A summary will equal the number of items (since there is one eigenvalue for each item). If all items would perfectly correlate, they would produce a single factor that will have an eigenvalue equal to the number of items. The other eigenvalues will then equal zero. After it is determined how many factors exist, an orthogonal rotation procedure is applied to the factors. This results in a loading matrix that indicates how strongly each item relates to each factor. A loading matrix contains factor loadings that are correlations of each original variable with each factor. Every variable (rows in the matrix) has a loading for every factor (columns in the matrix). A variable ‘loads’ on a factor if the factor loading is larger than 0.30-0.35 (so the correlation between item and factor is at least 30%) Dooley, page 91: Factor analysis identifies how many different constructs (or factors) are being measured by a test’s items and the extent to which each item in the test is related to (loaded on) each factor. Factor analysis uses the correlations among all items of a test to identify groups of items that correlate more highly among themselves that with items outside the group. Each group of items defines a common factor. Confimatory factor analysis Allows the testing for a hypothesized structure: when the items of subscales are tested to see if they can support the intended subscale structure. With exploratory factor analysis, the ‘best fitting factor structure’ is fit to the data. With confirmatory factor analysis, the structure is hypothesized in advance, and the data are fit to this structure. Conducting the CFA specify in advance the number of factors, the factor(s) on which each item will load and whether or not the factors are intercorrelated. Loadings are again presented in a similar loading matrix. Each element is either set on zero (so it does not load on a factor) or ‘freed’ so that its loading can be estimated by the analysis. Another matrix represents the correlations among factors. The CFA analysis yields estimates of the factor loadings and interfactor correlations. A CFA that fits well indicates that the subscale structure may explain the data, but it does not mean that it actually does. Support for the subscales in an instrument is not very strong evidence that they reflect their intended constructs. Additional evidence is needed. Validation strategy Test validation requires a strategy of collecting as many different types of evidence as possible. The evidence becomes strong if it’s tied to convergent validities based on very different operationalizations (measurements). Accumulation of validation evidence shows that a scale is working as expected. At this point you can say that the test demonstrates construct validity (supported, but never proven. New theory can always bring different findings that make the initial explanation incorrect.) Chapter 7: Reliability and norms Reliability Internal consistency reliability is an indicator of how well the individual items of a scale measure an underlying construct. Coefficient alpha is most often used to measure internal consistency. It is good to replicate the coefficient alpha in subsequent samples. Estimates of reliability across different samples will expand the generalizability of the scale to more subject groups. Test-retest reliability. This type of reliability reflects measurement consistency over time. How well does a scale correlate with itself across repeated administrations to the same respondents? The longer the time period, the lower the test-retest reliability would be expected to be. Calculation: administer a test two times, and identify each respondent by a code. Match the two administrations and calculate a correlation coefficient for the two. 6 Summated rating scale construction (Spector) – A summary Norms In order to interpret the meaning of scores, you need to know something about the distribution of scores in various populations. Normative approach uses the distribution of scores as the frame of reference to interpreted the meaning of a score. Score of individual is compared to the distribution of all scores. Most scales are developed and normed on limited populations. To compile norms, one would collect data with the instrument on as many respondents as possible. Reliability and validity studies will provide data that can be added to the norms of the instrument. To compile norms, one would calculate the mean and standard deviation across all respondents, and the shape of the distribution. Also study subpopulations! The availability of normative data on different groups (males and females, race) increases the likelihood that meaningful comparisons can be made. If you want to determine overall norms for the population, make sure that the sample is representative for the population of interest. 7