Scale Design (Giles, 2002, chapter 8) Psychometric assessment includes a range of tests, scales, and questionnaires. Psychometric tests (e.g., competence tests) Questionnaires (e.g., NZES) Scales (e.g., attitudes) Thurstone (and Chave, 1929) Semantic differential (Osgood, Suci, & Tannenbaum, 1957) Likert Creating an item pool An item pool is the initial pool of items that you will beginning testing, with the aim of reducing them to a usable subset of items. The items in the scale should be logically related, so as to reduce error. One first step is to develop a test specification that summarises the intended nature and scope of the scale E.g., Attitudes to hunting - Morality of hunting Safety of hunting Relationship with nature General affective orientation Power and masculinity 1 Writing items Things to avoid: Complex or compound items Items with several clauses, allowing for ambiguous responses. ‘The royal family are a drain on the national economy and are a relic of a bygone age’ ‘Should there be a reform of our justice system placing greater emphasis on the needs of victims, providing restitution and compensation for them and imposing minimum sentences and hard labour for all serious violent offences?’ Use of jargon Avoid technical terms, or terms which might have different meanings in different contexts/disciplines. Incomplete/ambiguous items ‘If I found that my teenage son or daughter had smoked cannabis, I would be horrified’ ‘The treaty is an important part of New Zealand history’ Statements of ‘fact’ ‘Crime in British cities has increased since the last war’ People may feel they are not in a position to comment. Unidirectional phrasing 2 Piloting Administer the scale items to a pilot sample The sample should be at least as big as the number of items plus one. Item analysis Facility Index – if the measure is a competence based test The FI is the proportion of participants who get an item ‘correct’. Discrimination Index A DI is a measure of how well an item discriminates between high and low scale scorers. - Identify the top and bottom 25% of OVERALL SCALE SCORES. - Calculate the total summed responses to each item for each of the top and bottom 25%. - Divide the total for the upper quartile by the total for the lower quartile. - This is your DI – larger numbers indicate better discrimination. You can also conduct t-tests of each item comparing the upper and lower quartiles. 3 Item-total correlations The item-total correlation is the correlation between the scores on individual variables with the scale total scores for all items. High correlations indicate that an item is contributing to the ability of the scale to discriminate people. Giles (2002) recommends using items with correlations ≥ 0.20. 4 Example: Attitudes to Hunting DI Hunting is a great way to get back to nature There’s no skill in hunting animals with guns Animals hunt each other so it’s fine for humans to hunt animals too Hunting animals is a cowardly pastime Hunting is one way to show our power over nature Hunting animals is unethical Hunting animals for food is part of the natural order of things Hunting is a dangerous pastime and should not be allowed People shouldn’t knock hunting till they’ve tried it Just because other animals prey on each other doesn’t mean we should hunt them too Hunting is a fun activity If a person respects nature, they shouldn’t go hunting animals Hunting is a perfectly safe activity We are not superior to animals so it’s not okay to hunt them Hunting for sport is a perfectly moral pastime Hunting animals is unfair because animals can’t shoot back Hunting is an activity for the strong of body and character Hunting animals is stupid People who don’t like hunting are wimps Hunting animals is cruel and should not be allowed I-T Corr 2.80 1.34 2.56 1.87 1.32 2.06 1.32 1.89 1.92 2.47 .75 .40 .74 .68 .17 .71 .43 .71 .63 .73 2.17 2.47 1.94 1.62 2.69 2.37 2.32 2.47 1.71 2.66 .70 .75 .48 .54 .72 .71 .58 .80 .31 .80 To check that scale items make an internally consistent scale – check their Cronbach Alphas. You do this using: Analyse ‘Scale’ ‘Reliability Analysis’ Insert the variables you want alpha for, then: ‘Statistics’ tick all of the options under ‘Descriptives’. Overall, the full scale has a Cronbach’s alpha of .9158 Without ‘power over nature’ the scale has an alpha of .9241 5 Scale Reliability •Internal reliability: –Cronbach’s alpha; Kuder-Richardson 20; split-half; etc. –Want to know that all of your items are highly intercorrelated. If not, then you have noise or heterogeneity in your measure. You may think that you’re measuring depression, let’s say, but you’re really measuring depression and anxiety. Criteria for Cronbach’s Alpha •The minimum acceptable level is .70. You would like to have the alpha be in the .80s, and you’re ecstatic if it goes into the .90s. •Another example. Paul Jose and a colleague have written a new scale to measure parental facilitation of literacy and numeracy in preschool children (PFLNS), and sought to compare it with a pre-existing measure (Home Literacy Environment; HLE). Some critical facts •HLE: composed of 9 items and the purported Cronbach’s alpha is .74. Has been shown by the authors to predict reading scores on two commonly used tests of reading: PPVT and PIAT-R. •PFLNS: composed of 42 items, and no idea what the reliability would be. No validation yet. •Research plan: Collect data from 200 parents on the HLE and PFLNS in Chicago and Wellington, and individually test these children (4 and 5years old) on the TEMA and TERA. •By so doing, we could examine the internal reliability, test-retest reliability, and validity data in one fell swoop. Let’s see how it turned out. Next page is alpha for the HLE. 6 HLE: Scale TELLY CHECKS NEWSPAP MAGADULT MAGCHILD MOTHREAD FATHREAD CHILREAD NUMBOOKS Scale Mean if Item Deleted Corrected Variance if Item Deleted 9.8764 11.1437 10.8764 10.5833 11.2730 9.9770 10.0201 9.9195 9.7557 5.3247 4.9534 4.0394 3.7250 4.6025 4.3222 4.2157 4.6794 5.0497 ItemTotal Correlation Alpha if Item Deleted -.0525 .0673 .3439 .3360 .1802 .3020 .2905 .2367 .2037 .5345 .5124 .4127 .4117 .4781 .4343 .4362 .4608 .4776 Reliability Coefficients N of Cases = 348.0 N of Items = 9 Alpha = .4955 PFLNS •Cronbach’s alpha = .866 for 42 items. •Inescapable conclusion: the PFLNS is internally reliable and the HLE is definitely not. Doesn’t usually turn out to be quite so clean. •Something to remember: the more items you have (if they are similar), the higher your alpha. A 9-item scale must be very coherent in order to have a good alpha. We have 42 items, and they have 9. 7 Reliability •Okay, we’ve demonstrated good internal reliability, are we done yet? •No, because we don’t know if the scales have good reliability over time, usually called “test-retest reliability”. One simply correlates scores between individuals over a relatively short period of time (a few weeks to a month). •What are they for the HLE and PFLNS? Answer: We don’t know yet. We have the data, but have not entered them yet. We could guess that the PFLNS would be better, again because of the larger number of items in it. Reliability over time •Why is this important? Because you want to know that whatever you’re measuring is relatively stable over time. •But is that true for all measures? In the case of parental practices, the answer is yes. But in the case of rapidly changing variables, such as mood, you would not expect stability over time. So think about this before you gather the data and check it. 8 Validity •There are four kinds of validity. Let’s review them. –Face validity: do the items look like they measure what they’re supposed to measure? –Convergent validity: does the measure correlate with similar measures and fail to correlate with dissimilar measures? –Criterion validity: does the measure predict something that it is supposed to predict? –Construct validity: the degree to which the measure accurately measures the hypothetical construct it is designed to measure. Validity of the PFLNS •So what kind of validity should we consider? –Face validity: we created items that measured the degree to which parents did educationally enriching activities. –Convergent validity: does our scale correlate with the HLE? We could have included a measure of anxiety, or something unrelated too. –Criterion validity: does the scale predict scores from standardized tests of literacy and numeracy? This is the most important goal. –Construct validity: does the scale predict the hypothetical construct of “parental behaviours that facilitate academic skills”? This would be the long-term goal of a number of data collections. 9 Face Validity •HLE: –Approximately how many books does your child own? –How many hours of television does your child watch each week? •PFLNS: –Use maths in home routines, e.g., measuring ingredients for cooking. –Do alphabet workbooks or worksheets. Convergent validity •Correlation between the HLE and PFLNS: r(322) = .245, p < .001. •Correlation between the HLE and the PFLNS-Reading sub-score: r(322) = .259, p < .001 •Correlation between the HLE and the PFLNS-Maths sub-score: r(322) = .190, p < .001 Criterion Validity •Correlation of the HLE with: –Reading: b = .017, R2= .001, p = .82. –Mathematics: b = .047, R2= .002, p = .35. •Correlation of the PFLNS with: –Reading: b = .238, R2= .06, p = .001. –Mathematics: b = .158, R2= .03, p = .003. •Conclusion? The PFLNS seems to do a better job of predicting maths and literacy scores than the HLE. 10 Construct Validity •This is not easily demonstrated. One needs to have the results from a variety of studies, all of which show that the new scale is a good predictor/correlate of related constructs. •Other than the HLE, no pre-existing measure of parental behaviours exists with which we can correlate our new measure. One really needs to have 3-5 other measures to “triangulate” in on the hypothetical construct. •One cannot measure a hypothetical construct directly, but one can use structural equation modeling to determine this. 11