PSYCHOLOGICAL ASSESSMENT Dr. Münevver Başman munevver.kaya@marmara.edu.tr munevver.rock@gmail.com Test Development The process of developing a test occurs in five stages: • The idea for a test is conceived (test conceptualization), • items for the test are drafted (test construction). • First draft of the test is then tried out on a group of sample testtakers (test tryout). • The data from the tryout are collected, testtakers’ performance on the test as a whole and on each item is analyzed. • • Statistical procedures (item analysis) are employed to assist in making judgments about which items are good as they are, which items need to be revised, and which items should be discarded. On the basis of the item analysis and related considerations, a revision or second draft of the test is created. This revised version of the test is tried out on a new sample of testtakers, the results are analyzed, and the test is further revised if necessary—and so it goes. Test Conceptualization • To define the construct that will be measure by test, literature about the construct should be reviewed. • Available literature on existing tests designed to measure a particular construct should be reviewed. • a structured interview to measure a construct can be developed. It can involve open-ended interviews. • For psychological tests designed to be used in clinical settings, clinicians, patients, patients’ family members, clinical staff, and others may be interviewed for insights that could assist in item writing. • Testtaker sample can be asked to write a composition about the construct. Test Construction • a rating scale can be defined as a grouping of words, statements, or symbols on which judgments of the strength of a particular trait, attitude, or emotion are indicated by the testtaker. • It is termed a summative scale, because the final test score is obtained by summing the ratings across all the items. Deciding the Number of Categories • The original Likert type scale includes 5 response categories from ‘strongly agree’ to ‘strongly disagree’. Later, two, three, four, six, and seven response categories were used in the Likert scales. • Even/Odd number of categories: • • Testtakers tend to choose positive or negative categories Moderate ones tend to leave blank. • It is better to have an odd number of response categories. Loss of information in terms of scale level as the number of categories falls below 5. The difference between categories is indistinguishable as they go up from 5. • Young people, those with low levels of education? • “Smiley” faces have been used in social-psychological research with young children and adults with limited language skills. • One type of summative rating scale, the Likert scale, is used extensively in psychology, usually to scale attitudes. Writing Items • Items are written using the information obtained at the end of the literature review and composition applications. • Do not use statements that refer to the past rather than to the present. • Avoid statements that are factual or could be interpreted as factual. • Make sure no statement can be interpreted in more than one way. • Do not use filler statements that are irrelevant to the psychological construct being investigated. • Avoid universals such as all, always, none, and never because they often introduce ambiguity. • Words such as only, just, merely, and others of a similar evaluative nature should be used with care and moderation. • Be sure to include statements that cover the entire range of the construct of interest. • Keep the wording of the statements simple, clear, and direct. • Keep the statements short (rarely exceed 20 words). • Be sure each statement contains only one complete thought. • Statements should be written in simple sentences rather than compound or complex sentences unless this cannot be avoided. • Avoid using words that may not be understood by those who are to be given the completed scale. Make the reading level appropriate to the intended audience. • Do not use double negatives. • The number of positive and negative expressions should be approximately same. • Both positive and negative expressions, including the same scale, are not included in the tryout scale. • lack of spelling errors, expression disorders • Positive-negative items should be mixed order • Numerical and verbal categories should be written (only numeric categories may cause confusion) • The instruction should be easy to understand and as short as possible. aim of the scale • number of items in the scale • Response categories • estimated response time • The identity of the respondents should be hidden and this should also be stated in the instruction. • • Submission to expert opinion Test tryout • Printed-online • Layout such as letter size, line spacing • Items must be distinguishable from each other • Print quality-screen layout Test tryout • This first draft of the test is then tried out on a group of sample testtakers. • Sample that is applied the scale should represent the population. • Item statistics depend on sample. • Number of testtakers should be at least five times more than the number of items Item Analysis • The main purpose of data analysis obtained by test tryout is to develop a valid and reliable scale included items with the best psychometric properties. Item Score Total Score Scoring responses Item analysis Select items Categories Positive Negative Totally disagree 1 5 Almost totally disagree 2 4 Sometimes agree 3 3 Almost totally agree 4 2 Totally agree 5 1 • The scale score of each respondent is the sum of the item scores. For this, the response of each respondent to each item should be scored. • Reverse scores • High scale scores always show a positive construct. Item Analysis • Item Discrimination • Item Difficulty (for maximum performance test) • Item reliability • Item validity Item Discrimination Correlation Difference between means of Upper and Lower level Correlation based item analysis • Item-total correlation: The correlation between each item and a scale score. • Corrected item-total correlation: Correlation between each item and a scale score that excludes that item. r n XY X Y n X ( X ) nY (Y ) 2 2 2 2 Difference between means of Upper and Lower level • Respondents are ranked from the highest score to the lowest score based on the total scale scores. • 27% upper group-27% lower group • Is there a significant difference between the means of total scores of upper group and lower group? • Upper group mean>Lower group mean Index of Item Evaluation of Item Discrimination 0.40 and upper Excellent 0.30-0.39 Good item. Still can be improved. 0.20-0.29 Item should be improved. 0.19 and lower Weak. It must be removed from the test. Item Difficulty p= Number of correct answered students/ number of upper and lower level students p= (29+9) / 108 = 0,35 Du Dl pj 2N Item 18 A B* C D E Missing Total Upper (%27) 8 29 8 3 5 1 54 Lower (%27) 11 9 13 10 6 5 54 Total 19 38 21 13 11 6 108 Item Discrimination Du Dl rjx N • rjx= 29-9 / 54= 0,37 Item 18 A B* C D E Missing Total Upper (%27) 8 29 8 3 5 1 54 Lower (%27) 11 9 13 10 6 5 54 Total 19 38 21 13 11 6 108 Number of Students (100) Item 1 Number of correct answered students Item 2 Number of correct answered students Upper group (%27) (27) 25 20 Lower group (%27) (27) 15 15 Index of item’s difficulty (p) Index of item discrimination (rjx) p 25 15 0.74 54 25 15 rjx 0.36 27 20 15 p 0.64 54 rjx 20 15 0.19 27 Item standart deviation and variance • Give information about the differentiation of item scores • Item variance = • Item standart deviation Sx= • If index of an item is .60, find the item variance? pj (1 pj ) pj (1 pj ) Item reliability The reliability of each item is directly proportional to the discrimination and standard deviation of the item. The standard deviation of the item reaches its highest value when item difficulty is 0.50. Therefore, if the item difficulty is .50 or about 0.50 , it will increase the reliability of the item. rx rjx sx Item validity • Factor Analysis (for typical performance test) • The correlation between the score on item 1 and a score on the criterion measure (denoted by the symbol r1c) is multiplied by item 1’s item-score standard deviation (s1), and the product is equal to an index of an item’s validity (r1c s1). (for maximum performance test) Revision • On the basis of the item analysis and related considerations, a revision or second draft of the test is created. This revised version of the test is tried out on a new sample of testtakers, the results are analyzed, and the test is further revised if necessary—and so it goes. TEST RELIABILITY Test-Retest Parallel or alternate test forms Singleadministration methods TEST VALIDITY Validity Face validity Content validity Criterionrelated validity Construct validity