UNIT IV ITEM ANALYSIS IN TEST DEVELOPMENT CHAP 14: ITEM ANALYSIS CHAP 15: INTRODUCTION TO ITEM RESPONSE THEORY CHAP 16: DETECTING ITEM BIAS 1 CHAPTER 14 ITEM ANALYSIS *The goal of test construction is to create a test with minimum length and good reliability and validity. *Item Analysis is the computation and examination of any statistical property of an item response distribution. *Item Analysis is a process that we go through when constructing a new test or subtests from a pool of items with good reliability and validity. 2 CHAPTER 14 ITEM ANALYSIS *Categories of Item Parameter *Item parameters fall into 3 categories or indices. 1. Indices that describe the distribution of responses to a single item (e. g. mean and variance of item responses). 2. Indices that describe the degree of relationship between the response to the item and some criterion of interest. Ex. next 3 CHAPTER 14 ITEM ANALYSIS Ex. The relationship between the questions (items) and the criterion of interest i.e., depression in Factor Analysis. 3. Indices that are a function of both, meaning relationship to item variance/mean and a criterion of interest. Ex. First, find the variance/mean for your items then, calculate the relationship between these items variance and the criterion of interest (i.e., depression) for two groups.. 4 ITEM DIFFICULTIES (P) It is one of the 7 steps in Item Analysis. We use Item difficulties to select the best items. 5 ITEM DIFFICULTIES (P) P= f/N or Number of examinees who answered an item correctly / Total number of participants (See your midterm item analysis and Chap 5). The higher the P value the easier the item 6 7 CHAPTER 14 ITEM ANALYSIS *Steps in Item Analysis In a typical item analysis the test developer will take 7 steps (they are similar to the process of test construction in Chapter 4). Next Slide 8 FYI PROCESS OF TEST CONSTRUCTION CHAP IV 1-Identifying purposes of test scores use 2-Identifying behaviors to represent the construct 3- Preparing test specification i.e., Bloom Taxonomy 4- Item construction 5- Item Review 9 PROCESS OF TEST CONSTRUCTION 6- Preliminary item tryouts 7- Field test 8- Statistical Analysis 9- Reliability and Validity 10- Guidelines 10 7 STEPS IN ITEM ANALYSIS (P) 1. Describe what proportions of the test score are of greatest important. Ex. when I select questions for your midterm/final exam I look for the similarities of the questions with those of qualifying/comprehensive or EPPP exams. 11 7 STEPS IN ITEM ANALYSIS (P) 2. Identify the item parameters (e.g. mean, variance) most relevant to these proportions. 3. Administer the items to a sample of examinees representative of those for whom the test is intended. Ex. IQ test for children or depression test for adults. 12 7 STEPS IN ITEM ANALYSIS (P) 4. Estimate for each item the parameters identified in step 2 i.e., variance). 5. Establish a plan for item selection. Ex. Using item difficulties (P) as in Item Analysis to select the items. 13 7 STEPS IN ITEM ANALYSIS (P) 6. Select the final subset of items, or use the data (Items in your Item Analysis) for test revision. Ex. Takeout all questions with very high or very low item difficulties. 7. Conduct a cross validation (validity) study. Ex. Use SPSS and compare the results of 2 tests or 2 classes (e. g. this year class and last year class). i.e., Confirmatory Factor Analysis. 14 UNIT V TEST SCORING AND INTERPRETATION CHAP 17: CORRECTING FOR GUESSING AND OTHER SCORING METHODS CHAP 18: SETTING STANDARDS CHAP 19: NORMS AND STANDARD SCORES CHAP 20: EQUATINGSCORESFROM DIFFERENT TESTS 15 CHAPT 19 NORMS AND STANDARDS SCORES 16 NORMS AND STANDARD SCORES 1895 *Alfred Binet (1910) Ratio IQ = Ratio of MA/CA 1912 In 1912 in Germany Wilhelm Stern proposed the following formula: IQ = [Mental age/Chronological age]100 standardized it. This formula works fairly well for children but not for adults. *The abbreviation "IQ" was coined by the psychologist William Stern for the German term Intelligenz-quotient Ratio IQ NORMS AND STANDARD SCORES 1916 *3. Lewis Terman from Stanford University, publishes the Stanford-Binet Intelligence Test. He used the standardized version IQ = [Mental age/Chronological age]100 NORMS AND STANDARD SCORES *Deviation IQ = Uses Norms to estimate the IQ We use Norms when we want to compare an examinee’s score (raw score) or score on a test to the distribution of scores (scaled or standard scores) for a sample from a well-defined population. Ex. next 20 NORMS AND STANDARD SCORES Ex. When we want to estimate the IQ of a 20 year-old persons, We compare their raw score on the subtest of an IQ test with the people of their age, which is “their norm” (standard score). Using this technique tells us where they stand among the people of their age. 21 *9 BASIC STEPS IN CONDUCTING A NORMING STUDY (P.432) 1. Identify the population of interest Ex. Students, employees of a company, inmates, patients, etc. 2. Identify the most critical statistics that will be computed for the sample data. Ex. Standard deviation σ, σ² , M, SS, p 22 NORMS AND STANDARD SCORES *9BASIC STEPS IN CONDUCTING A NORMING STUDY (P.432) 3. Decide on the tolerable amount of sampling error That is the discrepancy between the sample statistic (M) and population parameter, (µ) (Central Tendency M=µ). The Central Limit Theorem has 3 characteristics; 1. Central Tendency 2.The Shape of the Distribution (normal) and 3. Variability or Standard Error of Mean (σm). M-µ 23 9BASIC STEPS IN CONDUCTING A NORMING STUDY (P.432) 4. Device a procedure for drawing a sample from the population of interest. There are 4 types of probability sampling I Simple Random Sampling Give everyone in the population an equal chance to be selected Ex. Draw names from a hat. II Systemic Sampling N/n Select every Kth name on the list. Ex. CAU Pop N=1500 and your sample size n=150 N/n=1500/150=10 Select every 10th student. 24 9BASIC STEPS IN CONDUCTING A NORMING STUDY (P.432) SAMPLING CONT.. III Stratified Sampling “Strata” means different layers. We use Stratified Sampling when we want to compare 2 different groups (e.g. Males and females CAU Doctoral Students). First we randomly select males then, randomly select females. 25 9BASIC STEPS IN CONDUCTING A NORMING STUDY(P.432) SAMPLING CONT.. IV Cluster Sampling We use Cluster sampling when the population consists of units not individuals, such as classes. Ex. Miami Dade School Districts. If we want to conduct a research with the Miami Dade 2nd graders (1000- 2nd grade classes). We’ll randomly select about 10 of these 1000- 2nd grade classes to be in our sample, then we conduct research. 26 9BASIC STEPS IN CONDUCTING A NORMING STUDY (P.432) 5.Estimate the minimum sample size (n) required to hold the sampling error within the specific limits. There are different statistical procedures to estimate the (n). (n) should be ≥30. (Law of large number). 1. n= (σ/d)² d=effect size d=M-µ/σ 2. n= (σ/σm) ² σm= σ/√n Standard error of mean for pop Ex. Z score 27 Sm=S/√n Estimated Standard Error of the Mean for a sample. Ex. t-distribution NORMS AND STANDARD SCORES 28 THE EFFECT SIZE EX. TWO INDEPENDENT T-TEST 29 NORMS AND STANDARD SCORES 30 9BASIC STEPS IN CONDUCTING A NORMING STUDY (P.432) 6. Draw the Sample and collect the Data 7. Compute the Values of the Group Statistics of interest and their standard error. Sm=S/√n or σm = σ/√n Calculate the standard error of measurement, which is the difference between M and µ. Also known as sampling error. 31 9BASIC STEPS IN CONDUCTING A NORMING STUDY (P.432) 8. Identify the Types of Normative Scores that will be needed, and prepare the Normative Score Conversion table (see next 2 slide). 9. Prepare written documentation of the Normative Scores. 32 NORMS AND STANDARD SCORES Types of Normative Scores Raw Score Score on a subtest or a test. Scaled Score Normative score for specific age. 33 NORMATIVE SCORES 34 Wex-ler *NORMATIVE SCORES 35 NORMS AND STANDARD SCORES *Usefulness of Scaled Scores Scaled Scores are useful for two purpose: 1. Scaled scores relate the examinee’s performance to percentile rank scores of the norm group and their grade level. 2. In evaluation and research the mean scaled score is a better estimation of average group performance than the mean raw score. 36 37 43