COURSE SYLLABUS Instructional Science 752: Test Theory Fall 2005 Instructor: Dr. Richard Sudweeks, email: richard_sudweeks@byu.edu Office: 150-M McKay Bldg,; Telephone: 422-7078 Hours: 9:00-11:50 and 1:00-2:00 p.m. Monday and Wednesday Class Meeting Schedule: 4:00—5:20 p.m., Monday & Wednesday359 MCKB Required Textbooks: Embretson, S.E. & Reise, S.P. (2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum. Shavelson, R.J. & Webb, N.M. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage. Course Packets: Course Materials for IPT 752 Supplementary Readings for IPT 752 Course Rationale: Test theory is a discipline of study which focuses on explaining how reliable and valid inferences about unobservable characteristics of persons can be made from their responses to the items on a test. The discipline is also concerned with studying and resolving practical problems and issues associated with the development and use of tests and other assessment procedures. Modern test theory is a composite of classical true-score theory, item response theory, and generalizability theory. This composite theory has three important uses or applications: 1. It provides criteria for use in evaluating the adequacy of the process used in developing tests and the resulting instruments and procedures as a basis for making decisions about how and where they need to be improved, 2. It provides a set of procedures and a rationale for solving practical measurement problems such as estimating reliability and validity, conducting item analyses, detecting potentially-biased items, equating test scores, determining appropriate scoring procedures, etc. 3. It provides a rationale for identifying the limitations of the scores obtained from a particular test and the cautions that should be kept in mind when interpreting the results. Course Objectives: The purpose of this course is to help students understand modern test theory and to critically examine current practices and problems in educational and psychological measurement from the perspective of this theory. Hence, the course is designed to help students become proficient in applying modern test theory to evaluate and improve existing assessment instruments and procedures, to solve practical measurement problems, and to interpret test scores in a responsible way. In addition to this main goal, the course is designed to help students be able to-1. Distinguish between the classical and modern approaches to test theory including the assumptions on which each is based and the implications of each for constructing, scoring, and interpreting tests. 2. Describe the relative advantages and disadvantages associated with classical true-score theory, and the contributions and limitations of generalizability theory and item response theory. 3. Use classical methods plus item response theory and generalizability theory to solve practical problems in estimating the reliability of test scores, conducting item analyses, detecting item bias, etc. 4. Understand the current controversies and problems in educational and psychological measurement and identify lines of inquiry for further research that are likely to be fruitful. 2 Software Students are expected to develop proficiency in using four computer programs: 1. EXCEL (or some other similar program with both spreadsheet and graphing capabilities): This program will be frequently used to provide demonstrations and examples of concepts taught in this course. Some homework assignments are facilitated by using spreadsheet software. Students who know how to use a spreadsheet can readily create and examine examples of their own. An instructional module developed by Robert Miller will be provided to help students become proficient in using EXCEL. This module includes explanations, demonstrations, examples, and learning activities. 2. SPSS (for use in conducting item analysis, reliability studies, and factor analyses). 3. GENOVA (for use in conducting both G studies and D studies using generalizability theory): Both PC and Macintosh versions of this program developed by Joe Crick and Robert Brennan are available from the authors. GENOVA handles only balanced designs, but a new version called urGENOVA handles both balanced or unbalanced designs. The assignments in this class will include balanced designs only, so GENOVA is sufficient. Copies of GENOVA are available on some computers in the I.P & T. graduate student lab in 150 MCKB. The manual is on reserve at the Lee Library, but directions for preparing control files to complete the homework assignments are included in the Course Packet. 4. WINSTEPS (for Rasch analyses using the dichotomous, rating scale, or partial-credit models): This program developed by Michael Linacre can handle data sets including up to 10 thousand items and 1 million persons. A small-scale, student version called MINISTEP may be downloaded free from www.winsteps.com/ministep.htm. This reduced version is limited to a maximum of 25 items and 100 cases, but that will be sufficient for the assignments in this class. Course Requirements: 1. Become proficient in using each of the four computer programs described above. 2. Complete all assigned homework exercises. 3. Successfully complete the three examinations: two interim exams and a final exam. 4. Locate a published journal article of interest to you which focuses on the use of item response theory or generalizability theory. Prepare a written summary and critique of the article in which you-a. State the problem addressed by the study including any research questions or hypotheses. b. Describe how the method (IRT or G-theory) was used in this study to address the purpose of the study. c. Critique the use of the method in the context of this study: (1). Was the method appropriate in this context or would another method be more appropriate? (2). To what extent are the researcher's conclusions supported by the data? 5. Complete a data analysis project involving the use of IRT or G-theory. Testing And Grading Policy: Course grades will be determined by performance on the four exams (50%), the two journal article critiques (15% each = 30%) and the homework exercises (20%). Students are expected to complete the reading assignments prior to and in preparation for discussion in class. Homework exercises should also be completed in preparation for review and discussion in class. The examinations will include problems similar to the problems encountered in the homework. Students who understand the homework exercises should perform well on the examinations. 3 Course Schedule: Date 8/28 Topic Text Readings Course overview and introduction to test theory Supplementary Readings Ghiselli et al., ch. 1 8/30 Basic statistics in educational and psychological measurement Miller (2000) 9/8 Test scores as composites Crocker & Algina, ch. 5 Ghiselli et al.: 7 9/13 Test dimensionality 9/15 Classical test theory 9/20 Methods of estimating reliability 9/22 Coefficient alpha Cortina (1993); Miller (1995) Schmitt (1996) 9/27 Estimating true scores Harvill (1991:SRP) 9/29 Reliability generalization Vacha-Haase (1998) Thompson & Vacha-Haase (2000) 10/4 Classical item analysis Crocker & Algina, ch. 14 10/4—10/11 10/6 Netemeyer et al., ch. 1-2 Tate (2002) Traub (1997; SRP) Netemeyer et al., ch. 3 Traub & Rowley (1991; SRP); FIRST EXAM Introduction to Rasch Scaling and Item Response Theory Embretson & Reise, chs. 1-3 Henard (2000) 10/11 Binary & polytomous IRT models Embretson & Reise, chs. 4-5 Harris (1989) 10/13 Scale meaning and properties Embretson & Reise, ch. 6 Hambleton & Jones (1993) 10/18 Measuring persons and calibrating items Embretson & Reise, chs. 7-8 10/20 Assessing model fit Embretson & Reise, ch. 9 Smith (2000) 10/25 Using IRT software Linacre & Wright (2000) 10/27 Typical applications of Rasch procedures Schulman & Wolfe (2000) 10/27—11/3 SECOND EXAM 4 11/1 Introduction to generalizability theory Shavelson & Webb, ch. 1 11/3 Analysis of variance and Shavelson & Webb, ch. 2 variance component estimation 11/8 G studies with crossed facets Shavelson & Webb, chs. 3 11/10 G-studies with nested facets Shavelson & Webb, ch. 4 11/15 D-studies based on the original G-study design Shavelson & Webb, chs. 6-7 10/17 D-studies based on alternative G-study designs Shavelson & Webb, chs. 8-9 11/22 Using generalizability studies 11/27 G studies & D studies with fixed facets 11/27—12/4 Hoyt & Melby (1999) Strube (2000) Brennan (1992) Webb et al (1988) Sudweeks et al. (2004) Shavelson & Webb, ch. 5 Strube (2000) Netemeyer et al., ch. 4-6 Clark & Watson (1995) THIRD EXAM 11/29 Introduction to validity 12/1 Additional validity issues Bryant (2000), Benson, (1998) 12/6 Factor analytic studies of construct validity Thompson & Daniel (1996) Brace, Kemp & Snelgar, ch. 11 12/14 FINAL EXAM (Tuesday, 7:00 a.m.--10:00 a.m.) Supplementary Readings Packet: Brennan, R.L. (1992). Generalizability theory. Educational Measurement: Issues and Practice, 11 (4), 27-34. Hambleton, R.K. & Jones, R.W. (1993). Comparison of classical test theory and item response theory and their applications to test development. Educational Measurement: Issues and Practice, 12(3), 38-47. Harris, D. (1989). Comparison of the 1-, 2-, and 3-parameter IRT models. Educational Measurement: Issues and Practices, 8(1), 35-41. Harvill, L.M. (1991). Standard error of measurement. Educational Measurement: Issues and Practice, 10(2), 33-41. Traub, R.E. & Rowley, G.L. (1991). Understanding reliability. Educational Measurement: Issues and Practice, 10(1), 37-45. 5 References for Supplementary Readings: Complete bibliographic information for each of the Supplementary Readings listed in the Course Schedule is given below. Copies of the books referenced are available on two-hour reserve at the Reserve Desk in the Lee Library. Complete copies the journal articles that are noncopyrighted are included in the Supplementary Readings packet. The other periodical articles are available either on Electronic Reserve or in the Periodical Room at the Lee Library. Benson, J. (1998). Developing a strong program of construct validation: A test anxiety example. Educational Measurement: Issues and Practice, 17(1), 10-17 & 22. Crocker, L. & Algina, J. (1986). Introduction to classical and modern test theory. New York: Holt, Rinehart & Winston. [Reserve Desk at Lee Library] Henard, D.H. (2000). Item response theory. In L.G. Grimm & P.R. Yarnold (Eds.), Reading and understanding more multivariate statistics (pp. 67-97). Washington, DC: American Psychological Association. [Reserve Desk at Lee Library] Hoyt, W.T. & Melby, J.N. (1999). Dependability of measurement in counseling psychology: An introduction to generalizability theory. Counseling Psychologist, 27, 325-351. [Electronic Reserve] Linacre, J.M. & Wright, B.D. (2000). A user's guide to WINSTEPS. Chicago: MESA Press. [Reserve Desk at Lee Library] McKinley, R.L. (1989). An introduction to item response theory. Measurement and Evaluation in Counseling and Development, 22, 37-57. [Electronic Reserve] Schulman, J.A. & Wolfe, E.W. (2000). Development of a Nutrition Self-Efficacy Scale for Prospective Physicians. Journal of Applied Measurement, 1, 107-130. Smith, R.M. (2000). Fit analysis in latent trait measurement models. Journal of Applied Measurement, 1, 199-218. Strube, (M.J. 2000). Reliability and generalizability theory. In L.G. Grimm & P.R. Yarnold (Eds.), Reading and understanding more multivariate statistics (pp. 23-66). Washington, DC: American Psychological Association. [Reserve Desk at Lee Library] Tate, R. (2002). Test dimensionality. In G. Tindal & T.M. Haladyna (Eds.), Large-scale assessment programs for all students: Validity, technical adequacy, and implementation (p. 181-211). Mahwah, NJ: Erlbaum. Thompson, B. & Vacha-Haase, T. (2000). Psychometrics is datametrics: The test is not reliable. Educational and Psychological Measurement, 60, 174-195. [Periodical Room at Lee Library] Traub, R.E. (1997). Classical test theory in historical perspective. Educational Measurement: Issues and Practice, 16(4), 8-14. [Electronic Reserve] Vacha-Haase, T. (1998). Reliability generalization: Exploring variance in measurement error affecting score reliability across studies. Educational and Psychological Measurement, 58, 620. [Electronic reserve] Webb, N.M., Rowley, G.L. & Shavelson, R.J. (1988). Using generalizability theory in counseling and development. Measurement and Evaluation in Counseling and Development, 21, 81-90. [Electronic Reserve] 6 Additional Helpful References The books listed below are additional sources that are useful supplements for this course. Copies of most are available at the Lee Library Reserve Desk. Allen, M.J. & Yen, W.M. (1979). Introduction to measurement theory. Monterey, CA: Brooks/Cole. Bond, T.G. & Fox, C.M. (2001). Applying the Rasch model: Fundamental measurement in the human sciences. Mahwah, NJ: Erlbaum. Brace, N., Kemp, R., & Snelgar, R. (2003). SPSS for psychologists (2nd ed.). Mahwah, NJ: Erlbaum. Brennan, R.L. (1992). Elements of generalizability theory (Rev. ed.). Iowa City, Iowa: American College Testing Program. Brennan, R.L. (2001). Generalizability theory. New York: Springer-Verlag. Crick, J.E. & Brennan, R.L. (1983). Manual for GENOVA: A generalized analysis of variance system. ACT Technical Bulletin No. 43. Iowa City, IA: American College Testing Program. Fischer, G.H. & Molenaar, I.W. (Eds.) (1995). Rasch models: Foundations, recent developments, and applications. New York: Springer. Ghiselli, E.E., Campbell, J.P. & Zedeck, S. (1981). Measurement theory for the behavioral sciences. San Francisco: W.H. Freeman. Gulliksen, H. (1950). Theory of mental tests. New York: John Wiley. Hambleton, R.K., Swaminathan, H., & Rogers, H.J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage Publications. Linn, R.L. (Ed.) (1989). Educational measurement (3rd ed.). New York: Macmillan. McDonald, R.P. (1999). Test theory: A unified treatment. Mahwah, NJ: Lawrence Erlbaum. Nunnally, J.C. & Bernstein, I.H. (1994). Psychometric theory (3rd ed.). New York: McGraw-Hill. (2nd ed.). London: Routledge. Rust, J. & Golombok, S. (1999). Modern psychometrics: The science of psychological assessment Smith, E.V.,Jr. & Smith R.M. (2004). Introduction to Rasch measurement. Maple Grove, MN: JAM Press. Thompson, B. (2002). Score reliability: Contemporary thinking on reliability issues. Thousand Oaks, CA: Sage. Thorndike, R.L. (1982). Applied psychometrics. Boston: Houghton Mifflin. van der Linden, W.J. & Hambleton, R.K. (Eds.) (1997). Handbook of modern item response theory. New York: Springer. 7 RELATED PERIODICALS: Students are also encouraged to become familiar with the following journals which often include articles and research reports related to issues treated in this course. Copies of these journals are available in the Periodicals Room at the Lee Library. Applied Measurement in Education Applied Psychological Measurement Educational and Psychological Measurement Educational Measurement: Issues and Practice Journal of Educational and Behavioral Statistics Journal of Educational Measurement Measurement and Evaluation in Counseling and Development