UNIVERSITI TEKNOLOGI MALAYSIA ♦ BORANG PENGESAHAN STATUS TESIS STATISTICAL APPROACH ON GRADING: MIXTURE MODELING JUDUL : SESI PENGAJIAN: 2005/2006 ZAIRUL NOR DEANA BINTI MD DESA Saya mengaku membenarkan tesis (PSM/Sarjana/Doktor Falsafah)* ini disimpan di Perpustakaan Universiti Teknologi Malaysia dengan syarat-syarat kegunaan seperti berikut : 1. Tesis adalah hak milik Universiti Teknologi Malaysia. 2. Perpustakaan Universiti Teknologi Malaysia dibenarkan membuat salinan untuk tujuan pengajian sahaja. 3. Perpustakaan dibenarkan membuat salinan tesis ini sebagai bahan penukaran antara institusi pengajian tinggi. 4. ** Sila tandakan (√ ) SULIT (Mengandungi maklumat yang berdarjah keselamatan atau kepentingan Malaysia seperti yang termaktub di dalam AKTA TERHAD (Mengandungi maklumat TERHAD yang telah ditentukan oleh organisasi/badan di mana penyelidikan dijalankan) √ TIDAK TERHAD Disahkan oleh (TANDATANGAN PENULIS) (TANDATANGAN PENYELIA) Dr. ISMAIL B. MOHAMAD Alamat Tetap : NO. 114 TAMAN ORKID, FASA II, SG. LAYAR, 08000 SG. PETANI, KEDAH DARUL AMAN Tarikh : CATATAN : APRIL 2006 * Tarikh: APRIL 2006 Potong yang tidak berkenaan. ** Jika tesis ini SULIT atau TERHAD, sila lampirkan surat daripada pihak berkuasa/organisasi berkenaan dengan menyatakan sekali sebab dan tempoh tesis ini perlu dikelaskan sebagai SULIT atau TERHAD. ♦ Tesis dimaksudkan sebagai tesis bagi Ijazah Doktor Falsafah dan Sarjana secara penyelidikan, atau disertasi bagi pengajian secara kerja kursus dan penyelidikan, atau Laporan Projek Sarjana Muda (PSM). “I hereby declare that I have read this dissertation and in my opinion this dissertation is sufficient in terms of scope and quality for the award of the degree of Master of Science (Mathematics)” Signature : ………………………………….. Supervisor : Dr. Ismail B. Mohamad Date : 14th April 2006 STATISTICAL APPROACH ON GRADING: MIXTURE MODELING ZAIRUL NOR DEANA BINTI MD DESA A dissertation submitted in partial fulfillment of the requirements for the award of the degree of Master of Science (Mathematics) Faculty of Science Universiti Teknologi Malaysia APRIL 2006 ii I declare that this thesis entitled “Statistical Approach on Grading: Mixture Modeling” is the result of my own research except as cited in the references. The thesis has not been accepted for any degree and is not concurrently submitted in candidature of any other degree. Signature : ………………………………….. Name : Zairul Nor Deana Binti Md Desa Date : 14th April 2006 iii Especially for my beloved parents Mak, Tok and Tokwan who taught me to trust myself and love all things great and small My siblings ~ Abg Am, Dila, Fatin and Aida Zaha & All teachers, lecturers and friends iv ACKNOWLEDGEMENT First of all, I would like to thank Allah S.W.T, the Lord Almighty, for the health and perseverance needed to complete this thesis. I would like to express my appreciation to the supervisor Dr. Ismail B. Mohamad for his meticulous and painstaking review and his tireless effort in reading on an earlier draft of this thesis. In addition, I thank the chairperson Dr. Arifah Bahar and both internal examiners Dr. Zarina Mohd Khalid and Tn. Hj. Hanafiah Mat Zin for their helpful detailed comments that improved the final report. My sincere appreciation also extends to the lecture and my fellow postgraduate colleagues of the Department of Mathematics at Universiti Teknologi Malaysia who provided me with valuable input. Finally, I gratefully acknowledge the support of my family and anonymous referees for their patience and understanding. v ABSTRACT The purpose of this study is to compare results obtained from three methods of assigning letter grades to students’ achievement. The conventional and the most popular method to assign grades is the Straight Scale method. Statistical approaches which use the Standard Deviation and conditional Bayesian methods are considered to assign the grades. In the conditional Bayesian model, we assume the data to follow the Normal Mixture distribution where the grades are distinctively separated by the parameters: means and proportions of the Normal Mixture distribution. The problem lies in estimating the posterior density of the parameters which is analytically intractable. A solution to this problem is using the Markov Chain Monte Carlo method namely Gibbs sampler algorithm. The Gibbs sampler algorithm is applied using the WinBUGS programming package. The Straight Scale, Standard Deviation and Conditional Bayesian methods are applied to the examination raw scores of 560 students. The performance of these methods are compared using the Neutral Class Loss, Lenient Class Loss and Coefficient of Determination. The results showed that Conditional Bayesian performed out the Conventional Method of assigning grades. vi ABSTRAK Tujuan kajian ini adalah untuk membandingkan keputusan yang didapati daripada tiga kaedah memberi gred kepada pencapaian pelajar. Kaedah konvesional yang popular adalah kaedah Skala Tegak. Pendekatan statistik yang menggunakan kaedah Sisihan Piawai dan kaedah Bayesian bersyarat dipertimbangkan untuk memberi gred. Dalam model Bayesian, dianggapkan bahawa data adalah mengikut taburan Normal Tergabung dimana setiap gred adalah dipisahkan secara berasingan oleh parameter-parameter; min-min dan kadar bandingan dari taburan Normal Tergabung. Masalah yang timbul adalah sukar untuk menganggarkan ketumpatan posterior bagi parameter-parameter tersebut secara analitik. Satu penyelesaian masalah ini adalah dengan menggunakan kaedah Markov Chain Monte Carlo iaitu melalui algorithm persampelan Gibbs. Algorithm persampelan Gibbs diapplikasikan dengan menggunakan pekej perisian pengaturcaraan WinBUGS. Kaedah Skala Tegak, kaedah Sisihan Piawai dan kaedah Bayesian bersyarat dijalankan terhadap markah mentah peperiksaan daripada 560 orang pelajar. Pencapaian ketiga-tiga kaedah dibandingkan melalui nilai Kehilangan Kelas Neutral, Kehilangan Kelas Tidak Tegas dan Pekali Penentuan. Didapati keputusan yang diperolehi menunjukkan bahawa kaedah Bayesian Bersyarat menunjukkan pencapaian yang lebih baik dibandingkan dengan kaedah Skala Tegak dan kaedah Sisihan Piawai. vii TABLE OF CONTENTS CHAPTER TITLE PAGE Cover 1 Declaration ii Dedication iii Acknowledgement iv Abstract v Abstrak vi Table of Contents vii List of Tables x List of Figures xi List of Appendixes xiii Nomenclature xiv RESEARCH FRAMEWORK 1 1.1 Introduction 1 1.1 Statement of the Problem 2 1.2 Research Objectives 3 1.4 Scope of the Study 3 1.3 Significance of the Study 4 1.5 Research Layout 5 viii 2 REVIEW OF GRADING PLAN AND GRADING 7 METHODS 2.1 Introduction 7 2.2 Grading Philosophies 10 2.3 Definition and Designation of Measurement 12 2.3.1 Levels of Measurement 2.3.2 3 Norm-Referenced Versus Criterion-Referenced Measurement 2.4 Weighting Grading Components 18 GRADING ON CURVES AND BAYESIAN GRADING 24 3.1 Introduction 24 3.2 Grading On Curves 25 20 3.2.1 Linearly Transformation Scores 25 3.2.2 Model Set Up for Grading on Curves 26 3.2.3 Standard Deviation Method 27 3.3 Bayesian Grading 31 3.3.1 Distribution-Gap 32 3.3.2 Why Bayesian Inference? 33 3.3.3 Preliminary View of Bayes’ Theorem 35 3.3.4 Bayes’ Theorem 37 3.3.5 Model Set Up for Bayesian Grading 41 3.3.6 41 Bayesian Methods for Mixtures 3.3.7 Mixture of Normal (Gaussian) Distribution 45 3.3.8 Prior Distribution 48 3.3.9 54 Posterior Distribution 3.4 Interval Estimation 4 16 61 NUMERICAL IMPLEMENTATION OF THE BAYESIAN GRADING 64 4.1 Introduction to Markov Chain Monte Carlo Methods 65 ix 4.2 Gibbs Sampling 66 4.3 Introduction to WinBUGS Computer Program 69 4.4 Model Description 69 4.5 Setting the Priors and Initial Values 74 4.5.1 Setting the Prior 74 4.5.2 Initial Values 77 4.6 Label Switching in MCMC 77 4.7 Sampling Results 78 4.7.1.1 Case 1: Small Class 79 4.7.1.2 83 4.8 5 Convergence Diagnostics 4.7.2.1 Case 2: Large Class 87 4.7.2.2 93 Convergence Diagnostics Discussion 96 4.9 Loss Function and Leniency Factor 101 4.10 Performance Measures 105 CONCLUSION AND SUGGESTION 108 5.1 Conclusion 108 5.2 Suggestions 110 REFERENCES 112 Appendix A – F 116-147 x LIST OF TABLES TABLE NO. TITLE PAGE 2.1 Comparison of Norm-Referenced and Criterion-Referenced 19 2.2 Rubrics for Descriptive Scale 19 3.1 Grading on Curve Scales for the Scores between Which a 30 Certain Letter Grade is Assigned, the Mean is "set" at C+ 4.1 Optimal Estimates of Component Means for Case 1 4.2 Minimum and Maximum Score for Each Letter Grade, Percent of Students and Probability of Raw Score Receiving 81 82 that Grade for GB: Case 1 4.3 Straight Scale and Standard Deviation Methods: Case 1 82 4.4 Optimal Estimates of Component Means for Case 2 89 4.5 Minimum and Maximum Score for Each Letter Grade, Percent of Students and Probability of Raw Score Receiving 90 that Grade for GB: Case 2 4.6 4.7 Straight Scale and Standard Deviation Methods: Case 2 Posterior for 95% Credible Interval of Component Means and its Ratio 90 98 4.8 Leniency Factor and Loss Function Constant 104 4.9 Cumulative Probability for GB; Case 1 105 4.10 Performance of GB, Straight Scale and Standard Deviation Methods: Case 1 107 xi LIST OF FIGURES FIGURE NO. 1.1 TITLE 1.2 A Functional Mapping of Letter Grades A Partition on Letter Grades 3.1 Plot of the Raw Scores and Corresponding Transformed Scores 3.2 PAGE 14 15 26 Relationship among Different Types of Transformation Scores in a Normal Distribution; µ = 60, σ = 10 30 3.3 Hierarchical Representation of a Mixture 44 3.4 45 4.1 Normal Mixture Model Outlined on Each Letter Grades Graphical Model for Bayesian Grading 4.2 Kernel-Density Plots of Posterior Marginal Distribution of Mean for Grade B+ 4.3 Monitoring Plots for Traces Diagnostics of Mean: (a) Grade D and (b) Grade B+. 4.4 Gelman-Rubin Convergence Diagnostics of Mean; (a) Grade D and (b) Grade B+ 4.5 Quantiles Diagnostics of Mean; (a) Grade D and (b) Grade B+ 4.6 Autocorrelations Diagnostics of Mean; (a) Grade D and (b) Grade B+ 4.7 Kernel-Density Plots of Posterior Marginal Distribution of Mean for Grade B 73 85 86 86 87 87 92 xii 4.8 4.9 Monitoring Plots for Traces Diagnostics of Mean: (a) Grade B and (b) Grade A Gelman-Rubin Convergence Diagnostics of Mean; (a) Grade B and (b) Grade A 4.10 Quantiles Diagnostics of Mean; (a) Grade B and (b) Grade A 4.11 Autocorrelations Diagnostics of Mean; (a) Grade B and (b) Grade A 4.12 Cumulative Distribution Plots for Straight Scale (dotted line) and GB Method; (a) Case 1 and (b) Case 2 94 94 95 95 99 4.13 Density Plots with Histogram for Case 1 100 4.14 Density Plots and Histogram for Case 2 100 xiii LIST OF APPENDIXES APPENDIX TITLE PAGE A1 Normal Distribution Table 116 A2 Grading via Standard Deviation Method for Selected Means and Standard Deviation 117 B The Probability of Set Function and Mixture Model 119 C Weighting Grades Component 123 D Some Useful Integrals 124 -The Gamma, Inverse Gamma and Related Integrals E WinBUGS for Bayesian Grading 125 F Bayes, Metropolis and David A. Frisbie 147 xiv NOMENCLATURE GC - Grading on Curves GB - Conditional Bayesian Grading MCG - Multi-Curve Grading MCMC - Markov Chain Monte Carlo G - Grade Sample Space N - Number of Students in a class ng - Number of Students for Grade g B - Burn-In Period T - Number of Iterations h {θ x} - Conditional Probability Density of Prior L{x θ } - Conditional Likelihood Function of Raw Score p (⋅ x ) - π (θ ) - Prior Distribution m ( x) - Marginal Density of Raw Score p ( xi ) - The Probability Distribution of Raw Score πg - Component Probability of Component g θ - Parameter of Interest (Conjugate Prior) Θ - Vector of Parameter of Interest N ( ⋅, ⋅) - Normal Distribution Conditional Distribution of Conjugate Prior or Posterior Density 0 IG ( ⋅, ⋅) - Inverse Gamma Distribution Di ( ⋅) - Dirichlet Distribution C (⋅) - Categorical Distribution R - Ratio in Gelman-Rubin Statistics R2 - Coefficient of Determination C ( yi , yˆi ) - Loss Function CC - Class Loss LF - Leniency Factor 1 CHAPTER 1 RESEARCH FRAMEWORK 1.1 Introduction At the end of a course, educators intend to convey the level of achievement of each student in their classes by assigning grades. Students, university administrators and prospective employers use these grades to make a multitude of different decisions. Grades cause a lot of stress for student; this exhibits the fact of education life. Grades reflect personal philosophy and human psychology, as well as effort, to measure intellectual progress with standardized objective criteria. There are many ways to assign student’s grades which all seem to have their advantages and disadvantages. The educators or graders are the most proficient persons to form a personal grading plan because it incorporates the personal values, beliefs, and attitudes of a particular educator. For that reason, a philosophy of grading in establishing a grading plan must be shaped and influenced by current research evidence, prevailing lore, reasoned judgement and matters of practicality. However, a more professional approach should be developed with the ability to be applied at any grade level and in any 2 subject matter area where letter grades are assigned to students at the end of reporting period. 1.2 Statement of the Problem Most approaches in grading plan require additional effort and varying degrees of mathematical expertise. The educator has to assign a score, which meaningfully assign a letter grade, such as A, B- or C, to each student. There is no standard answer to questions like: What should an “A” student grade mean? What percent of students in my class should received a “C”?. University or faculty regulations encourage a uniform grading policy so that grade of A, B, C, D and E will have the same meaning, independent of the faculty or university awarding the grade. Other campus units usually know the grading standard of a faculty or university. For example, a “B” in a required course given by Faculty X might indicate that the student have an ability in developing most of the skills and referred to as prerequisites for later learning. A “B” in required course given by Faculty Y might indicate that the student is not a qualified candidate for graduate school in the related fields. Nevertheless, the faculty and educator may be using different grading standards. Course structure may seem to require a grading plan which differs from faculty guidelines or the educator and faculty may hold different ideas about a function of grading. Therefore a satisfactory grading plan must be worked out in order to meet the objective measurement and evaluation in education. Since both philosophies and instructional approaches change as curriculum changes, educators need to adjust their grading plans accordingly. In this study, we are not comparing faculty regulations on their grading methods but we attempt to differentiate each letter grade based on the overall raw score of the student from the beginning of a semester to the end of the semester period. Statistically based method is 3 used in this research which takes into account the grading philosophy with respect to conditions of measurement and evaluation of students’ achievement. The students’ final grades intend to have a norm-referenced meaning. By definition, a norm-reference grade does not tell what a student can do; there is no content basis other than the name of the subject area associated with the grade. Furthermore, the distinctions and relationship of several grading methods; conventional and futuristic are discussed carefully. 1.3 Research Objectives The objectives of this study are to understand the grading philosophy, grading policies, grading methods and exploring the appropriate grading methods. The philosophy and policy are viewed as educational principles and the grading methods were driven by statistical procedures. The primary objectives are to develop mathematical models on grading system for both conventional and future approaches and finally we carry out the programming method in statistical analysis on Bayesian Grading method of assigning grades. The data on the past years record is used in this study. 1.4 Scope of the Study Assigning a grade to student can be done in various ways. At present, most instructors assign grade conventionally through the raw score from the test given in class and final examination raw scores at the end of a semester period. The grades may be assigned based on the instructors’ “feel” throughout the instructors experience with their students. To avoid the “unfair” judgment on student performance, the new approach in grading method which is statistically based is adjusted to the conventional grading plan. 4 This method has a scientific evidence in assigning grade as compared to using only instructors’ personal feels. A model called Bayesian Grading (GB) method is developed to assign the grades. A Bayesian Inference based on decision making is an important tool to classify the letter grade into its particular class or component. The Gibbs Sampler is used in estimating the optimal class to the grade when the students’ raw scores are assumed to be normal and form a bell-shaped distribution. Adjustments to the raw score which take into account the instructor’s leniency factor is to allow the educator to vary the leniency of his or her evaluation. Based on the information, we calculate the probability that each student’s raw score corresponds to each possible letter grade. The grader’s (or instructor’s) degree of leniency is used to specify educators’ loss function which is used to assign the most optimal letter grades. These categories of grading are built upon earlier understanding of student raw scores, and it combine the raw scores with current data measure in a way that update the degree of belief (subjective probability) of the educator. With this principle, the student raw scores are assumed to be independent of the other students. 1.5 Significance of the Study In this research, the Bayesian Grading (GB) method of assigning letter grades to student based on their whole semester raw score is described. The GB categorize the marks into several different classes and each class is assigned a different letter grade. The methods take into account the educator degree of leniency in categorizing the raw scores into several classes. 5 This instructional statistical designed is to help prospective, intermediate and beginning educators to sort out the issues involved in formulating their grading plan and to help experienced educators to reexamine the fairness and defensibility of their current grading practices. It also can be applied at any level of school, college or university. 1.6 Research Layout Chapter I is intended to introduce basic terminology and a framework of the study. Chapter II, include literatures on some basic grading policies and grading plan for conventional grading methods and a futuristic grading method that will be used throughout the dissertation. Chapter III presents a more specific grading method which is the grading based on curve and a introducing the basic Bayesian grading method which include a discussion on finding the probability distribution of letter grades. A Bayesian inference in setting the prior and estimating the posterior of probability figured theoretically by including the proofs for readers understanding. In Chapter IV, we carefully discuss the model parameters estimation that were drawn from the mixture models using Gibbs Sampler. In addition, an estimation of the letter grades which take into account the instructors’ loss function is shown to find the optimal letter grades. The simulation is developing uses the WinBUGS (the recently developed software package: the MS Windows operating system version of Bayesian Analysis Using Gibbs Sampling). This is a flexible software for the Bayesian analysis in analyzed complex statistical models uses Markov chain Monte Carlo (MCMC) methods. The URL address to download the free version of the software is www.mrcbsu.cam.ac.uk/bugs/. 6 The significance of the results in real life will be judged by several selected instructors which uses real raw scores data. Furthermore, the result will be compared between the conventional grading methods and the Bayesian Grading method. Finally, Chapter V includes the conclusion and suggestion for further research on grading methods. 7 CHAPTER 2 REVIEW OF GRADING PLAN AND GRADING METHODS 2.1 Introduction Grading via mathematical model in education has been hotly debated topic for many decades. Prior to objective tests, marking and grading are usually synonymous, and infallibility of the educators’ judgment were rarely questions. A grade should demonstrates students performance during academic session and describe achievement, ability and aptitude of the student in particular course and what a student knows rather than how well they has performed relative to the reference group. Walvoord, and Anderson, (1998) believed that grading includes tailoring assignments to the learning goals of the course, establishing criteria, helping students acquire the knowledge they need, assessing student learning over time, raise student motivation, feedback results so student can learn from their mistakes, and using results to plan future teaching methods. Nevertheless, the problem of using grades to describe student achievement has been persistently troublesome at all levels of education [Ebel & Frisbie, 1991]. Grading is frequently the subject of educational controversy because the grading process is difficult, different philosophies call for different grading system, and the task of grading is sometimes unpleasant. Figlio and Lucas (2003) have been proved that instructors at different levels are likely to assign different grade distribution to their classes. However, a development of grading models is in place to attract our attention to create a structured educational grading system. There is no particular or single, universally agreed upon grade assignment methods. Various grading methods practically applied in schools, colleges and universities. Weighting grading components and combining them to obtain a final grade is the most common grading practice. Generally, grades are typically based on the grades of graded components such as mid term test, quizzes, projects, assignments, studio projects and final examination. Educators often wish to weight some components more heavily than others. For example, quizzes scores it might be valued at the same weight as each of two weeks or three hour exams grade. The variability of scores (standard deviation) is the key to proper weighting. A practical solution to combining several weighting components is first to transform raw scores to standardized scores; z or McCall T scores [Robert, 1998; Ebel & Frisbie, 1991; Martuza, 1977; Merle, 1968]. This grading method called “grading on the curve” or “grading on the normal curve” which became popular during the 1920’s and 1930’s. Grading on the curve is the simplest method to determine in advanced what percentage of the class would get A’s (say the top 10% get an A), what percentage for B’s, and so on [Stanley and Hopkins, 1972]. Even though it is amounted simply, but it has serious drawback. The fixed percentages are nearly determined arbitrarily. In addition, the used of normal curve to model achievement in a single classroom is generally inappropriate, except in large required course at the college or university level [Frisbie and Waltman, 1992]. Grading on curve is efficient from an educator point of view. Therein lays the only merit in the method. A relative method called Standard Deviation Method implicitly assumed the data come from a single population and is the most complicated computationally but is also the fairest in producing grades objectively. It uses the standard deviation which tells the average number of n students differ from their class average. It is a number that describes the dispersion, variability or spread of scores around average score. Unlike grading on curve, this method requires no fixed percentage in advanced. 9 Alex, (2003), has studied a more related method called Multi-Curves Grading method (MCG), which is built upon Distribution Gap Grading method. Alex has assumed that the raw scores are realization from a Normal Mixture, with each component of the mixture corresponding to a different letter grades. The estimation procedures are used to compute the probability where each student’s raw score corresponds to each possible letter grade. In moving from scores to grades, educators can grade on an absolute grading scale (say 90 to 100 is an A). Given that the students only care about their relative rank, which kind of grading is better? Works by Pradeep and John (2005) have shown that if the students are disparate identical, then absolute grading is always better than grading on a curve. This shows that when all the students are disparate identical, it is always better to grade according to an absolute scale. A grading on curve is defined by the number nA of students getting A, the number nB getting B, and so on. The grades are obtained by ranking student exam scores, and assigning the top nA scores the grade A. If k > nA students tie with the top score, then all must get A, and the number of B’s is diminished by the excess A’s, and so on. In this study, we are interested in converting the scores to grades. Three methods; Grading on Curve(GC) with some adjustment using the weighted procedure from several raw scores in finding the optimal grading scheme and the Straight Scale method (SS) is then compared to the Bayesian Grading method (GB). Many researchers have studied a conventional GC broadly and here we made some modifications of GC which taken into account the statistical point of view. GB is the new approach commenced by Alex,(2003) which is the focus of this study. Alex named the method Multi-Curves Grading method (MCG). The selected of Straight Scale method, Standard Deviation method, and conditional Bayesian Grading method are the norm-referenced which is the normative type of grading. Each of methods has the pros and cons. The Straight Scale method or 10 sometimes recall as Fixed Scale method is beneficial to calculate easily. It is also easy for students to understand and generally accepted consistently and might reduce competition between students. Unfortunately, the Straight Scale method has serious drawbacks. The model ignored the raw scores and the percent score ranges for each letter grade are fixed for all grading components. For example, the fact that 91% is needed for an A places severe and unnecessary restrictions on the instructors when they developing each assessment tool. A fixed percentage is arbitrary and thus not defensible and meaningless. Why should the cutoff for an A be 85, 86 or 87 instead? Why shouldn’t the A cutoff be 80-100% for a certain text, 91-100% for another and 70-100% for a certain simulation exercise? Is there any reason why the same numerical standard must be applied to every grading component when those standards are arbitrary and void some absolute meaning? What sound rationale can be given for any particular cutoff? Some of the instructor find themselves in a bind when the highest score obtained on an exam is only 72%. Was the examination much too difficult? Did students study too little? Was instruction relatively ineffective? Oftentimes, instructors decide to “adjust” scores so that 72% is equated to 100%. In addition, this method can also allow all students receive the same grade and thus not provide information needed to screen students in competitive circumstances. The purposes of this chapter are (a) to explain the grading philosophies and grading policies which come across in the educational psychology, (b) to define measurement topic in general and (c) briefly describe several conventional grading methods in grades assignment. The more detailed description of GC and GB is discussed in Chapter III. 2.2 Grading Philosophies Grades reflect our personal philosophy and human psychology, as well as effort to measure people intellectual progress with standardized objective criteria. In 11 educators’ point of view, whatever your personal philosophy about grades, their importance to your students means that you must make a constant effort to be fair and reasonable in maintaining grading standards. Philosophy of grading is broadly on how the educator grades their students? and what to expect on the students’ entire performance. The educator should inform to the students at the beginning of semester which criteria and method is employed in assigning grades. Lawrence (2005) has outlined several grading philosophies as follows: Philosophy I: Grades are indicators of relative knowledge and skill; that is, a student’s performance can and should be compared to the performance of other students in that course. The standard to be used for the grade is the mean or average score of the class on a test, paper or project. The grade distribution could be objectively set by determining the percentage of A’s, B’s, C’s, D’s and E’s that will be awarded. Outliers (really high or really low) can be awarded grades as seems fit. Philosophy II: Grades are based on preset expectation or criteria. In theory, every student in the course could get an A if each of them met the preset expectations. The grades are usually expressed as the percentage of success achieved ( 90% and above is an A, 80-90% is a B, 70-80% is a C, 60-70 is a D and below 60 is an E) this letter grades of range are subject to change depending on institutions’ grading policy. Pluses and minuses can be worked into this range. Philosophy III: Students come into the course with an A, and it is theirs to lose through poor performance or absence, late papers etc. With this philosophy the teacher takes away points rather than adding them. Philosophy IV: Grades are subjective assessments of how a student is performing according to his or her potential. Students who plan to major in a 12 subject should be grade harder than a student just taking a course out of general interest. Therefore, the standard set depends upon student variables and shouldn’t be set in stone. The grading system are closely attached to the educator’s own philosophy since they know their student throughout teaching. In line with this, factors that will influence educators’ evaluation, must be considered in advanced. For example some educators weigh content more heavily than style. It has been suggested that lower (or higher) evaluations should be used as a tool to motivate students. Some of them negotiate with students about the ‘methods’ of evaluation, while others did not. As personal preference it is so much a part of the grading and evaluation of students, a thoughtful examination of one’s own personal philosophy concerning these issues will be very useful. In assigning marks to a student by administering mid term test, project or examination, which is by transforming their performance into a form of numbers of letter grades, the educators should know the procedure to measure the students performance. In the next section, we are to discuss the definition of measurement and related issue to measurement. This knowledge is of significant important to discover an educators’ skills in grading assignment. 2.3 Definition and Designation of Measurement The focus of measurement is to quantify a characteristic of a phenomenon. Measurement is generally a process of collecting information. The results are quantifiable in terms of mathematical symbols such as time, distance, amount or number of task performed correctly. For example, in grading, numbers assigned to permit statistical analysis on the resulting data. The most important aspect of grading is the specification rules for transforming numbers to letter grades. The rules for assigning numbers and letter grades must be standardized and applied uniformly. 13 Definition 2.1 [Martuza, (1977)] A measurement is the process of assigning numerals to object, events, or people using rule. Ideally the rule assigns the numerals to present the amounts of specific attribute possessed by the objects, events, or people being measured. From the definitions, we precisely define measurement as the grading process of assigning raw score and a letter grade to a student. Thus we need to comprehend the following mathematical formalism to understand the issue in grading process. The illustration in Figure 1.1 shows that we may show the grades in mathematical terms measurement is a functional mapping from the set of objects {si } to the set of real numbers of the raw score { xi } ; the raw score set { xi } consist of every possible outcome in raw scores of a random test components. The objects are the students itself denoted as si , where s is a shorthand notation for student and i referring to which students is being described. For simplicity maybe the i is the ID of each student and may take on any integer ( i, N ∈ℵ) starting from 1 until N finite number of students. The symbols xi ( xi ∈ ℜ ) denote the standardized raw score corresponding to student i. The method to compute weighted and standardized raw score will be explained in the next sections. The student and standardized raw score are ranked in descending order; s1 > s2 > ⋅⋅⋅ > sN and x1 > x2 > ⋅⋅⋅ xN . Our task is to convert the standardized raw score to fairness meaningful letter grades. In other words, the purpose of this study is to define the probability set function of the raw scores. A probability set function of raw score tells us how the probability is distributed over various subsets of raw score in a sample space G. The properties of the probability set function are statistically defined in Appendix B; Definition B1. 14 s1 x1 Raw Score s2 A B C D x2 ⋅ ⋅ ⋅ xN ⋅ ⋅ ⋅ sn Objects (students) Weighted and Standardized Raw E Letter grades Figure 1.1: A Functional Mapping of Letter Grades In addition, a measure of grades is a set function, which is an assignment of a number µ ( g ) to the set g in a certain class [Ash, (1972)]. If G is a set whose point correspond to the possible outcomes of a random experiment, certain subsets of G will be called “events” and assigned a probability. Intuitively, g is an event if the question “Does w (say 85) belong to g ?” has a finite yes or no answer after the experiment is performed and the outcome should correspond to the point 85 ∈ G . We denote G as a sample space of grades g1 = E , g 2 = D, g3 = D + ⋅⋅⋅, g11 = A ; { g L ∈ G} and the subscript L = 1,2, …, 11 denote the eleven components of letter grades in the grading policy. We describe the eleven grade components as the set of { A, A-, B+, B, B-, C+,C, C-, D+, D, E }which match the set of grade point averages { 4.0, 3.7, 3.3, 3.0, 2.7, 2.3, 2.0, 1.7, 1.3, 1.0, 0.0 }. 15 The native guides in giving meaningful letter grades corresponding to the grade point average were explained by Frisbie and Waltman (1992). Figure 1.2 exhibits the partition of letter grades. The spaces between A and B, C, D and E show the borderline gaps within grade which compute into complexity to decide the accurate maximum and minimum score for each grade. The curves illustrates that the raw scores are not in the straight line, which show the variety of raw scores for different students in the same letter grades. gap … C B A … Raw Score Figure 1.2: A Partition on Letter Grades Prior to the task in letter grades assignment, we begin by highlighting the levels of measurement also known as scale types of measurement. Most of the references addressed four hierarchical levels. There are Nominal, Ordinal, Interval and Ratio measurement. 16 2.3.1 Levels of Measurement The level of measurement is important to determine the type of statistical analysis that can be conducted. The four possible levels of measurement are as follows [Martuza, (1977)]: A. Nominal Nominal measurement is the simplest type of scale. Observations are placed into a category according to a defined property. The numbers are used for labeling purposes only to classify the object or event into a category. The categories are exhaustive (everyone should have a category to end up in) and mutually exclusive (a person should end up in only one category). The descriptive statistics are frequency, percentages and inferential statistics for non parametric such as the Chi-Square. Examples of nominal are social security numbers or numbers on the back of a football player. Another example is grouping people into categories based upon sex (Male or Female). The Female group might be assigned the number "0" and Male "1". B. Ordinal The numbers are used to show a relative ranking. The higher the number, the higher in significance the event might be. The variables used consist of a mutually exclusive and exhaustive set of orderable categories given as follows: a) 1= Highly Competent , 2= Competent, 3=Average, 4=Need attention/improvement, 5= Poor b) Rank ordering people in a classroom according to height and assigning the shortest person the number "1", the next shortest person the number "2" and so on. 17 However, ordinal measurement does not tell us how much greater one level of an attribute is than another. For instance we do not know if being completely independent is twice as good as when a mechanical assistance is present. The appropriate descriptive statistics for ordinal data are percentiles and rank order while the appropriate inferential statistics is the non parametric such as Median or Mean test. C. Interval We know the interval between points. The numbers show a mutually exclusive and exhaustive set of equal space, ordered categories between each event. For example, the temperature readings; Celsius temperatures in Kuala Lumpur on three particular months were as follows: January 25 - 32 ; June 24 - 36 ; December 25 - 30 . Each degree change of temperature reflects an equal amount of difference. However, the zero point is arbitrary. Furthermore we cannot conclude that if the temperature is 30 is twice as hot as a 15 temperature. The relevant statistical measure are the relational statistics; the correlations and in inferential statistics; parametric: t tests, Analysis of Variance (ANOVA), Analysis of Covariance (ANCOVA), Multivariate Analysis of Variances (MANCOVAS), and regression. D. Ratio Ratio measurement describes the interval data that have an absolute value of zero. For example, suppose a test score in percentage of three students are as follows: Diana – 90%, Iskandar – 60% and Swee Lee – 30%. These numeral indicates that Diana have higher score than Iskandar and Swee Lee do; or we can say Iskandar has a higher mark than does Swee Lee and lower than Diana’s. Does this mean that the ability difference between Swee Lee 18 and Iskandar is less than that of Iskandar and Diana? Hence that ratio statement like “Iskandar is as twice as bright as Swee Lee” are meaningless. In statistical measure appropriate, there is no distinction made between interval and ratio data. 2.3.2 Norm-Referenced Versus Criterion-Referenced Measurement The two basic types of grading are normative (comparative or relative standard) and criterion (mastery or absolute standard). A relative comparison is being made if the educator is evaluating relative to the performance of others in the class or in welldefined norm group. We assume that the students in the class represent a distribution of intelligence that will result in similar distribution of learning. Descriptive statistics are often associated with normative grading, and the terms “curving the score” or “grading on curves” are used. The well-known bell-curve is centered on the mean (or median) score, and the distribution is often indexed by standard deviation of the scores. Most standardized tests are in a normative fashion and hence, are so-called Norm-Referenced. Besides, whenever a decision about educator’s status in content based or about his or her achievement with respect to an explicit instructional objective is made by comparing his or her test score to some preset standard or criterion for success, the test score is said to be given a Criterion-Referenced as opposed to a norm-referenced interpretation. More details about these grading types is referred to Martuza (1977), Frisbie and Waltman (1992) and Lawrence, (2005). Table 2.1 and Table 2.2 describe briefly the comparison between Norm-Referenced and Criterion-Referenced and for an extension we present descriptors of grade levels performance using Rubrics respectively. A more generic type of indicator could be in the form of rubric, which is a descriptive scale with value attached as shown in the table. The Rubric is an authentic assessment tool which is particularly useful in assessing criteria which are complex and subjective. Designed rubrics can be found in many books in educational and psychology fields such as Stanley and Hopkins (1972), Martuza (1977), Ebel and Frisbie (1991) and Lawrence (2005). 19 Table 2.1: Comparison of Norm-Referenced and Criterion-Referenced* Norm-Referenced Criterion-Referenced Compares the performance of individuals Compares the performance of individuals against the references group Spread out the grade distribution against preset criteria Grades may be clustered at the high or low ends Content dependent Course objective dependent Encourages competition Encourage collaboration Grades effected by outliers Does not motivate student to improve Grades not effected by how other individuals perform Can be used diagnostically to indicate strengths and weaknesses * Subject to change according to department/faculty grading policy or grading plan Table 2.2: Rubrics for Descriptive Scale* Grade Description Entire major and minor goals achieved High level of skill development A Pluses for work submitted on time and carefully proofread (80-100 points) Minuses for work submitted late or resubmitted Exceptional preparation for later learning Entire major and most minor goals achieved Advanced development of most skills B Pluses for work submitted on time and carefully proofread (65-79 points) Minuses for work submitted late or resubmitted Has prerequisites for later learning Most major and minor goals achieved Demonstrate ability to use basic skills C (50-64 points) Pluses for all work carefully proofread Lacks a few prerequisites for later learning Sufficient goals achieved to warrant a passing effort D Some important skills not attained (40-49 points) Deficient in many of the prerequisite for later learning Few goals achieved E Most essential skills cannot be demonstrated (0-39 points) Lacks most prerequisites needed for later learning * Subject to change according to department/faculty grading policy or grading plan 20 The decision to use either an absolute or relative grading standard is the most fundamental decision an educator must make with regard to performance assessment. When the absolute standard is chosen, all methods and tools of evaluation must be designed to yield content-referenced interpretations. Absolute standard for grading must be established for each component which contribute to course grade – tests, papers, quizzes, presentations, projects and other assignments. If the decision is to use a relative standard, all measures must be geared to provide norm-referenced interpretations as explained in this section. Obviously, in both cases criterion-referenced decisions need to be made as long as several grading symbols are available [Ebel & Frisbie, 1991] which is the Rubric’s description stated above. Through a clear majority of institutions present use of letter grading with relative standards, percent grading is by no means dead. The percent grading is then converted to letter grades. Some instructors voice a preference for absolute grading over relative grading for philosophical reasons but find task of establishing standard overbearing, or in some cases, too arbitrary. This type of decision is very subjective, which varies with instructor judgement. 2.4 Weighting Grading Components Consider the problem that the instructor faces to motivate his students to study for two tests, say when instructor gives a midterm and final. The problem is that if a student does very well or very badly on the midterm, he (or his rivals) will feel less incentive to work for the final if he or they are unable to change their rank. One way to solve this problem is to weight the final more than the midterm, and to grade even ruder on the midterm than on the final. The introduction section of this chapter we look at weighting as a grading method. In this section, we state the Stanley and Hopkins, (1972), Ebel & Frisbie, (1991), and Frisbie and Waltman, (1992) in weighting procedure. When educators determine a course grade by combining grading components just as it is deciding which 21 components to use, each component carries less or more weight in determining final score. For example, for final exam the instructor decide to give the highest weight (say 3.0) than the mid term test (say 1.5) and project assignment (say 1.0) since final examination is covered the entire course contents of a semester. But for mid term test, is covered only from week 1 until week 8 of the semester. The components score must be pooled to form a composite score for final summary. To obtain grades of maximum validity, educators must give each component the proper weight, not too much or too little according to how important each component score or grade in describing achievement and performance at the end of the grading period. The standard deviation of its scores provides the approximation to the weight of a component quite well. If one set of score twice as variable as another, the first set is likely to carry about twice of the second in their total. The weight of one component on a composite depends on the variability of the test scores. For example, Table 2.3 demonstrates the scores on component of Course Works (Mid Term Test and Assignment) and Final Examination of three students (Diana, Iskandar and Swee Lee); which displayed on the first section along with their total scores on three components. Each of them made the highest score on one test, middle score on a second and lowest on the third. However, for future reference, that the ranks of their total scores on three test are the same as their ranks on Assignment (Test Z). The second section of the table gives information of the total score corresponding to each of grading components along with its descriptive statistics; the mean scores and the standard deviation of the scores on the three tests. Test X has the highest number of total points. Test Y has the highest mean score and Test Z has score with greatest variability. 22 Table 2.3: Weighting the Test Scores Final Exam (X) Mid Term (Y) Assignment (Z) Total 551 502 443 792 753 841 83 251 162 1423 1501 1442 Tests Characteristics: Total Points Mean Score Standard Deviation 100.0 49.7 5.51 90.0 79.3 4.51 25.0 16.3 8.50 215.0 145.3 4.16 Weighted Score: Diana Iskandar Swee Lee x1.54 85 77 68 x1.89 149 141 158 x1.00 8 25 16 242 244 242 Student: Diana Iskandar Swee Lee * The superscript at the score represents the ranks The question in weighting was on which test was it most important to do well? What factors arised about giving weights? On which the payoff for ranking first the highest, and the penalty for ranking last the heaviest? Clearly Test Z, the test with the greatest variability of the score. Which test ranked the student in the same order as their final ranking, based on total scores? Again the test is Test Z. Thus, the influence of one component on a composite depends not on total points or mean score but on score variability [Ebel & Frisbie, 1991]. Furthermore, the important of the test component, the uniqueness referred the course objective and the accuracy of the score obtained from a component is the key in weighting the component [Frisbie and Waltman, 1992]. Now, if the instructor tends to give an equal weight, this can be applied by weighting their scores to make the standard deviation equal as shown in the last section on Table 2.3. Since Final Examination is the tough test so are multiplied by 1.54, to change their standard deviation from 5.51 to 8.50, the same as on Test Z. With equal standard deviation the test carries equal weight and gives students the same average rank on the tests the same total score approximately. Even if in practice the instructors never 23 bother to do this, being aware of the principle should help you score various assignments or projects that are subjectively evaluated. Moreover, when the whole possible range of scores is used, score variability is closely related to the extent of the available score scale. This means that scores on a 50-point mid term test are likely to carry about five times the weight of scores on a 10-point assignment project, provided that scores extent across the whole range in both cases. The most efficient means of proper weighting involves the computation of standard scores, is T-score, for each grading component. Then each component will be represent on a score scale that yields the same standard deviation (10 for T-scores) for each measure. The details on T-score insures in Chapter III. 24 CHAPTER 3 GRADING ON CURVES AND BAYESIAN GRADING 3.1 Introduction As an introduction for grading, let ξ denote all possible letter grades of N number of students from a particular course. There is a grading map ξ : xi → G which ranks student according to G(x) when the scores obtained are xi ∈ ℜ . Each rank corresponds to a grade. Maps ξ is precisely depends on the scores and not on the names, effort or extra credit of student personality. A higher score implies a higher letter grade. In this chapter we tend to focus on two particular ways of generating ξ ; Grading on Curves and Bayesian Grading. 3.2 Grading On Curves A well known and the simplest variety of relative grading standard is called “grading on curve” (GC). The ‘curve’ would approximate the standard ‘bell’ shaped; referred to usually as the asymptotically normal distribution curve or some symmetric variant of fit. The use is based upon two assumptions: (a) the variable being rated is normally distributed on a continuous scale and (b) the categories cover known intervals on the continuum [Merle, 1968]. We assume the students’ ability and accomplishments is normally distributed by the raw scores distribution which is often used to describe the achievements of individuals in a large heterogeneous group. This method of assigning grades based on group comparison is complicated due to the need to established arbitrary quotas for each grade category. What percent should obtain A’s, B’s, C’s, D’s or E’s? Once these quotas are fixed, grades are assigned without regard to actual level of performance. Quota setting strategies vary from instructor to another instructor and department to department and seldom carry a defensible rationale. While some instructors defend the use of the normal or bell shaped curve is an appropriate model for setting quotas, using the normal curve is as arbitrary as using any other curves. In the next sections, we have the following goals; (a) to formalize the treatment of transformation, (b) to describe and illustrate some of the more commonly encountered transformed score scales (z-score and T-score), and (c) to show the relationships exist among z-score and T-score. 3.2.1 Linearly Transformation Scores A transformation is a rule (or set of rules) for converting scores from one scale (the observed score i.e. the raw score, x) to a new set of scores (standardized score i.e the deviation score). Transformations score can be classified as being either linear or nonlinear. A linear transformation converts the score from one scale to another in such a way that the distribution shape is not changed, whereas the nonlinear transformation 26 converts the score from one scale to another in such a way that the shape of the distribution is altered [Martuza, 1977]. One way to find out whether a transformation is linear or not is to plot the raw score and transformed scores as shown in Figure 3.1. If all points in the plot fall exactly on a straight line as in Figure 3.1(a) and it can be sure the transformation is linear; if the points do not fall on a straight line as in Figure 3.1(b), the transformations is nonlinear. Generally, the linear transformations preserves the shape of a test score distribution whereas the nonlinear transformations, always alter the shape of the test score distribution. 3 4 2 3 1 2 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 -1 -2 1 0 0 -3 Raw Scor e Figure 3.1(a) 2 4 6 8 10 Raw Scor e Figure 3.1(b) Figure 3.1: Plot of the Raw Scores and Corresponding Transformed Scores 3.2.2 Model Set Up for Grading on Curves Using group comparison for grading is appropriate when the class size is sufficiently large (N ≥ 30 ) to provide a reference group representative of students typically enrolled in the course. We assume that the students scores’ are independent of each other and we shall be concerned here with the most widely used statistical value; the standard deviation. 12 27 3.2.3 Standard Deviation Method When a distribution of raw scores is normal or approximately so, standard score reveal a great deal of information. Standard score are recommended because they allow us to measure performance on each grading component with an identical or standard yardstick. To transform a set of student raw scores to standard scores, we need only to estimate the mean and standard deviation of the set and divide the respective deviations of the scores from their mean by the standard deviation i.e., z= xi − µ σ (3.1) Table A1 in Appendix A have shows the probability for range of z. In short, a standard z-score indicates how many standard deviation the corresponding raw score of particular student is from the mean of the reference group. The mean and standard deviation of any set of standard scores are 0 and 1, respectively. If an eleven component of letter grades scheme A, A-, B+, B, B-, C+,C, C-, D+, D, and E with equal intervals are employed, and if the practical limits of the z scale are considered to be -2.50 to +2.50 and -2.0 to +2.0, each interval will extent 0.4545 and 0.3636 of standard deviation units respectively [Merle, 1968]. The interval is called the grade cutoff points that form equal intervals on the score scale. But, the first remains that the interval of A’s, B’s or E’s is given arbitrarily which dependent on the instructor’s philosophy. Since the instructors has declared the standard deviation according to the final scores performance, the percentage of student getting A’s, B’s and other letter grades are figured symmetrically. Under the two assumptions in Section 3.2, the distribution of eleven components of letter grades would follow the proportions shown in Figure 3.2. Furthermore, the transformation of raw scores to standard scores may results in decimal number, some which negative. In order to eliminate decimals and negative signs, standard scores frequently are multiplied by a constant and added to another constant. General form of all such transformation is T = zσ new + µ new (3.2) 28 in which z , σ new , µ new and T ; represent the normalized z score, the standard deviation, mean and transformation score of a new normal distribution respectively. A widely used scheme is the one in which the standard scores are multiplied by 10 and added to 50 [Merle, 1968; Martuza, 1977; Spencer, 1983; Lawrence, 1995; Alex, 2003]. The converted standard scores in this form is usually denoted as T. When raw scores are normally distributed, their equivalents T score are identical to the well-known McCall T scores; T = 10 z + 50 and Stanines scores; Stanine = 2 z + 5 , standard scores used to report the results from the Iowa Test of Educational Development (ITED); ITED = 5 z + 15 , and College Entrance Examination Board (CEEB); CEEB = 100 z + 500 . For example, consider the raw scores with mean 60 and standard deviation 10. This adjustment is made since we have stated that the average letter grade for students should be 60 instead of 50 for McCall T scores which sense that, more than half of student in a particular course should get the C+ and above of letter grades. The C+ is chosen instead of the C- because we enclose eleven letter grades and not five typically. Therefore we have fifty percent of the students will always have score greater than the mean (i.e for example with the mean 60). See Table 3.1 and Figure 3.2. The Straight Scale explained the raw score falls within a certain predetermined interval, assign the letter grade corresponding to that interval. The method is easy to be applied and is frequently used by most university grading scheme. The intervals are created before any actual raw scores are realized. In other words, this method does not even consider the actual raw scores in making the cutoffs between letter grades. Straight Scale also makes no sense if the test that produced the raw scores is either too difficult or too easy. But often, it is impossible to know the difficulty of a test until after the raw scores are seen [Alex, 2003]. 29 The procedure for standard deviation grading method is as follows: i) Build a frequency distribution of the total scores by listing all obtainable scores and the number of students receiving each. Calculate the mean, median and standard deviation. Note that, weight each grading variable before combining the standard scores. ii) If the mean and median are similar in value, use the mean for further computations, otherwise use the median. If the median is chosen, add 0.1818 of the standard deviation to the median and subtract the same value from median. These are the cutoff points for the range of C+’s. iii) Add 0.3636 standard deviation to the upper cutoff of the C+’s to find the B- and B cutoff. Subtract the same value to find the C and C- cutoff. See Figure 3.2 for illustration and Table 3.1 for letter grade cutoff if the median is chosen to be equal to 60. In step (iii), if unlike mean (or median) and standard deviation are used, immediately ignored the third column in Table 3.1. Subsequently, distribute the cutoff scores based on estimate of mean (or median) and standard deviation. For example of the cutoff scores given in Appendix A correspond to the different means and standard deviations. For extend, review borderline cases by using number of assignment completed, quality of the assignments, or some other relevant achievement data to decide if any borderline grades should be raised or lowered. 30 z score -2.00 -1.64 -1.27 -0.91 -0.54 -0.18 0.18 0.54 0.91 1.27 1.64 2.00 T score 0 iii 43.6 47.2 50.9 54.6 58.2 61.8 65.5 69.1 72.7 76.4 iii 100 Letter Grades E D D+ C- C C+ B- B B+ A- A Figure 3.2: Relationship among Different Types of Transformation Scores in a Normal Distribution; µ = 60, σ = 10 Table 3.1: Grading on Curve Scales for the Scores between Which a Certain Letter Grade is Assigned, the Mean is "set" at C+ Letter Straight Grading on Curve Percentage of Students Cumulative Percentage 100 5.05% 5.05% µ + 1.2726σ µ + 1.6362σ 5.15% 10.2% 79 µ + 0.9090σ µ + 1.2726σ 7.94% 18.14% 70 74 µ + 0.5454σ µ + 0.9090σ 11% 29.12% B- 65 69 µ + 0.1818σ µ + 0.5454σ 13.7% 42.86% C+ 60 64 µ − 0.1818σ µ + 0.1818σ 14.23% 57.07% C 55 59 µ − 0.5454σ µ − 0.1818σ 13.7% 70.3% C- 50 54 µ − 0.9090σ µ − 0.5454σ 11% 81.3% D+ 45 49 µ − 1.2726σ µ − 0.9090σ 7.94% 89.24% D 40 44 µ − 1.6362σ µ − 1.2726σ 5.15% 94.39% E 0 39 0 µ − 1.6362σ 5.05% 99.44-100% Grade Scale From To From To A 85 100 µ + 1.6362σ A- 80 84 B+ 75 B 31 The advantage of this method is that it is automatically adjusts letter grades to the difficulty of the test that produced the raw scores. For instance, if a test is made more difficult, therefore the mean (or median) of the raw scores decrease and all letter grades would be the same. The mean or median could be set to B- or B rather than C+. If the instructor has some notion of what the grade distribution should be like, some trial and error might be needed to decide how many standard deviations each grade cutoff should stated from the composite average. However, it has serious drawbacks. The fixed standard deviation for each letter grade cutoff is nearly always determined arbitrary. In addition, the use of normal curves to model achievement in a single classroom is generally inappropriate, except in large required courses at college or university levels. Nevertheless, when a relative grading method is desired, the standard deviation method is more attractive, despite its computational requirements. 3.3 Bayesian Grading Bayesian grading is a contemporary method in assigning letter grades which has developed by Alex Strashny in 2003. This method is called Multi-Curve Grading method (MCG). In this study, we called the method as Bayesian Grading (GB). In general, GB is applying Bayesian inference through Bayesian network in classifying a class of students into several different subgroups where each of them corresponds to possible letter grades. The method is built based on the Distribution-Gap grading method in finding the grades cutoffs [Ebel & Frisbie, 1991; Frisbie and Waltman, 1992; Alex, 2003]. 32 3.3.1 Distribution-Gap The Distribution-Gap method is another relative grading variation. This is formed by ranking the composites score of students from high to low that is in the form of a frequency distribution. The frequency distribution is observed cautiously for gaps that are for several shorts intervals in the consecutive score range where no students obtained. A horizontal line is drawn at the top of the first gap which gives an As’ cutoffs and a second gap is required. This process continues until all possible letter grade ranges (A-E) have been recognized. The cutoffs are assigned after looking at the composite scores. For example, if the highest composite scores in a class was 241, 238, 235, 227, 226,… then the instructor might use the gap between 235 and 227 to separate the A and A- grades. The gap between 241 and 238 is too small and might produce too few A grades. The one between 241 and 235 might be large enough, and 235 seem more like near to 238 than 227. The major fallacy with this technique is the dependence on the prospect to form the gaps. The size and setting of gaps may depend as much as on random measurement error as on actual achievement differences among students. If the scores from an equivalent set of measures could be obtained from the group, the smaller gaps might appear in different setting or the larger gaps might turn out to be somewhat smaller. For example, Farah’s 227 maybe would have been 233 (if there had been less error in her score), and Johan’s 235 maybe would have been 230. Under those circumstances, the AB gap would be less obvious, and to many final grade decision would have been made by reviewing borderline cases. Measurement error from different measures does not necessarily cancel each other out as they are expected to do on repeated measurement with the same instrument. The major advantages or attraction of the distribution-gap method is that when grades are assigned and the gaps are wide enough, few students appear to be right on the borderline of receiving a higher grade which this help instructor avoid disputes with 33 students about near misses. Consequently, instructors receive fewer student complaints and fewer requests to re-examine or retest paper to search for extra credit that would for example, change a B+ grade to A-. The method also stated that each component should be assigned a different letter grade. But when the gaps are narrow, too much emphasis is located on the borderline information that the instructor had decided was not relevant enough or accurate enough to be included among the set of grading components that form the composite scores. Only occasionally will the gap distribution method yield results that are comparable to those obtained with more dependable and defensible methods [Frisbie and Waltman, 1992]. In practice, distribution-gap is hard to apply since we decide the gaps to make the cutoffs subjectively. 3.3.2 Why Bayesian Inference? In the Bayesian approach of statistics, an attempt is made to utilize all available information in order to reduce the amount of uncertainty present in an inferential or decision making of assigning grades problem. As new information is obtained, it is combined with any previous information (raw scores) to form the basis for statistical procedures. The formal mechanism used to combine the new information with previously available information is known as Bayes’ Theorem [Robert , 1972]; this explains why the term “Bayesian’ is used to describe this general approach in grading. It is built up earlier understanding with currently measured data in a way that updates the degree of belief (subjective probability) of the instructors on their student performance. The earlier understanding and experience is called the “prior belief” and the new belief that results from updating the prior belief is called the “posterior belief” [Press, 2003]. This inferential updating process is eponymously called Bayesian inference. 34 Prior belief Belief or understanding held prior (previous) to observing current set data, available either from an experiment or from other sources. For example, the average of the composites scores of student performance. Posterior belief The belief held after having observed the current data, and having examined those data light of how well the instructors conform with defined notions. For example, the revised average of final scores and the correspond interval. Bayes’ theorem involves the use of probabilities, which is only natural, since probability can be thought of as the mathematical language of uncertainty. At any given point specifically in this study we say that at any given raw scores of students at the end of instructional period, the instructors state of information about some uncertain score value can be represented by a set of probabilities. When new information is obtained which takes into account fairness and meaningful letter grades, these probabilities of scores ‘maps’ the scores to letter grades are revised in order that they may represent all of the available information. The principal approaches to inference guide modern data analysis are frequentist, Bayesian and likelihoodist. We now describe each as following [Carlin and Louis, 2000]: Frequentist Evaluates procedures based on imagining repeated sampling from a particular model which defines the probability distribution of the observed data conditional on unknown parameters. The procedures properties are structured for fixed values of unknown parameters; good procedure perform well over a broad range of parameter. Frequentist procedure also known as a classical model. For example, see Section 3.2; Standard Deviation methods whereas instructor fixes the z-scores correspond to the letter grades. 35 Bayesian Requires a sampling model whereas the prior distribution on all unknown quantities (parameters) in the model. The prior and likelihood are used to compute the conditional distribution of the unknowns given the observed data (the posterior distribution) from which all statistical inferences arise. The Bayesian evaluates procedures for a repeated sampling experiment of unknowns draw from posterior distribution for a given data set. Likelihoodist The likelihoodist or “Fisherian” develops a sampling model but not a prior as does the frequentist. The inferences are restricted to procedures that use the data only as reported by the likelihood, as a Bayesian would. Assigning grades through statistical (Bayesian) principle play an important role in scientific discovery within mathematical expertise in grading policy formulation. Currently, the futuristic statistical analyses on grading are performed with the help of commercial software packages known as WinBUGS. The details on the Bayesian network and parameter estimation in assigning grades is build in Chapter IV. 3.3.3 Preliminary View of Bayes’ Theorem According to the Bayesian view, all quantities are of two kinds: (a) those known to the instructor in making the inference and (b) those unknown to the students, the former are described by their known values. Herein, we present Bayes’ theorem compactly. Consider a random variable X that has a distribution of probability that depends upon the symbol θ , where θ is an element of a well defined set Ω . If the symbol θ is the 36 mean (or median) of the raw scores, Ω may be real line ( ℜ ⊂ Ω ). In the context of the usual statistical model with a random variable X having possible distribution indexed by a parameter θ , the raw score x becomes possible value of random variable and known to the instructor and the purpose is to make inferences concerning the unknown parameter. Therefore, in the Bayesian approach, the instructors will wish to calculate the probability distribution of θ given X = x . In order to make probability statements about θ given X , we must begin with a model providing a joint probability distribution for θ and X . Furthermore, the most basic Bayesian model consists of two stages; a likelihood specification X θ ~ f ( x θ ) and prior specification θ ~ π (θ ) . Note that we have partition the letter grades into eleven component which each component have unknown parameter (mean or median). For simplify, let us now introduced a random variable Θ that has a distribution of probability over the set Ω and we look upon θ as a possible value of the random variable Θ . In addition, X or θ can be vectors of raw scores and mean (or median) for each letter grades components respectively. The symbolized of above statement are shows as follows: Θ = {θ1 , θ 2 , ⋅⋅⋅,θ k } (3.2) X = { x1 , x2 , ⋅⋅⋅, xn } (3.3) which the raw scores is vector n observations whose the probability distribution f ( x θ ) depends on the values of k ∈ℵ unknown parameters, Θ . So that the pdf of the vector X depends vector Θ in a known way. We are assumed π is known, so that we concerned with making inferences about an unknown θ , which continuous, Bayes’ theorem takes the form appropriate for continuous θ : f {θ x1 , x2 , ⋅⋅⋅, xn } = f { x1 , x2 , ⋅⋅⋅, xn θ }π (θ ) ∫ f { x1 , x2 , ⋅⋅⋅, xn θ }π (θ ) dθ where f (θ ⋅) denotes the probability density (pdf) of unknown θ subsequent to observing raw scores { x1 , x2 , ⋅⋅⋅, xn } that bear on θ , f ( xi ⋅) denote the likelihood (3.4) 37 function (joint conditional distribution) of the raw scores and π (θ ) denotes the probability density (pdf) of θ prior to observing any raw scores. This form of statement the theorem stills just a statement of conditional probability, as well as in Section 3.3.4. Equation 3.4 directly shows the posterior distribution of θ . We simplify Eq. 3.4 as follows: h {θ x} = L { x θ } π (θ ) (3.5) m ( x) where m ( x ) = ∫ f ( x θ )π (θ ) dθ the marginal density of the raw score x , f ( x θ ) denotes the joint pdf of the data, L { x θ } denotes the likelihood function (log-likelihood) of raw scores and h ( ⋅) denote the posterior pdf. Eq. 3.5 is a special case of Bayes’ theorem in Eq. 3.4 [Carlin and Louis, 2000]. We can write the joint conditional pdf of X, ⎧ n ⎫ given Θ = θ , as L {x θ } = log ⎨∏ f ( xi θ ) ⎬ = log f { x1 θ } f { x2 θ } ⋅⋅⋅ f { xn θ } . ⎩ i =1 ⎭ { } 3.3.4 Bayes’ Theorem Bayes’ theorem is simply a statement of conditional probability. A general form of Bayes’ theorem for events is defined as following details. Consider that A1 , A2 , ⋅⋅⋅, Ak is any set of mutually exclusive and exhaustive events and event B and Aj are of special interest [see Hogg and Craig, 1978 and Hogg et al. 2005 for proof]. Bayes’ theorem provides a rule to find the conditional probability of Aj given B in terms of the conditional probability of B given Aj . For these conversion, Bayes’ theorem some how called a theorem about “inverse probability”. Press (2003) figured Bayes’ theorem for events as shows in Eq.3.6 and by the law of total probability that: 38 P { Aj B} = { } P B Aj P { Aj } (3.6) k ∑ P {B A } P { A } i i =1 i for P { B} ≠ 0 . See Appendix B; Definition B1. The denominator of Bayes’ theorem having mutually exclusive and exhaustive since B = B ∩ Ai , i = 1, 2,..., k and we have P ( B ) = P { B ∩ A1} + P { B ∩ A2 } + ⋅⋅⋅ + P { B ∩ Ak } however, P { B ∩ Ai } = P ( Ai ) P { B Ai } therefore P ( B ) = P ( A1 ) P { B A1} + P ( A2 ) P { B A2 } + ⋅⋅⋅ + P ( Ak ) P { B Ak } (3.7) k = ∑ P ( Ai ) P { B Ai } i =1 The interpretation of P { Aj } in Eq. 3.6 is personal since it is our personal prior probability of event Aj . That is explanation of the degree of belief about event Aj prior to your having any information about event B that may be accept on Aj . Additionally, P { Aj B} denotes your posterior probability of event Aj in that your degree of belief about event Aj posterior to you having the information about B . From Eq.3.7 we simplify the posterior probability as follows: P { Aj B} = P {B ∩ Aj } P { B} = { P { Aj } P B Aj } k ∑ P ( A ) P {B A } i =1 i i (3.8) 39 which is the well-known Bayes’ theorem. Bayes’ theorem provides a method of determining exactly what those probabilities of the raw scores to assigned letter grades are. Prior probabilities and posterior probabilities have it own terms. Prior probabilities are the degree of belief the analyst has prior to observing any data that may be accept on the problem. For example, in policy analysis perhaps the decisions are made without data; they are made merely on the basis of informed judgment. In such case prior probabilities are all you have. This fact accentuates the importance of formulating prior probabilities very carefully. In science, business, medicine and engineering, inference and decision about unknown quantities are generally made by learning from previous understanding of essential theory experience. If the sample sizes of current data collected are large, the prior probabilities will usually disappear and the data will be left to ‘speak for themselves’. However, if the sample sizes are small, prior probabilities can weight heavily in contrast to the small amount observed data, and so could extremely important. Furthermore, the posterior probabilities are the probabilities that results from Bayes’ theorem application. The posterior probabilities of mutually exclusive and exhaustive events must sum to one for them to be bona fide (authentic or reliable) probabilities [Press, 2003]. In words we can writes Bayes’ theorem as ⎧ (prior probability)(likelihood) ⎪ (prior probability)(likelihood) ⎪∑ posterior probability = ⎨ ⎪ (prior probability)(likelihood) ⎪ ∫ (prior probability)(likelihood) ⎩ for discrete case for continous case Bayes’ theorem also tell us that the probability for Θ posterior to the data X is proportional to the product of the distribution for Θ prior to the data and the likelihood for Θ given X [Box and Tiao, 1973]. That is posterior probability ∝ prior distribution × likelihood 40 Proof (see Appendix B; example B1). This form of posterior distribution is drawn from prior probability of conjugate prior. Gelman et al. (1995), enclosed definition on conjugacy prior distribution. That is, if F is a class of sampling distribution f { x θ } , and π is a class of prior distribution for θ , then the class π is conjugate for F if f {θ x} ∈ π for all f {⋅ θ } ∈ F and f ( ⋅) ∈ π π is always conjugate no matter what class of sampling distribution is used and taking π to be set of all densities having same functional form as the likelihood is most interest. Carlin and Louis (2000) explained that any distributions which can be figure in exponential family of distributions will have conjugate priors. In addition, it is not really necessary to determine the marginal pdf in Eq. 3.4 and Eq. 3.5 to find the posterior pdf f (θ ⋅) if the prior is called to be conjugate prior. This verified in Appendix B; Example B1. In this study, we have relates Bayes’ theorem to find the probability of assigning the meaningful letter grades based on student raw scores. One way to interpret the theorem is that it provides a means for updating the instructor degree of belief about the average raw scores of their student in the light of new information of raw scores that bears on average raw scores. The updating that takes place is from instructor original degree of belief, P {average raw scores for each letter grades} to instructors updated belief, P {average raw scores for each letter grades raw scores} . The theorem might also be thought of as a rule for learning from previous raw scores of students. 3.3.5 Model Set Up for Bayesian Grading Letter grades are assigned according to the probabilities of particular raw scores. From the eleven component of letter grades as defined in Section 2.3, we possible to partition the sample space of letter grades into eleven parts. Each part or each 41 component having the mean or median itself, we have entirely about eleven parameters sets. That is the distribution of the scores is a weighted sum of several equal distribution. In this study, we assumed that the raw scores are come from a Normal distribution. Precisely to say that the raw scores are come from a Normal Mixture since the scores are normally distributed in each component which we are tend to assigned. A feature of this distribution is that if a random variables are independent and normally distributed, then any linear combination of them is also normally distributed [Robert, 1972]. Obviously, Figure 3.3 shows a Normal Mixture on each letter grades where each of components is distinguish accurately. The dot line shows the letter grades sample space is normally distributed with mean µ and variance σ 2 . The eleven components of letter grades are potentially present, however it does not mean that all the letter grades will actually be assigned. 3.3.6 Bayesian Methods for Mixtures Mixture models (finite mixture models) are typically used to model data where each observation is assumed to have arrived from one k groups. Each component is suitable being modeled by a density from some distribution family. The density of each group is referred to as a component of mixture, and weighted by relative frequency of the group in the population . This methods is applied in GB by which raw scores may be clustered together into groups for classification; into correspond letter grade. A mixture model provides a convenient and flexible family of distribution for estimating or approximating distribution which are not well modeled by any other standard parametric family. This model being used as a parametric alternative to non-parametric density estimation. The advantages of Bayesian approach to this model are it is the “best” model and also a coherent way of combining results over different models [Stephens, 2000]. 42 In this study, we consider a finite mixture model in which raw scores data X = { x1 , x2 , ⋅⋅⋅, xn } are assumed to be independent and identically distributed from a mixture distribution of g components. Eq. (3.9) is called the mixture density which the mixture proportion (component probability) constrained to be non-negative and sum to unity. Our interests are to find the probability that a particular raw scores belongs to a component of the mixture normal. From Eq. 3.7, the raw scores are independently and identically with the distribution p ( xi ) = ∑ π gφ ( µ g ,σ g2 ) G for i=1,2,...,n (3.9) g =1 where xi is the raw score of student i , g is indicator the G=11 components of the mixture, π g is the component probability of component g and can be written as π = {π 1 , π 2 , ⋅⋅⋅, π g } where ∑ π = 1 (from Definition B1; Appendix B) that is the total area of all curves must equal to 1 and π cannot be negative [Robert, 1972], φ ( ⋅) is some parametric component density function (pdf), µ g and σ g2 are mean and variance of { } component g and we can writes as vectors µ = µ1 , µ 2 , ⋅⋅⋅, µ g and σ 2 = {σ 12 ,σ 22 , ⋅⋅⋅,σ g2 } . We might also denotes θ1 = {π 1, µ1 ,σ 12 } ,θ 2 = {π 2 , µ 2 ,σ 22 } , ⋅⋅⋅,θ g = {π g , µ g ,σ g2 } and therefore we simplify the sets of θ as equal to Θ = {θ1 ,θ 2 , ⋅⋅⋅,θ g } Eq. (3.9) is usually referred to as a mixture density, which has mixing probabilities π g . This natural distribution are used for assigning letter grades if the observed students (population) is more realistically thought of as a combination of several distinct letter 43 grades. The mixture models also attractive since they can accommodate arbitrarily large range of models anomalies such as multiple modes, heavy tails at each letter grades interval and so on [Carlin and Louis, 2000]. Obviously, the more raw scores observation there are, the more accurately the model is estimated. However, since the estimation procedure is Bayesian, the model can always be estimated, no matter how small the data sets are. For illustration on parameters involved in mixture models, we display the structures figure in which the basic observation units are class into larger units. This is shown in Figure 3.3 that showing conditional independence relationships within parameters involved. The structured illustration is defined as Hierarchically Structured Data [Press, 2003] which can relates in grades assigning problem. It is natural to model such problem hierarchically, with observed outcomes modeled conditionally on certain parameters, known as hyperparameters. Hierarchical models are generally characterized by expression of a marginal model P ( y ) where y represent the entire data vector through a sequence of conditional models involving latent variables. Thus the mixture models in assigning letter grades are viewed hierarchically; the observed random variables of raw scores xi are modeled conditionally on the hyperparameter vector Θ [Gelman et al., 1995]. Figure 3.3 can also be relates to a Functional Mapping of Letter Grades, (see Figure 1.1 in Section 2.3) for the definition of letter grades. It is shows that the arrows pointing downwards to each variable from conditioning variables (parents) of its prior model. Note that Level 3 displayed as the rank ordered observation of raw scores. Level 2 is the sets of parameter θ for each letter grade and Level 1 shows our main mean and variance of letter grades of sample space. 44 ( µ ,σ ) 2 ( µ ,σ ) 2 Level 1 Θ θ1 θ2 X x11 , x12 , ⋅⋅⋅ ⋅⋅⋅ ⋅⋅⋅ θg Level 2 Level 3 x21 , x22 , ⋅⋅⋅ ⋅ ⋅ ⋅ ⋅⋅⋅ xg1 , xg 2 , ⋅⋅⋅, xgn Figure 3.3: Hierarchical Representation of a Mixture As shown in Figure 3.4, a very natural limit is that the eleven components of letter grades are ordered by their mean. Mean for grade E is the lowest significance to other letter grades, grade D have mean higher than E and lower than D+, and so on. Therefore, the grade A have the highest ranking and having a shorts interval belongs to A’s grade. That is µ1 < µ 2 < ⋅⋅⋅ < µG where G=11 components of letter grades. A principal on the distribution-gap method (see Section 3.3.1) are applied since there are often gaps in students’ scores. The idea behind distribution gaps is if the means are relatively far from each other, then these gaps are visible to be bared. Then the cutoffs are made at the gaps which are similar to assign a different letter grade to each component. Note that each component of letter grades are associate with three parameters. That is the component probabilities, then component mean and the component variance; π g , µ g and σ g2 respectively. However the component probabilities π g can be determine if we know the other components probabilities. Thus, this model contains of 3G − 1 parameters [Alex, (2003)]. Via model set up on GB, we will explained in details the estimation of probabilities that each raw 45 score correspond to each letter grade in the Chapter IV. The instructor leniency factors are taken into consideration to encounter overestimated or underestimated when overlook in assigning letter grade. Normal Curves for each Raw Score distinguish letter grades Combined distribution E D D+ C- C C+ B- B B+ A- A Letter Grade Figure 3.4: Normal Mixture Model Outlined on Each Letter Grades 3.3.7 Mixture of Normal (Gaussian) Distribution We have assumed that the raw scores observed are drawn as independently and identically Normal distributed. In this subsection, we describe the univariate mixture of g Normal distribution. Let us review mixture models as given by Eq. (3.9) as follows p ( xi ) = ∑ π g N ( µ g , σ g2 ) G for i=1,2,...,n g =1 where N ( ⋅) denotes the Normal density function with probability density function is given by N ( xi ) = 1 2πσ g e 1 ⎛ xi − µ g − ⎜ 2 ⎜⎝ σ g ⎞ ⎟ ⎟ ⎠ 2 46 or equivalently to writes the likelihood function as 1 1 2⎫ 2⎫ ⎧ −1 ⎧ −1 exp ⎨ 2 ( x1 − µ ) ⎬ ×⋅⋅⋅ exp ⎨ 2 ( xn − µ ) ⎬ 2πσ 2πσ ⎩ 2σ ⎭ ⎩ 2σ ⎭ 2⎫ ⎧ −1 ∝ σ − n exp ⎨ 2 ∑ ( xi − µ ) ⎬ ⎩ 2σ ⎭ p {x θ } = n and let s = ∑ ( xi − µ ) and denotes the Normal density function as 2 i =1 ⎧ −s ⎫ p {x θ } ∝ σ − n exp ⎨ 2 ⎬ ⎩ 2σ ⎭ Furthermore, by Eq. (3.9) G 1 p ( xi ) = ∑ π g 2πσ g g =1 = π1 1 2πσ 1 e e 1⎛ xi − µ g ⎞ − ⎜ ⎟ 1⎝ σ ⎠ 1⎛ x − µ ⎞ − ⎜ i 1⎟ 1⎝ σ ⎠ 2 2 +π2 1 2πσ 2 e 1⎛ x − µ ⎞ − ⎜ i 2⎟ 1⎝ σ ⎠ 2 + ⋅⋅⋅ + π g 1 2πσ g e 1⎛ xi − µ g ⎞ − ⎜ ⎟ 1⎝ σ ⎠ 2 provided i = 1, 2,..., n . We have introduced π g as the component probability of component g (mixture proportion) and denotes as resemblance as an indicator variables π g , with ⎧π g πg = ⎨ ⎩0 if the ith raw scores is drawn from the gth mixture component otherwise provided 0 ≤ π g ≤ 1 and G ∑π g =1 g = 1 and we can say the Eq.(3.9) as the probability that a particular raw score belongs to a component of the mixture proportional to the ordinate of that component at the raw score. In other words we may writes Eq.(3.9) by the following form 47 p ( xi ∈ G ) ∝ π gφ ( xi µ g , σ g2 ) (3.10) Using Eq.(3.9) or Eq. 3.(10), compute the probability that the raw score belongs to each of the components. Note that, our components are corresponding to the eleven letter grades. Therefore, the probability that a raw score belongs to a particular letter grade is the same as that the raw score belongs to the corresponding component. Let n be the total number of raw scores. x is n × 1 vector of raw scores. We have assumed the raw scores are iid normally distributed, that xi Θ ~ ∑ π g N ( µ g , σ g2 ) then the G g =1 likelihood distribution of Eq. (3.9) can be expressed as p {x G, Θ} = ∏ ∑ π g N ( µ g , σ g2 ) n G (3.11) i =1 g =1 that is { } p {x G, Θ} = ∏ π 1 f ( xi ; G,θ1 ) + π 2 f ( xi ; G,θ 2 ) + ⋅⋅⋅ + π g f ( xi ; G,θ g ) n i =1 2 2 ⎧ 1⎛ x − µ ⎞ 1⎛ x − µ ⎞ − ⎜ i 1⎟ − ⎜ i 2⎟ 1 1 ⎪ 1⎝ σ 1 ⎠ 1⎝ σ 2 ⎠ = ∏ ⎨π 1 e +π2 e + ⋅⋅⋅ + π g 2πσ 1 2πσ 2 i =1 ⎪ ⎩ n { 1 2πσ g e 1⎛ xi − µ g − ⎜ 1⎝⎜ σ g ⎞ ⎟ ⎟ ⎠ 2 ⎫ ⎪ ⎬ ⎪ ⎭ } where Θ = π , µ , σ 2 are G × 1 vectors of hyperparameters. For the purpose of this study we have select g to be equal to eleven and then we estimate the other parameters as shown in Chapter IV. The joint distribution or the so-called the posterior distribution of these parameters can be addressed as in Section 3.3.4 as follows f {Θ G, x} ∝ likelihood × prior distribution or f {Θ G, x} ∝ L {x G, Θ} h {Θ} 48 which is equivalent to { } f {π , µ , σ 2 G, x} ∝ L x G, π , µ , σ 2 h {π , µ , σ 2 } { } where h ( ⋅) is the probability density of the prior Θ = π , µ , σ 2 . We called the prior as conjugate prior so as to give the posterior same distribution as the likelihood function. These conjugate priors are set to be Normal, Inverse-Gamma and Dirichlet distributed respectively and denotes by µ ~ N (ν g , δ g2 ) , σ 2 ~ IG (α g , β g ) , and π ~ Di (η ) . Most of statistician has chosen these types of distributions for Normal observation. 3.3.8 Prior Distribution Choosing the prior and its hyperparameters arise many cases. For example, they approximately know what the mean of each component should be. In many instances, h ( ⋅) is not known and injects the problem of personal or subjective probability; yet the choice of h ( ⋅) affects the posterior pdf. [Hogg and Craig, 1978]. Although, they do not have any information at all which the information should be ignored in order to check their results by an objectives analysis. Cornebise et al. (2005). was given a rule in choosing the prior. They have stated that we can use an empirical prior; hyperparameters built up upon the data or by using noninformative prior in which prior carrying no information at all. This is actually hard to achieve because purely noninformative prior can be improper and cause troubles. Moreover, the prior is often chosen in a closed-by-sampling or conjugate prior family, such that conditioning by the sample only result in a change of the hyperparameters, not in a change of family. For further reading notes on noninformative prior, we recommend the readers to refer to Carlin and Louis (2000) and Box and Tiao (1973). 49 Here we have chosen the conjugate prior implementation to the posterior ( ) distribution. We employed a shorthand notation f ( x θ ) = N x µ , σ 2 to denote a Normal density with mean µ and variance σ 2 . To proof that µ ~ N (ν , δ 2 ) , σ 2 ~ IG (α , β ) and π ~ Di (η ) we may refer to such distribution as a noninformative prior for Θ and argue that all of the information resulting in the posterior distribution generate from the data, and hence resulting inferences were completely objective, rather than subjective. The main issue is how to select the prior which provides little information relative to what is expected to be provided by the intended observation. Now, we consider a case of a single parameter which might be directly applied to hyperparameter problem. Suppose a random sample of n observation of raw score form a Normal distribution N ( µ , σ 2 ) . The likelihood function of µ given n independent observation of raw scores from Normal population N ( µ , σ 2 ) , is n n i =1 i =1 L ( µ x, σ ) = ∏ f ( x µ ) = ∏ 1 ⎛ xi − µ ⎞ σ ⎟⎠ 2 − ⎜ 1 e 2⎝ 2πσ n ⎡ 1 ⎤ − 2σ 2 ∑ ( xi − µ ) =⎢ ⎥ e ⎣ 2πσ ⎦ −n ⎧ 1 = ⎡ 2πσ 2 ⎤ exp ⎨− 2 ⎣ ⎦ ⎩ 2σ since, ∑( x − µ ) = ∑( x − x ) 2 i i 2 1 + n ( µ − x ) and given 2 2 (3.12) ∑( x − µ ) ∑( x − x ) i i 2 2 ⎫ ⎬ ⎭ is a fixed constant and σ 2 is known, we simplify Eq. (3.12) as follows −n 2 2 ⎫ ⎧ 1 L ( µ x, σ ) = ⎡ 2πσ 2 ⎤ exp ⎨− 2 ⎡ ∑ ( xi − x ) + n ( µ − x ) ⎤ ⎬ ⎣ ⎦⎭ ⎣ ⎦ ⎩ 2σ ⎧⎪ 1 ⎛ µ − x ⎞ 2 ⎫⎪ 2⎫ ⎧ 1 ∝ exp ⎨− 2 n ( µ − x ) ⎬ = exp ⎨− ⎜ ⎟ ⎬ ⎩ 2σ ⎭ ⎩⎪ 2 ⎝ σ / n ⎠ ⎭⎪ (3.13) 50 in words we can say that the likelihood of µ is proportional to Normal distribution which is centered about x and standard deviation σ / n . Eq. (3.13) are said to be data translated likelihoods since the likelihood is in the form of L (θ x ) = g (θ − t ( x ) ) and thus the data will give rise to the same function form for the likelihood [Lee, (1989)]. Therefore we have proved that µ g ~ N (ν g , δ g2 ) where N (ν g , δ g2 ) is the Normal distribution with meanν g and variance δ g2 for each component of letter grade. So that the conjugate prior of µ g is again given by the Normal distribution. We denotes the distribution of conjugate prior of µ (or µ g ) as ⎧⎪ 1 ⎛ µ − x ⎞ 2 ⎫⎪ p ( µ x ) ∝ exp ⎨− ⎜ ⎟ ⎬ ⎪⎩ 2 ⎝ σ / n ⎠ ⎭⎪ ( ) ( ) Therefore the conjugate prior of µ g is Normally distributed [ p µ g = N ν g , δ g2 ] with pdf: ⎧ 1 ⎛ µ −ν 1 ⎪ g exp ⎨− ⎜ g p(µ g ) = ⎜ 2π δ g ⎪⎩ 2 ⎝ δ g ⎞ ⎟⎟ ⎠ 2 ⎫ ⎪ ⎬ ⎭⎪ Next, for components variance we simply look at L (σ x, µ ) by taking the same equation as in Eq. (3.13) which interest of σ 2 given n observation of raw scores and known µ . Thus we have −n 2 ⎫ ⎧ 1 L (σ x, µ ) = ⎡ 2πσ 2 ⎤ exp ⎨− 2 ⎡ ∑ ( xi − µ ) ⎤ ⎬ ⎣ ⎦⎭ ⎣ ⎦ ⎩ 2σ ∝ σ −n 2 ⎧⎪ 1 ⎡ xi − µ ) ⎤ ⎫⎪ ⎧ ns 2 ⎫ ( −n exp ⎨− 2 ⎢ n∑ ⎥ ⎬ = σ exp ⎨− 2 ⎬ n ⎩ 2σ ⎭ ⎪⎩ 2σ ⎢⎣ ⎦⎥ ⎪⎭ (3.14) 51 n ( xi − µ ) i =1 n where s = ∑ 2 2 . Box and Tiao (1973) have shown that the likelihood in Eq.(3.14) curves in the original metric σ are not data translated. The principal of noninformative prior should not be taken to be locally uniform in σ . However, the corresponding likelihood curves in terms of log σ are exactly data translated and it therefore should be locally uniform in log σ . In, principle, we might have any form of prior distribution for the variance σ 2 . Therefore we revised the previous form of Eq.(3.14) in terms of log likelihood function as −n ⎧ ns 2 ⎫ ⎪⎫ ⎪⎧ log L (σ x, µ ) = log ⎨ ⎡ 2πσ 2 ⎤ exp ⎨− 2 ⎬⎬ ⎦ ⎪⎩ ⎣ ⎩ 2σ ⎭ ⎭⎪ n = −n log σ − log 2π − n ( s − σ ) 2 ∝ −n log σ − n ( s − σ ) multiplied the likelihood by s n (the multiplication by constant leaves the likelihood unchanged); to gives −n ⎧⎪ ⎧ ns 2 ⎫⎫⎪ log L (σ x, µ ) = log ⎨ ⎡ 2πσ 2 ⎤ s n exp ⎨− 2 ⎬⎬ ⎦ ⎪⎩ ⎣ ⎩ 2σ ⎭⎪⎭ = −n log σ − n log 2π + n log s − n ( s − σ ) (3.15) ∝ −n log σ + n log s − n ( s − σ ) differentiate Eq.(3.15) with respect to σ : d d log L (σ x, µ ) = {−n log σ + n log s − n ( s − σ )} dσ dσ −n = +n σ ⎛1 ⎞ = − n ⎜ − 1⎟ ⎝σ ⎠ 1 ∝ σ (3.16) 52 which give the prior distribution of σ 2 as p (σ 2 ) ∝ 1 (3.17) σ2 using Eq.(3.17) and Eq.(3.15) to find the appropriate normalizing constant, we obtain 1 ⎛ ns 2 ⎞ p (σ x ) = ⎜ ⎟ ⎛n⎞⎝ 2 ⎠ Γ⎜ ⎟ ⎝2⎠ n/2 ⎡⎣σ 2 ⎤⎦ 2 −1 ⎡ ⎛ n ⎞ ⎤ ⎛ ns 2 ⎞ = ⎢Γ ⎜ ⎟ ⎥ ⎜ ⎟ ⎣ ⎝ 2 ⎠⎦ ⎝ 2 ⎠ ( ns / 2 ) 2 − ⎡⎣( n / 2 ) +1⎤⎦ n/2 ⎡⎣σ 2 ⎤⎦ ⎛ ns 2 ⎞ exp ⎜ − 2 ⎟ ⎝ 2σ ⎠ − ⎡⎣( n / 2 ) +1⎤⎦ ⎛ ns 2 ⎞ exp ⎜ − 2 ⎟ ⎝ 2σ ⎠ (3.18) n/2 ⎛ ns 2 ⎞ exp ⎜ − 2 ⎟ = 2 ⎡⎣( n / 2 ) +1⎤⎦ ⎝ 2σ ⎠ ⎡ ⎤ Γ ( n / 2 ) ⎣σ ⎦ − ⎣⎡( n / 2 ) +1⎦⎤ ⎛ ns 2 ⎞ = k ⎡⎣σ 2 ⎤⎦ exp ⎜ − 2 ⎟ ⎝ 2σ ⎠ ( ns / 2 ) where k = 2 n/2 Γ ( n / 2) is the normalizing constant requires to make the distribution integrate to unity (i.e 1). To proof let the constants α = n / 2 and β = ns 2 / 2 then we have the RHS of Eq. (3.18) as follows ( ns / 2 ) 2 RHS = n/2 Γ ( n / 2 ) ⎡⎣σ 2 ⎤⎦ ⎡⎣( n / 2 ) +1⎤⎦ ⎛ ns 2 ⎞ exp ⎜ − 2 ⎟ ⎝ 2σ ⎠ ( β ) ⎡σ 2 ⎤ −[α +1] exp ⎛ − β ⎞ = ⎜ 2 ⎟ Γ (α ) ⎣ ⎦ ⎝ σ ⎠ α 53 applying the integral formula in Appendix D ( formula (ii)), we integrate the RHS to have ( β ) ⎡σ 2 ⎤ −[α +1] exp ⎛ − β ⎞ dσ 2 ⎜ 2 ⎟ ∫0 Γ (α ) ⎣ ⎦ ⎝ σ ⎠ α β ) ∞ 2 −[α +1] ( ⎛ β ⎞ ⎡⎣σ ⎤⎦ exp ⎜ − 2 ⎟ dσ 2 = ∫ Γ (α ) 0 ⎝ σ ⎠ α β) ( = Γ ( α ) β − α = 1 ■. Γ (α ) α ∞ Therefore, the conjugate prior of σ 2 is Inverse-Gamma type of distribution with pdf: ( β ) ⎡σ 2 ⎤ −[α +1] exp ⎛ − β ⎞ p (σ ) = ⎜ 2 ⎟ Γ (α ) ⎣ ⎦ ⎝ σ ⎠ α 2 Carlin and Louis (2000), have stated that if X ~ IG (α , β ) then the E [ X ] = 1/ β ⎡⎣(α − 1) ⎤⎦ Var [ X ] = 1/ β 2 ⎡(α − 1) (α − 2 ) ⎤ provided α > 1, α > 2 respectively. They also named ⎣ ⎦ 2 the Inverse-Gamma to be reciprocal gamma since X = 1/ Y where Y ~ G (α , β ) . Inverse Gamma is very commonly used in Bayesian statistics as the conjugate prior for a variance parameters σ 2 arising in a normal likelihood function. Choosing α and β appropriately for such a prior can be aided by solving IG distribution for µ ≡ E [ X ] and σ 2 ≡ Var [ X ] . This result is α = (µ /σ ) + 2 2 and β= ( 1 ) µ (µ /σ 2 ) +1 setting the prior mean and standard deviation both equal to µ (i.e µ / σ 2 = 1 ), thus produces α = 3 and β = 1 . 2µ 54 Finally, in this section we would like to apply the most widely used of prior distribution for component probabilities as Dirichlet distributed with parametersη . This type of distribution is a distribution on the sets of all probability distributions. Typically, the vector π is assigned a Dirichlet prior with elements {η1 ,η2 , ⋅⋅⋅,η g } with typical choice being such η1 = η2 = ⋅⋅⋅ = η g = 1 . This is set to make all components a priori equally likely, all the element η must be equal. Thus, we have stated that π is symmetric Dirichlet distributed with pdf: ( ) p π ηg = Γ (η0 ) ∏ Γ (η ) G g =1 G ∏π g =1 η g −1 (3.19) g g The Dirichlet is the multivariate generalization of the beta; for g = 2, Di (η ) = Beta (η1 ,η 2 ) . If X ~ Di ( η ) , X = { x1 , x2 ,..., xg }′ where 0 ≤ xi ≤ 1 and G ∑x g =1 g = 1 , η = (η1 ,η2 ,...,η g )′ and η g ≥ 0 , then E ⎡⎣ X g ⎤⎦ = η g / η0 , Var ⎡⎣ X g ⎤⎦ = η g (η0 − η g ) η02 (η0 + 1) and Cov ( X g , X h ) = −η gη h / η02 (η0+1 ) where η0 = ∑η g . G g =1 Here we are not describes more on Dirichlet distribution. Further information can be obtained in Carlin and Louis (2000) pg. 51 and 327; Congdon (2003) pg. 58 and Gelman et al. (1995) pg. 79 and 476. 3.3.9 Posterior Distribution In section 3.37 we write the posterior distribution proportional to the product of Likelihood and Prior distribution. That is { } f {π , µ , σ 2 G, x} ∝ L x G, π , µ , σ 2 h {π , µ , σ 2 } (3.20) 55 Therefore, to find the posterior of components mean we substitute the prior distribution of µ g ;( or µ ~ N ( µ , σ 2 / n ) and the likelihood function of raw scores into Eq. (3.20). ⎧⎪ 1 ⎛ µ − x ⎞ 2 ⎫⎪ We have proved that p ( µ x ) ∝ exp ⎨− ⎜ ⎟ ⎬ . Suppose a parameter µ distributed 2 σ / n ⎝ ⎠ ⎭⎪ ⎩⎪ as ⎧ 1 ⎛ µ −ν 1 ⎪ g exp ⎨− ⎜ p(µ ) = ⎜ 2πδ g ⎪⎩ 2 ⎝ δ g ⎞ ⎟⎟ ⎠ 2 ⎫ ⎪ ⎬ ⎭⎪ and the likelihood function of µ is proportional to a Normal distribution of µ given n observation of raw scores x is 2 ⎧ ⎛ ⎫ ⎪ 1 µ−x ⎞ ⎪ L ( µ x ) ∝ exp ⎨− ⎜ ⎟ ⎬ 2 ⎜ σ / n ⎟⎠ ⎪ ⎩⎪ ⎝ g ⎭ provided −∞ < µ < ∞ , where x = ∑x i ; xi ∈g i is the function of the observations x . Then, by ng Bayes’ theorem the posterior distribution of µ given the data of raw scores is p ( µ x) = p ( µ ) L ( µ x) ∫ p ( µ ) L ( µ x) d µ f ( µ x) = ∫ f ( µ x) d µ ∞ 0 ∞ 0 56 where ⎧ 1 ⎪ 1 ⎛ µ −ν g x x µ µ µ f( exp ⎨− ⎜ ) = p( ) L( ) = 2 ⎜ δg 2π δ g ⎩⎪ ⎝ ⎧ 1 ⎛ µ −ν ⎪ g ∝ exp ⎨ − ⎜ ⎜ ⎪⎩ 2 ⎝ δ g ⎧ ⎡ ⎪ 1 ⎢⎛ µ − ν g = exp ⎨ − ⎜ ⎜ ⎪ 2 ⎢⎣⎝ δ g ⎩ ⎞ ⎟⎟ ⎠ 2 ⎧ ⎫ ⎪ ⎪ 1⎛ µ−x ⎬ exp ⎨− ⎜ ⎪⎭ ⎪⎩ 2 ⎜⎝ σ g / ng ⎞ ⎛ µ−x ⎟⎟ + ⎜ ⎠ ⎜⎝ σ g / ng 2 ⎞ ⎟ ⎟ ⎠ 2 ⎞ ⎟⎟ ⎠ ⎞ ⎟ ⎟ ⎠ 2 2 ⎧ ⎫ 1 ⎪ ⎪ 1⎛ µ−x exp ⎨− ⎜ ⎬ π σ 2 / n ⎪⎩ 2 ⎜⎝ σ g / ng g ⎭⎪ ( ) ⎫ ⎪ ⎬ ⎪⎭ ⎤⎫ ⎥ ⎪⎬ ⎥⎪ ⎦⎭ (3.21) using the identity [Box and Tiao, 1973] A ( z − a ) + B ( z − b ) = ( A + B )( z − c ) + 2 with c = 2 2 AB 2 (a − b) A+ B 1 ( Aa + Bb ) A+ B we write the terms in exponential as ⎛ µ −ν g ⎜⎜ ⎝ δg ⎞ ⎛ µ−x ⎟⎟ + ⎜ ⎠ ⎜⎝ σ g / ng 2 2 ⎞ 2 1 1 2 ⎟ = 2 ( µ −ν g ) + 2 µ − x) ( ⎟ δg σ g / ng ⎠ ∑x 1 1 x ∈g we have A = 2 , B = 2 , z = µ , a = ν g , b = x = i δg σg / n ng ∴c = 1 1 δ 2 g + 1 σ / ng 2 g ⎛ 1 ⎞ 1 x⎟ ⎜⎜ 2 ν g + 2 σ g / ng ⎟⎠ ⎝ δg ⎛ ∑ xi ⎜ 1 xi ∈g = ν + 2 1 1 ⎜ δ g2 g σg ⎜ + 2 2 ⎝ δ g σ g / ng 1 ⎞ ⎟ ⎟ ⎟ ⎠ i and since x = ∑x xi ∈g ng i ⎞ ⎟ ⎟ ⎠ 2 ⎫ ⎪ ⎬ ⎪⎭ 57 therefore, ⎛ µ −ν g ⎜⎜ ⎝ δg ⎞ ⎛ µ−x ⎟⎟ + ⎜ ⎠ ⎜⎝ σ g / ng 2 ⎞ ⎟ ⎟ ⎠ 2 1 1 ⎛ 1 δ σ / ng 2 2 1 ⎞ =⎜ 2 + 2 µ − υg ) + µg − x ) ⎟ ( ( ⎜δ ⎟ 1 1 ⎝ g σ g / ng ⎠ + 2 2 δ g σ g / ng 2 g ⎛ 1 1 =⎜ 2 + 2 ⎜δ ⎝ g σ g / ng 2 g ⎞ 2 ⎟⎟ ( µ − υ g ) + d ⎠ where d is a constant independent of µ . Thus, ⎧⎪ 1 ⎡⎛ 1 1 f ( µ x ) = exp ⎨− ⎢⎜ 2 + 2 ⎜ 2 ⎣⎢⎝ δ g σ g / ng ⎩⎪ ⎧⎪ 1 ⎡⎛ 1 1 = exp ⎨− ⎢⎜ 2 + 2 ⎜ 2 ⎢⎣⎝ δ g σ g / ng ⎩⎪ ⎛ ∑ xi ⎜ ν g xi ∈g + 2 where υ g = 1 1 ⎜ δ g2 σg ⎜ + 2 2 δ g σ g / ng ⎝ 1 ⎤ ⎫⎪ ⎞ 2 ⎟⎟ ( µ − υ g ) + d ⎥ ⎬ ⎠ ⎦⎥ ⎭⎪ ⎞ 2 ⎤⎫ ⎪ ⎛ d⎞ ⎟⎟ ( µ − υ g ) ⎥ ⎬ exp ⎜ − ⎟ ⎝ 2⎠ ⎥⎦ ⎭⎪ ⎠ ⎞ −1 ⎛ ∑ xi ⎟ ⎛ 1 ng ⎞ ⎜ ν g xi ∈g ⎟ = ⎜⎜ δ 2 + σ 2 ⎟⎟ ⎜ δ 2 + σ 2 g ⎠ ⎜ g g ⎟ ⎝ g ⎠ ⎝ ⎞ ⎟ ⎟ ⎟ ⎠ so that ∞ ∫ f ( µ x )d µ = exp −∞ −d 2 1 ⎪⎧ −1 ⎡⎛ 1 exp ∫−∞ ⎨ 2 ⎢⎢⎜⎜ δ g2 + σ g2 / ng ⎪⎩ ⎣⎝ ∞ ⎛ 1 1 = 2π ⎜ 2 + 2 ⎜δ ⎝ g σ g / ng now, substituting the results of f ( µ x ) and ⎞ ⎟⎟ ⎠ −1/ 2 ⎫ ⎞ 2 ⎤⎪ ⎟⎟ ( µ − υ g ) ⎥ ⎬ d µ ⎥⎦ ⎪⎭ ⎠ ⎛ −d ⎞ exp ⎜ ⎟ ⎝ 2 ⎠ ∞ ∫ f ( µ x ) d µ to obtain −∞ p ( µ x) = f ( µ x) ∫ ∞ 0 f ( µ x) d µ = ⎧⎪ 1 ⎡⎛ 1 n ⎞ 2 ⎤⎫ ⎪ ⎛ d⎞ exp ⎨− ⎢⎜ 2 + g2 ⎟ ( µ − υ g ) ⎥ ⎬ exp ⎜ − ⎟ ⎜ ⎟ 2 ⎣⎢⎝ δ g σ g ⎠ ⎝ 2⎠ ⎦⎥ ⎭⎪ ⎩⎪ ⎛ 1 n ⎞ 2π ⎜ 2 + g2 ⎟ ⎜δ ⎟ ⎝ g σg ⎠ −1/ 2 ⎛ −d ⎞ exp ⎜ ⎟ ⎝ 2 ⎠ 58 1/ 2 ⎛ 1 ng ⎞ ⎜⎜ 2 + 2 ⎟⎟ δg σ g ⎠ ∴ p ( µ x) = ⎝ 2π = ⎧⎪ 1 ⎡⎛ 1 1 exp ⎨ − ⎢⎜ 2 + 2 2 ⎢⎣⎜⎝ δ g σ g / ng ⎩⎪ 1 ⎛ 1 n ⎞ 2π ⎜ 2 + g2 ⎟ ⎜δ ⎟ ⎝ g σg ⎠ −1/ 2 ⎞ 2 ⎤⎫ ⎪ − µ υ g ) ⎥⎬ ⎟⎟ ( ⎥⎦ ⎭⎪ ⎠ ⎧ ⎡ ⎤⎫ ⎪ ⎢ ⎥⎪ 2 ⎪⎪ 1 ⎢ ( µ − υ g ) ⎥ ⎪⎪ exp ⎨− ⎢ −1 ⎥ ⎬ ⎪ 2 ⎢ ⎛ 1 ng ⎞ ⎥ ⎪ ⎪ ⎢ ⎜⎜ δ 2 + σ 2 ⎟⎟ ⎥ ⎪ g ⎠ ⎦⎭ ⎪ ⎣⎝ g ⎩⎪ whereas ⎛ ∑ xi ⎜ ν g xi ∈g υg = + 2 1 ng ⎜⎜ δ g2 σg + 2⎝ 2 δg σ g 1 ⎡1 n ⎤ defined Vg = ⎢ 2 + g2 ⎥ ⎣⎢ δ g σ g ⎥⎦ −1 ⎞ −1 ⎛ ∑ xi ⎟ ⎛ 1 ng ⎞ ⎜ ν g xi ∈g ⎟ = ⎜⎜ δ 2 + σ 2 ⎟⎟ ⎜ δ 2 + σ 2 g ⎠ ⎜ g g ⎟ ⎝ g ⎠ ⎝ ⎞ ⎟ ⎟ ⎟ ⎠ (3.22) i ν g x∑ ∈g and M g = 2 + 2 δg σg x i Thus, simplify Eq. (3.22) as follows ( ∴ p µ g x, σ g ) 2 ⎧ ⎡ ⎤⎫ ⎪ 1 ⎢ ( µ − Vg M g ) ⎥ ⎪ = − exp ⎨ ⎬ −1/ 2 −1 ⎥⎪ 2π (Vg ) ⎪⎩ 2 ⎢⎣ (Vg ) ⎦⎭ 1 (3.23) ■. Applying this results, we have the conditional distribution for posterior µ g is µ g ⋅⋅⋅ ~ N (Vg M g ,Vg−1 ) . 59 Now, we want to find the posterior distribution of σ g2 . Suppose a priori a parameter σ g2 is distributed as ( β ) ⎡σ p (σ ) = Γ (α ) ⎣ α g 2 g g 2 g ⎦⎤ −[α +1] ⎛ β ⎞ exp ⎜ − g2 ⎟ ⎜ σg ⎟ ⎝ ⎠ using conditions in Section 3.37 we have p (σ g2 x ) ∝ p (σ g2 ) L (σ g2 x ) 1 = Γ (α g ) = 1 Γ (α g ) (β ) αg g (β ) αg g ⎣⎡σ g ⎦⎤ 2 ⎣⎡σ g ⎦⎤ 2 −[α +1] −[α +1] σ −n ⎧ ⎡ ⎧⎪ β g ⎫⎪ ⎪ 1 ⎢ ⎛ xi − µ g exp ⎨− 2 ⎬ exp ⎨− ∑ ⎜ ⎜ ⎪⎩ 2 ⎣⎢ ⎝ σ g ⎩⎪ σ g ⎭⎪ ⎧ ⎡β 1 ⎛ x − µg ⎪ −n σ exp ⎨− ⎢ g2 + ∑ ⎜ i ⎜ ⎪⎩ ⎢⎣ σ g 2 ⎝ σ g ⎞ ⎟⎟ ⎠ 2 ⎤⎫ ⎥ ⎪⎬ ⎥⎪ ⎦⎭ simplify this equation to be 2 ⎧ ⎡ β + 1/ 2 xi − µ g ) ⎤ ⎫ ( ∑ g ⎪⎪ ⎢ ⎥ ⎪⎪ − ⎡α + ( ng / 2 ) +1⎤ xi ∈g ⎦ p (σ 2 x ) ∝ ⎡⎣σ g 2 ⎤⎦ ⎣ exp ⎨− ⎢ ⎥⎬ σ g2 ⎥⎪ ⎪ ⎢ ⎢ ⎥⎦ ⎭⎪ ⎪⎩ ⎣ ⎞ ⎟⎟ ⎠ 2 ⎤⎫ ⎥ ⎪⎬ ⎥ ⎦ ⎭⎪ 60 using an appropriate normalizing constant to have the integration of the above equation equal to unity, we integrate the RHS to be as following 2 ⎧ ⎡ β + 1/ 2 ⎤⎫ x µ − ( ) ∑ g i g ∞ ⎪ ⎢ ⎥ ⎪⎪ ⎡ ⎤ α g + ( ng / 2 ) 1 xi ∈g 2 − ⎣α g + ( ng / 2 ) +1⎦ ⎪ 2 ⎡ ⎤ β σ RHS = ∫ − ( g) ⎥ ⎬ dσ g ⎨ ⎢ 2 ⎣ g ⎦ σg 0 Γ α g + ( ng / 2 ) ⎥⎪ ⎪ ⎢ ⎦⎥ ⎭⎪ ⎩⎪ ⎣⎢ ( ) 2 ⎧ ⎡ β + 1/ 2 ⎤⎫ x µ − ( ) ∑ g i g ∞ ⎪ ⎢ ⎥ ⎪⎪ ⎡ ⎤ α g + ( ng / 2 ) 1 xi ∈g 2 − ⎣α g + ( ng / 2 ) +1⎦ ⎪ 2 ⎡ ⎤ β σ = − ( g) ⎥ ⎬ dσ g ⎨ ⎢ 2 ∫ ⎣ g ⎦ σg Γ α g + ( ng / 2 ) 0 ⎥⎪ ⎪ ⎢ ⎪⎩ ⎣⎢ ⎦⎥ ⎪⎭ α g + ( ng / 2 ) βg ) ( −α + ( n / 2 ) = Γ α g + ( ng / 2 ) β g g = 1 Γ α g + ( ng / 2 ) ( ) ( ) ( ) therefore we have the conditional distribution for posterior σ g2 is ( p σ g2 x, µ g ) 2 ⎧ ⎡ β + 1/ 2 xi − µ g ) ⎤ ⎫ ( ∑ g ⎪ ⎢ ⎥ ⎪⎪ − ⎡α + n / 2 +1⎤ α + n /2 1 xi ∈g = ( β g ) g ( g ) ⎡⎣σ 2 ⎤⎦ ⎣ g ( g ) ⎦ exp ⎪⎨− ⎢ ⎥⎬ 2 σ Γ α g + ( ng / 2 ) g ⎢ ⎥⎪ ⎪ ⎢ ⎥⎦ ⎭⎪ ⎩⎪ ⎣ ( ) (3.24) ■. that is −1 ⎛ ⎞ ⎡ 2⎤ 1/ σ ⋅⋅⋅ ~ G ⎜ α g + ng / 2, ⎢ β g + 1/ 2 ∑ ( xi − µ g ) ⎥ ⎟ ⎜ xi ∈g ⎣ ⎦ ⎟⎠ ⎝ 2 g or, similarly −1 ⎛ ⎞ ⎡ −1 2⎤ σ ⋅⋅⋅ ~ IG ⎜ α g + ng / 2, ⎢ β g + 1/ 2 ∑ ( xi − µ g ) ⎥ ⎟ . ⎜ xi ∈g ⎣ ⎦ ⎟⎠ ⎝ 2 g 61 3.4 Interval Estimation Credible or probability intervals (or credible set) usually used for Bayesian so as not to confuse them with confidence intervals [Carlin and Rubin, 2000; Press, 2003; Hogg et.al.,2005]. Suppose the posterior distribution for parameter θ given by F (θ x1 , x2 , ⋅⋅⋅, xn ) ≡ F (θ raw score ) . Then for some preassigned α , we can find an interval ( a, b ) : Definition 3.1[Carlin and Rubin, 2000] A 100 (1 − α ) % credible set for Θ is a subset C = ( a, b ) of Θ such that: ⎧ p (θ x ) dθ = F ( b ) − F ( a ) ⎪∫ 1 − α ≤ P {a < θ < b x} = ⎨C ⎪∑ p (θ x ) ⎩C continuous case (3.25) discrete case Definition 3.1 enables direct probability statement about the likelihood of θ falling in C , that is the probability that θ lies in C given the observed data x is at least (1 − α ) . The interval ( a, b ) is called a credibility interval for θ , at credibility level α . Similar to confidence interval, if we choose α = 0.05 then we refer to such an interval as a 95 percent credibility interval. Press, (2003) refer to ( a, b ) as a 95 percent Bayesian confident interval, attempting to distinguish it from a frequentist confident interval. We used “ ≤ ” instead of “ = ” in Definition 3.1 in order to accommodate discrete setting, where obtaining an interval with coverage probability exactly (1 − α ) may not be possible. Moreover, the used “ = ” is often used by statistician to simplify the definition for continuous type which convey the similar meanings and do have exactly the right coverage in order to minimize the size and thus obtains a more precise estimates. 62 More generally, the Highest Posterior Density (HPD) gives a technique for minimize and obtained a precise estimate. Since we have preassigned α , for known functional form of F , this does not specify the interval ( a, b ) , therefore which interval should be chosen to? It is not uniquely defined. Consequently, we choose the specific (1 − α ) interval that contains most of the posterior probability. To do so we choose the smallest interval ( a, b ) to specify two properties [Press, 2003]: Property I : F ( b ) − F ( a ) = 0.95 Property II : If p (θ x1 , x2 , ⋅⋅⋅, xn ) denotes the posterior density, for a < θ < b , p (θ x1 , x2 , ⋅⋅⋅, xn ) has a greater value than that for any other interval for which Property I holds. The posterior density for every point inside HPD interval is greater than that for every point outside the interval. Also for a given credibility level, the HPD interval is as small as possible. The HPD interval always exist and it is unique, as long as for all intervals of contents (1 − α ) , the posterior density never uniform in any interval of the space of θ [Box and Tiao,1978; Section 2.8, pg 122-125]. The first property will be employed to give a formal definition. Definition 3.2: [Box and Tiao, 1978] Let p ( Θ x1 , x2 , ⋅⋅⋅, xn ) be a posterior density function. An interval I in the parameter space of Θ is called a HPD interval of content (1 − α ) if: i) Pr {Θ ∈ I x} = 1 − α ii) for θ1 ∈ I and θ 2 ∉ I p {θ1 x} ≥ p {θ 2 x} . 63 Now, let φ = f ( θ ) define a one-to-one transformation of the parameters from θ to φ . Any interval satisfied (1 − α ) in the space of Θ transform into an interval of the same content in the space of φ if the transformation is linear. As in the univariate case, when a noninformative prior is used, which is equivalently to assume that some transformed set of parameters φ = f ( θ ) are locally uniformly distributed, then standardized HPD intervals calculated in terms of φ are available. In Chapter IV, we present the sampling using Gibbs Sampling approach to determine the estimate of posterior distribution and exhibit the graphical representations to figure out the interval for parameter estimates. The entire results and plots are computed via WinBUGS programming package. 64 CHAPTER 4 NUMERICAL IMPLEMENTATION OF THE BAYESIAN GRADING 4.1 Introduction to Markov Chain Monte Carlo Methods We have set up a model to assign grades through conditional Bayesian. In this chapter we will explained how to estimate the model given in Chapter III (Equation (3.9) and (3.10)). We will discuss a general simulation approach for numerically calculating the quantities that arise in the Bayesian prior-posterior analysis. In letter grades assigning problem we are interested to find the optimal mean values for each well defined grades component. Herein, we are interest to find the unknown parameter θ of the posterior density. Suppose θ ~ p (θ ) and if we seek E ⎡⎣ p ( µ , σ , π x ) ⎤⎦ = ∫ p ( µ , σ , π ) ⎡⎣ µ , σ , π x ⎤⎦ d ( µ , σ , π ) θ ≈ 1 N ∑ p(µ G g =1 t ,σ t, π t ) or simply we write as γ ≡ E ⎡⎣ p (θ x ) ⎤⎦ = ∫ θ p (θ x ) dθ 65 then γˆ = 1 N N ∑ f (θ ) where θ i =1 i is vector-valued parameter vector and x is a vector of raw scores and converges to E ⎡⎣ p (θ x ) ⎤⎦ with probability 1 as N → ∞ . This integral cannot be computed analytically and that the dimension of the integration exceeds three or four. In such cases we can compute the integral by Monte Carlo sampling methods or precisely as called as non-iterative Monte Carlo methods [Carlin and Louis, (2000)]. The idea behind Markov Chain (MC) is to discard the immediate task at hand and to ask how the posterior density p (θ x ) may be sampled. Press (2003), explained that if we were to have draws θ (1) , θ ( 2) , ⋅⋅⋅, θ ( N ) ~ p (θ x ) from the posterior density, then ii d provided the sample is large enough, we can estimate not just the above integral but also other features of the posterior density by taking those draws and forming the relevant sampling-based estimates. For example, the raw score sample average of the sampled draws would be our simulation based estimate of the posterior mean. Under suitable Strong Laws of Large Numbers these estimate would converges to the posterior quantities as the simulation becomes large. Furthermore, it is possible to sample complex and high-dimensional posterior densities by a set of method that is called Markov Chain Monte Carlo (MCMC). The objectives of MCMC are to generate a sample form a joint probability distribution of posterior and to estimate expectation of parameter θ . These methods involved the simulation of a suitable constructed MC that converges to the target density of interest (the posterior density). In general MC is defined as the property that the conditional density of θ ( t ) with posterior distribution p (θ x ) of the parameters θ (t ) conditioned on the entire preceding history of the chain depends only the previous value θ ( t −1) . The idea behind or the underlying rationale of MCMC simulation is to construct a transition density that converges to the posterior density form any starting point at θ 0 (in the sense ( ) that for any measurable set S under p , Pr θ ( t )∈ S x, θ 0 converges to ∫ p (θ x ) dθ S 66 as t → ∞ . In this study we are not explained MC in details since MC is not the focus of this study, however we used the concept of transition probabilities is and the formula of Chapman-kolmogorov is applied. The extension of the topic related might be referred to Walsh, (2004), Gelman et al. (1995), Carlin and Louis (2000) and Press (2003). One problem with applying the Monte Carlo integration is in obtaining samples from one complex probability distribution p ( x ) . This problem is solving by MCMC methods. In particular, it is solving by mathematical physicist to integrate the very complex function by random sampling, and the resulting most general MCMC approach is called the Metropolis-Hasting algorithm (M-H algorithm) which is introduced by Metropolis et al. (1953), [Press, (2003)]. A second technique for constructing MC samplers is by Gibbs sampling algorithm. This methods has introduced by Geman and Geman (1984), Tanner and Wong (1987) and Gelfand and Smith (1990), [Press, 2003]. Gibbs sampling algorithm is actually the best known for special case of the M-H algorithm. 4.2 Gibbs Sampling In this study we are not discuss the M-H algorithm since our objective is to generate problem using Gibbs sampling in which the Gibbs sampling is one of the most simple MCMC algorithm and the special case of M-H algorithm. Gibbs sampling is also known as alternating conditional sampling [Gelman et al. (1995)]. Gibbs sampling is defined in terms of subvectors of θ . Suppose the parameter θ from raw scores have been divided into g components or subvectors, Θ = {θ1 , θ 2 , ⋅⋅⋅, θ g } . The MC is constructed by sampling the set of full condition densities { p (θ 1 ) ( ) ( x,θ 2 ,θ 3 ⋅⋅⋅, θ g ; p θ 2 x,θ1 ,θ 3 ⋅⋅⋅,θ g ; ⋅⋅⋅ ; p θ11 x,θ1 ,θ 2 ⋅⋅⋅,θ g −1 )} 67 each of the iteration of Gibbs sampler is cycles trough the subvectors of Θ , in which the subset conditional on the value of all the others. There are g steps in iteration t .At each iteration t , an ordering of the g subvectors of Θ is chosen and, in turn, each θ t j is sample from the known conditional distribution given all the other components of Θ : ( ) p θ g x, Θt−−j1 where Θt−−j1 represent all of the parameters except for θ vector of raw score for the students. Their current values θ θ , ⋅⋅⋅,θ t −1 j +1 t −1 g } . Thus, each subvector θ {θ t −1 −j j and x is the , θ t2, ⋅⋅⋅, θ t 1 t j −1 , is updated conditional on the latest value of j Θ for the other components, which are the iterations t values for the components already updated and iteration t − 1 values for the others. Therefore we have the updated posterior distribution for each letter grades component. The Gibbs sampler algorithm is: Algorithm 4.1 (Gibbs Sampling) 1. Specify any arbitrary initial values Θ 0 = {θ10 , θ 20 , ⋅⋅⋅,θ g0 } 2. For t = 1, 2,..., B + T , Construct Θ( t ) as follows: { ~ p {θ ~ p {θ } }, Draw θ 1t ~ p1 θ1 θ t2−1, θ 3t −1, ⋅⋅⋅,θ tg−1 , Draw θ t 2 Draw θ t 3 1 2 θ 1t , θ 3t −1, ⋅⋅⋅,θ tg−1 1 3 θ 1t ,θ t2, ⋅⋅⋅,θ tg−1 , } ⋅ ⋅ ⋅ { Draw θ tg ~ p1 θ g θ 1t , θ t2, ⋅⋅⋅, θ tg −1 } Example: { ~ p {θ } θ ,θ , ⋅⋅⋅,θ } , { } Draw θ 11~ p1 θ1 θ 20 ,θ30 , ⋅⋅⋅, θ g0 , Draw θ 12 ~ p1 θ 2 θ11 , θ30 , ⋅⋅⋅,θ g0 , Draw θ 13 3. 1 3 1 1 1 2 0 g { After completing one iteration of the scheme use the draws from the previous step to construct Θ t +1 . After t such iterations we would then obtain {θ 1t , θ t2, ⋅⋅⋅,θ tg } . } ⋅ ⋅ ⋅ , Draw θ 1g ~ p1 θ g θ11 , θ 21 , ⋅⋅⋅,θ g1 −1 . 68 Note that, we must eliminate Θ( t ) for all t ≤ B, where B T is the “burn-in” period and since we are presumes that the limiting distribution has been reached. The remaining values of Θ( t ) are the simulated draws from the posterior distribution of Θ . Alternatively, we can iterate step 2 and 3 as follows: For t = 1, 2,..., B + T , Construct Θ( t ) as follows 2. { ~ p {θ Update θ 1t to θ 1t +1~ p θ 1t +1 x,θ t2,θ 3t , ⋅⋅⋅,θ t g Update θ t2 to θ t +1 2 }, } t +1 2 x,θ 1t ,θ 3t , ⋅⋅⋅,θ t g t +1 2 x,θ 1t ,θ t2, ⋅⋅⋅,θ t g −1 ⋅ ⋅ ⋅ Update θ t g to θ t +1 g { ~p θ } The complete updated vector is then labeled Θ(t +1) . Repeat the above steps 3. B + T times. Theorem 4.1 [Carlin and Louis (2000)] For the Gibbs sampling algorithm (as outline above), { (t ) } →{θ d } ~ p {θ } as t → ∞ (a) Θ( t ) = θ 1(t ),θ (2t ), ⋅⋅⋅, θ (b) The converges in theorem (a) is exponential in t using L 1 norm. g ,θ 2, ⋅⋅⋅,θ 1 g ,θ 2, ⋅⋅⋅,θ 1 g From this theorem, all we require to obtain samples from the joint distribution of {θ ,θ 2, ⋅⋅⋅,θ 1 g } is ability to sample from g corresponding full conditional distributions. In this study, the joint distribution of interest is the joint posterior distribution of the given raw scores, denotes as p {θ 1, θ 2, ⋅⋅⋅,θ g x} . 69 4.3 Introduction to WinBUGS Computer Program As mentioned in Chapter I and Chapter III, for Bayesian problem we will analyses the estimation of the posterior distribution via WinBUGS. Before that, we explain WinBUGS in general. WinBUGS is defined as the MS Windows operating system version of Bayesian Analysis Using Gibbs Sampling which is a versatile package that has been designed to carry out MCMC computations for a wide variety of Bayesian models [Press, (2003)]. The software can be found in the internet in which a free downloaded version might be run by clicking at the WinBUGS links. The program and manuals are available over the web at www.mrc-bsu.cam.ac.uk/bugs/. WinBUGS requires that the Bayesian model expressible as a directed graph. There is no standard that the users exactly make a drawing of the model in the form of directed graph. However based on understanding of the model, the instructor can handle the directed graph as they understand the model in assigning letter grades for their student. The directed graph is extremely helpful the instructor as the first step in doing the analysis. To practice, in drawing the directed graph we can used the menu on the WinBUGS window which permit the user to draw a directed graph. For example of WinBUGS program and directed graph see Volume I, Volume II and Volume III in Help menu of WinBUGS windows. 4.4 Model Description Now, we want to decide the unknown parameter for letter grades assigning problem. Let y i be equal to the component (i.e. particular letter grade) of mixture to which raw scores x i fit in. Therefore the unknown parameters in this study are Θ = { y, π , µ , σ 2 } . Enhance the unknown parameters with y is actually to make it easy in finding the conditional distribution needed by Gibbs sampling algorithm. By augmenting 70 the data x by “missing data” Y = { y1 , y2 , ⋅⋅⋅, yn } , in which each raw score observation xi is assumed to arise from a specific but unknown component yi of the mixture [Stephens, 2000]. In addition, this variable is not observed and thus named latent variable and indicates the original population of observation x i [Cornebise et al., 2005]. ( ) Therefore the model p ( xi ∈ G ) ∝ π gφ xi µ g , σ g2 can be written in terms of the latent variables, with yi is assumed to be realization of independent and identically distributed discrete random variable Y1 , Y2 , ⋅⋅⋅, Yn with probability mass function ( ) p yi = g π , µ g , σ g2 = π g , (4.1) provided i = 1, 2,..., n ; g = 1, 2,..., G . Conditional on the Y ' s , the raw scores x1 , x2 ,..., xn are assumed to be independent observations from the densities { } ( p xi yi = g , π, µ,σ 2 = p { xi ; µ g , σ g2 } = N ⋅ µ , σ 2 ) (4.2) integrating out the latent variables Y1 , Y2 , ⋅⋅⋅, Yn then yield the model in Eq.(3.10) as mentioned above. Let ⋅ ⋅⋅⋅ mean conditioning on all other parameters in Θ , the raw score data, and G = 11 . The conditional distributions are { p { yi = g ⋅⋅⋅} ∝ π g N xi µ g , σ g2 } where the posterior for π is then provided by a Dirichlet elementsη g + ng , where ng is the number of sample members assigned to gth letter grade [Congdon, 2003]. Thus we have π g ⋅⋅⋅ ~ Di (η g + ng ) , and ng = # {i : y i = g} , n g is simply the number of raw score 71 allocated to group g according to the parameter y . Therefore, the Gibbs sampling from Algorithm 4.1 of step 2 will be as follows: Algorithm 4.2 (Gibbs Sampling for Normal Mixture) 2. For t = 1, 2,..., B + T , Construct Θ( t ) as follows π ⋅⋅⋅ ~ Di (η1 + n1 ,η2 + n2 , ⋅⋅⋅,η g + ng ) µ g ⋅⋅⋅ ~ N (Vg M g ,Vg ) ⎛ ⎡ σ ⋅⋅⋅ ~ IG ⎜ α g + ng / 2, ⎢ β g−1 + 1/ 2 ∑ xi − µ g ⎜ xi ∈g ⎢⎣ ⎝ ( 2 g ⎡ 1 ng ⎤ where Vg = ⎢ 2 + 2 ⎥ ⎢⎣ δ g σ g ⎥⎦ −1 , Mg = ) −1 ⎞ ⎥ ⎟ ⎥⎦ ⎟ ⎠ 2⎤ ν g ng x g x and xg = ∑ i . The Gibbs sampling 2 + 2 δg σ g i:zi = g ng updates were performed in the order π , µ g , σ g2 . Since we have choose ( ) η1 = η2 = ... = η g = 2 , therefore we also may write π ⋅⋅⋅ ~ Di 2 + n1 , 2 + n2 , ⋅⋅⋅, 2 + ng . Note that, WinBUGS uses precision instead of variance to specify a normal distribution! We denote τ = 1 σ2 or σ = 1/ τ . The model specified as shown in Appendix E. The WinBUGS will treat everything in the opening and closing brackets; {} or () as a description of the model. Therefore, the words MODEL, DATA and INITIAL VALUE are not required. We locate the words as a reminder in explaining the information involved for the model. The model contains of two chain in which the first chain (denote as INITIAL VALUE1) is use to define the distribution of raw scores and for chain two (denote as INITIAL VALUE2) we specify as to define parameters involved in estimating that particular parameters. 72 WinBUGS requires that the Bayesian model be a directed graph. A directed graph consists of nodes connected by descending links. The directed graph for the model in implement the posterior distribution of Algorithm 4.2 is shown in Figure 4.1. Figure 4.1. A node in the graph represents each parameter or variable in the model including the hyperparameters of the normal distribution. Each node has none, one or more “parents” and none, one or more “children”. Constant are in rectangle; variable nodes which depends functionally or stochastically on their parents, are in ovals. Note that the WinBUGS programming language is used to specify the modelprior-likelihood. It is not a programming language. It does not specify series of commands to be execute in sequence [Press, 2003]. The purpose of the WinBUGS model specification language is to “paint a word picture” of the directed graph. Having specified the model as a full joint distribution on all quantities, whether parameters or observables, we wish to sample values of the unknown parameters from their conditional posterior distribution given those stochastic nodes that have been observed. The basic idea behind the Gibbs sampling algorithm is to successively sample from the conditional distribution of each node given all the others in the graph (these are known as full conditional distributions). 73 m[g] alpha.a[g] v[g] phi[] P[1:11] alpha.tau[g] alpha.b [g] beta.b [g] G[i] sigma mu[i] tau[i] y[i] Figure 4.1: Graphical Model for Bayesian Grading 74 74 4.5 Setting the Priors and Initial Values 4.5.1 Setting the Prior In this section, we explain how to set the prior parameters and its initial values. The prior parameters have to bet set to something. They can be set to very uninformative values, but they still have to be set to something. Whatever values we set them to, someone can always ask “why these particular values? Why not some other values?” The hint in setting the prior is setting the prior that are uninformative and that make more or less sense. In practice it does not matter exactly to which values we set the prior parameters. As long as they are reasonably uninformative, the end result is the same. Note that, as long as the prior is uninformative, the result is driven by the data. So the specific values of the prior parameters really make no difference. In this study, we need two vectors that determine the prior for component means: ν and δ 2 where ν g are the prior means of component means. Since the data of raw scores is always put into the interval of [0,100], we approximate the prior of component means equidistantly on that interval. Thus, for G = 11 , ν g ≈ 9 g . The prior variance of component means should be set to some high value since the prior means ν g is very uncertain corresponds to the true component means. Therefore we set the prior standard deviation to be of 20, and then the variance is 400. We can also set the variance to be high value such as 500, 600, and so on. The end result is the same! For the component variances we follows the idea from Alex, (2003); whereas a priori the expected of standard deviation of each components to be approximately 5. To be uninformative, the variance is set to a relatively large value (let it be 4). Again if we do not like this number, use other numbers; the answer will be the same. Hogg and Craig (1978), remarks that the terminology mathematical expectation has origin in games of chance. That is, the mathematical expectation of u ( X ) ; X is a random variable, and 75 75 u ( X ) is of such a character that if the integral (or sum) exist, the convergence is absolute. In addition, we treat the component variance with its own distribution of probability. We simply take this note in approximating the expected value and the standard deviation of component variances as follows: We sets the following to be uninformative: E [σ ] = 5, Var [σ ] = 4 know that from Hogg and Craig (1978), Var [σ ] = E ⎡⎣σ 2 ⎤⎦ − { E [σ ]} 2 ∴ E ⎡⎣σ 2 ⎤⎦ = Var [σ ] + { E [σ ]} 2 = 4 + 52 = 29. which gives the expected value for component variance is 29. Now, refer to Section 3.3.8 in Chapter 3, we have proved that the σ 2 ~ IG (α , β ) . To make the prior distribution of component variance reasonably vague, Carlin and Louis (2000) pg. 326 explained that we can set the prior mean and standard deviation both equal to the µ ; where µ ≡ E ⎡⎣σ ⎤⎦ ,τ 2 β= ⎛µ⎞ ≡ Var ⎡⎣σ ⎤⎦ . This means that α = ⎜ ⎟ + 2 = 1 + 2 = 3 and ⎝τ ⎠ 2 2 2 1 1 1 . These parameters created a diffuse, uninformative = = µ [ µ / τ ] + 1 µ [1 + 1] 2µ inverse Gamma with mean µ . For that reason, we have apply this idea in setting the prior of component variances in which for all g of letter grades component it will follows inverse Gamma distribution with parameters α = 3 and β = 1 = 0.0172 . 2(29) The standard deviation of prior variance equal to its mean; i.e E ⎡⎣σ 2 ⎤⎦ = 29 . 76 76 In addition, note that an inverted Gamma distribution density for variance is the same as the 1/ σ follows a Gamma distribution. Finally, we need to set the prior of component probabilities. We have selected that the component probability is Dirichlet distributed with parameterη . To make all the components a priori equally likely, all the elements of η must be equal and greater than or equal to zero; (η ≥ 0 ) . The elements of η can be interpreted as the number of synthetic observation from each component of the mixture [Alex, 2003]. Therefore, we predict that there is may be two observations that might be fake observation from each component. The first observation might be the value at the lowest tail and the second point is at the upper tail which these two points may take to consider as the borderline value that we uncertain to which letter grade should we assigns. Therefore the elements of η is choose to be equal to 2; i.e: η1 = η2 = ... = η g = 2 . Know that the raw score is drawn from a distribution of Eq.(3.9), where a Dirichlet process is adopted for µ ' s and σ ' s . The Dirichlet process specifies a baseline prior from which candidate values for µ g and σ g are drawn. Suppose this parameter are unknown for each letter grades component, and that clustering in these values is expected. Then for similar letter grades of raw scores within a cluster, the same value θ of µ g and σ g would be appropriate for them. Theoretically, the maximum number G of cluster could be n [Congdon, 2003]. In addition, the cluster indicator for raw score i is chosen according to G [ i ] ~ Categorical ( P) 77 77 4.5.2 Initial Values The first step in the computation problem is to obtain the crude estimates of the model parameters. Initial parameter estimates for the computations are easily obtained. In setting the initial parameter values, Θ ( 0) we first sort the data to the descending and subdivided into G = 11 group of equal size [Raftery, 1996]. The lowest observations are in group one, the lowest observations which are not in group one are in group two and so on. The initial parameter estimates for the computations are easily obtained by estimating µ g as xg that is the average of the observations in the g th group, for each g = 1, 2,..., G , and estimating σ 2g as the average of the G within group sample variance, s 2g . Simplify, we can set the initial values for means and variances to the sample quantities from the corresponding groups and therefore we crudely estimate ν and δ 2 as the mean and variance for G estimated values. For the initial values of component probabilities we set the entire of them to be a fair proportion that equal to have 1/ G . Further explanation in setting the initial values revealed in Gelman et al. (1995) pg 424-426. 4.6 Label Switching in MCMC Gibbs Sampler described above works in broad. In mixture models, there are also issues called label switching in MCMC output. This is mainly cause by the nonidentifiability of the components under symmetric priors. In the other words, the problem is that, it is impossible to identify which component of mixture a draw is made from. As a result, posterior densities for all components appear the same. 78 78 If sampling takes place from an unconstrained prior with G groups then the parameter space has G ! subspaces correspond to different ways of labeling the states [Congdon, 2003]. In an MCMC run on an unconstrained prior they may be jumps between these subspace. A basic solution to these issues is to inflict identifiability constraints when they can be found [Alex, 2003]. Constraint may be imposed to ensure that components do not ‘flip over’ during estimation. For Bayesian mixtures of the invariance of the likelihood to permutations in the labeling is not a problem that is easy solved as in the frequentist approach [Jasra, Holmes and Stephens, 2005]. Providentially, in grading assigning application, one may specify that one mixture probability is always greater than another. In other words, there is a very natural constraints, that are the means are ordered µ1 < µ2 < ⋅⋅⋅ < µ g . Depending on grading assignment problem, one sort of constraint may be more appropriate to a particular raw score data set; which the inequality mean of each subgroups are well identified. That is, the mean raw score of E’s is less than mean raw score of D’s, D+’s and so on. See Appendix E; marginal posterior density estimates for the means of the different letter grade. The symmetries in the posterior distribution are immediately seen, with the posterior means being the same for each component, as well as the classification probabilities all being close to 0.09 initially. Note that the Gibbs Sampler draws are post-processes to implement this constraint that is ordering on one of the parameters. 4.7 Sampling Results In this section, we present two real life sampling results. Both cases observed from a small class and large class of students. We have assumed that the final scores are transformed to the composite score. In addition, we compare the letter grades assignment from GB to the letter grades actually assigned by instructors. Therefore the reader can judge how well GB does by visual inspection. In Chapter II, we discussed the method in weighting the raw scores that comes from several source (e.g. Test, Midterm 79 79 Test, Project, Assignment, Studio or Lab Work). We strictly mention that it is inappropriate to assign grades base on combined score without transform the scores into weighted composites score. The combination of several raw scores would produce contaminated normal on the score distribution. However, in these study, instructors actually assigned grades based on the combined raw score. For comparison purpose, we consider how well GB workable also based on the combined score. In this study, we are paying attention on the letter grades component means. The results given will be considered as the decision to the instructors in assigning the letter grade and evaluate their students performance corresponds to that particular semester. To check whether the sampling estimates converge to its expected values or not, we demonstrates numerous result in WinBUGS output as demonstrate in the subsequent section. 4.7.1.1 Case 1: Small Class The model and raw scores data for Case 1 are enclosed in Appendix E. We have a small class of 62 students that attend one of a course for a semester. The mean raw score is 75.9, the median is 74.5 and the standard deviation is 12.88. Table 4.1 show WinBUGS output of the marginal moments and quantiles for means of each letter grade upon sampling. Time for 150,000 sampling was less than 50s for computer on 3.0GHz of Pentium 4. At least 500 updates burn in followed by a further 75,500 updates gave the parameter estimates. In other words we discards µ g ( t ) for all t ≤ 500 , the burn in period (or initial transient phase of chain)for sampling the components means and continues to 75,500 updates which implied the optimal estimates of letter grades means. Table 4.1 indicates that the component means of each letter grade in small class of 62 students. ‘node’ is the column for parameter that we want to estimate followed by 80 80 the columns of means and standard deviations of corresponding nodes. Now, focus on the 95% equal-tail Credibility Intervals between 2.5% and 97.5% quantiles (refer to Chapter III, Section 3.4). That is we can see the MC error for µ1 is too large, meaning that for the particular class, there is no students should be assigned to the grade E by the instructor. Besides, we have µ2 (i.e. mean for grade D) with lower bound of α = 0.05 ( or α / 2 = 0.0025) is 37.87 . Therefore the instructor would haves decide to assign grade E if the raw scores of their students is less than 37. Conversely, grade D should be assign for raw scores that between 37 and 43, grade D+ for raw score greater than 43 and less than 53 and so on. Furthermore, we have the probability of the raw scores belongs to the corresponding grades. For example, probability of the raw scores 96 are probably to be assign grade A is about 0.0743, the raw scores of 70 will be assign grade B- by the instructor with probability 0.1756 and so on as shown in Table 4.1. In addition, Table 4.2 demonstrates the minimum and maximum score for each letter grade and percent of students receiving to the respective letter grade. We seen that the 25.81 percent of the students assigned to grade B- and more than half of the students was assigned the better and meaningful grades. The letter grades assigned by Straight Scale and Standard Deviation methods are shown in Table 4.3. These results shows the different methods will assign the different letter grades to the students. For example, the raw scores for grade B is between 74 to 82 for Bayesian grading, 70 to 74 for Straight Scale and 90 to 92 for grading through Standard Deviation method. Moreover, the cumulative percentage also shown that more than 50% of the class are assigned grade B and above for GB method, more than 60% for grading via Straight Scale and more than 30% for grading via Standard Deviation method. 81 81 Table 4.1: Optimal Estimates of Component Means for Case 1 Node Mean Std. Dev MC error 2.5% Median 97.5% Start Sample π1 π2 π3 π4 π5 π6 π7 π8 π9 π 10 π 11 0.0135 0.009429 2.57E-5 0.001647 0.01139 0.03707 501 150000 0.03374 0.01476 3.706E-5 0.01117 0.03166 0.06801 501 150000 0.03378 0.01476 4.052E-5 0.01118 0.03172 0.06816 501 150000 0.05401 0.01855 4.707E-5 0.02371 0.05197 0.09575 501 150000 0.05403 0.01846 4.61E-5 0.02393 0.05203 0.09541 501 150000 0.08111 0.02243 5.829E-5 0.04297 0.07916 0.13 501 150000 0.1756 0.03105 8.129E-5 0.1192 0.1741 0.2404 501 150000 0.1756 0.03114 7.68E-5 0.1189 0.1742 0.2407 501 150000 0.1893 0.03208 8.16E-5 0.1306 0.1879 0.256 501 150000 0.1149 0.02622 6.407E-5 0.06869 0.1132 0.1709 501 150000 0.07433 0.02148 5.709E-5 0.03788 0.07238 0.1213 501 150000 µ1 µ2 µ3 µ4 µ5 µ6 µ7 µ8 µ9 µ 10 µ 11 1.435E+6 3.2E+6 8863.0 -4.843E+6 1.43E+6 7.698E+6 501 150000 38.0 0.06298 1.609E-4 37.87 38.0 38.13 501 150000 45.0 0.05662 1.454E-4 44.89 45.0 45.11 501 150000 55.67 0.8745 0.005166 53.93 55.66 57.43 501 150000 60.0 0.02515 6.647E-5 59.95 60.0 60.05 501 150000 65.6 0.3317 9.094E-4 64.94 65.6 66.26 501 150000 69.5 0.1071 2.751E-4 69.29 69.5 69.71 501 150000 75.0 0.4676 0.001335 74.08 75.0 75.93 501 150000 84.0 0.5011 0.001446 83.01 84.0 84.99 501 150000 92.56 0.2583 6.781E-4 92.05 92.56 93.07 501 150000 95.33 0.1076 2.735E-4 95.12 95.33 95.55 501 150000 82 82 Table 4.2: Minimum and Maximum Score for Each Letter Grade, Percent of Students and Probability of Raw Score Receiving that Grade for GB: Case 1 Grade A AB+ B BC+ C CD+ D E GB From To 95 92 83 74 69 64 59 53 44 37 0 100 94 91 82 73 68 63 58 52 43 36 Number of Student 3 7 10 13 16 5 3 2 2 1 0 Percentage Cumulative Percentage Probability % % 4.84 11.29 16.13 20.97 25.81 8.06 4.84 3.23 3.23 1.61 0 4.84 16.13 32.26 53.23 79.03 87.1 91.94 95.16 98.39 100 100 0.0743 0.1149 0.1893 0.1756 0.1756 0.0811 0.054 0.054 0.0338 0.0337 0.0135 Table 4.3: Straight Scale and Standard Deviation Methods: Case 1 Letter Grades Score A AB+ B BC+ C CD+ D E 85-100 80-84 75-79 70-74 65-79 60-64 55-59 50-54 45-49 40-44 0-39 Standard Deviation Straight Scale Number of Cumulative Cumulative Number of Students Percentage Score Percentage Students % % 17 27.4 95.57-100.00 1 1.61 8 40.3 90.89-95.57 9 16.13 6 50.0 86.21-90.89 5 24.19 12 69.4 81.52-86.21 7 35.48 10 85.5 76.84-81.52 6 45.16 4 91.9 72.16-76.84 6 54.84 2 95.2 69.48-72.16 15 79.03 1 96.8 62.79-67.48 5 87.10 1 98.4 58.11-62.79 3 91.94 53.53-58.11 2 95.16 1 100.0 0.00-53.43 3 100.00 83 83 To examine the posterior density functions of the means see Figure 4.2. Figure 4.2, exhibits the sampling for mean distribution of grade B+. These types of plots are called the smothered Kernel-Density estimate for the component means. The smoother the curves pictured the better posterior distribution sampling plot for component means. The optimal posterior distribution for the remainder grades are enclosed in Appendix E. 4.7.1.2 Convergence Diagnostics In previous section we have mentioned that the sampling acquired 75,500 updates to converge an optimal solution. The issue is in which updates the solutions converge to the optimal? A Markov chain that approaches its stationary distribution slowly or exhibits high autocorrelation can produce an inaccurate picture of the posterior distribution. In WinBUGS there are three simple convergence diagnostics; autocorrelation functions, Gelman-Rubin and traces diagnostics. We would tests these tools in Sampling Monitor Tool (see Appendix E). In this small class example, now consider the convergence diagnostic of the mean grade D and B+. First, for the traces diagnostics that also called time series trace; the plot of random variable(s) being generated versus the number of iterations. We found that the multiple chains cover the same range and not shows any trend or long cycle. Figure 4.3 (a) and (b) demonstrates the stable posterior for the mean of the grade D ( µ 2 ) and B+ ( µ 9 ). We conclude that the convergence achieved very quickly. Next, Figure 4.4 (a) and (b) shows the cumulative graph for Gelman-Rubin convergence statistics. The green trace shows the width of the central 80% of the between runs, the blue trace is the average width of the 80% interval within the individuals runs and we see that the ratio (the red trace) of the between and within chain 84 84 variances rapidly approach to 1; i.e ( R = (between / within) → 1) . This indicates convergence. Now, we examine the quantiles for a cumulative graph of the same grades. Figure 4.5 (a) and (b) shown convergence when the quantiles of the parallel chains rapidly coincide. Finally, we look on the autocorrelation function of grade D and B+. See Figure 4.6. Autocorrelation is not directly a convergence diagnostics, a long-tailed autocorrelation graph suggest that the model is ill conditioned and that the chain will converge more slowly. In this case, the long-tails are not appear to the plots of the autocorrelation. Therefore, from suggestion statement the model is highly conditioned and that the chain converges rapidly. As of above, the convergence is not an issue for GB since all the convergence diagnostics shows the estimates of component means are converge very quickly. We can stop the iteration at any updates converge to its optimal value, in which the estimates satisfied the convergence diagnostics and the kernel-density display the smooth plot. When the updates present an optimal estimate then the following updates converge to same value. See Appendix E for convergence diagnostics of all letter grades. 85 85 mu.c[9] chains 1:2 sample: 1000 1.0 0.75 0.5 0.25 0.0 82.0 83.0 84.0 85.0 mu.c[9] chains 1:2 sample: 10000 1.0 0.75 0.5 0.25 0.0 82.0 84.0 86.0 mu.c[9] chains 1:2 sample: 50000 1.0 0.75 0.5 0.25 0.0 80.0 82.0 84.0 86.0 mu.c[9] chains 1:2 sample: 100000 1.0 0.75 0.5 0.25 0.0 80.0 82.0 84.0 86.0 mu.c[9] chains 1:2 sample: 150000 1.0 0.75 0.5 0.25 0.0 80.0 82.0 84.0 86.0 Figure 4.2: Kernel-Density Plots of Posterior Marginal Distribution of Mean for Grade B+ 86 86 (a) (b) Figure 4.3: Monitoring Plots for Traces Diagnostics of Mean: (a) Grade D and (b) Grade B+. mu.c[2] chains 1:2 mu.c[2] chains 1:2 1.5 1.0 1.0 0.5 0.5 0.0 0.0 501 600 800 iteration 501 20000 (a) 40000 60000 iteration mu.c[9] chains 1:2 mu.c[9] chains 1:2 1.5 1.0 1.0 0.5 0.5 0.0 0.0 501 600 800 501 iteration 20000 40000 60000 iteration (b) Figure 4.4: Gelman-Rubin Convergence Diagnostics of Mean; (a) Grade D and (b) Grade B+ 87 87 mu.c[2] chains 2:1 mu.c[2] chains 2:1 38.2 38.1 38.0 37.9 37.8 37.7 38.2 38.1 38.0 37.9 37.8 521 600 800 3501 20000 (a) iteration 40000 60000 iteration mu.c[9] chains 2:1 mu.c[9] chains 2:1 86.0 85.0 84.0 83.0 82.0 85.0 84.0 83.0 521 600 800 3501 iteration 20000 40000 60000 iteration (b) Figure 4.5: Quantiles Diagnostics of Mean; (a) Grade D and (b) Grade B+ mu.c[2] chains 1:2 mu.c[2] chains 1:2 1.0 0.5 0.0 -0.5 -1.0 1.0 0.5 0.0 -0.5 -1.0 0 20 40 lag 0 20 1000 iterations 40 lag 75500 iterations (a) mu.c[9] chains 1:2 mu.c[9] chains 1:2 1.0 0.5 0.0 -0.5 -1.0 1.0 0.5 0.0 -0.5 -1.0 0 20 40 lag 0 1000 iterations 20 40 lag 75500 iterations (b) Figure 4.6: Autocorrelations Diagnostics of Mean; (a) Grade D and (b) Grade B+ 88 88 4.7.2 Case 2: Large Class The model and raw score data for Case 2 enclosed in Appendix E. Now we consider to the class of 498 students that attend one of a course for a semester. The mean is 71.53, the median is 73 and the standard deviation is 12.58. Table 4.4 show WinBUGS output of the marginal moments and quantiles for means of each letter grade upon sampling. At least 500 updates burn in followed by a further 75,500 updates gave the parameter estimates. In other words we discard µ g ( t ) for all t ≤ 500 , the burn in period (or initial transient phase of chain) for sampling the components means. Then the sampling continues to 75,500 updates which an optimal estimates of letter grade means for large class of students. Time for 150,000 sampling was less than 3 minutes. Table 4.4 shows the optimal estimate of component means and component probabilities of each letter grade. From Table 4.4 the instructor should assigned grade A for the raw scores between 91 and 100, grade A- for the raw scores between 84 and 90, and so on. The corresponding grades intervals are decided from the credibility interval of 2.5% to 97.5% and with α = 0.05 ( or α / 2 = 0.0025 ) . In addition, similar to Case 1, Table 4.6 shows the letter grades along with its score range of Straight Scale and Standard Deviation methods. In addition, the probability of the students in this course wills fail is about 0.03 if the raw scores are less than 33. Differ to Case 1, there is no student assigned to grade E. In other word, foe example raw score of 32 and 25 both are probably to be assign grade E for this course. Most of the students in this class are also expecting to have grade B- as in previous case. However the raw scores for B- in this class are between 71 and 75 with probability 0.2593, while for Case 1 it is between 69 and 73 with probability 0.1765. This not surprise since we are examines grade of different class and different course in which the contents of the course is definitely vary. Clearly, we have shown that different course or different class or different number of students will give an impact to instructor grading plan. 89 89 Table 4.4: Optimal Estimates of Component Means for Case 2 Node Mean Std. Dev MC error 2.5% Median 97.5% Start Sample π1 π2 π3 π4 π5 π6 π7 π8 π9 π 10 π 11 0.03145 0.00547 1.44E-5 0.02163 0.03115 0.043 501 150000 0.03927 0.006077 1.514E-5 0.02825 0.03896 0.05204 501 150000 0.0334 0.00563 1.509E-5 0.02323 0.03309 0.04527 501 150000 0.04322 0.006361 1.651E-5 0.03159 0.04292 0.05644 501 150000 0.05497 0.007151 1.911E-5 0.0418 0.05468 0.06973 501 150000 0.09038 0.009001 2.303E-5 0.07356 0.09012 0.1088 501 150000 0.2593 0.01372 3.658E-5 0.2327 0.2591 0.2866 501 150000 0.1945 0.01238 3.078E-5 0.1708 0.1943 0.2193 501 150000 0.1297 0.01053 2.718E-5 0.1098 0.1294 0.151 501 150000 0.08255 0.008611 2.247E-5 0.06646 0.08226 0.1002 501 150000 0.04125 0.006233 1.621E-5 0.02987 0.04096 0.05433 501 150000 µ1 µ2 µ3 µ4 µ5 µ6 µ7 µ8 µ9 µ 10 µ 11 33.73 0.5143 0.001466 32.72 33.73 34.74 501 150000 43.37 0.374 9.542E-4 42.63 43.37 44.11 501 150000 51.75 0.2213 5.475E-4 51.31 51.75 52.19 501 150000 59.29 0.2298 5.945E-4 58.83 59.29 59.74 501 150000 64.04 0.1606 4.145E-4 63.72 64.04 64.35 501 150000 67.44 0.1117 3.01E-4 67.23 67.44 67.66 501 150000 71.89 0.07132 1.775E-4 71.75 71.89 72.03 501 150000 76.48 0.08646 2.315E-4 76.31 76.48 76.65 501 150000 80.54 0.0997 2.724E-4 80.34 80.54 80.73 501 150000 84.32 0.151 3.633E-4 84.02 84.32 84.61 501 150000 92.55 0.5138 0.00135 91.54 92.55 93.56 501 150000 90 90 Table 4.5: Minimum and Maximum Score for Each Letter Grade, Percent of Students and Probability of Raw Score Receiving that Grade for GB: Case 2 Grade A AB+ B BC+ C CD+ D E GB From To 91 84 80 76 71 67 63 58 51 42 0 100 90 83 89 75 70 66 62 57 50 41 Number of Student 13 32 64 84 143 53 32 23 16 18 20 Percentage Cumulative Percentage Probability % % 2.6 6.4 12.9 16.9 28.7 10.6 6.4 4.6 3.2 3.6 4.0 2.6 9.0 21.9 38.8 67.5 78.1 84.5 89.2 92.4 96.0 100.0 0.04125 0.08255 0.1297 0.1945 0.2593 0.09038 0.05497 0.04322 0.0334 0.03927 0.03145 Table 4.6: Straight Scale and Standard Deviation Methods: Case 2 Letter Grades Score A AB+ B BC+ C CD+ D E 85-100 80-84 75-79 70-74 65-79 60-64 55-59 50-54 45-49 40-44 0-39 Straight Scale Number of Cumulative Students Percentage % 34 6.8 75 21.9 115 45.0 131 71.3 60 83.3 24 88.2 9 90.0 16 93.2 6 94.4 12 97.0 15 100.0 Standard Deviation Score Number of Students 93.46-100 89.01-93.46 84.44-89.01 79.86-84.44 75.29-79.86 70.71-75.29 66.14-70.71 61.56-66.14 56.99-61.56 52.54-56.99 0-52.54 9 6 19 60 99 112 84 32 23 9 45 Cumulative Percentage % 1.81 3.01 6.83 18.88 38.76 61.24 78.11 84.54 89.16 90.96 100.00 91 91 Now, we compare Table 4.2 and Table 4.5 to the grades assigned by instructor when they applying the Straight Scale and Standard Deviation method as shown in Table 4.3 and Table 4.6. The results indicates that the grading plan via GB, Straight Scale and Standard Deviation method vary to the grades interval and to the number of student getting the respective grade. Before we diagnose the convergence of the estimates values, we examine the posterior density functions of the means. See Figure 4.7. Figure 4.7 exhibits the sampling for mean distribution of grade B. It shows the smoother curves which gives better posterior distribution sampling plot for grade B. The optimal posterior distribution for the remainder grades of Case 2 enclosed in Appendix E. 92 92 mu.c[8] chains 1:2 sample: 1000 6.0 4.0 2.0 0.0 76.2 76.4 76.6 mu.c[8] chains 1:2 sample: 10000 6.0 4.0 2.0 0.0 76.0 76.2 76.4 76.6 76.8 mu.c[8] chains 1:2 sample: 50000 6.0 4.0 2.0 0.0 76.0 76.2 76.4 76.6 76.8 mu.c[8] chains 1:2 sample: 100000 6.0 4.0 2.0 0.0 76.0 76.2 76.4 76.6 76.8 mu.c[8] chains 1:2 sample: 150000 6.0 4.0 2.0 0.0 76.0 76.25 76.5 76.75 Figure 4.7: Kernel-Density Plots of Posterior Marginal Distribution of Mean for Grade B 93 93 4.7.2.2 Convergence Diagnostics As in section 4.7.1.2, now we want to show the convergence diagnostics of the estimates of components mean in Case 2. Time for 150,000 sampling was less than 2.5 minutes for computer on 3.0GHz of Pentium 4. In this case, consider the convergence diagnostic of the mean grade A and B. The traces diagnostics plot shows the multiple chains are not showing any trend or long cycle and it cover the same range. Figure 4.8 (a) and (b) demonstrates the plotted chains appear reasonably stable posterior for the mean of the grade A ( µ 1 ) and B ( µ 8 ). We stop the sampler at this point, concluding that an acceptable degree of convergence has been obtained. Next, Figure 4.9 (a) and (b) display the cumulative graph for Gelman-Rubin convergence statistics. We have the ratio (the red trace) of the between (the green trace) and within (the blue trace) chain variances rapidly approach to 1; i.e ( R = ( pooled / within) → 1) indicate improved convergence. Now, quantiles for a cumulative graph of the same grades are examined as shown in Figure 4.10 (a) and (b). The figure shows that the quantiles of the parallel chains rapidly coincide. This imply convergence. Finally, we look on the autocorrelation function of grade A and grade B. See Figure 4.6. In this case, the long-tails are not view to the autocorrelation plots with lower observed autocorrelations. Therefore from suggestion approved the model is highly conditioned and that the chain converges rapidly. 94 94 (a) (b) Figure 4.8: Monitoring Plots for Traces Diagnostics of Mean: (a) Grade B and (b) Grade A mu.c[8] chains 1:2 P[8] chains 1:2 1.0 1.5 1.0 0.5 0.5 0.0 0.0 501 600 501 800 20000 40000 60000 iteration iteration (a) mu.c[11] chains 1:2 mu.c[11] chains 1:2 1.0 1.0 0.5 0.5 0.0 0.0 501 600 800 501 iteration 20000 40000 iteration (b) Figure 4.9: Gelman-Rubin Convergence Diagnostics of Mean; (a) Grade B and (b) Grade A 60000 95 95 mu.c[8] chains 2:1 mu.c[8] chains 2:1 76.8 76.7 76.6 76.5 76.4 76.3 76.6 76.4 76.2 521 600 800 3501 iteration 40000 60000 iteration (a) mu.c[11] chains 2:1 mu.c[11] chains 2:1 94.0 94.0 93.0 93.0 92.0 92.0 91.0 91.0 521 600 20000 800 3501 iteration 20000 40000 60000 iteration (b) Figure 4.10: Quantiles Diagnostics of Mean; (a) Grade B and (b) Grade A mu.c[8] chains 1:2 mu.c[8] chains 1:2 1.0 0.5 0.0 -0.5 -1.0 1.0 0.5 0.0 -0.5 -1.0 0 20 40 lag 0 (a) 20 40 lag mu.c[11] chains 1:2 mu.c[11] chains 1:2 1.0 0.5 0.0 -0.5 -1.0 1.0 0.5 0.0 -0.5 -1.0 0 20 40 lag 0 (b) 20 40 lag Figure 4.11: Autocorrelations Diagnostics of Mean; (a) Grade B and (b) Grade A 96 96 4.8 Discussion In the first part of Case 1 and Case 2, we set the T = 75,500 and at least B = 500 . The iterations in burn-in periods are eliminate to reduce the effect of the starting distribution. Generally, we discard the first half of each sequence and focus attention on the second half iterations [Casella and George (1992)]. The Gibbs sampler generates a Markov Chain of random variables which converge to the distribution of interest (target distribution). We assumed that the posterior distribution of the simulated values θ t , for large enough t , are close to the target distribution p (θ x ) . Another issues that sometimes arises in mind is that, once approximate convergence has been reached, is whether to used every t th simulation draw, for some value of t such as 1,000 or 50,000 in order to have approximately independent draws from the target distribution. In Case 1 and Case 2, we have not found it is useful to skip iterations, except when the computer storage is problem or the speed is too low. If the “effective” number of simulation is lower than the actual number of draws, the inefficiency will automatically be reflected in the posterior intervals obtained by simulation quantiles. See Table 4.1 and Table 4.4. For each estimated values µˆ g of each raw scores, we label the optimal sequences of length n as µˆ i g ( i = 1, 2,..., n ; g = 1, 2,...G ) and compute B and W , the between- and within- (see Convergence Diagnostic section) sequence variances as following: B= s 2g = 2 n G 1 G 2 1 G µ − µ and W = s where µ = ∑ ( ⋅ g ⋅⋅ ) ∑ g ∑ µ ig ⋅g G − 1 g =1 G g =1 n g ( 1 G ∑ µ ig − µ ⋅ g n − 1 g =1 ) 2 , µ ⋅⋅= 1 G ∑ µ ⋅ g and G g =1 [Gelman et al., 1995]. The between-sequence variance, B contains a factor of n because it is based on the variance of the within-sequence 97 97 means, µ ⋅ g , each of which is an average of n values µ i g . If only one sequence is simulated, B cannot be calculated. Therefore, we estimate the posterior variances of µ g , by a weighted average of W and B, namely var + ( µ x ) = n −1 1 W+ B n n which overestimates the posterior variance assuming the starting distribution is appropriately overdispersed, but is unbiased under stationarity. That is the starting distribution equals the target distribution. Meanwhile, for any finite n the “within” variance should W should be an underestimate of var ( µ x ) . In limit T → ∞ , the expectation of W approaches var ( µ x ) . We monitor convergence of the iterative simulation by estimating the factor by which the scale of the current distribution for µ might be reduced if the simulations were continued to the limit T → ∞ . This potential scale reduction is estimated by R̂ = var + ( µ x ) W or Rˆ = n −1 1 W+ B n n W which tend to 1 as T → ∞ . Practically, convergence is considered as achieved when Rˆ < 1.2 [Cornebise et al., 2005]. In a multiple parameters case, this diagnosis must be carried out for each parameter separately, the convergence being attained when all parameters have converge to its target distribution. 98 98 For example, consider the interval for the grades of Case 2. Table 4.7 shows the posterior quantiles from the second halves of the Gibbs sampler sequences. In this case 75,500 iterations were sufficient for approximate convergence; Rˆ < 1.1 for all parameters. Table 4.7: Posterior for 95% Credible Interval of Component Means and its Ratio If Node Mean Std. dev. 2.5% Median 97.5% µ1 µ2 µ3 µ4 µ5 µ6 µ7 µ8 µ9 µ 10 µ 11 33.73 0.5143 32.72 33.73 34.74 1.000062 43.37 0.374 42.63 43.37 44.11 1.000035 51.75 0.2213 51.31 51.75 52.19 1.000017 59.29 0.2298 58.83 59.29 59.74 1.000016 64.04 0.1606 63.72 64.04 64.35 1.00001 67.44 0.1117 67.23 67.44 67.66 1.000006 71.89 0.07132 71.75 71.89 72.03 1.000004 76.48 0.08646 76.31 76.48 76.65 1.000004 80.54 0.0997 80.34 80.54 80.73 1.000005 84.32 0.151 84.02 84.32 84.61 1.000007 92.55 0.5138 91.54 92.55 93.56 1.000022 R̂ R̂ not near to 1 for all of the estimates, then we need to continue the simulation updates. Once R̂ near to 1 for all parameter interest, just consider the G × n samples from the second halves of the sequences together and treat them as samples from the target distribution. Therefore for Case 1, we are permit to stop the iteration at 11× 62 = 682 iterations and for Case 2 at 11× 498 = 5478 iterations in which the sampling results for optimal solution are approximately the same as at this iteration. Although, in above cases we are aim to have the smoothest density plots and therefore conclude the convergence is sufficient to have the optimal estimates. In addition, if we consider the ratio as a convergence diagnostics, then the sampling takes less than 5s and 15s respectively for both cases. Estimates of functions of the parameters is easily to obtained. Suppose we seek an estimate of the distribution of γ = σ / µ , the coefficient of 99 99 variation. We simply define the transformed Monte Carlo samples γ i = σ i / µ i provided i = 1, 2,..., n , and create a kernel density estimates as shown in Figure 4.2 and Figure 4.7 based on these values. The method of monitoring convergence presented here has the key advantage of not requiring the user to examine time series graph (in this study we recall as “trace” graph) of simulated sequences. Inspection of such plots is a notoriously unreliable method of assessing convergence and is unnormally when monitoring a large number of quantities of the parameter interest, such as can arise in complicated hierarchical models. This is because it is based on the means and variances. This method is also effective for quantities if the posterior distribution is approximately normal. Figure 4.12 shows the plots of grade cumulative density function for Case 1 (a) and Case 2 (b). The dotted line represents the cumulative distribution of Straight Scale and Standard Deviation methods and the smooth line is for grade according to GB grading. Whereas Figure 4.13 and Figure 4.14 demonstrate the cumulative density plots for each letter grades along with the histogram for Case 1 and Case 2. (a) (b) GC SS GB (a) (b) Figure 4.12: Cumulative Distribution Plots for Straight Scale (dotted line) and GB Method; (a) Case 1 and (b) Case 2 100100 Figure 4.13: Density Plots with Histogram for Case 1 Figure 4.14: Density Plots and Histogram for Case 2 101101 4.9 Loss Function and Leniency Factor Making a decision in assigning letter grades to students will face ‘loss’ when they overlook on the accurate letter grade. The loss function, describes the ‘loss’ that the instructor experience if they assign a certain letter grade while the students warrant another letter grade [Alex, 2003]. For that reason, we need to minimized the expected loss to have the optimal letter grade in which based on the probability distribution of the letter grade that describe in Chapter III and the above sections in this chapter. The loss function is the objective function generally used in Bayesian statistical analysis. It must be nonnegative [Berger, 1985; Press, 2003; Hogg et al. 2005]. Carlin and Louis (2000), present that the specific loss function forms correspond to point estimation, interval estimation, an hypothesis testing. The notations used are prior distribution: P (θ ) , θ ∈ Θ sampling distribution: p {x θ } allowable actions: a∈ Α decision rule: d∈D : Χ → A loss function: L (θ , a ) In estimation problems, the action to be taken is to choose an estimator, θˆ , so that the action a = θˆ . The loss function L (θ , a ) computes the loss deserved when the true state of nature θ and we take action a . The most often used of loss function to be of quadratic form either referred as squared error loss (SEL) or the weighted squared error (WSEL). 102102 Generally, we denote the quadratic loss function as ( ) ( L θ ,θˆ =c θ − θˆ ) 2 where c is constant value. The quadratic loss function represents the loss function as a symmetric function. Meaning that, underestimates of θ are equally consequential with overestimates. The Bayes’ estimator with respect to a quadratic loss function is the mean of the posterior distribution [Press, 2003]. In this study, we assumed that the instructor will feels a different loss for overestimates and underestimate the letter grades. Therefore, it is inappropriate to use symmetric loss functions. This is because the instructor might feel worse if they assigned a grade that is too low than if they assigned a grade that is too high. Finally, a low grade will influence the adversely by their students. Suppose there is a raw score data set X =( x1 , x2 ,..., xn ) and we wish to specify the Bayesian estimator θˆ ( X ) ≡ θˆ , depending upon X . The Bayesian decision maker should minimized the expected loss with respect to the decision maker’s posterior distribution. Since we are not consider the symmetric loss function, there is an alternative loss function called the asymmetric loss function. The asymmetric loss function is linear function. This function denote as ( ( ) ) ⎧k1 θˆ − θ , ⎪ ˆ L θ ,θ = ⎨ ⎪k2 θ − θˆ , ⎩ ( ) θ − θˆ < 0 θˆ − θ ≥ 0. The constant k1 ≠ k2 can be chosen to reflect the relative importance of underestimation and overestimation. This constant will be usually different. The above form of asymmetric loss function is also called the piecewise linear loss function and the Bayes’ estimator is the k2 percentile of the posterior [Press, 2003]. In special case, where k1 + k2 k1 = k2 = k , the Bayes’ estimator become the median of the posterior distribution. In this 103103 case, the loss function is called the absolute error loss and the function is equal ( ) to L θ ,θˆ = k θ − θˆ . These types of loss functions are quite often a useful approximation to the true loss. Refer to asymmetric loss and the absolute loss, we have decided, to design the loss function in assigning letter grades as follows: ⎪⎧c yˆi − yi , C ( yi , yˆi ) = ⎨ ⎪⎩ yˆi − yi , yˆi ≤ yi yˆi > yi (4.3) which y is the numeric equivalent of the letter grade that the student truly deserves; ŷ is the numeric equivalent of the letter grade that the instructor assigned; and c is the positive constant that reflect instructor’s preference. This signify that, when c = 1 , the instructors think equally badly about underestimating and overestimating the grade and when c > 1 , the instructors think worse about underestimating than they thinks about overestimating and conversely if 0 < c < 1 , the instructors think worse about overestimating than they thinks about underestimating. Now, let the Bayes’ estimator q = c present the q −th quantile that is the c +1 optimal letter grade under this loss function. Since the distribution in Eq.(3.9) is not continuous, we choose the highest letter grade whose the cumulative probability is less than q . If q = 0.5 , the loss is symmetric, and the optimal letter grade is the mean [Alex, 2003]. For q > 0.5 , the instructor think loss if he underestimates, and therefore they bump the grade up. This value of q > 0.5 can be interpret as the higher factor that reflect the instructor decision. Alex, (2003) called this type of factor as Leniency Factor (LF) which meaning that the instructor is neither lenient nor strict in assigning grade to the student. LF can be classify as; if 0.0 ≤ q < 0.5 mean the instructor is strict and if 0.5 < q ≤ 1.0 means that the instructor is lenient. In addition, Alex (2003) defined if 104104 the LF = 0.5 then the factor is said to be Neutral Instructors’. The LFs of instructor are summarized in Table 4.8. Table 4.8: Leniency Factor and Loss Function Constant q= Leniency Factor Lenient 0.5 < q ≤ 1.0 Neutral 0.5 Strict c c +1 0.0 ≤ q < 0.5 Underestimate c >1 q > 0.5 Equally Underestimate and Overestimate c =1 q = 0.5 0 < c <1 q < 0.5 Overestimate How to compute the LF for instructor?. The LF is based on how the instructor thinks about overestimating and underestimating the letter grade. For LF formula, the constant c in the loss function is referred as the loss that the instructor thinks for underestimating as contrasted to the loss of they think for overestimating. For example, if the grader cares about underestimate the letter grade one and a half times as much as they cares about overestimating it, then the LF is q = 1.5 /(1.5 + 1) = 0.6 ; the instructor is lenient in assigning grade to the student. Example 4.1: Consider again Case 2. Table 4.9 contains of probability and cumulative probability for the raw scores belongs to a letter grade. We know that the probability is Dirichlet distributed and since this not continuous, we can choose the highest letter grades whose the cumulative probability is less than q [Press, 2003]. 105105 Now, if q = 0.5 (∴ c = 1) , that is the instructor is choose to be in neutral mode, then the most optimal letter grade is C+. In other words, if the instructor is lenient that is q > 0.5 ( q = 0.55 ∴c = 1.22 ) the optimal grade is B-. Assigning A- would involve a very high LF of 0.96 or above; q > 0.5, i.e q = 0.96 and ∴ c = 24 . Table 4.9: Cumulative Probability for GB; Case 2 Grade E D D+ CC C+ BB B+ AA 4.10 GB From To 0 42 51 58 63 67 71 76 80 84 91 41 50 57 62 66 70 75 89 83 90 100 Probability Cumulative Probability 0.03145 0.03927 0.0334 0.04322 0.05497 0.09038 0.2593 0.1945 0.1297 0.08255 0.04125 0.03145 0.07072 0.10412 0.14734 0.20231 0.29269 0.55199 0.74649 0.87619 0.95874 0.99999 Performance Measures In measuring the performance, there are two measures to determining how well grading methods executed. First, we can refer to the average loss defined in Eq.(4.3). Now, we introduce the class loss (CC) as follows: CC = 1 n ∑ Ci n i =1 (4.4) 106106 where n is the number of students in the class. Let say the grades that assigned by the instructor in the class as the “true” grades. In other words, the lower CC means that the other grading methods assigns grades closer to those actually assigned by the instructor. For example of Case 1, see Appendix E; Table E1, Table E2, Table E3, Table E and Table E5. We have computed CC as shown in the first two columns in Table 4.10. Note that, in this example we make the comparison between the straight scale against GB to the grade that assigned by instructor in the class. Another method in evaluating grading plan performance is by the raw coefficient of determination. Before we go trough to this coefficient, first we consider coefficient of correlation, r, is a statistical measure of how closely data fits a line. That is, it measures the strength and direction of a relationship between two variables. The range of the correlation coefficient is from -1 to +1. If there is a strong positive linear relationship between the variables, the value of r will be close to +1. If there is a strong negative linear relationship between the variables, the value of r will be close to -1. When there is no linear relationship between the variables or there is only a weak relationship, the value of r will be close to 0. Now, the raw coefficient of determination is given by n R 2r = 1 − ∑e 2 i ∑y 2 i i =1 n i =1 where e i = yi − yˆi and 0 ≤ R 2r ≤ 1 . Raw coefficient of determination is a measure of the variation of the dependent variable that is explained by the regression line and the independent variable. The value of is usually expressed as a percentage and so it ranges from 0% to 100%. Thus, the closer the value is to 100%, the better the model is representing the data. Moreover, higher value indicates that a grading method gets closer to the grades actually assigned. For example of the raw score coefficient of 107107 determinations in Case 1, see Table 4.10. These values are computed in Appendix E; Table E1-continued and Table E3-continued and E5-continued. Table 4.10: Performance of GB, Straight Scale and Standard Deviation Methods: Case 1 Neutral CC Lenient CC R 2r (%) Straight Scale 0.7903 1.2677 98.98 Standard Deviation 1.4839 1.4839 GB 0.1935 0.3097 93.71 99.66 From Table 4.10, we have that the R 2r for GB is higher than Straight Scale and Standard Deviation. Therefore, a GB method is gets closer to the grades actually assigned by the instructor as compared to Straight Scale and Standard Deviation method. However, the GB is no significant difference to Straight Scale since the percentage of different is low but we can say GB and Standard Deviation method has significance difference for the high different in R 2r value. In addition, this is sustainable since the CC values for both lenient and neutral of GB are lower than Straight Scale and Standard Deviation methods. 108108 CHAPTER 5 CONCLUSION AND SUGGESTION 5.1 Conclusion This study describes the personal grading plan via statistical approach. We have designed a statistical model to deal with grading philosophy. Grades are the instructors’ evaluations of the academic work and performance of their students’ complete, or in their performance in a laboratory, on stage, or in studio. Discussions of educational standard sometimes refer to instructors’ grading plan, and at other times to the expectations instructional communicate to students. Since the philosophies and instructional changes as curriculum changes, the instructors need to be prepared to adjust their grading plans accordingly. Furthermore, the instructors should check to see what the grade distributions in their department have been like at their course level. Normally, the university policy written is the norm against which the reasonableness of each instructor’s grades will be judged. Now, consider the grading on curve namely Standard Deviation method and conditional Bayesian method. For Standard Deviation method, takes into account the 109 hardness level of the examination and the cutoff point are not tied to random error. When the instructor has some notion of what the grade distribution should be like, some trial and some error might be needed to decide how many standard deviations each grade cutoff should be from the composite average. In other words, the mean and gap are decided appealing arbitrarily. When a standard deviation is desired, this is the most attractive, despite its computational requirements. However, the curve grade the students based on single class meaningless unless provided in relation to group student is being scored against. We are also not recommend this method for classes with non-normal distribution. The conditional Bayesian method is inspired by an existing method called Distribution Gap method. This method allow for screening students accordingly to their performance relative to their peers and is useful for competitive circumstances where the feedback allow the students to compare their performance to their peers. Moreover, it is requires no fixed percentages in advance. Basically this method removes the subjectivity from Distribution Gap, making it more applicable. The conditional Bayesian grading reflects the common belief that a class is composed of several subgroups, each of which should be assigned a different grade. In this study, we have showed that conditional Bayesian grading successfully separates the letter grades. In applying conditional Bayesian method, the instructor needs to determine their own Leniency Factor. This is a spontaneous measure that reflects how leniently the instructor wants in assigning letter grade. If the instructor is lenient, then the suggested leniency factor is around 0.6. In this study, we have carried out a couple of experiments in which an experienced instructor assigned letter grade by his judgmental grading . Then we use the conditional Bayesian with Standard Deviation and Straight Scale methods to assigns letter grades based on the raw scores. This study provide evidence that Bayesian grading gets very closer to what the instructor actually assigned. The students also benefit academically and the instructors improved the quality in assigning grades. Another advantage of conditional Bayesian grading was a instructor who is using this method does not have to be experienced and does not have to spend many time going through all 110 the raw scores since all the work of assigning grades is done by the computer. The quality of this grading method is as good as the judgmental grading of experienced instructor. The conditional Bayesian grading is easy to apply, as all the work is done by the computer. The hardship in applying this method is the students may not easily understand the method since they are used to the Straight Scale method. Students of statistic, measurement and evaluation classes may understand it. 5.2 Suggestions In this study we give a detailed account of the progress of grading methods and conditional Bayesian grading through mixture modeling. In conditional Bayesian method the Gibbs sampler run with completion is often not worth programming (unless it can be quickly implemented, for example in WinBUGS), since the chance of it failing to converge is too high. The challenge ahead is the assurance of its convergence. A drawback of using Bayesian mixture model is the difficulty of simulating the highdimensional, variable-dimensional target measure, which is characteristic of such problem ( an example on Bayesian grading; to be considered with instructor’s attributes, more than one raw score per student, students’ attributes such as class attendance, personality factor and other variables that can implicates to the students raw scores. For those interested person, they can deals this question with multivariate problem of mixture modeling. The challenge is to include these attributes in the model to assign grades to the students. It is however, premature to conclude from this study that conditional Bayesian grading is unambiguously desirable. In addition, the current study helps us to better understand the effect of grading methods. Moreover, it may still be the case that our measure of instructor grading method is merely reflective of some other unmeasured instructor attribute. Before we can apply the conditional Bayesian method as a policy 111 outcome, it is important to understand the distributional consequences at all levels including the students, and understood by the policymakers in order to implement a policy for a grading standard. In general the Bayesian grading method is more appropriate and efficient. 112 REFERENCES Alex, S. (2003). A Method for Assigning Letter Grades: Multi-Curve Grading. Dept. of Economics, University of California - Irvine, 3151 SSPA, Irvine. Ash, R.B., (1972), Real Analysis and Probability. New York: Academic Press Inc. Berger, J.O. (1985), Statistical Decision Theory and Bayesian Analysis. 2nd Edition. New York; Springer-Verlag New York, Inc. Birnbaum, D.J., (2001). Grading System for Russian Fairy Tales. www.clover.slavic.pitt.edu/~tales/02-1/grading.html Box, G.E.P. and Tiao, G.C. (1973). Bayesian Inference in Statistical Analysis. Massachusetts: Addison-Wesley Publishing Company, Inc. Casella, G., and George, E. I., (1992). Explaining Gibbs Sampler. Vol. 46. No. 3. pg. 167-174.The American Statistical Assosiation. Congdon, P (2003). Applied Bayesian Modelling. West Sussex, England: John Wiley & Son Ltd. Cornebise, J., Maumy, M., and Philippe, G. A.(2005) Practical Implementation of the Gibbs Sampler for Mixture of Distribution: Application to the Determination of Specifications in Food Industry. www.stat.ucl.ac.be/~lambert/BiostatWorkshop2005/ slidesMaumy.pdf. 113 Cross, L.H., (1995). Grading Students. ED398239. ERIC Clearinghouse on Assessment and Evaluation, Washington DC. www.ericfacility.net/ericdigest. Ebel, R. L. and Frisbie, D. A. (1991). Essentials of Educational. 3th ed. Englewood Cliffs, NJ.: Prentice-Hall, Inc. Figlio, D.N., and Lucas, M.E. (2003). Do High Grading Standards Effect Student Performance?. Journal of Public Economics 88 (2004) 1815-1834. Frisbie, D.A and Waltman, K.K.(1992). Developing a personal grading plan. Educational Measurement: Issues and Practice. Iowa: National Council on Measurement in Education. Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995). Bayesian Data Analysis. London: Chapman & Hall. Glass, G.V., and Stanley, J.C.(1970). Statistical Methods in Education and Psychology. Englewood Cliffs, New Jersey: Prentice-Hall, Inc. Hogg, R.V. and Craig, A.T. (1978). Introduction to Mathematical Statistics. 4th ed. New York: MacMillan Publishing Co.,Inc. Hogg, R.V., McKean, J.W. and Craig, A.T. (2005). Introduction to Mathematical Statistics. 6th ed. New Jersey: Pearson Prentice Hall. Jasra, A., Holmes, C.C., and Stephens, D.A. (2005). Markov Chain Monte Carlo Methods and the Label Switching Problem in Bayesian Mixture Modeling. Vol. 20. No.1, 50-57. Statistical Sciece; Institute of Mathematical Statistics. Johnson, B. and Christensen, L.(2000). Educational Research: Chapter 5 -Quantitative and Qualitative Approaches. 2nd ed. Alabama ,Allyn and Bacon Inc. 114 Jones, P.N., (1991). Mixture Distributions Fitted to Instrument Counts. Rev. Sci. Instrum. 62(5); Australia: American Institute of Physics. Lawrence D.A (2005). A Guide to Teaching & Learning Practices: Chapter 13Grading. Tallahassee , Florida State University. Unpublished. Lawrence, H.C.(1995). Grading Students. ED398239. ERIC Clearinghouse on Assessment and Evaluation, Washington, DC. Lee, P.M., (1989). Bayesian Statistics. New York. Oxford University Press. Martuza, V.R. (1977). Applying Norm-Referenced and Criterion-Referenced: Measurement and Evaluation. Boston Massachusetts: Allyn and Bacon Inc. Merle, W.T. (1968). Statistics in Education and psychology: A First Course. New York: The Macmillan Company. Newman, K. (2005). Bayesian Inference. Lecture Notes. http://www.creem.stand.ac.uk/ken/mt4531.html. Peers, I.S.(1996). Statistical Analysis For Education and Psychology Researchers. London: Falmer Press. Pradeep, D. and John, G. (2005). Grading Exams: 100, 99, ... , 1 or A, B, C? Incentives in Games of Status. 3-6. Press, S.J. (2003). Subjective and Objective Bayesian Statistic: Principle, Models and Application. New Jersey: John Wiley & Son, Inc. Raftery, A.E (1996), Hypothesis Testing and Model Selection, in S.R.W.R. Gilks & D.Spiegelhalter, eds, “Markov Chain Monte Carlo in Pratice”, Chapman and Hall, pg. 163-188. 115 Robert, L.W. (1972). An Introduction to Bayesian Inference and Decision. New York: Holt, Rinehart and Winston, Inc. Robert, M.H. (1998). Assessment and Evaluation of Developmental Learning: Qualitative Individual Assessment and Evaluation Models. Westport: Greenwood Publishing Group Inc. Spencer, C. (1983). Grading on the Curve. Antic, 1(6): 64 Stanley, J.C. and Hopkins, K.D. (1972). Educational Psychological Measurement and Evaluation. Englewood Cliffs, NJ.: Prentice Hall, Inc. Stephens, M. (2000). Bayesian Analysis of Mixture Models with An Unknown Number of Component-An Alternative To Reversible Jump Methods. University of Oxford. The Annals Statistics. Vol. 28, No.1, 40-74. Walsh, B. (2004). Markov Chain Monte Carlo. Lecture Notes for EEB 581. Walvoord, B.E. and Anderson, V.J. (1998). Effective Grading: A Tool for Learning and Assessment. San Francisco: Jossey-Bass Publisher. 116 APPENDIX A1 φ (z) Table A.1 : Normal Distribution Table 0 z z 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359 0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753 0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141 0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517 0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879 0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224 0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549 0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852 0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133 0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389 1 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621 1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830 1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015 1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177 1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319 1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441 1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545 1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633 1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706 1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767 2 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817 2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857 2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890 2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916 2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936 2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952 2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964 2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974 2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981 2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986 3 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990 3.1 0.4990 0.4991 0.4991 0.4991 0.4992 0.4992 0.4992 0.4992 0.4993 0.4993 3.2 0.4993 0.4993 0.4994 0.4994 0.4994 0.4994 0.4994 0.4995 0.4995 0.4995 3.3 0.4995 0.4995 0.4995 0.4996 0.4996 0.4996 0.4996 0.4996 0.4996 0.4997 3.4 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4998 3.5 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 3.6 0.4998 0.4998 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 3.7 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 3.8 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 3.9 0.5000 117 APPENDIX A2 Grading via Standard Deviation Method for Selected Means and Standard Deviation z µ=50 T LG z µ=51 T LG z µ=55 T LG z µ=59 T LG z µ=60 T LG z µ=63 T LG z µ=65 T LG z µ=70 T LG σ=10 -1.9998 -1.6362 -1.2726 -0.9090 -0.5454 -0.1818 0.1818 0.5454 0.9090 1.2726 1.6362 1.9998 30.00 33.64 37.27 40.91 44.55 48.18 51.82 55.45 59.09 62.73 66.36 70.00 E D D+ CC C+ BB B+ AA -1.9998 -1.6362 -1.2726 -0.9090 -0.5454 -0.1818 0.1818 0.5454 0.9090 1.2726 1.6362 1.9998 31.00 34.64 38.27 41.91 45.55 49.18 52.82 56.45 60.09 63.73 67.36 71.00 E D D+ CC C+ BB B+ AA -1.9998 -1.6362 -1.2726 -0.9090 -0.5454 -0.1818 0.1818 0.5454 0.9090 1.2726 1.6362 1.9998 30.00 33.64 37.27 40.91 44.55 48.18 56.82 60.45 64.09 67.73 71.36 75.00 E D D+ CC C+ BB B+ AA -1.9998 -1.6362 -1.2726 -0.9090 -0.5454 -0.1818 0.1818 0.5454 0.9090 1.2726 1.6362 1.9998 39.00 42.64 46.27 49.91 53.55 57.18 60.82 64.45 68.09 71.73 75.36 79.00 E D D+ CC C+ BB B+ AA -1.9998 -1.6362 -1.2726 -0.9090 -0.5454 -0.1818 0.1818 0.5454 0.9090 1.2726 1.6362 1.9998 40.00 43.64 47.27 50.91 54.55 58.18 61.82 65.45 69.09 72.73 76.36 80.00 E D D+ CC C+ BB B+ AA -1.9998 -1.6362 -1.2726 -0.9090 -0.5454 -0.1818 0.1818 0.5454 0.9090 1.2726 1.6362 1.9998 43.00 46.64 50.27 53.91 57.55 61.18 64.82 68.45 72.09 75.73 79.36 83.00 E D D+ CC C+ BB B+ AA -1.9998 -1.6362 -1.2726 -0.9090 -0.5454 -0.1818 0.1818 0.5454 0.9090 1.2726 1.6362 1.9998 45.00 48.64 52.27 55.91 59.55 63.18 66.82 68.45 72.09 75.73 79.36 83.00 E D D+ CC C+ BB B+ AA -1.9998 -1.6362 -1.2726 -0.9090 -0.5454 -0.1818 0.1818 0.5454 0.9090 1.2726 1.6362 1.9998 50.00 53.64 57.27 60.91 64.55 68.18 71.82 75.45 79.09 82.73 86.36 90.00 E D D+ CC C+ BB B+ AA 33.00 36.09 39.18 42.27 45.36 48.45 51.55 54.64 57.73 60.82 63.91 E D D+ CC C+ BB B+ AA -1.9998 -1.6362 -1.2726 -0.9090 -0.5454 -0.1818 0.1818 0.5454 0.9090 1.2726 1.6362 34.00 37.09 40.18 43.27 46.36 49.45 52.55 55.64 58.73 61.82 64.91 E D D+ CC C+ BB B+ AA -1.9998 -1.6362 -1.2726 -0.9090 -0.5454 -0.1818 0.1818 0.5454 0.9090 1.2726 1.6362 38.00 41.09 44.18 47.27 50.36 53.45 56.55 59.64 62.73 65.82 68.91 E D D+ CC C+ BB B+ AA -1.9998 -1.6362 -1.2726 -0.9090 -0.5454 -0.1818 0.1818 0.5454 0.9090 1.2726 1.6362 42.00 45.09 48.18 51.27 54.36 57.45 60.55 63.64 66.73 69.82 72.91 E D D+ CC C+ BB B+ AA -1.9998 -1.6362 -1.2726 -0.9090 -0.5454 -0.1818 0.1818 0.5454 0.9090 1.2726 1.6362 43.00 46.09 49.18 52.27 55.36 58.45 61.55 64.64 67.73 70.82 73.91 E D D+ CC C+ BB B+ AA -1.9998 -1.6362 -1.2726 -0.9090 -0.5454 -0.1818 0.1818 0.5454 0.9090 1.2726 1.6362 46.00 49.09 52.18 55.27 58.36 61.45 64.55 67.64 70.73 73.82 76.91 E D D+ CC C+ BB B+ AA -1.9998 -1.6362 -1.2726 -0.9090 -0.5454 -0.1818 0.1818 0.5454 0.9090 1.2726 1.6362 48.00 51.09 54.18 57.27 60.36 63.45 66.55 69.64 72.73 75.82 78.91 E D D+ CC C+ BB B+ AA -1.9998 -1.6362 -1.2726 -0.9090 -0.5454 -0.1818 0.1818 0.5454 0.9090 1.2726 1.6362 53.00 56.09 59.18 62.27 65.36 68.45 71.55 74.64 77.73 80.82 83.91 E D D+ CC σ=8.5 -1.9998 -1.6362 -1.2726 -0.9090 -0.5454 -0.1818 0.1818 0.5454 0.9090 1.2726 1.6362 BB B+ AA 118 APPENDIX B The Probability of Set Function Definition B1 If P ( C ) is defined for type of subset of the space Ω , and if (a) P ( C ) ≥ 0, (b) P {C1 ∪ C2 ∪ C3 ∪ ⋅⋅⋅} = P ( C1 ) + P ( C2 ) + P ( C3 ) + ⋅⋅⋅ , where the sets Ci , i ∈ , are such that no two have a point in common; that is Ci ∩ C j = ∅ , j ≠ i , (c) P ( Ω ) = 1, then P ( C ) is called the probability set function of the outcome of the random experiment. For each subset C of Ω , the number P ( C ) is called the probability that the outcome of the random experiment is an element set C , or the probability of the event C , or the probability measure of the set C . A probability set function tells us how the probability distributed over various subsets C of a sample space Ω . In definition B1.(b), if these subsets are such that no two have an element in general, they are called mutually disjoint sets and the corresponding events C1 , C2 , C3 , ⋅⋅⋅ are said to be mutually exclusive events. Moreover if Ω = C1 ∪ C2 ∪ C3 ∪ ⋅⋅⋅ , the mutually exclusive events are then characterized as being exhaustive and the probability of their union is obviously equal to 1. The probability set function P and the random variable X and is sometimes denoted PX ( A ) , where set A ⊂ Ω , X ( c ) ∈ A and c ∈ Ω . That is Pr { X ∈ A} = PX ( A ) = P ( C ) . The probability PX ( A ) is often called an induced probability (Hogg and Craig, 1978). 119 Mixture Model Example B1 Consider the model X i λ ~ iid Poisson ( λ ) Θ ~ Γ (α , β ) , α and β are known Random sample is drawn from Poisson distribution with mean λ and the prior distribution is a Γ (α , β ) distribution. We let X′ = { X 1 , X 2 , ⋅⋅⋅, X n } . Thus the joint conditional distribution of X, given Θ = λ is derived by following: Poisson pdf: ⎧ e−λ λ x ⎪ f ( x λ ) = ⎨ x! ⎪0 ⎩ x = 0,1, 2,... elsewhere Likelihood function: n L ( x λ ) = ∏ f ( xi λ ) = f ( x1 λ ) f ( x2 λ ) ⋅⋅⋅ f ( xn λ ) i =1 e − λ λ xi =∏ xi ! i =1 n n = e − nλ λ ∑ xi i =1 n ∏x ! i =1 i the prior pdf: π (λ ) = λ α −1e − λ / β Γ (α ) β α ;0<λ <∞ hence, the joint mixed continuous pdf [Eq.3.4 or 3.5] is given by n L ( x λ )π (λ ) = ∑ xi e − nλ λ i=1 λ α −1e − λ / β n Γ (α ) β α ! x ∏i i =1 120 provided that 0 < λ < ∞ , xi = 0,1,2,… and i=1,2,3,…,n and equal to zero elsewhere. Then the marginal distribution of the sample is ∞ m( x) = ∫ λ∑ ∏ xi !Γ (α ) β α e 0 − λ ( n +1/ β ) xi +α −1 dλ ∞ 1 − λ n +1/ β ) ∑ xi +α −1 λ e ( dλ = α ∫ ∏ xi !Γ (α ) β 0 ∗ Γ (α ) = ∫ zα −1e − z dz let z = λ n +1 β ∑ 1 −z ⎛ β z ⎞ e ⎜ = ⎟ ∏ xi !Γ (α ) β α ∫0 ⎝ n + 1 ⎠ ∞ ∴ dz = xi +α −1 β n +1 ∞ 1 x +α −1 ⎛ n + 1 ⎞ e− z z ∑ i /⎜ = ⎟ α ∫ ! x Γ α β ( ) ⎝ β ⎠ ∏i 0 = n +1 β dλ, λ = βz n +1 dλ ∑ xi +α dλ Γ ( ∑ xi + α − 1) 1 α ∏ xi !Γ (α ) β ⎛ n + 1 ⎞∑ xi +α ⎜ β ⎟ ⎝ ⎠ therefore, the posterior pdf of Θ , given X = x is n f (λ x) = = = L ( x λ )π (λ ) m( x) e ⎛ nβ +1 ⎞ −λ⎜ ⎟ ⎝ β ⎠ λ∑ ∑ xi = Γ ( ∑ xi + α − 1) e − nλ λ i =1 λ α −1e − λ / β 1 / n Γ (α ) β α ∏ xi !Γ (α ) β α ⎛ n + 1 ⎞∑ xi +α x ! ∏ i ⎜ β ⎟ i =1 ⎝ ⎠ xi +α −1 ( n + 1/ β )∑ xi +α Γ ( ∑ xi + α ) e ⎛ nβ +1 ⎞ −λ⎜ ⎟ ⎝ β ⎠ λ∑ xi +α −1 Γ ( ∑ xi + α ) ( β / ( nβ + 1) )∑ xi +α (**) 121 provided 0 < λ < ∞ , xi = 0,1,2,… and i=1,2,3,…,n and is equal to zero elsewhere. We can see that ( ∗∗) is the Gamma type with β nβ + 1 n α ∗ = ∑ xi + α and β ∗ = i =1 when divided the joint mixed pdf by marginal distribution to have the posterior distribution, the denominator of ( ∗∗) seen does not depends upon λ but it is depend upon random variable x . Therefore we can writes the denominator as any constant that depend upon x , say c ( x ) such that the posterior rewritten in terms of x +α −1 f (λ x) = c (x) λ ∑ i e ⎛ nβ +1 ⎞ −λ⎜ ⎟ ⎝ β ⎠ λ − ( nβ +1) x +α −1 = c ( x) λ ∑ i e β provided 0 < λ < ∞ , xi = 0,1,2,… and i=1,2,3,…,n which is c (x) = 1 Γ ( ∑ xi + α ) ( β / ( nβ + 1) )∑ xi +α In addition, we write the posterior distribution is proportional to L ( x λ ) π ( λ ) , that is f ( λ x ) proportional to L ( x λ ) π ( λ ) and given by f (λ x) ∝ L ( x λ )π (λ ) Generally we write this form as follows f (θ x ) ∝ L ( x θ ) π (θ ) where θ is our parameter of interest. In words we can say that posterior probability ∝ prior distribution × likelihood 122 Note that in the right hand number of these equation all factors involving constants and x alone (not θ ) can be dropped. For illustration in this example, we simply write λ − ( nβ +1) x +α −1 f (λ x) ∝ λ ∑ i e β or equivalently f ( λ x ) ∝ λ ∑ i e − nλ λ α −1e − λ / β ■. x 123 APPENDIX C Example C1: Weighting Grades Component Suppose a student took two tests, one project assignment and one final exam. The instructor want to weight the corresponding components at 20%,20%,10% and 50% respectively. First, the instructor must construct the probability distribution of letter grade based on each test. Then combined this distribution using the weights and adds first distribution multiplied by 20% to the second distribution multiplied by 20% to the project’s distribution multiplied by 10% and to the final exam multiplied by 50%. 124 APPENDIX D Some Useful Integrals The Gamma, Inverse Gamma and Related Integrals For α > 0, β > 0; ∞ (i ) ∫x α −1 − β x e dx = Γ (α ) β −α 0 ∞ ( ii ) ∫ x −(α +1)e− β / x dx = Γ (α ) β −α 0 ∞ 1 ( iii ) ∫ xα −1e− β x dx = Γ (α / 2 ) β −α / 2 2 2 0 ∞ ( iv ) ∫ x −(α +1)e− β x −2 0 1 dx = Γ (α / 2 ) β −α / 2 2 Generally, for α > 0, β > 0, a > 0; ∞ ∫x α −1 − β x a e 0 ∞ dx = ∫ x −(α +1) − β x − a 0 ∞ Γ (α ) = ∫ xα −1e − x dx 0 e dx = 1 ⎛ α ⎞ −α / a Γ⎜ ⎟β a ⎝a⎠ 125 APPENDIX E WinBUGS for Bayesian Grading Specification Tool To check “the model is syntactically correct”, to enter the data, to compile the model and to enter or generate the initial values Sample Monitor Tool To select the nodes to be monitored; WinBUGS save a file of the values of the node generated by MCMC. To explore the posterior distribution Update Tool To run the model by entering the desired number of iteration of the chain Complete WinBUGS session showing model, tool windows, and output windows 126 Case 1: Small Class a) Model MODEL{ for( i in 1 : N ) { Y[i] ~ dnorm(mu[i],tau[i]) mu[i] <- mu.c[G[i]] tau[i]<-tau.c[G[i]] G[i] ~ dcat(P[]) } for( g in 1 : M ) { tau.c[g] ~ dgamma(alpha.b[g],beta.b[g]) alpha.b[g]<-3+numgrade[g]/2 beta.b[g]<-1/(2*L[g]) sigma.c[g] <- 1 / sqrt(tau.c[g]) mu.c[g] ~ dnorm(alpha.a[g], alpha.tau[g]) alpha.a[g]<-(1/(1/v[g]+numgrade[g]*tau.c[g]))*(m[g]*v[g]+numgrade[g]* L[g]*tau.c[g]) alpha.tau[g]<-(1/(numgrade[g]*tau.c[g]+1/v[g])) m[g]~dnorm(q[g],0.0025) q[g]<-9*g v[g]<-400 } P[1:11 ] ~ ddirch(phi[]) } ~ DATA (N = 62) 38, 45, 52, 57, 58, 60, 60, 60, 64, 65, 65, 67, 67, 69, 69, 69, 69, 69, 69, 70, 70, 70, 70, 70, 70, 72, 72, 72, 73, 74, 74, 75, 76, 76, 78, 79, 79, 81, 81, 81, 82, 82, 83, 83, 84, 85, 85, 87, 89, 89, 90, 91, 92, 92, 93, 93, 94, 94, 94, 95, 95, 96. 127 b) Posterior Marginal Density Marginal posterior density estimates for the means of the different letter grade . mu.c[1] chains 1:2 sample: 150000 mu.c[2] chains 1:2 sample: 150000 1.50E-7 8.0 6.0 4.0 2.0 0.0 1.00E-7 5.00E-8 0.0 -2.0E+7 -1.0E+7 0.0 1.00E+7 37.5 mu.c[3] chains 1:2 sample: 150000 38.0 38.5 mu.c[4] chains 1:2 sample: 150000 8.0 6.0 4.0 2.0 0.0 0.6 0.4 0.2 0.0 44.25 44.75 45.25 50.0 mu.c[5] chains 1:2 sample: 150000 52.5 55.0 57.5 60.0 mu.c[6] chains 1:2 sample: 150000 20.0 15.0 10.0 5.0 0.0 1.5 1.0 0.5 0.0 59.8 59.9 60.0 60.1 63.0 mu.c[7] chains 1:2 sample: 150000 64.0 65.0 66.0 67.0 mu.c[8] chains 1:2 sample: 150000 4.0 3.0 2.0 1.0 0.0 1.0 0.75 0.5 0.25 0.0 68.5 69.0 69.5 70.0 72.0 mu.c[9] chains 1:2 sample: 150000 74.0 76.0 mu.c[10] chains 1:2 sample: 150000 1.0 0.75 0.5 0.25 0.0 2.0 1.5 1.0 0.5 0.0 80.0 82.0 84.0 86.0 91.0 mu.c[11] chains 1:2 sample: 150000 6.0 4.0 2.0 0.0 94.5 95.0 95.5 96.0 92.0 93.0 128 c) Convergence Diagnostics ~ Trace 129 ~ Gelman-Rubin Statistics mu.c[1] chains 1:2 mu.c[2] chains 1:2 1.0 1.0 0.5 0.5 0.0 0.0 501 20000 40000 60000 501 20000 iteration mu.c[3] chains 1:2 1.0 0.5 0.5 0.0 0.0 20000 60000 mu.c[4] chains 1:2 1.0 501 40000 iteration 40000 60000 501 20000 iteration 40000 60000 iteration mu.c[5] chains 1:2 mu.c[6] chains 1:2 1.0 1.0 0.5 0.5 0.0 0.0 501 20000 40000 60000 501 20000 iteration mu.c[7] chains 1:2 mu.c[8] chains 1:2 1.0 0.5 0.5 0.0 0.0 20000 60000 iteration 1.0 501 40000 40000 60000 501 20000 iteration 40000 60000 iteration mu.c[9] chains 1:2 mu.c[10] chains 1:2 1.0 1.0 0.5 0.5 0.0 0.0 501 20000 40000 60000 501 iteration 20000 40000 iteration mu.c[11] chains 1:2 1.0 0.5 0.0 501 20000 40000 iteration 60000 60000 130 ~ Running Quantiles mu.c[1] chains 2:1 mu.c[2] chains 2:1 1.00E+7 38.2 38.1 38.0 37.9 37.8 5.00E+6 0.0 -5.0E+6 3501 20000 40000 60000 3501 20000 iteration 40000 60000 iteration mu.c[3] chains 2:1 mu.c[4] chains 2:1 45.2 45.1 45.0 44.9 44.8 58.0 57.0 56.0 55.0 54.0 53.0 3501 20000 40000 60000 3501 20000 iteration 40000 60000 iteration mu.c[5] chains 2:1 mu.c[6] chains 2:1 60.05 66.5 66.0 65.5 65.0 64.5 60.0 59.95 3501 20000 40000 60000 3501 20000 iteration 40000 60000 iteration mu.c[7] chains 2:1 mu.c[8] chains 2:1 69.8 76.0 75.5 75.0 74.5 74.0 69.6 69.4 69.2 3501 20000 40000 60000 3501 20000 iteration 40000 60000 iteration mu.c[9] chains 2:1 mu.c[10] chains 2:1 85.0 93.5 93.0 84.0 92.5 83.0 92.0 3501 20000 40000 60000 3501 iteration 20000 40000 iteration mu.c[11] chains 2:1 95.6 95.4 95.2 95.0 3501 20000 40000 iteration 60000 60000 131 ~ Autocorrelation mu.c[1] chains 1:2 mu.c[2] chains 1:2 1.0 0.5 0.0 -0.5 -1.0 1.0 0.5 0.0 -0.5 -1.0 0 20 40 0 20 lag 40 lag mu.c[3] chains 1:2 mu.c[4] chains 1:2 1.0 0.5 0.0 -0.5 -1.0 1.0 0.5 0.0 -0.5 -1.0 0 20 40 0 20 lag 40 lag mu.c[5] chains 1:2 mu.c[6] chains 1:2 1.0 0.5 0.0 -0.5 -1.0 1.0 0.5 0.0 -0.5 -1.0 0 20 40 0 20 lag 40 lag mu.c[7] chains 1:2 mu.c[8] chains 1:2 1.0 0.5 0.0 -0.5 -1.0 1.0 0.5 0.0 -0.5 -1.0 0 20 40 0 20 lag 40 lag mu.c[9] chains 1:2 mu.c[10] chains 1:2 1.0 0.5 0.0 -0.5 -1.0 1.0 0.5 0.0 -0.5 -1.0 0 20 40 0 lag 20 40 lag mu.c[11] chains 1:2 1.0 0.5 0.0 -0.5 -1.0 0 20 40 lag 132 Table E1: Measuring Performance of class loss for Case 1 : GB method c=1 Score Instructor GB C C Score 38 45 52 57 58 60 60 60 64 65 65 67 67 69 69 69 69 69 69 70 70 70 70 70 70 72 72 72 73 74 74 2 3 4 4 4 5 5 5 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 2 3 3 4 4 5 5 5 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8 8 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 75 76 76 78 79 79 81 81 81 82 82 83 83 84 85 85 87 89 89 90 91 92 92 93 93 94 94 94 95 95 96 Instructor GB 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 11 11 11 5 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 11 11 11 C C 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 7 CC=12/62 0.193548 133 Continued : Table E1 c=1 Score Instructor GB 38 45 52 57 58 60 60 60 64 65 65 67 67 69 69 69 69 69 69 70 70 70 70 70 70 72 72 72 73 74 74 1 3 4 4 4 5 5 5 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 2 3 3 4 4 5 5 5 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8 8 GB2 e e2 4 9 9 16 16 25 25 25 36 36 36 36 36 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 64 64 1221 1-R= R= -1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 Score 75 76 76 78 79 79 81 81 81 82 82 83 83 84 85 85 87 89 89 90 91 92 92 93 93 94 94 94 95 95 96 6 0.003422854 0.996577 Instructor GB GB2 e e2 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 11 11 11 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 11 11 11 64 64 64 64 64 64 64 64 64 64 64 81 81 81 81 81 81 81 81 81 81 100 100 100 100 100 100 100 121 121 121 2577 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 7 134 Table E2: Measuring Performance of class loss for Case 1 : GB method c=1.6 Score Instructor GB C C Score 38 45 52 57 58 60 60 60 64 65 65 67 67 69 69 69 69 69 69 70 70 70 70 70 70 72 72 72 73 74 74 2 3 4 4 4 5 5 5 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 1 3 3 4 4 5 5 5 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8 8 1 0 1.6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.6 1.6 1.6 1.6 0 0 1 0 1.6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.6 1.6 1.6 1.6 0 0 75 76 76 78 79 79 81 81 81 82 82 83 83 84 85 85 87 89 89 90 91 92 92 93 93 94 94 94 95 95 96 Instructor GB 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 11 11 11 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 11 11 11 C C 0 0 0 0 0 0 1.6 1.6 1.6 1.6 1.6 0 0 0 0 0 0 0 0 1.6 1.6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.6 1.6 1.6 1.6 1.6 0 0 0 0 0 0 0 0 1.6 1.6 0 0 0 0 0 0 0 0 0 0 11.2 CC=19.2/62 0.309677 135 Table E3: Measuring Performance of class loss for Case 1 : Straight Scale method c=1 Score Instructor 38 45 52 57 58 60 60 60 64 65 65 67 67 69 69 69 69 69 69 70 70 70 70 70 70 72 72 72 73 74 74 2 3 4 4 4 5 5 5 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 Straight Scale 1 3 4 5 5 6 6 6 6 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 C C 1 0 0 -1 -1 -1 -1 -1 0 -1 -1 -1 -1 0 0 0 0 0 0 -1 -1 -1 -1 -1 -1 0 0 0 0 0 0 1 0 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 Score Instructor 75 76 76 78 79 79 81 81 81 82 82 83 83 84 85 85 87 89 89 90 91 92 92 93 93 94 94 94 95 95 96 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 11 11 11 Straight Scale 9 9 9 9 9 9 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 16 C C -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 -2 -2 -2 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 0 0 0 33 CC=49/62 0.790322581 136 Continued: Table E3 c=1 Score Instructor Straight Scale 38 45 52 57 58 60 60 60 64 65 65 67 67 69 69 69 69 69 69 70 70 70 70 70 70 72 72 72 73 74 74 2 3 4 4 4 5 5 5 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 1 3 4 5 5 6 6 6 6 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 1 9 16 25 25 36 36 36 36 49 49 49 49 49 49 49 49 49 49 64 64 64 64 64 64 64 64 64 64 64 64 1478 e e2 Score 1 0 0 -1 -1 -1 -1 -1 0 -1 -1 -1 -1 0 0 0 0 0 0 -1 -1 -1 -1 -1 -1 0 0 0 0 0 0 1 0 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 75 76 76 78 79 79 81 81 81 82 82 83 83 84 85 85 87 89 89 90 91 92 92 93 93 94 94 94 95 95 96 InstructorStraight Scale 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 11 11 11 16 9 9 9 9 9 9 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 e e2 81 81 81 81 81 81 100 100 100 100 100 100 100 100 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 3343 1-R= 0.010164 R= 0.989836 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 -2 -2 -2 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 0 0 0 33 137 Table E4: Measuring Performance of class loss for Case 1 : Straight Scale method c=1.6 Score Instructor 38 45 52 57 58 60 60 60 64 65 65 67 67 69 69 69 69 69 69 70 70 70 70 70 70 72 72 72 73 74 74 2 3 4 4 4 5 5 5 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 Straight Scale 1 3 4 5 5 6 6 6 6 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 C C 1.6 0 0 -1.6 -1.6 -1.6 -1.6 -1.6 0 -1.6 -1.6 -1.6 -1.6 0 0 0 0 0 0 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 0 0 0 0 0 0 1.6 0 0 1.6 1.6 1.6 1.6 1.6 0 1.6 1.6 1.6 1.6 0 0 0 0 0 0 1.6 1.6 1.6 1.6 1.6 1.6 0 0 0 0 0 0 Score Instructor Straight Scale C 75 76 76 78 79 79 81 81 81 82 82 83 83 84 85 85 87 89 89 90 91 92 92 93 93 94 94 94 95 95 96 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 11 11 11 9 9 9 9 9 9 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 25.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -3.2 -3.2 -3.2 -3.2 -3.2 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 0 0 0 C 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 3.2 3.2 3.2 3.2 3.2 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 0 0 0 53 CC=78.6 1.267741935 138 Table E5: Measuring Performance of class loss for Case 1 : Standard Deviation method c =1 Score Instructor 38 45 52 57 58 60 60 60 64 65 65 67 67 69 69 69 69 69 69 70 70 70 70 70 70 72 72 72 73 74 74 2 3 4 4 4 5 5 5 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 Std. C deviation 1 1 1 2 2 3 3 3 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 1 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 2 2 2 C Score Instructor 1 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 2 2 2 75 76 76 78 79 79 81 81 81 82 82 83 83 84 85 85 87 89 89 90 91 92 92 93 93 94 94 94 95 95 96 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 11 11 11 65 Std. C C deviation 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 8 8 9 9 9 9 10 10 10 10 10 10 10 10 10 11 2 2 2 1 1 1 2 2 2 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 2 2 2 1 1 1 2 2 2 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 27 CC=92/62 1.483871 139 Continued: Table E5 c=1 Std. Score Instructor deviation std2 38 45 52 57 58 60 60 60 64 65 65 67 67 69 69 69 69 69 69 70 70 70 70 70 70 72 72 72 73 74 74 2 3 4 4 4 5 5 5 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 1 1 1 2 2 3 3 3 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 1 1 1 4 4 9 9 9 16 16 16 16 16 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 36 36 36 601 e e2 Score Instructor -1 -2 -3 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -3 -3 -3 -2 -2 -2 1 4 9 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 9 9 9 4 4 4 75 76 76 78 79 79 81 81 81 82 82 83 83 84 85 85 87 89 89 90 91 92 92 93 93 94 94 94 95 95 96 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 11 11 11 141 Std. std.dev2 e e2 deviation 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 8 8 9 9 9 9 10 10 10 10 10 10 10 10 10 11 36 36 36 49 49 49 49 49 49 64 64 64 64 64 64 64 64 81 81 81 81 100 100 100 100 100 100 100 100 100 121 2259 1-R= R= 0.062937 0.937063 -2 -2 -2 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 0 0 0 0 0 0 0 -1 -1 0 4 4 4 1 1 1 4 4 4 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 39 140 Table E6: Measuring Performance of class loss for Case 1 : Standard Deviation method c=1.6 y Score Instructor 38 45 52 57 58 60 60 60 64 65 65 67 67 69 69 69 69 69 69 70 70 70 70 70 70 72 72 72 73 74 74 2 3 4 4 4 5 5 5 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 Std. C deviation 1 1 1 2 2 3 3 3 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 1 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 2 2 2 C Score Instructor 1 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 2 2 2 75 76 76 78 79 79 81 81 81 82 82 83 83 84 85 85 87 89 89 90 91 92 92 93 93 94 94 94 95 95 96 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 11 11 11 65 Std. C C deviation 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 8 8 9 9 9 9 10 10 10 10 10 10 10 10 10 11 2 2 2 1 1 1 2 2 2 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 2 2 2 1 1 1 2 2 2 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 27 CC=92/62 1.483871 141 Case 2: Large Class c) Model MODEL { for( i in 1 : N ) { Y[i] ~ dnorm(mu[i],tau[i]) mu[i] <- mu.c[G[i]] tau[i]<-tau.c[G[i]] G[i] ~ dcat(P[]) } for( g in 1 : M ) { tau.c[g] ~ dgamma(alpha.b[g],beta.b[g]) alpha.b[g]<-3+numgrade[g]/2 beta.b[g]<-1/(2*L[g]) sigma.c[g] <- 1 / sqrt(tau.c[g]) mu.c[g] ~ dnorm(alpha.a[g], alpha.tau[g]) alpha.a[g]<-(1/(1/v[g]+numgrade[g]*tau.c[g]))*(m[g]*v[g]+numgrade[g]*L[g]*tau.c[g]) alpha.tau[g]<-(1/(numgrade[g]*tau.c[g]+1/v[g])) m[g]~dnorm(q[g],0.0025) q[g]<-9*g v[g]<-400 } P[1:11 ] ~ ddirch(phi[]) } ~DATA (N=498) 29,30,30,30,31,32,34,34,35,36,36,36,37,38,38,40,40,41,41,41,42,42,42,43,43,43,43,44,45,45, 46,47,48,48,50,50,50,50,51,51,51,52,52,52,52,53,53,53,54,54,56,57,57,57,58,58,58,59,59,60, 60,60,60,60,60,61,61,61,61,61,61,62,62,62,62,62,62,63,63,64,64,64,64,65,65,65,65,65,65,65,6 5,65,65,65,65,65,65,65,66,66,66,66,66,66,66,66,66,66,66,67,67,67,67,67,67,67,67,67,67,67, 67,67,68,68,68,68,68,68,68,68,68,68,68,69,69,69,69,69,69,69,69,69,69,70,70,70,70,70,70,70,7 0,70,70,70,70,70,70,70,70,70,70,70,71,71,71,71,71,71,71,71,71,71,71,71,71,71,71,71,71,71, 71,71,71,71,71,71,71,71,71,71,71,71,71,72,72,72,72,72,72,72,72,72,72,72,72,72,72,72,72,72,7 2,72,72,72,72,72,72,72,72,72,72,72,72,72,72,72,72,72,72,73,73,73,73,73,73,73,73,73,73,73, 73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,74,74,74,74,74,74,7 4,74,74,74,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75, 75,75,75,75,75,76,76,76,76,76,76,76,76,76,76,76,76,76,76,76,76,76,76,76,76,77,77,77,77,77,7 7,77,77,77,77,77,77,77,77,77,77,78,78,78,78,78,78,78,78,78,78,78,78,78,78,78,78,78,78,78, 78,78,78,78,78,78,78,78,78,78,78,78,79,79,79,79,79,79,79,79,79,79,79,79,79,79,79,79,79,80,8 0,80,80,80,80,80,80,80,80,80,80,80,80,80,81,81,81,81,81,81,81,81,81,81,81,81,81,81,82,82, 82,82,82,82,82,82,82,82,82,82,82,82,82,82,82,82,82,83,83,83,83,83,83,83,83,83,83,83,83,83,8 3,83,83,84,84,84,84,84,84,84,84,84,84,84,85,85,85,85,86,86,86,86,86,87,87,87,87,87,88,88, 88,89,89,90,90,91,91,91,92,94,94,95,95,96,96,98,98,98 142 b) Posterior Marginal Density Marginal posterior density estimates for the means of the different letter grade. mu.c[1] chains 1:2 sample: 150000 mu.c[2] chains 1:2 sample: 150000 1.0 0.75 0.5 0.25 0.0 1.5 1.0 0.5 0.0 30.0 32.0 34.0 36.0 41.0 mu.c[3] chains 1:2 sample: 150000 42.0 43.0 44.0 45.0 mu.c[4] chains 1:2 sample: 150000 2.0 1.5 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 50.0 51.0 52.0 58.0 mu.c[5] chains 1:2 sample: 150000 59.0 60.0 mu.c[6] chains 1:2 sample: 150000 3.0 4.0 3.0 2.0 1.0 0.0 2.0 1.0 0.0 63.0 63.5 64.0 64.5 66.5 mu.c[7] chains 1:2 sample: 150000 6.0 4.0 4.0 2.0 2.0 0.0 0.0 71.6 71.8 72.0 72.2 76.0 mu.c[9] chains 1:2 sample: 150000 3.0 4.0 2.0 2.0 1.0 0.0 0.0 80.5 81.0 83.5 mu.c[11] chains 1:2 sample: 150000 0.8 0.6 0.4 0.2 0.0 90.0 76.25 76.5 76.75 mu.c[10] chains 1:2 sample: 150000 6.0 80.0 67.5 mu.c[8] chains 1:2 sample: 150000 6.0 71.4 67.0 92.0 94.0 84.0 84.5 85.0 143 c) Convergence Diagnostics ~ Trace 144 ~ Gelman-Rubin Statistics mu.c[1] chains 1:2 mu.c[2] chains 1:2 1.0 1.0 0.5 0.5 0.0 0.0 501 20000 40000 60000 501 20000 iteration mu.c[3] chains 1:2 mu.c[4] chains 1:2 1.0 0.5 0.5 0.0 0.0 20000 40000 60000 iteration 1.0 501 40000 60000 501 20000 iteration 40000 60000 iteration mu.c[5] chains 1:2 mu.c[6] chains 1:2 1.0 1.0 0.5 0.5 0.0 0.0 501 20000 40000 60000 501 20000 iteration mu.c[7] chains 1:2 mu.c[8] chains 1:2 1.0 0.5 0.5 0.0 0.0 20000 40000 60000 iteration 1.0 501 40000 60000 501 20000 iteration 40000 60000 iteration mu.c[9] chains 1:2 mu.c[10] chains 1:2 1.0 1.0 0.5 0.5 0.0 0.0 501 20000 40000 60000 501 20000 iteration 40000 iteration mu.c[11] chains 1:2 1.0 0.5 0.0 501 20000 40000 iteration 60000 60000 145 ~ Running Quantiles mu.c[1] chains 2:1 mu.c[2] chains 2:1 35.0 44.5 44.0 43.5 43.0 42.5 34.0 33.0 32.0 3501 20000 40000 60000 3501 20000 iteration 40000 60000 iteration mu.c[3] chains 2:1 mu.c[4] chains 2:1 52.25 52.0 51.75 51.5 51.25 59.75 59.5 59.25 59.0 58.75 3501 20000 40000 60000 3501 20000 iteration 40000 60000 iteration mu.c[5] chains 2:1 mu.c[6] chains 2:1 64.4 64.2 64.0 63.8 63.6 67.8 67.6 67.4 67.2 3501 20000 40000 60000 3501 20000 iteration 40000 60000 iteration mu.c[7] chains 2:1 mu.c[8] chains 2:1 72.1 72.0 71.9 71.8 71.7 76.7 76.6 76.5 76.4 76.3 3501 20000 40000 60000 3501 20000 iteration 40000 60000 iteration mu.c[11] chains 2:1 mu.c[9] chains 2:1 94.0 80.8 80.7 80.6 80.5 80.4 80.3 93.0 92.0 91.0 3501 20000 40000 60000 3501 iteration 20000 40000 iteration mu.c[10] chains 2:1 84.8 84.6 84.4 84.2 84.0 3501 20000 40000 iteration 60000 60000 146 ~Autocorrelation mu.c[3] chains 1:2 mu.c[1] chains 1:2 1.0 0.5 0.0 -0.5 -1.0 1.0 0.5 0.0 -0.5 -1.0 0 20 40 0 20 lag 40 lag mu.c[2] chains 1:2 mu.c[4] chains 1:2 1.0 0.5 0.0 -0.5 -1.0 1.0 0.5 0.0 -0.5 -1.0 0 20 40 0 20 lag 40 lag mu.c[5] chains 1:2 mu.c[6] chains 1:2 1.0 0.5 0.0 -0.5 -1.0 1.0 0.5 0.0 -0.5 -1.0 0 20 40 0 20 lag 40 lag mu.c[7] chains 1:2 mu.c[8] chains 1:2 1.0 0.5 0.0 -0.5 -1.0 1.0 0.5 0.0 -0.5 -1.0 0 20 40 0 20 lag 40 lag mu.c[9] chains 1:2 mu.c[10] chains 1:2 1.0 0.5 0.0 -0.5 -1.0 1.0 0.5 0.0 -0.5 -1.0 0 20 40 0 lag 20 40 lag mu.c[11] chains 1:2 1.0 0.5 0.0 -0.5 -1.0 0 20 40 lag 147 APPENDIX F Bayes Metropolis David A.Frisbie