B O R

advertisement
UNIVERSITI TEKNOLOGI MALAYSIA
♦
BORANG PENGESAHAN STATUS TESIS
STATISTICAL APPROACH ON GRADING:
MIXTURE MODELING
JUDUL :
SESI PENGAJIAN:
2005/2006
ZAIRUL NOR DEANA BINTI MD DESA
Saya
mengaku membenarkan tesis (PSM/Sarjana/Doktor Falsafah)* ini disimpan di Perpustakaan
Universiti Teknologi Malaysia dengan syarat-syarat kegunaan seperti berikut :
1. Tesis adalah hak milik Universiti Teknologi Malaysia.
2. Perpustakaan Universiti Teknologi Malaysia dibenarkan membuat salinan untuk tujuan
pengajian sahaja.
3. Perpustakaan dibenarkan membuat salinan tesis ini sebagai bahan penukaran antara
institusi pengajian tinggi.
4. ** Sila tandakan (√ )
SULIT
(Mengandungi maklumat yang berdarjah keselamatan atau
kepentingan Malaysia seperti yang termaktub di dalam AKTA
TERHAD
(Mengandungi maklumat TERHAD yang telah ditentukan
oleh organisasi/badan di mana penyelidikan dijalankan)
√
TIDAK TERHAD
Disahkan oleh
(TANDATANGAN PENULIS)
(TANDATANGAN PENYELIA)
Dr. ISMAIL B. MOHAMAD
Alamat Tetap :
NO. 114 TAMAN ORKID, FASA II,
SG. LAYAR, 08000 SG. PETANI,
KEDAH DARUL AMAN
Tarikh :
CATATAN :
APRIL 2006
*
Tarikh:
APRIL 2006
Potong yang tidak berkenaan.
** Jika tesis ini SULIT atau TERHAD, sila lampirkan surat daripada pihak
berkuasa/organisasi berkenaan dengan menyatakan sekali sebab dan tempoh tesis ini perlu
dikelaskan sebagai SULIT atau TERHAD.
♦ Tesis dimaksudkan sebagai tesis bagi Ijazah Doktor Falsafah dan Sarjana secara
penyelidikan, atau disertasi bagi pengajian secara kerja kursus dan penyelidikan, atau
Laporan Projek Sarjana Muda (PSM).
“I hereby declare that I have read this dissertation and in my
opinion this dissertation is sufficient in terms of scope and quality for the
award of the degree of Master of Science (Mathematics)”
Signature
: …………………………………..
Supervisor
: Dr. Ismail B. Mohamad
Date
: 14th April 2006
STATISTICAL APPROACH ON GRADING:
MIXTURE MODELING
ZAIRUL NOR DEANA BINTI MD DESA
A dissertation submitted in partial fulfillment of the
requirements for the award of the degree of
Master of Science (Mathematics)
Faculty of Science
Universiti Teknologi Malaysia
APRIL 2006
ii
I declare that this thesis entitled “Statistical Approach on Grading: Mixture Modeling”
is the result of my own research except as cited in the references.
The thesis has not been accepted for any degree and is not concurrently submitted in
candidature of any other degree.
Signature
: …………………………………..
Name
: Zairul Nor Deana Binti Md Desa
Date
: 14th April 2006
iii
Especially for my beloved parents
Mak, Tok and Tokwan
who taught me to trust myself and love all things great and small
My siblings ~ Abg Am, Dila, Fatin and Aida
Zaha
&
All teachers, lecturers and friends
iv
ACKNOWLEDGEMENT
First of all, I would like to thank Allah S.W.T, the Lord Almighty, for the health
and perseverance needed to complete this thesis.
I would like to express my appreciation to the supervisor Dr. Ismail B. Mohamad
for his meticulous and painstaking review and his tireless effort in reading on an earlier
draft of this thesis. In addition, I thank the chairperson Dr. Arifah Bahar and both
internal examiners Dr. Zarina Mohd Khalid and Tn. Hj. Hanafiah Mat Zin for their
helpful detailed comments that improved the final report. My sincere appreciation also
extends to the lecture and my fellow postgraduate colleagues of the Department of
Mathematics at Universiti Teknologi Malaysia who provided me with valuable input.
Finally, I gratefully acknowledge the support of my family and anonymous
referees for their patience and understanding.
v
ABSTRACT
The purpose of this study is to compare results obtained from three methods of
assigning letter grades to students’ achievement. The conventional and the most popular
method to assign grades is the Straight Scale method. Statistical approaches which use
the Standard Deviation and conditional Bayesian methods are considered to assign the
grades. In the conditional Bayesian model, we assume the data to follow the Normal
Mixture distribution where the grades are distinctively separated by the parameters:
means and proportions of the Normal Mixture distribution. The problem lies in
estimating the posterior density of the parameters which is analytically intractable. A
solution to this problem is using the Markov Chain Monte Carlo method namely Gibbs
sampler algorithm. The Gibbs sampler algorithm is applied using the WinBUGS
programming package. The Straight Scale, Standard Deviation and Conditional
Bayesian methods are applied to the examination raw scores of 560 students. The
performance of these methods are compared using the Neutral Class Loss, Lenient Class
Loss and Coefficient of Determination. The results showed that Conditional Bayesian
performed out the Conventional Method of assigning grades.
vi
ABSTRAK
Tujuan kajian ini adalah untuk membandingkan keputusan yang didapati
daripada tiga kaedah memberi gred kepada pencapaian pelajar. Kaedah konvesional
yang popular adalah kaedah Skala Tegak. Pendekatan statistik yang menggunakan
kaedah Sisihan Piawai dan kaedah Bayesian bersyarat dipertimbangkan untuk memberi
gred. Dalam model Bayesian, dianggapkan bahawa data adalah mengikut taburan
Normal Tergabung dimana setiap gred adalah dipisahkan secara berasingan oleh
parameter-parameter; min-min dan kadar bandingan dari taburan Normal Tergabung.
Masalah yang timbul adalah sukar untuk menganggarkan ketumpatan posterior bagi
parameter-parameter tersebut secara analitik. Satu penyelesaian masalah ini adalah
dengan menggunakan kaedah Markov Chain Monte Carlo iaitu melalui algorithm
persampelan Gibbs. Algorithm persampelan Gibbs diapplikasikan dengan menggunakan
pekej perisian pengaturcaraan WinBUGS. Kaedah Skala Tegak, kaedah Sisihan Piawai
dan kaedah Bayesian bersyarat dijalankan terhadap markah mentah peperiksaan daripada
560 orang pelajar. Pencapaian ketiga-tiga kaedah dibandingkan melalui nilai Kehilangan
Kelas Neutral, Kehilangan Kelas Tidak Tegas dan Pekali Penentuan. Didapati keputusan
yang diperolehi menunjukkan bahawa kaedah Bayesian Bersyarat menunjukkan
pencapaian yang lebih baik dibandingkan dengan kaedah Skala Tegak dan kaedah
Sisihan Piawai.
vii
TABLE OF CONTENTS
CHAPTER
TITLE
PAGE
Cover
1
Declaration
ii
Dedication
iii
Acknowledgement
iv
Abstract
v
Abstrak
vi
Table of Contents
vii
List of Tables
x
List of Figures
xi
List of Appendixes
xiii
Nomenclature
xiv
RESEARCH FRAMEWORK
1
1.1 Introduction
1
1.1 Statement of the Problem
2
1.2 Research Objectives
3
1.4 Scope of the Study
3
1.3 Significance of the Study
4
1.5 Research Layout
5
viii
2
REVIEW OF GRADING PLAN AND GRADING
7
METHODS
2.1 Introduction
7
2.2 Grading Philosophies
10
2.3 Definition and Designation of Measurement
12
2.3.1
Levels of Measurement
2.3.2
3
Norm-Referenced Versus Criterion-Referenced
Measurement
2.4 Weighting Grading Components
18
GRADING ON CURVES AND BAYESIAN GRADING
24
3.1
Introduction
24
3.2 Grading On Curves
25
20
3.2.1
Linearly Transformation Scores
25
3.2.2
Model Set Up for Grading on Curves
26
3.2.3
Standard Deviation Method
27
3.3 Bayesian Grading
31
3.3.1 Distribution-Gap
32
3.3.2
Why Bayesian Inference?
33
3.3.3
Preliminary View of Bayes’ Theorem
35
3.3.4
Bayes’ Theorem
37
3.3.5 Model Set Up for Bayesian Grading
41
3.3.6
41
Bayesian Methods for Mixtures
3.3.7 Mixture of Normal (Gaussian) Distribution
45
3.3.8 Prior Distribution
48
3.3.9
54
Posterior Distribution
3.4 Interval Estimation
4
16
61
NUMERICAL IMPLEMENTATION OF THE
BAYESIAN GRADING
64
4.1 Introduction to Markov Chain Monte Carlo Methods
65
ix
4.2
Gibbs Sampling
66
4.3
Introduction to WinBUGS Computer Program
69
4.4 Model Description
69
4.5
Setting the Priors and Initial Values
74
4.5.1
Setting the Prior
74
4.5.2
Initial Values
77
4.6
Label Switching in MCMC
77
4.7
Sampling Results
78
4.7.1.1 Case 1: Small Class
79
4.7.1.2
83
4.8
5
Convergence Diagnostics
4.7.2.1 Case 2: Large Class
87
4.7.2.2
93
Convergence Diagnostics
Discussion
96
4.9 Loss Function and Leniency Factor
101
4.10 Performance Measures
105
CONCLUSION AND SUGGESTION
108
5.1
Conclusion
108
5.2
Suggestions
110
REFERENCES
112
Appendix A – F
116-147
x
LIST OF TABLES
TABLE NO.
TITLE
PAGE
2.1
Comparison of Norm-Referenced and Criterion-Referenced
19
2.2
Rubrics for Descriptive Scale
19
3.1
Grading on Curve Scales for the Scores between Which a
30
Certain Letter Grade is Assigned, the Mean is "set" at C+
4.1
Optimal Estimates of Component Means for Case 1
4.2
Minimum and Maximum Score for Each Letter Grade,
Percent of Students and Probability of Raw Score Receiving
81
82
that Grade for GB: Case 1
4.3
Straight Scale and Standard Deviation Methods: Case 1
82
4.4
Optimal Estimates of Component Means for Case 2
89
4.5
Minimum and Maximum Score for Each Letter Grade,
Percent of Students and Probability of Raw Score Receiving
90
that Grade for GB: Case 2
4.6
4.7
Straight Scale and Standard Deviation Methods: Case 2
Posterior for 95% Credible Interval of Component Means
and its Ratio
90
98
4.8
Leniency Factor and Loss Function Constant
104
4.9
Cumulative Probability for GB; Case 1
105
4.10
Performance of GB, Straight Scale and Standard Deviation
Methods: Case 1
107
xi
LIST OF FIGURES
FIGURE NO.
1.1
TITLE
1.2
A Functional Mapping of Letter Grades
A Partition on Letter Grades
3.1
Plot of the Raw Scores and Corresponding Transformed
Scores
3.2
PAGE
14
15
26
Relationship among Different Types of Transformation
Scores in a Normal Distribution; µ = 60, σ = 10
30
3.3
Hierarchical Representation of a Mixture
44
3.4
45
4.1
Normal Mixture Model Outlined on Each Letter Grades
Graphical Model for Bayesian Grading
4.2
Kernel-Density Plots of Posterior Marginal Distribution
of Mean for Grade B+
4.3
Monitoring Plots for Traces Diagnostics of Mean: (a)
Grade D and (b) Grade B+.
4.4
Gelman-Rubin Convergence Diagnostics of Mean; (a)
Grade D and (b) Grade B+
4.5
Quantiles Diagnostics of Mean; (a) Grade D and (b)
Grade B+
4.6
Autocorrelations Diagnostics of Mean; (a) Grade D and
(b) Grade B+
4.7
Kernel-Density Plots of Posterior Marginal Distribution
of Mean for Grade B
73
85
86
86
87
87
92
xii
4.8
4.9
Monitoring Plots for Traces Diagnostics of Mean: (a)
Grade B and (b) Grade A
Gelman-Rubin Convergence Diagnostics of Mean;
(a) Grade B and (b) Grade A
4.10
Quantiles Diagnostics of Mean; (a) Grade B and (b)
Grade A
4.11
Autocorrelations Diagnostics of Mean; (a) Grade B and
(b) Grade A
4.12
Cumulative Distribution Plots for Straight Scale (dotted
line) and GB Method; (a) Case 1 and (b) Case 2
94
94
95
95
99
4.13
Density Plots with Histogram for Case 1
100
4.14
Density Plots and Histogram for Case 2
100
xiii
LIST OF APPENDIXES
APPENDIX
TITLE
PAGE
A1
Normal Distribution Table
116
A2
Grading via Standard Deviation Method for Selected Means
and Standard Deviation
117
B
The Probability of Set Function and Mixture Model
119
C
Weighting Grades Component
123
D
Some Useful Integrals
124
-The Gamma, Inverse Gamma and Related Integrals
E
WinBUGS for Bayesian Grading
125
F
Bayes, Metropolis and David A. Frisbie
147
xiv
NOMENCLATURE
GC
-
Grading on Curves
GB
-
Conditional Bayesian Grading
MCG
-
Multi-Curve Grading
MCMC
-
Markov Chain Monte Carlo
G
-
Grade Sample Space
N
-
Number of Students in a class
ng
-
Number of Students for Grade g
B
-
Burn-In Period
T
-
Number of Iterations
h {θ x}
-
Conditional Probability Density of Prior
L{x θ }
-
Conditional Likelihood Function of Raw Score
p (⋅ x )
-
π (θ )
-
Prior Distribution
m ( x)
-
Marginal Density of Raw Score
p ( xi )
-
The Probability Distribution of Raw Score
πg
-
Component Probability of Component g
θ
-
Parameter of Interest (Conjugate Prior)
Θ
-
Vector of Parameter of Interest
N ( ⋅, ⋅)
-
Normal Distribution
Conditional Distribution of Conjugate Prior or Posterior
Density
0
IG ( ⋅, ⋅)
-
Inverse Gamma Distribution
Di ( ⋅)
-
Dirichlet Distribution
C (⋅)
-
Categorical Distribution
R
-
Ratio in Gelman-Rubin Statistics
R2
-
Coefficient of Determination
C ( yi , yˆi )
-
Loss Function
CC
-
Class Loss
LF
-
Leniency Factor
1
CHAPTER 1
RESEARCH FRAMEWORK
1.1
Introduction
At the end of a course, educators intend to convey the level of achievement of
each student in their classes by assigning grades. Students, university administrators and
prospective employers use these grades to make a multitude of different decisions.
Grades cause a lot of stress for student; this exhibits the fact of education life. Grades
reflect personal philosophy and human psychology, as well as effort, to measure
intellectual progress with standardized objective criteria.
There are many ways to assign student’s grades which all seem to have their
advantages and disadvantages. The educators or graders are the most proficient persons
to form a personal grading plan because it incorporates the personal values, beliefs, and
attitudes of a particular educator. For that reason, a philosophy of grading in establishing
a grading plan must be shaped and influenced by current research evidence, prevailing
lore, reasoned judgement and matters of practicality. However, a more professional
approach should be developed with the ability to be applied at any grade level and in any
2
subject matter area where letter grades are assigned to students at the end of reporting
period.
1.2
Statement of the Problem
Most approaches in grading plan require additional effort and varying degrees of
mathematical expertise. The educator has to assign a score, which meaningfully assign a
letter grade, such as A, B- or C, to each student. There is no standard answer to
questions like: What should an “A” student grade mean? What percent of students in my
class should received a “C”?. University or faculty regulations encourage a uniform
grading policy so that grade of A, B, C, D and E will have the same meaning,
independent of the faculty or university awarding the grade. Other campus units usually
know the grading standard of a faculty or university.
For example, a “B” in a required course given by Faculty X might indicate that
the student have an ability in developing most of the skills and referred to as
prerequisites for later learning. A “B” in required course given by Faculty Y might
indicate that the student is not a qualified candidate for graduate school in the related
fields. Nevertheless, the faculty and educator may be using different grading standards.
Course structure may seem to require a grading plan which differs from faculty
guidelines or the educator and faculty may hold different ideas about a function of
grading. Therefore a satisfactory grading plan must be worked out in order to meet the
objective measurement and evaluation in education.
Since both philosophies and instructional approaches change as curriculum
changes, educators need to adjust their grading plans accordingly. In this study, we are
not comparing faculty regulations on their grading methods but we attempt to
differentiate each letter grade based on the overall raw score of the student from the
beginning of a semester to the end of the semester period. Statistically based method is
3
used in this research which takes into account the grading philosophy with respect to
conditions of measurement and evaluation of students’ achievement. The students’ final
grades intend to have a norm-referenced meaning. By definition, a norm-reference grade
does not tell what a student can do; there is no content basis other than the name of the
subject area associated with the grade. Furthermore, the distinctions and relationship of
several grading methods; conventional and futuristic are discussed carefully.
1.3
Research Objectives
The objectives of this study are to understand the grading philosophy, grading
policies, grading methods and exploring the appropriate grading methods. The
philosophy and policy are viewed as educational principles and the grading methods
were driven by statistical procedures. The primary objectives are to develop
mathematical models on grading system for both conventional and future approaches
and finally we carry out the programming method in statistical analysis on Bayesian
Grading method of assigning grades. The data on the past years record is used in this
study.
1.4
Scope of the Study
Assigning a grade to student can be done in various ways. At present, most
instructors assign grade conventionally through the raw score from the test given in class
and final examination raw scores at the end of a semester period. The grades may be
assigned based on the instructors’ “feel” throughout the instructors experience with their
students. To avoid the “unfair” judgment on student performance, the new approach in
grading method which is statistically based is adjusted to the conventional grading plan.
4
This method has a scientific evidence in assigning grade as compared to using only
instructors’ personal feels.
A model called Bayesian Grading (GB) method is developed to assign the
grades. A Bayesian Inference based on decision making is an important tool to classify
the letter grade into its particular class or component. The Gibbs Sampler is used in
estimating the optimal class to the grade when the students’ raw scores are assumed to
be normal and form a bell-shaped distribution. Adjustments to the raw score which take
into account the instructor’s leniency factor is to allow the educator to vary the leniency
of his or her evaluation. Based on the information, we calculate the probability that each
student’s raw score corresponds to each possible letter grade. The grader’s (or
instructor’s) degree of leniency is used to specify educators’ loss function which is used
to assign the most optimal letter grades.
These categories of grading are built upon earlier understanding of student raw
scores, and it combine the raw scores with current data measure in a way that update the
degree of belief (subjective probability) of the educator. With this principle, the student
raw scores are assumed to be independent of the other students.
1.5
Significance of the Study
In this research, the Bayesian Grading (GB) method of assigning letter grades to
student based on their whole semester raw score is described. The GB categorize the
marks into several different classes and each class is assigned a different letter grade.
The methods take into account the educator degree of leniency in categorizing the raw
scores into several classes.
5
This instructional statistical designed is to help prospective, intermediate and
beginning educators to sort out the issues involved in formulating their grading plan and
to help experienced educators to reexamine the fairness and defensibility of their current
grading practices. It also can be applied at any level of school, college or university.
1.6
Research Layout
Chapter I is intended to introduce basic terminology and a framework of the
study. Chapter II, include literatures on some basic grading policies and grading plan for
conventional grading methods and a futuristic grading method that will be used
throughout the dissertation.
Chapter III presents a more specific grading method which is the grading based
on curve and a introducing the basic Bayesian grading method which include a
discussion on finding the probability distribution of letter grades. A Bayesian inference
in setting the prior and estimating the posterior of probability figured theoretically by
including the proofs for readers understanding.
In Chapter IV, we carefully discuss the model parameters estimation that were
drawn from the mixture models using Gibbs Sampler. In addition, an estimation of the
letter grades which take into account the instructors’ loss function is shown to find the
optimal letter grades. The simulation is developing uses the WinBUGS (the recently
developed software package: the MS Windows operating system version of Bayesian
Analysis Using Gibbs Sampling). This is a flexible software for the Bayesian analysis in
analyzed complex statistical models uses Markov chain Monte Carlo (MCMC) methods.
The URL address to download the free version of the software is www.mrcbsu.cam.ac.uk/bugs/.
6
The significance of the results in real life will be judged by several selected
instructors which uses real raw scores data. Furthermore, the result will be compared
between the conventional grading methods and the Bayesian Grading method.
Finally, Chapter V includes the conclusion and suggestion for further research on
grading methods.
7
CHAPTER 2
REVIEW OF GRADING PLAN AND GRADING METHODS
2.1
Introduction
Grading via mathematical model in education has been hotly debated topic for
many decades. Prior to objective tests, marking and grading are usually synonymous,
and infallibility of the educators’ judgment were rarely questions. A grade should
demonstrates students performance during academic session and describe achievement,
ability and aptitude of the student in particular course and what a student knows rather
than how well they has performed relative to the reference group. Walvoord, and
Anderson, (1998) believed that grading includes tailoring assignments to the learning
goals of the course, establishing criteria, helping students acquire the knowledge they
need, assessing student learning over time, raise student motivation, feedback results so
student can learn from their mistakes, and using results to plan future teaching methods.
Nevertheless, the problem of using grades to describe student achievement has been
persistently troublesome at all levels of education [Ebel & Frisbie, 1991]. Grading is
frequently the subject of educational controversy because the grading process is
difficult, different philosophies call for different grading system, and the task of grading
is sometimes unpleasant. Figlio and Lucas (2003) have been proved that instructors at
different levels are likely to assign different grade distribution to their classes.
However, a development of grading models is in place to attract our attention to
create a structured educational grading system. There is no particular or single,
universally agreed upon grade assignment methods. Various grading methods practically
applied in schools, colleges and universities.
Weighting grading components and combining them to obtain a final grade is the
most common grading practice. Generally, grades are typically based on the grades of
graded components such as mid term test, quizzes, projects, assignments, studio projects
and final examination. Educators often wish to weight some components more heavily
than others. For example, quizzes scores it might be valued at the same weight as each
of two weeks or three hour exams grade. The variability of scores (standard deviation) is
the key to proper weighting. A practical solution to combining several weighting
components is first to transform raw scores to standardized scores; z or McCall T scores
[Robert, 1998; Ebel & Frisbie, 1991; Martuza, 1977; Merle, 1968]. This grading method
called “grading on the curve” or “grading on the normal curve” which became popular
during the 1920’s and 1930’s. Grading on the curve is the simplest method to determine
in advanced what percentage of the class would get A’s (say the top 10% get an A),
what percentage for B’s, and so on [Stanley and Hopkins, 1972]. Even though it is
amounted simply, but it has serious drawback. The fixed percentages are nearly
determined arbitrarily. In addition, the used of normal curve to model achievement in a
single classroom is generally inappropriate, except in large required course at the college
or university level [Frisbie and Waltman, 1992]. Grading on curve is efficient from an
educator point of view. Therein lays the only merit in the method.
A relative method called Standard Deviation Method implicitly assumed the data
come from a single population and is the most complicated computationally but is also
the fairest in producing grades objectively. It uses the standard deviation which tells the
average number of n students differ from their class average. It is a number that
describes the dispersion, variability or spread of scores around average score. Unlike
grading on curve, this method requires no fixed percentage in advanced.
9
Alex, (2003), has studied a more related method called Multi-Curves Grading
method (MCG), which is built upon Distribution Gap Grading method. Alex has
assumed that the raw scores are realization from a Normal Mixture, with each
component of the mixture corresponding to a different letter grades. The estimation
procedures are used to compute the probability where each student’s raw score
corresponds to each possible letter grade.
In moving from scores to grades, educators can grade on an absolute grading
scale (say 90 to 100 is an A). Given that the students only care about their relative rank,
which kind of grading is better? Works by Pradeep and John (2005) have shown that if
the students are disparate identical, then absolute grading is always better than grading
on a curve. This shows that when all the students are disparate identical, it is always
better to grade according to an absolute scale. A grading on curve is defined by the
number nA of students getting A, the number nB getting B, and so on. The grades are
obtained by ranking student exam scores, and assigning the top nA scores the grade A. If
k > nA students tie with the top score, then all must get A, and the number of B’s is
diminished by the excess A’s, and so on.
In this study, we are interested in converting the scores to grades. Three methods;
Grading on Curve(GC) with some adjustment using the weighted procedure from several
raw scores in finding the optimal grading scheme and the Straight Scale method (SS) is
then compared to the Bayesian Grading method (GB). Many researchers have studied a
conventional GC broadly and here we made some modifications of GC which taken into
account the statistical point of view. GB is the new approach commenced by
Alex,(2003) which is the focus of this study. Alex named the method Multi-Curves
Grading method (MCG).
The selected of Straight Scale method, Standard Deviation method, and
conditional Bayesian Grading method are the norm-referenced which is the normative
type of grading. Each of methods has the pros and cons. The Straight Scale method or
10
sometimes recall as Fixed Scale method is beneficial to calculate easily. It is also easy
for students to understand and generally accepted consistently and might reduce
competition between students. Unfortunately, the Straight Scale method has serious
drawbacks. The model ignored the raw scores and the percent score ranges for each
letter grade are fixed for all grading components. For example, the fact that 91% is
needed for an A places severe and unnecessary restrictions on the instructors when they
developing each assessment tool. A fixed percentage is arbitrary and thus not defensible
and meaningless. Why should the cutoff for an A be 85, 86 or 87 instead? Why
shouldn’t the A cutoff be 80-100% for a certain text, 91-100% for another and 70-100%
for a certain simulation exercise? Is there any reason why the same numerical standard
must be applied to every grading component when those standards are arbitrary and void
some absolute meaning? What sound rationale can be given for any particular cutoff?
Some of the instructor find themselves in a bind when the highest score obtained on an
exam is only 72%. Was the examination much too difficult? Did students study too
little? Was instruction relatively ineffective? Oftentimes, instructors decide to “adjust”
scores so that 72% is equated to 100%. In addition, this method can also allow all
students receive the same grade and thus not provide information needed to screen
students in competitive circumstances.
The purposes of this chapter are (a) to explain the grading philosophies and
grading policies which come across in the educational psychology, (b) to define
measurement topic in general and (c) briefly describe several conventional grading
methods in grades assignment. The more detailed description of GC and GB is discussed
in Chapter III.
2.2
Grading Philosophies
Grades reflect our personal philosophy and human psychology, as well as effort
to measure people intellectual progress with standardized objective criteria. In
11
educators’ point of view, whatever your personal philosophy about grades, their
importance to your students means that you must make a constant effort to be fair and
reasonable in maintaining grading standards. Philosophy of grading is broadly on how
the educator grades their students? and what to expect on the students’ entire
performance. The educator should inform to the students at the beginning of semester
which criteria and method is employed in assigning grades. Lawrence (2005) has
outlined several grading philosophies as follows:
Philosophy I: Grades are indicators of relative knowledge and skill; that is, a
student’s performance can and should be compared to the performance of other
students in that course. The standard to be used for the grade is the mean or
average score of the class on a test, paper or project. The grade distribution could
be objectively set by determining the percentage of A’s, B’s, C’s, D’s and E’s
that will be awarded. Outliers (really high or really low) can be awarded grades
as seems fit.
Philosophy II: Grades are based on preset expectation or criteria. In theory,
every student in the course could get an A if each of them met the preset
expectations. The grades are usually expressed as the percentage of success
achieved ( 90% and above is an A, 80-90% is a B, 70-80% is a C, 60-70 is a D
and below 60 is an E) this letter grades of range are subject to change depending
on institutions’ grading policy. Pluses and minuses can be worked into this
range.
Philosophy III: Students come into the course with an A, and it is theirs to lose
through poor performance or absence, late papers etc. With this philosophy the
teacher takes away points rather than adding them.
Philosophy IV: Grades are subjective assessments of how a student is
performing according to his or her potential. Students who plan to major in a
12
subject should be grade harder than a student just taking a course out of general
interest. Therefore, the standard set depends upon student variables and shouldn’t
be set in stone.
The grading system are closely attached to the educator’s own philosophy since
they know their student throughout teaching. In line with this, factors that will influence
educators’ evaluation, must be considered in advanced. For example some educators
weigh content more heavily than style. It has been suggested that lower (or higher)
evaluations should be used as a tool to motivate students. Some of them negotiate with
students about the ‘methods’ of evaluation, while others did not. As personal preference
it is so much a part of the grading and evaluation of students, a thoughtful examination
of one’s own personal philosophy concerning these issues will be very useful.
In assigning marks to a student by administering mid term test, project or
examination, which is by transforming their performance into a form of numbers of
letter grades, the educators should know the procedure to measure the students
performance. In the next section, we are to discuss the definition of measurement and
related issue to measurement. This knowledge is of significant important to discover an
educators’ skills in grading assignment.
2.3
Definition and Designation of Measurement
The focus of measurement is to quantify a characteristic of a phenomenon.
Measurement is generally a process of collecting information. The results are
quantifiable in terms of mathematical symbols such as time, distance, amount or number
of task performed correctly. For example, in grading, numbers assigned to permit
statistical analysis on the resulting data. The most important aspect of grading is the
specification rules for transforming numbers to letter grades. The rules for assigning
numbers and letter grades must be standardized and applied uniformly.
13
Definition 2.1 [Martuza, (1977)]
A measurement is the process of assigning numerals to object, events, or people
using rule. Ideally the rule assigns the numerals to present the amounts of
specific attribute possessed by the objects, events, or people being measured.
From the definitions, we precisely define measurement as the grading process of
assigning raw score and a letter grade to a student. Thus we need to comprehend the
following mathematical formalism to understand the issue in grading process.
The illustration in Figure 1.1 shows that we may show the grades in
mathematical terms measurement is a functional mapping from the set of objects {si } to
the set of real numbers of the raw score { xi } ; the raw score set { xi } consist of every
possible outcome in raw scores of a random test components. The objects are the
students itself denoted as si , where s is a shorthand notation for student and i referring to
which students is being described. For simplicity maybe the i is the ID of each student
and may take on any integer ( i, N ∈ℵ) starting from 1 until N finite number of students.
The symbols xi
( xi ∈ ℜ )
denote the standardized raw score corresponding to student i.
The method to compute weighted and standardized raw score will be explained
in the next sections. The student and standardized raw score are ranked in descending
order; s1 > s2 > ⋅⋅⋅ > sN and x1 > x2 > ⋅⋅⋅ xN . Our task is to convert the standardized raw
score to fairness meaningful letter grades. In other words, the purpose of this study is to
define the probability set function of the raw scores. A probability set function of raw
score tells us how the probability is distributed over various subsets of raw score in a
sample space G. The properties of the probability set function are statistically defined in
Appendix B; Definition B1.
14
s1
x1
Raw Score
s2
A
B
C
D
x2
⋅
⋅
⋅
xN
⋅
⋅
⋅
sn
Objects
(students)
Weighted and
Standardized Raw
E
Letter grades
Figure 1.1: A Functional Mapping of Letter Grades
In addition, a measure of grades is a set function, which is an assignment of a
number µ ( g ) to the set g in a certain class [Ash, (1972)]. If G is a set whose point
correspond to the possible outcomes of a random experiment, certain subsets of G will
be called “events” and assigned a probability. Intuitively, g is an event if the question
“Does w (say 85) belong to g ?” has a finite yes or no answer after the experiment is
performed and the outcome should correspond to the point 85 ∈ G .
We denote G as a sample space of grades g1 = E , g 2 = D, g3 = D + ⋅⋅⋅, g11 = A ;
{ g L ∈ G} and the subscript L = 1,2, …, 11 denote the eleven components of letter grades
in the grading policy. We describe the eleven grade components as the set of { A, A-,
B+, B, B-, C+,C, C-, D+, D, E }which match the set of grade point averages { 4.0, 3.7,
3.3, 3.0, 2.7, 2.3, 2.0, 1.7, 1.3, 1.0, 0.0 }.
15
The native guides in giving meaningful letter grades corresponding to the grade
point average were explained by Frisbie and Waltman (1992). Figure 1.2 exhibits the
partition of letter grades. The spaces between A and B, C, D and E show the borderline
gaps within grade which compute into complexity to decide the accurate maximum and
minimum score for each grade. The curves illustrates that the raw scores are not in the
straight line, which show the variety of raw scores for different students in the same
letter grades.
gap
…
C
B
A
…
Raw Score
Figure 1.2: A Partition on Letter Grades
Prior to the task in letter grades assignment, we begin by highlighting the levels
of measurement also known as scale types of measurement. Most of the references
addressed four hierarchical levels. There are Nominal, Ordinal, Interval and Ratio
measurement.
16
2.3.1 Levels of Measurement
The level of measurement is important to determine the type of statistical
analysis that can be conducted. The four possible levels of measurement are as follows
[Martuza, (1977)]:
A. Nominal
Nominal measurement is the simplest type of scale. Observations are
placed into a category according to a defined property. The numbers are used
for labeling purposes only to classify the object or event into a category. The
categories are exhaustive (everyone should have a category to end up in) and
mutually exclusive (a person should end up in only one category). The
descriptive statistics are frequency, percentages and inferential statistics for
non parametric such as the Chi-Square. Examples of nominal are social
security numbers or numbers on the back of a football player. Another
example is grouping people into categories based upon sex (Male or Female).
The Female group might be assigned the number "0" and Male "1".
B. Ordinal
The numbers are used to show a relative ranking. The higher the number,
the higher in significance the event might be. The variables used consist of a
mutually exclusive and exhaustive set of orderable categories given as
follows:
a) 1= Highly Competent , 2= Competent, 3=Average, 4=Need
attention/improvement, 5= Poor
b) Rank ordering people in a classroom according to height and
assigning the shortest person the number "1", the next shortest person
the number "2" and so on.
17
However, ordinal measurement does not tell us how much greater one
level of an attribute is than another. For instance we do not know if being
completely independent is twice as good as when a mechanical assistance is
present. The appropriate descriptive statistics for ordinal data are percentiles
and rank order while the appropriate inferential statistics is the non
parametric such as Median or Mean test.
C. Interval
We know the interval between points. The numbers show a mutually
exclusive and exhaustive set of equal space, ordered categories between each
event. For example, the temperature readings; Celsius temperatures in Kuala
Lumpur on three particular months were as follows: January 25 - 32 ; June
24 - 36 ; December 25 - 30 . Each degree change of temperature reflects an
equal amount of difference. However, the zero point is arbitrary. Furthermore
we cannot conclude that if the temperature is 30 is twice as hot as a 15
temperature. The relevant statistical measure are the relational statistics; the
correlations and in inferential statistics; parametric: t tests, Analysis of
Variance (ANOVA), Analysis of Covariance (ANCOVA), Multivariate
Analysis of Variances (MANCOVAS), and regression.
D. Ratio
Ratio measurement describes the interval data that have an absolute value
of zero. For example, suppose a test score in percentage of three students are
as follows: Diana – 90%, Iskandar – 60% and Swee Lee – 30%. These
numeral indicates that Diana have higher score than Iskandar and Swee Lee
do; or we can say Iskandar has a higher mark than does Swee Lee and lower
than Diana’s. Does this mean that the ability difference between Swee Lee
18
and Iskandar is less than that of Iskandar and Diana? Hence that ratio
statement like “Iskandar is as twice as bright as Swee Lee” are meaningless.
In statistical measure appropriate, there is no distinction made between
interval and ratio data.
2.3.2 Norm-Referenced Versus Criterion-Referenced Measurement
The two basic types of grading are normative (comparative or relative standard)
and criterion (mastery or absolute standard). A relative comparison is being made if the
educator is evaluating relative to the performance of others in the class or in welldefined norm group. We assume that the students in the class represent a distribution of
intelligence that will result in similar distribution of learning. Descriptive statistics are
often associated with normative grading, and the terms “curving the score” or “grading
on curves” are used. The well-known bell-curve is centered on the mean (or median)
score, and the distribution is often indexed by standard deviation of the scores. Most
standardized tests are in a normative fashion and hence, are so-called Norm-Referenced.
Besides, whenever a decision about educator’s status in content based or about his or
her achievement with respect to an explicit instructional objective is made by comparing
his or her test score to some preset standard or criterion for success, the test score is said
to be given a Criterion-Referenced as opposed to a norm-referenced interpretation. More
details about these grading types is referred to Martuza (1977), Frisbie and Waltman
(1992) and Lawrence, (2005). Table 2.1 and Table 2.2 describe briefly the comparison
between Norm-Referenced and Criterion-Referenced and for an extension we present
descriptors of grade levels performance using Rubrics respectively. A more generic type
of indicator could be in the form of rubric, which is a descriptive scale with value
attached as shown in the table. The Rubric is an authentic assessment tool which is
particularly useful in assessing criteria which are complex and subjective. Designed
rubrics can be found in many books in educational and psychology fields such as
Stanley and Hopkins (1972), Martuza (1977), Ebel and Frisbie (1991) and Lawrence
(2005).
19
Table 2.1: Comparison of Norm-Referenced and Criterion-Referenced*
Norm-Referenced
Criterion-Referenced
Compares the performance of individuals Compares the performance of individuals
against the references group
Spread out the grade distribution
against preset criteria
Grades may be clustered at the high or
low ends
Content dependent
Course objective dependent
Encourages competition
Encourage collaboration
Grades effected by outliers
Does not motivate student to improve
Grades not effected by how other
individuals perform
Can be used diagnostically to indicate
strengths and weaknesses
* Subject to change according to department/faculty grading policy or grading plan
Table 2.2: Rubrics for Descriptive Scale*
Grade
Description
Entire major and minor goals achieved
High level of skill development
A
Pluses for work submitted on time and carefully proofread
(80-100 points)
Minuses for work submitted late or resubmitted
Exceptional preparation for later learning
Entire major and most minor goals achieved
Advanced development of most skills
B
Pluses for work submitted on time and carefully proofread
(65-79 points)
Minuses for work submitted late or resubmitted
Has prerequisites for later learning
Most major and minor goals achieved
Demonstrate ability to use basic skills
C
(50-64 points)
Pluses for all work carefully proofread
Lacks a few prerequisites for later learning
Sufficient goals achieved to warrant a passing effort
D
Some important skills not attained
(40-49 points)
Deficient in many of the prerequisite for later learning
Few goals achieved
E
Most essential skills cannot be demonstrated
(0-39 points)
Lacks most prerequisites needed for later learning
* Subject to change according to department/faculty grading policy or grading plan
20
The decision to use either an absolute or relative grading standard is the most
fundamental decision an educator must make with regard to performance assessment.
When the absolute standard is chosen, all methods and tools of evaluation must be
designed to yield content-referenced interpretations. Absolute standard for grading must
be established for each component which contribute to course grade – tests, papers,
quizzes, presentations, projects and other assignments. If the decision is to use a relative
standard, all measures must be geared to provide norm-referenced interpretations as
explained in this section. Obviously, in both cases criterion-referenced decisions need to
be made as long as several grading symbols are available [Ebel & Frisbie, 1991] which
is the Rubric’s description stated above. Through a clear majority of institutions present
use of letter grading with relative standards, percent grading is by no means dead. The
percent grading is then converted to letter grades. Some instructors voice a preference
for absolute grading over relative grading for philosophical reasons but find task of
establishing standard overbearing, or in some cases, too arbitrary. This type of decision
is very subjective, which varies with instructor judgement.
2.4
Weighting Grading Components
Consider the problem that the instructor faces to motivate his students to study
for two tests, say when instructor gives a midterm and final. The problem is that if a
student does very well or very badly on the midterm, he (or his rivals) will feel less
incentive to work for the final if he or they are unable to change their rank. One way to
solve this problem is to weight the final more than the midterm, and to grade even ruder
on the midterm than on the final.
The introduction section of this chapter we look at weighting as a grading
method. In this section, we state the Stanley and Hopkins, (1972), Ebel & Frisbie,
(1991), and Frisbie and Waltman, (1992) in weighting procedure. When educators
determine a course grade by combining grading components just as it is deciding which
21
components to use, each component carries less or more weight in determining final
score. For example, for final exam the instructor decide to give the highest weight (say
3.0) than the mid term test (say 1.5) and project assignment (say 1.0) since final
examination is covered the entire course contents of a semester. But for mid term test, is
covered only from week 1 until week 8 of the semester. The components score must be
pooled to form a composite score for final summary. To obtain grades of maximum
validity, educators must give each component the proper weight, not too much or too
little according to how important each component score or grade in describing
achievement and performance at the end of the grading period.
The standard deviation of its scores provides the approximation to the weight of
a component quite well. If one set of score twice as variable as another, the first set is
likely to carry about twice of the second in their total. The weight of one component on
a composite depends on the variability of the test scores. For example, Table 2.3
demonstrates the scores on component of Course Works (Mid Term Test and
Assignment) and Final Examination of three students (Diana, Iskandar and Swee Lee);
which displayed on the first section along with their total scores on three components.
Each of them made the highest score on one test, middle score on a second and lowest
on the third. However, for future reference, that the ranks of their total scores on three
test are the same as their ranks on Assignment (Test Z). The second section of the table
gives information of the total score corresponding to each of grading components along
with its descriptive statistics; the mean scores and the standard deviation of the scores on
the three tests. Test X has the highest number of total points. Test Y has the highest
mean score and Test Z has score with greatest variability.
22
Table 2.3: Weighting the Test Scores
Final Exam
(X)
Mid Term
(Y)
Assignment
(Z)
Total
551
502
443
792
753
841
83
251
162
1423
1501
1442
Tests
Characteristics:
Total Points
Mean Score
Standard Deviation
100.0
49.7
5.51
90.0
79.3
4.51
25.0
16.3
8.50
215.0
145.3
4.16
Weighted Score:
Diana
Iskandar
Swee Lee
x1.54
85
77
68
x1.89
149
141
158
x1.00
8
25
16
242
244
242
Student:
Diana
Iskandar
Swee Lee
*
The superscript at the score represents the ranks
The question in weighting was on which test was it most important to do well?
What factors arised about giving weights? On which the payoff for ranking first the
highest, and the penalty for ranking last the heaviest? Clearly Test Z, the test with the
greatest variability of the score. Which test ranked the student in the same order as their
final ranking, based on total scores? Again the test is Test Z. Thus, the influence of one
component on a composite depends not on total points or mean score but on score
variability [Ebel & Frisbie, 1991]. Furthermore, the important of the test component, the
uniqueness referred the course objective and the accuracy of the score obtained from a
component is the key in weighting the component [Frisbie and Waltman, 1992].
Now, if the instructor tends to give an equal weight, this can be applied by
weighting their scores to make the standard deviation equal as shown in the last section
on Table 2.3. Since Final Examination is the tough test so are multiplied by 1.54, to
change their standard deviation from 5.51 to 8.50, the same as on Test Z. With equal
standard deviation the test carries equal weight and gives students the same average rank
on the tests the same total score approximately. Even if in practice the instructors never
23
bother to do this, being aware of the principle should help you score various assignments
or projects that are subjectively evaluated. Moreover, when the whole possible range of
scores is used, score variability is closely related to the extent of the available score
scale. This means that scores on a 50-point mid term test are likely to carry about five
times the weight of scores on a 10-point assignment project, provided that scores extent
across the whole range in both cases.
The most efficient means of proper weighting involves the computation of
standard scores, is T-score, for each grading component. Then each component will be
represent on a score scale that yields the same standard deviation (10 for T-scores) for
each measure. The details on T-score insures in Chapter III.
24
CHAPTER 3
GRADING ON CURVES AND BAYESIAN GRADING
3.1
Introduction
As an introduction for grading, let ξ denote all possible letter grades of N
number of students from a particular course. There is a grading map
ξ : xi → G
which ranks student according to G(x) when the scores obtained are xi ∈ ℜ . Each rank
corresponds to a grade. Maps ξ is precisely depends on the scores and not on the names,
effort or extra credit of student personality. A higher score implies a higher letter grade.
In this chapter we tend to focus on two particular ways of generating ξ ; Grading on
Curves and Bayesian Grading.
3.2
Grading On Curves
A well known and the simplest variety of relative grading standard is called
“grading on curve” (GC). The ‘curve’ would approximate the standard ‘bell’ shaped;
referred to usually as the asymptotically normal distribution curve or some symmetric
variant of fit. The use is based upon two assumptions: (a) the variable being rated is
normally distributed on a continuous scale and (b) the categories cover known intervals
on the continuum [Merle, 1968]. We assume the students’ ability and accomplishments
is normally distributed by the raw scores distribution which is often used to describe the
achievements of individuals in a large heterogeneous group. This method of assigning
grades based on group comparison is complicated due to the need to established
arbitrary quotas for each grade category. What percent should obtain A’s, B’s, C’s, D’s
or E’s? Once these quotas are fixed, grades are assigned without regard to actual level of
performance. Quota setting strategies vary from instructor to another instructor and
department to department and seldom carry a defensible rationale. While some
instructors defend the use of the normal or bell shaped curve is an appropriate model for
setting quotas, using the normal curve is as arbitrary as using any other curves. In the
next sections, we have the following goals; (a) to formalize the treatment of
transformation, (b) to describe and illustrate some of the more commonly encountered
transformed score scales (z-score and T-score), and (c) to show the relationships exist
among z-score and T-score.
3.2.1 Linearly Transformation Scores
A transformation is a rule (or set of rules) for converting scores from one scale
(the observed score i.e. the raw score, x) to a new set of scores (standardized score i.e the
deviation score). Transformations score can be classified as being either linear or
nonlinear. A linear transformation converts the score from one scale to another in such a
way that the distribution shape is not changed, whereas the nonlinear transformation
26
converts the score from one scale to another in such a way that the shape of the
distribution is altered [Martuza, 1977]. One way to find out whether a transformation is
linear or not is to plot the raw score and transformed scores as shown in Figure 3.1. If all
points in the plot fall exactly on a straight line as in Figure 3.1(a) and it can be sure the
transformation is linear; if the points do not fall on a straight line as in Figure 3.1(b),
the transformations is nonlinear. Generally, the linear transformations preserves the
shape of a test score distribution whereas the nonlinear transformations, always alter the
shape of the test score distribution.
3
4
2
3
1
2
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
-1
-2
1
0
0
-3
Raw Scor e
Figure 3.1(a)
2
4
6
8
10
Raw Scor e
Figure 3.1(b)
Figure 3.1: Plot of the Raw Scores and Corresponding Transformed Scores
3.2.2 Model Set Up for Grading on Curves
Using group comparison for grading is appropriate when the class size is
sufficiently large (N ≥ 30 ) to provide a reference group representative of students
typically enrolled in the course. We assume that the students scores’ are independent of
each other and we shall be concerned here with the most widely used statistical value;
the standard deviation.
12
27
3.2.3 Standard Deviation Method
When a distribution of raw scores is normal or approximately so, standard score
reveal a great deal of information. Standard score are recommended because they allow
us to measure performance on each grading component with an identical or standard
yardstick. To transform a set of student raw scores to standard scores, we need only to
estimate the mean and standard deviation of the set and divide the respective deviations
of the scores from their mean by the standard deviation i.e.,
z=
xi − µ
σ
(3.1)
Table A1 in Appendix A have shows the probability for range of z. In short, a standard
z-score indicates how many standard deviation the corresponding raw score of particular
student is from the mean of the reference group. The mean and standard deviation of any
set of standard scores are 0 and 1, respectively. If an eleven component of letter grades
scheme A, A-, B+, B, B-, C+,C, C-, D+, D, and E with equal intervals are employed, and
if the practical limits of the z scale are considered to be -2.50 to +2.50 and -2.0 to +2.0,
each interval will extent 0.4545 and 0.3636 of standard deviation units respectively
[Merle, 1968]. The interval is called the grade cutoff points that form equal intervals on
the score scale. But, the first remains that the interval of A’s, B’s or E’s is given
arbitrarily which dependent on the instructor’s philosophy. Since the instructors has
declared the standard deviation according to the final scores performance, the percentage
of student getting A’s, B’s and other letter grades are figured symmetrically. Under the
two assumptions in Section 3.2, the distribution of eleven components of letter grades
would follow the proportions shown in Figure 3.2. Furthermore, the transformation of
raw scores to standard scores may results in decimal number, some which negative. In
order to eliminate decimals and negative signs, standard scores frequently are multiplied
by a constant and added to another constant. General form of all such transformation is
T = zσ new + µ new
(3.2)
28
in which z , σ new , µ new and T ; represent the normalized z score, the standard deviation,
mean and transformation score of a new normal distribution respectively. A widely used
scheme is the one in which the standard scores are multiplied by 10 and added to 50
[Merle, 1968; Martuza, 1977; Spencer, 1983; Lawrence, 1995; Alex, 2003]. The
converted standard scores in this form is usually denoted as T. When raw scores are
normally distributed, their equivalents T score are identical to the well-known McCall T
scores; T = 10 z + 50 and Stanines scores; Stanine = 2 z + 5 , standard scores used to report
the results from the Iowa Test of Educational Development (ITED); ITED = 5 z + 15 ,
and College Entrance Examination Board (CEEB); CEEB = 100 z + 500 .
For example, consider the raw scores with mean 60 and standard deviation 10.
This adjustment is made since we have stated that the average letter grade for students
should be 60 instead of 50 for McCall T scores which sense that, more than half of
student in a particular course should get the C+ and above of letter grades. The C+ is
chosen instead of the C- because we enclose eleven letter grades and not five typically.
Therefore we have fifty percent of the students will always have score greater than the
mean (i.e for example with the mean 60). See Table 3.1 and Figure 3.2. The Straight
Scale explained the raw score falls within a certain predetermined interval, assign the
letter grade corresponding to that interval. The method is easy to be applied and is
frequently used by most university grading scheme. The intervals are created before any
actual raw scores are realized. In other words, this method does not even consider the
actual raw scores in making the cutoffs between letter grades. Straight Scale also makes
no sense if the test that produced the raw scores is either too difficult or too easy. But
often, it is impossible to know the difficulty of a test until after the raw scores are seen
[Alex, 2003].
29
The procedure for standard deviation grading method is as follows:
i)
Build a frequency distribution of the total scores by listing all obtainable scores and the number of students receiving each. Calculate the
mean, median and standard deviation. Note that, weight each grading
variable before combining the standard scores.
ii)
If the mean and median are similar in value, use the mean for further
computations, otherwise use the median. If the median is chosen, add
0.1818 of the standard deviation to the median and subtract the same
value from median. These are the cutoff points for the range of C+’s.
iii)
Add 0.3636 standard deviation to the upper cutoff of the C+’s to find
the B- and B cutoff. Subtract the same value to find the C and C- cutoff.
See Figure 3.2 for illustration and Table 3.1 for letter grade cutoff if the
median is chosen to be equal to 60.
In step (iii), if unlike mean (or median) and standard deviation are used,
immediately ignored the third column in Table 3.1. Subsequently, distribute the cutoff
scores based on estimate of mean (or median) and standard deviation. For example of
the cutoff scores given in Appendix A correspond to the different means and standard
deviations. For extend, review borderline cases by using number of assignment
completed, quality of the assignments, or some other relevant achievement data to
decide if any borderline grades should be raised or lowered.
30
z score -2.00 -1.64 -1.27 -0.91 -0.54 -0.18
0.18
0.54
0.91 1.27 1.64
2.00
T score 0 iii 43.6 47.2 50.9 54.6 58.2 61.8 65.5 69.1 72.7 76.4 iii 100
Letter Grades E D
D+
C-
C
C+
B-
B
B+
A-
A
Figure 3.2: Relationship among Different Types of Transformation Scores
in a Normal Distribution; µ = 60, σ = 10
Table 3.1: Grading on Curve Scales for the Scores between Which a Certain Letter
Grade is Assigned, the Mean is "set" at C+
Letter
Straight
Grading on Curve
Percentage
of Students
Cumulative
Percentage
100
5.05%
5.05%
µ + 1.2726σ
µ + 1.6362σ
5.15%
10.2%
79
µ + 0.9090σ
µ + 1.2726σ
7.94%
18.14%
70
74
µ + 0.5454σ
µ + 0.9090σ
11%
29.12%
B-
65
69
µ + 0.1818σ
µ + 0.5454σ
13.7%
42.86%
C+
60
64
µ − 0.1818σ
µ + 0.1818σ
14.23%
57.07%
C
55
59
µ − 0.5454σ
µ − 0.1818σ
13.7%
70.3%
C-
50
54
µ − 0.9090σ
µ − 0.5454σ
11%
81.3%
D+
45
49
µ − 1.2726σ
µ − 0.9090σ
7.94%
89.24%
D
40
44
µ − 1.6362σ
µ − 1.2726σ
5.15%
94.39%
E
0
39
0
µ − 1.6362σ
5.05%
99.44-100%
Grade
Scale
From
To
From
To
A
85
100
µ + 1.6362σ
A-
80
84
B+
75
B
31
The advantage of this method is that it is automatically adjusts letter grades to
the difficulty of the test that produced the raw scores. For instance, if a test is made more
difficult, therefore the mean (or median) of the raw scores decrease and all letter grades
would be the same. The mean or median could be set to B- or B rather than C+. If the
instructor has some notion of what the grade distribution should be like, some trial and
error might be needed to decide how many standard deviations each grade cutoff should
stated from the composite average. However, it has serious drawbacks. The fixed
standard deviation for each letter grade cutoff is nearly always determined arbitrary. In
addition, the use of normal curves to model achievement in a single classroom is
generally inappropriate, except in large required courses at college or university levels.
Nevertheless, when a relative grading method is desired, the standard deviation method
is more attractive, despite its computational requirements.
3.3
Bayesian Grading
Bayesian grading is a contemporary method in assigning letter grades which has
developed by Alex Strashny in 2003. This method is called Multi-Curve Grading
method (MCG). In this study, we called the method as Bayesian Grading (GB). In
general, GB is applying Bayesian inference through Bayesian network in classifying a
class of students into several different subgroups where each of them corresponds to
possible letter grades. The method is built based on the Distribution-Gap grading
method in finding the grades cutoffs [Ebel & Frisbie, 1991; Frisbie and Waltman, 1992;
Alex, 2003].
32
3.3.1 Distribution-Gap
The Distribution-Gap method is another relative grading variation. This is
formed by ranking the composites score of students from high to low that is in the form
of a frequency distribution. The frequency distribution is observed cautiously for gaps
that are for several shorts intervals in the consecutive score range where no students
obtained. A horizontal line is drawn at the top of the first gap which gives an As’ cutoffs
and a second gap is required. This process continues until all possible letter grade ranges
(A-E) have been recognized. The cutoffs are assigned after looking at the composite
scores. For example, if the highest composite scores in a class was 241, 238, 235, 227,
226,… then the instructor might use the gap between 235 and 227 to separate the A and
A- grades. The gap between 241 and 238 is too small and might produce too few A
grades. The one between 241 and 235 might be large enough, and 235 seem more like
near to 238 than 227.
The major fallacy with this technique is the dependence on the prospect to form
the gaps. The size and setting of gaps may depend as much as on random measurement
error as on actual achievement differences among students. If the scores from an
equivalent set of measures could be obtained from the group, the smaller gaps might
appear in different setting or the larger gaps might turn out to be somewhat smaller. For
example, Farah’s 227 maybe would have been 233 (if there had been less error in her
score), and Johan’s 235 maybe would have been 230. Under those circumstances, the AB gap would be less obvious, and to many final grade decision would have been made
by reviewing borderline cases. Measurement error from different measures does not
necessarily cancel each other out as they are expected to do on repeated measurement
with the same instrument.
The major advantages or attraction of the distribution-gap method is that when
grades are assigned and the gaps are wide enough, few students appear to be right on the
borderline of receiving a higher grade which this help instructor avoid disputes with
33
students about near misses. Consequently, instructors receive fewer student complaints
and fewer requests to re-examine or retest paper to search for extra credit that would for
example, change a B+ grade to A-. The method also stated that each component should
be assigned a different letter grade. But when the gaps are narrow, too much emphasis is
located on the borderline information that the instructor had decided was not relevant
enough or accurate enough to be included among the set of grading components that
form the composite scores. Only occasionally will the gap distribution method yield
results that are comparable to those obtained with more dependable and defensible
methods [Frisbie and Waltman, 1992]. In practice, distribution-gap is hard to apply since
we decide the gaps to make the cutoffs subjectively.
3.3.2 Why Bayesian Inference?
In the Bayesian approach of statistics, an attempt is made to utilize all available
information in order to reduce the amount of uncertainty present in an inferential or
decision making of assigning grades problem. As new information is obtained, it is
combined with any previous information (raw scores) to form the basis for statistical
procedures. The formal mechanism used to combine the new information with
previously available information is known as Bayes’ Theorem [Robert , 1972]; this
explains why the term “Bayesian’ is used to describe this general approach in grading.
It is built up earlier understanding with currently measured data in a way that updates the
degree of belief (subjective probability) of the instructors on their student performance.
The earlier understanding and experience is called the “prior belief” and the new belief
that results from updating the prior belief is called the “posterior belief” [Press, 2003].
This inferential updating process is eponymously called Bayesian inference.
34
Prior belief
Belief or understanding held prior (previous) to observing current set data,
available either from an experiment or from other sources. For example, the
average of the composites scores of student performance.
Posterior belief
The belief held after having observed the current data, and having examined
those data light of how well the instructors conform with defined notions. For
example, the revised average of final scores and the correspond interval.
Bayes’ theorem involves the use of probabilities, which is only natural, since
probability can be thought of as the mathematical language of uncertainty. At any given
point specifically in this study we say that at any given raw scores of students at the end
of instructional period, the instructors state of information about some uncertain score
value can be represented by a set of probabilities. When new information is obtained
which takes into account fairness and meaningful letter grades, these probabilities of
scores ‘maps’ the scores to letter grades are revised in order that they may represent all
of the available information. The principal approaches to inference guide modern data
analysis are frequentist, Bayesian and likelihoodist. We now describe each as following
[Carlin and Louis, 2000]:
Frequentist
Evaluates procedures based on imagining repeated sampling from a particular
model which defines the probability distribution of the observed data conditional
on unknown parameters. The procedures properties are structured for fixed
values of unknown parameters; good procedure perform well over a broad range
of parameter. Frequentist procedure also known as a classical model. For
example, see Section 3.2; Standard Deviation methods whereas instructor fixes
the z-scores correspond to the letter grades.
35
Bayesian
Requires a sampling model whereas the prior distribution on all unknown
quantities (parameters) in the model. The prior and likelihood are used to
compute the conditional distribution of the unknowns given the observed data
(the posterior distribution) from which all statistical inferences arise. The
Bayesian evaluates procedures for a repeated sampling experiment of unknowns
draw from posterior distribution for a given data set.
Likelihoodist
The likelihoodist or “Fisherian” develops a sampling model but not a prior as
does the frequentist. The inferences are restricted to procedures that use the data
only as reported by the likelihood, as a Bayesian would.
Assigning grades through statistical (Bayesian) principle play an important role
in scientific discovery within mathematical expertise in grading policy formulation.
Currently, the futuristic statistical analyses on grading are performed with the help of
commercial software packages known as WinBUGS. The details on the Bayesian
network and parameter estimation in assigning grades is build in Chapter IV.
3.3.3 Preliminary View of Bayes’ Theorem
According to the Bayesian view, all quantities are of two kinds: (a) those known
to the instructor in making the inference and (b) those unknown to the students, the
former are described by their known values. Herein, we present Bayes’ theorem
compactly.
Consider a random variable X that has a distribution of probability that depends
upon the symbol θ , where θ is an element of a well defined set Ω . If the symbol θ is the
36
mean (or median) of the raw scores, Ω may be real line ( ℜ ⊂ Ω ). In the context of the
usual statistical model with a random variable X having possible distribution indexed by
a parameter θ , the raw score x becomes possible value of random variable and known to
the instructor and the purpose is to make inferences concerning the unknown parameter.
Therefore, in the Bayesian approach, the instructors will wish to calculate the probability
distribution of θ given X = x . In order to make probability statements about θ given X ,
we must begin with a model providing a joint probability distribution for θ and X .
Furthermore, the most basic Bayesian model consists of two stages; a likelihood
specification X θ ~ f ( x θ ) and prior specification θ ~ π (θ ) . Note that we have
partition the letter grades into eleven component which each component have unknown
parameter (mean or median). For simplify, let us now introduced a random variable Θ
that has a distribution of probability over the set Ω and we look upon θ as a possible
value of the random variable Θ . In addition, X or θ can be vectors of raw scores and
mean (or median) for each letter grades components respectively. The symbolized of
above statement are shows as follows:
Θ = {θ1 , θ 2 , ⋅⋅⋅,θ k }
(3.2)
X = { x1 , x2 , ⋅⋅⋅, xn }
(3.3)
which the raw scores
is vector n observations whose the probability distribution f ( x θ ) depends on the values
of k ∈ℵ unknown parameters, Θ . So that the pdf of the vector X depends vector Θ in a
known way. We are assumed π is known, so that we concerned with making inferences
about an unknown θ , which continuous, Bayes’ theorem takes the form appropriate for
continuous θ :
f {θ x1 , x2 , ⋅⋅⋅, xn } =
f { x1 , x2 , ⋅⋅⋅, xn θ }π (θ )
∫
f { x1 , x2 , ⋅⋅⋅, xn θ }π (θ ) dθ
where f (θ ⋅) denotes the probability density (pdf) of unknown θ subsequent to
observing raw scores { x1 , x2 , ⋅⋅⋅, xn } that bear on θ , f ( xi ⋅) denote the likelihood
(3.4)
37
function (joint conditional distribution) of the raw scores and π (θ ) denotes the
probability density (pdf) of θ prior to observing any raw scores. This form of statement
the theorem stills just a statement of conditional probability, as well as in Section 3.3.4.
Equation 3.4 directly shows the posterior distribution of θ . We simplify Eq. 3.4 as
follows:
h {θ x} =
L { x θ } π (θ )
(3.5)
m ( x)
where m ( x ) = ∫ f ( x θ )π (θ ) dθ the marginal density of the raw score x , f ( x θ )
denotes the joint pdf of the data, L { x θ } denotes the likelihood function (log-likelihood)
of raw scores and h ( ⋅) denote the posterior pdf. Eq. 3.5 is a special case of Bayes’
theorem in Eq. 3.4 [Carlin and Louis, 2000]. We can write the joint conditional pdf of X,
⎧ n
⎫
given Θ = θ , as L {x θ } = log ⎨∏ f ( xi θ ) ⎬ = log f { x1 θ } f { x2 θ } ⋅⋅⋅ f { xn θ } .
⎩ i =1
⎭
{
}
3.3.4 Bayes’ Theorem
Bayes’ theorem is simply a statement of conditional probability. A general form
of Bayes’ theorem for events is defined as following details. Consider that A1 , A2 , ⋅⋅⋅, Ak
is any set of mutually exclusive and exhaustive events and event B and Aj are of special
interest [see Hogg and Craig, 1978 and Hogg et al. 2005 for proof]. Bayes’ theorem
provides a rule to find the conditional probability of Aj given B in terms of the
conditional probability of B given Aj . For these conversion, Bayes’ theorem some how
called a theorem about “inverse probability”. Press (2003) figured Bayes’ theorem for
events as shows in Eq.3.6 and by the law of total probability that:
38
P { Aj B} =
{
}
P B Aj P { Aj }
(3.6)
k
∑ P {B A } P { A }
i
i =1
i
for P { B} ≠ 0 . See Appendix B; Definition B1. The denominator of Bayes’ theorem
having mutually exclusive and exhaustive since B = B ∩ Ai , i = 1, 2,..., k and we have
P ( B ) = P { B ∩ A1} + P { B ∩ A2 } + ⋅⋅⋅ + P { B ∩ Ak }
however,
P { B ∩ Ai } = P ( Ai ) P { B Ai }
therefore
P ( B ) = P ( A1 ) P { B A1} + P ( A2 ) P { B A2 } + ⋅⋅⋅ + P ( Ak ) P { B Ak }
(3.7)
k
= ∑ P ( Ai ) P { B Ai }
i =1
The interpretation of P { Aj } in Eq. 3.6 is personal since it is our personal prior
probability of event Aj . That is explanation of the degree of belief about event Aj prior
to your having any information about event B that may be accept on Aj . Additionally,
P { Aj B} denotes your posterior probability of event Aj in that your degree of belief
about event Aj posterior to you having the information about B . From Eq.3.7 we
simplify the posterior probability as follows:
P { Aj B} =
P {B ∩ Aj }
P { B}
=
{
P { Aj } P B Aj
}
k
∑ P ( A ) P {B A }
i =1
i
i
(3.8)
39
which is the well-known Bayes’ theorem. Bayes’ theorem provides a method of
determining exactly what those probabilities of the raw scores to assigned letter grades
are.
Prior probabilities and posterior probabilities have it own terms. Prior
probabilities are the degree of belief the analyst has prior to observing any data that may
be accept on the problem. For example, in policy analysis perhaps the decisions are
made without data; they are made merely on the basis of informed judgment. In such
case prior probabilities are all you have. This fact accentuates the importance of
formulating prior probabilities very carefully. In science, business, medicine and
engineering, inference and decision about unknown quantities are generally made by
learning from previous understanding of essential theory experience. If the sample sizes
of current data collected are large, the prior probabilities will usually disappear and the
data will be left to ‘speak for themselves’. However, if the sample sizes are small, prior
probabilities can weight heavily in contrast to the small amount observed data, and so
could extremely important. Furthermore, the posterior probabilities are the probabilities
that results from Bayes’ theorem application. The posterior probabilities of mutually
exclusive and exhaustive events must sum to one for them to be bona fide (authentic or
reliable) probabilities [Press, 2003]. In words we can writes Bayes’ theorem as
⎧ (prior probability)(likelihood)
⎪ (prior probability)(likelihood)
⎪∑
posterior probability = ⎨
⎪ (prior probability)(likelihood)
⎪ ∫ (prior probability)(likelihood)
⎩
for discrete case
for continous case
Bayes’ theorem also tell us that the probability for Θ posterior to the data X is
proportional to the product of the distribution for Θ prior to the data and the likelihood
for Θ given X [Box and Tiao, 1973]. That is
posterior probability ∝ prior distribution × likelihood
40
Proof (see Appendix B; example B1). This form of posterior distribution is drawn from
prior probability of conjugate prior. Gelman et al. (1995), enclosed definition on
conjugacy prior distribution. That is, if F is a class of sampling distribution f { x θ } ,
and π is a class of prior distribution for θ , then the class π is conjugate for F if
f {θ x} ∈ π for all f {⋅ θ } ∈ F and f ( ⋅) ∈ π
π is always conjugate no matter what class of sampling distribution is used and
taking π to be set of all densities having same functional form as the likelihood is most
interest. Carlin and Louis (2000) explained that any distributions which can be figure in
exponential family of distributions will have conjugate priors. In addition, it is not really
necessary to determine the marginal pdf in Eq. 3.4 and Eq. 3.5 to find the posterior pdf
f (θ ⋅) if the prior is called to be conjugate prior. This verified in Appendix B; Example
B1.
In this study, we have relates Bayes’ theorem to find the probability of assigning
the meaningful letter grades based on student raw scores. One way to interpret the
theorem is that it provides a means for updating the instructor degree of belief about the
average raw scores of their student in the light of new information of raw scores that
bears on average raw scores. The updating that takes place is from instructor original
degree of belief, P {average raw scores for each letter grades} to instructors updated
belief, P {average raw scores for each letter grades raw scores} . The theorem might also
be thought of as a rule for learning from previous raw scores of students.
3.3.5 Model Set Up for Bayesian Grading
Letter grades are assigned according to the probabilities of particular raw scores.
From the eleven component of letter grades as defined in Section 2.3, we possible to
partition the sample space of letter grades into eleven parts. Each part or each
41
component having the mean or median itself, we have entirely about eleven parameters
sets. That is the distribution of the scores is a weighted sum of several equal distribution.
In this study, we assumed that the raw scores are come from a Normal distribution.
Precisely to say that the raw scores are come from a Normal Mixture since the scores are
normally distributed in each component which we are tend to assigned. A feature of this
distribution is that if a random variables are independent and normally distributed, then
any linear combination of them is also normally distributed [Robert, 1972]. Obviously,
Figure 3.3 shows a Normal Mixture on each letter grades where each of components is
distinguish accurately. The dot line shows the letter grades sample space is normally
distributed with mean µ and variance σ 2 . The eleven components of letter grades are
potentially present, however it does not mean that all the letter grades will actually be
assigned.
3.3.6 Bayesian Methods for Mixtures
Mixture models (finite mixture models) are typically used to model data where
each observation is assumed to have arrived from one k groups. Each component is
suitable being modeled by a density from some distribution family. The density of each
group is referred to as a component of mixture, and weighted by relative frequency of
the group in the population . This methods is applied in GB by which raw scores may be
clustered together into groups for classification; into correspond letter grade. A mixture
model provides a convenient and flexible family of distribution for estimating or
approximating distribution which are not well modeled by any other standard parametric
family. This model being used as a parametric alternative to non-parametric density
estimation. The advantages of Bayesian approach to this model are it is the “best” model
and also a coherent way of combining results over different models [Stephens, 2000].
42
In this study, we consider a finite mixture model in which raw scores data
X = { x1 , x2 , ⋅⋅⋅, xn } are assumed to be independent and identically distributed from a
mixture distribution of g components. Eq. (3.9) is called the mixture density which the
mixture proportion (component probability) constrained to be non-negative and sum to
unity. Our interests are to find the probability that a particular raw scores belongs to a
component of the mixture normal. From Eq. 3.7, the raw scores are independently and
identically with the distribution
p ( xi ) = ∑ π gφ ( µ g ,σ g2 )
G
for i=1,2,...,n
(3.9)
g =1
where xi is the raw score of student i , g is indicator the G=11 components of the
mixture, π g is the component probability of component g and can be written as
π = {π 1 , π 2 , ⋅⋅⋅, π g } where
∑ π = 1 (from Definition B1; Appendix B) that is the total
area of all curves must equal to 1 and π cannot be negative [Robert, 1972], φ ( ⋅) is some
parametric component density function (pdf), µ g and σ g2 are mean and variance of
{
}
component g and we can writes as vectors µ = µ1 , µ 2 , ⋅⋅⋅, µ g and
σ 2 = {σ 12 ,σ 22 , ⋅⋅⋅,σ g2 } . We might also denotes
θ1 = {π 1, µ1 ,σ 12 } ,θ 2 = {π 2 , µ 2 ,σ 22 } , ⋅⋅⋅,θ g = {π g , µ g ,σ g2 }
and therefore we simplify the sets of θ as equal to
Θ = {θ1 ,θ 2 , ⋅⋅⋅,θ g }
Eq. (3.9) is usually referred to as a mixture density, which has mixing probabilities π g .
This natural distribution are used for assigning letter grades if the observed students
(population) is more realistically thought of as a combination of several distinct letter
43
grades. The mixture models also attractive since they can accommodate arbitrarily large
range of models anomalies such as multiple modes, heavy tails at each letter grades
interval and so on [Carlin and Louis, 2000]. Obviously, the more raw scores observation
there are, the more accurately the model is estimated. However, since the estimation
procedure is Bayesian, the model can always be estimated, no matter how small the data
sets are.
For illustration on parameters involved in mixture models, we display the
structures figure in which the basic observation units are class into larger units. This is
shown in Figure 3.3 that showing conditional independence relationships within
parameters involved. The structured illustration is defined as Hierarchically Structured
Data [Press, 2003] which can relates in grades assigning problem. It is natural to model
such problem hierarchically, with observed outcomes modeled conditionally on certain
parameters, known as hyperparameters. Hierarchical models are generally characterized
by expression of a marginal model P ( y ) where y represent the entire data vector
through a sequence of conditional models involving latent variables. Thus the mixture
models in assigning letter grades are viewed hierarchically; the observed random
variables of raw scores xi are modeled conditionally on the hyperparameter vector
Θ [Gelman et al., 1995]. Figure 3.3 can also be relates to a Functional Mapping of
Letter Grades, (see Figure 1.1 in Section 2.3) for the definition of letter grades. It is
shows that the arrows pointing downwards to each variable from conditioning variables
(parents) of its prior model. Note that Level 3 displayed as the rank ordered observation
of raw scores. Level 2 is the sets of parameter θ for each letter grade and Level 1 shows
our main mean and variance of letter grades of sample space.
44
( µ ,σ )
2
( µ ,σ )
2
Level 1
Θ
θ1
θ2
X
x11 , x12 , ⋅⋅⋅
⋅⋅⋅
⋅⋅⋅
θg
Level 2
Level 3
x21 , x22 , ⋅⋅⋅ ⋅ ⋅ ⋅
⋅⋅⋅
xg1 , xg 2 , ⋅⋅⋅, xgn
Figure 3.3: Hierarchical Representation of a Mixture
As shown in Figure 3.4, a very natural limit is that the eleven components of
letter grades are ordered by their mean. Mean for grade E is the lowest significance to
other letter grades, grade D have mean higher than E and lower than D+, and so on.
Therefore, the grade A have the highest ranking and having a shorts interval belongs to
A’s grade. That is
µ1 < µ 2 < ⋅⋅⋅ < µG
where G=11 components of letter grades. A principal on the distribution-gap method
(see Section 3.3.1) are applied since there are often gaps in students’ scores. The idea
behind distribution gaps is if the means are relatively far from each other, then these
gaps are visible to be bared. Then the cutoffs are made at the gaps which are similar to
assign a different letter grade to each component. Note that each component of letter
grades are associate with three parameters. That is the component probabilities, then
component mean and the component variance; π g , µ g and σ g2 respectively. However
the component probabilities π g can be determine if we know the other components
probabilities. Thus, this model contains of 3G − 1 parameters [Alex, (2003)]. Via model
set up on GB, we will explained in details the estimation of probabilities that each raw
45
score correspond to each letter grade in the Chapter IV. The instructor leniency factors
are taken into consideration to encounter overestimated or underestimated when
overlook in assigning letter grade.
Normal Curves for each
Raw Score
distinguish letter grades
Combined
distribution
E
D
D+
C-
C
C+
B-
B
B+
A-
A
Letter
Grade
Figure 3.4: Normal Mixture Model Outlined on Each Letter Grades
3.3.7 Mixture of Normal (Gaussian) Distribution
We have assumed that the raw scores observed are drawn as independently and
identically Normal distributed. In this subsection, we describe the univariate mixture of
g Normal distribution. Let us review mixture models as given by Eq. (3.9) as follows
p ( xi ) = ∑ π g N ( µ g , σ g2 )
G
for i=1,2,...,n
g =1
where N ( ⋅) denotes the Normal density function with probability density function is
given by
N ( xi ) =
1
2πσ g
e
1 ⎛ xi − µ g
− ⎜
2 ⎜⎝ σ g
⎞
⎟
⎟
⎠
2
46
or equivalently to writes the likelihood function as
1
1
2⎫
2⎫
⎧ −1
⎧ −1
exp ⎨ 2 ( x1 − µ ) ⎬ ×⋅⋅⋅
exp ⎨ 2 ( xn − µ ) ⎬
2πσ
2πσ
⎩ 2σ
⎭
⎩ 2σ
⎭
2⎫
⎧ −1
∝ σ − n exp ⎨ 2 ∑ ( xi − µ ) ⎬
⎩ 2σ
⎭
p {x θ } =
n
and let s = ∑ ( xi − µ ) and denotes the Normal density function as
2
i =1
⎧ −s ⎫
p {x θ } ∝ σ − n exp ⎨ 2 ⎬
⎩ 2σ ⎭
Furthermore, by Eq. (3.9)
G
1
p ( xi ) = ∑ π g
2πσ g
g =1
= π1
1
2πσ 1
e
e
1⎛ xi − µ g ⎞
− ⎜
⎟
1⎝ σ ⎠
1⎛ x − µ ⎞
− ⎜ i 1⎟
1⎝ σ ⎠
2
2
+π2
1
2πσ 2
e
1⎛ x − µ ⎞
− ⎜ i 2⎟
1⎝ σ ⎠
2
+ ⋅⋅⋅ + π g
1
2πσ g
e
1⎛ xi − µ g ⎞
− ⎜
⎟
1⎝ σ ⎠
2
provided i = 1, 2,..., n . We have introduced π g as the component probability of
component g (mixture proportion) and denotes as resemblance as an indicator
variables π g , with
⎧π g
πg = ⎨
⎩0
if the ith raw scores is drawn from the gth mixture component
otherwise
provided 0 ≤ π g ≤ 1 and
G
∑π
g =1
g
= 1 and we can say the Eq.(3.9) as the probability that a
particular raw score belongs to a component of the mixture proportional to the ordinate
of that component at the raw score. In other words we may writes Eq.(3.9) by the
following form
47
p ( xi ∈ G ) ∝ π gφ ( xi µ g , σ g2 )
(3.10)
Using Eq.(3.9) or Eq. 3.(10), compute the probability that the raw score belongs to each
of the components. Note that, our components are corresponding to the eleven letter
grades. Therefore, the probability that a raw score belongs to a particular letter grade is
the same as that the raw score belongs to the corresponding component.
Let n be the total number of raw scores. x is n × 1 vector of raw scores. We have
assumed the raw scores are iid normally distributed, that xi Θ ~ ∑ π g N ( µ g , σ g2 ) then the
G
g =1
likelihood distribution of Eq. (3.9) can be expressed as
p {x G, Θ} = ∏ ∑ π g N ( µ g , σ g2 )
n
G
(3.11)
i =1 g =1
that is
{
}
p {x G, Θ} = ∏ π 1 f ( xi ; G,θ1 ) + π 2 f ( xi ; G,θ 2 ) + ⋅⋅⋅ + π g f ( xi ; G,θ g )
n
i =1
2
2
⎧
1⎛ x − µ ⎞
1⎛ x − µ ⎞
− ⎜ i 1⎟
− ⎜ i 2⎟
1
1
⎪
1⎝ σ 1 ⎠
1⎝ σ 2 ⎠
= ∏ ⎨π 1
e
+π2
e
+ ⋅⋅⋅ + π g
2πσ 1
2πσ 2
i =1 ⎪
⎩
n
{
1
2πσ g
e
1⎛ xi − µ g
− ⎜
1⎝⎜ σ g
⎞
⎟
⎟
⎠
2
⎫
⎪
⎬
⎪
⎭
}
where Θ = π , µ , σ 2 are G × 1 vectors of hyperparameters. For the purpose of this
study we have select g to be equal to eleven and then we estimate the other parameters
as shown in Chapter IV. The joint distribution or the so-called the posterior distribution
of these parameters can be addressed as in Section 3.3.4 as follows
f {Θ G, x} ∝ likelihood × prior distribution
or
f {Θ G, x} ∝ L {x G, Θ} h {Θ}
48
which is equivalent to
{
}
f {π , µ , σ 2 G, x} ∝ L x G, π , µ , σ 2 h {π , µ , σ 2 }
{
}
where h ( ⋅) is the probability density of the prior Θ = π , µ , σ 2 . We called the prior as
conjugate prior so as to give the posterior same distribution as the likelihood function.
These conjugate priors are set to be Normal, Inverse-Gamma and Dirichlet distributed
respectively and denotes by µ ~ N (ν g , δ g2 ) , σ 2 ~ IG (α g , β g ) , and π ~ Di (η ) . Most of
statistician has chosen these types of distributions for Normal observation.
3.3.8 Prior Distribution
Choosing the prior and its hyperparameters arise many cases. For example, they
approximately know what the mean of each component should be. In many instances,
h ( ⋅) is not known and injects the problem of personal or subjective probability; yet the
choice of h ( ⋅) affects the posterior pdf. [Hogg and Craig, 1978]. Although, they do not
have any information at all which the information should be ignored in order to check
their results by an objectives analysis.
Cornebise et al. (2005). was given a rule in choosing the prior. They have stated
that we can use an empirical prior; hyperparameters built up upon the data or by using
noninformative prior in which prior carrying no information at all. This is actually hard
to achieve because purely noninformative prior can be improper and cause troubles.
Moreover, the prior is often chosen in a closed-by-sampling or conjugate prior family,
such that conditioning by the sample only result in a change of the hyperparameters, not
in a change of family. For further reading notes on noninformative prior, we recommend
the readers to refer to Carlin and Louis (2000) and Box and Tiao (1973).
49
Here we have chosen the conjugate prior implementation to the posterior
(
)
distribution. We employed a shorthand notation f ( x θ ) = N x µ , σ 2 to denote a
Normal density with mean µ and variance σ 2 . To proof that µ ~ N (ν , δ 2 ) ,
σ 2 ~ IG (α , β ) and π ~ Di (η ) we may refer to such distribution as a noninformative
prior for Θ and argue that all of the information resulting in the posterior distribution
generate from the data, and hence resulting inferences were completely objective, rather
than subjective. The main issue is how to select the prior which provides little
information relative to what is expected to be provided by the intended observation.
Now, we consider a case of a single parameter which might be directly applied to
hyperparameter problem. Suppose a random sample of n observation of raw score form
a Normal distribution N ( µ , σ 2 ) . The likelihood function of µ given n independent
observation of raw scores from Normal population N ( µ , σ 2 ) , is
n
n
i =1
i =1
L ( µ x, σ ) = ∏ f ( x µ ) = ∏
1 ⎛ xi − µ ⎞
σ ⎟⎠
2
− ⎜
1
e 2⎝
2πσ
n
⎡ 1 ⎤ − 2σ 2 ∑ ( xi − µ )
=⎢
⎥ e
⎣ 2πσ ⎦
−n
⎧ 1
= ⎡ 2πσ 2 ⎤ exp ⎨− 2
⎣
⎦
⎩ 2σ
since,
∑( x − µ ) = ∑( x − x )
2
i
i
2
1
+ n ( µ − x ) and given
2
2
(3.12)
∑( x − µ )
∑( x − x )
i
i
2
2
⎫
⎬
⎭
is a fixed constant
and σ 2 is known, we simplify Eq. (3.12) as follows
−n
2
2 ⎫
⎧ 1
L ( µ x, σ ) = ⎡ 2πσ 2 ⎤ exp ⎨− 2 ⎡ ∑ ( xi − x ) + n ( µ − x ) ⎤ ⎬
⎣
⎦⎭
⎣
⎦
⎩ 2σ
⎧⎪ 1 ⎛ µ − x ⎞ 2 ⎫⎪
2⎫
⎧ 1
∝ exp ⎨− 2 n ( µ − x ) ⎬ = exp ⎨− ⎜
⎟ ⎬
⎩ 2σ
⎭
⎩⎪ 2 ⎝ σ / n ⎠ ⎭⎪
(3.13)
50
in words we can say that the likelihood of µ is proportional to Normal distribution
which is centered about x and standard deviation σ / n . Eq. (3.13) are said to be data
translated likelihoods since the likelihood is in the form of L (θ x ) = g (θ − t ( x ) ) and
thus the data will give rise to the same function form for the likelihood [Lee, (1989)].
Therefore we have proved that µ g ~ N (ν g , δ g2 ) where N (ν g , δ g2 ) is the Normal
distribution with meanν g and variance δ g2 for each component of letter grade. So that the
conjugate prior of µ g is again given by the Normal distribution. We denotes the
distribution of conjugate prior of µ (or µ g ) as
⎧⎪ 1 ⎛ µ − x ⎞ 2 ⎫⎪
p ( µ x ) ∝ exp ⎨− ⎜
⎟ ⎬
⎪⎩ 2 ⎝ σ / n ⎠ ⎭⎪
( )
(
)
Therefore the conjugate prior of µ g is Normally distributed [ p µ g = N ν g , δ g2 ] with
pdf:
⎧ 1 ⎛ µ −ν
1
⎪
g
exp ⎨− ⎜ g
p(µ g ) =
⎜
2π δ g
⎪⎩ 2 ⎝ δ g
⎞
⎟⎟
⎠
2
⎫
⎪
⎬
⎭⎪
Next, for components variance we simply look at L (σ x, µ ) by taking the same
equation as in Eq. (3.13) which interest of σ 2 given n observation of raw scores and
known µ . Thus we have
−n
2 ⎫
⎧ 1
L (σ x, µ ) = ⎡ 2πσ 2 ⎤ exp ⎨− 2 ⎡ ∑ ( xi − µ ) ⎤ ⎬
⎣
⎦⎭
⎣
⎦
⎩ 2σ
∝ σ
−n
2
⎧⎪ 1 ⎡
xi − µ ) ⎤ ⎫⎪
⎧ ns 2 ⎫
(
−n
exp ⎨− 2 ⎢ n∑
⎥ ⎬ = σ exp ⎨− 2 ⎬
n
⎩ 2σ ⎭
⎪⎩ 2σ ⎢⎣
⎦⎥ ⎪⎭
(3.14)
51
n
( xi − µ )
i =1
n
where s = ∑
2
2
. Box and Tiao (1973) have shown that the likelihood in
Eq.(3.14) curves in the original metric σ are not data translated. The principal of
noninformative prior should not be taken to be locally uniform in σ . However, the
corresponding likelihood curves in terms of log σ are exactly data translated and it
therefore should be locally uniform in log σ .
In, principle, we might have any form of prior distribution for the variance σ 2 .
Therefore we revised the previous form of Eq.(3.14) in terms of log likelihood function
as
−n
⎧ ns 2 ⎫ ⎪⎫
⎪⎧
log L (σ x, µ ) = log ⎨ ⎡ 2πσ 2 ⎤ exp ⎨− 2 ⎬⎬
⎦
⎪⎩ ⎣
⎩ 2σ ⎭ ⎭⎪
n
= −n log σ − log 2π − n ( s − σ )
2
∝ −n log σ − n ( s − σ )
multiplied the likelihood by s n (the multiplication by constant leaves the likelihood
unchanged); to gives
−n
⎧⎪
⎧ ns 2 ⎫⎫⎪
log L (σ x, µ ) = log ⎨ ⎡ 2πσ 2 ⎤ s n exp ⎨− 2 ⎬⎬
⎦
⎪⎩ ⎣
⎩ 2σ ⎭⎪⎭
= −n log σ − n log 2π + n log s − n ( s − σ )
(3.15)
∝ −n log σ + n log s − n ( s − σ )
differentiate Eq.(3.15) with respect to σ :
d
d
log L (σ x, µ ) =
{−n log σ + n log s − n ( s − σ )}
dσ
dσ
−n
=
+n
σ
⎛1 ⎞
= − n ⎜ − 1⎟
⎝σ
⎠
1
∝
σ
(3.16)
52
which give the prior distribution of σ 2 as
p (σ 2 ) ∝
1
(3.17)
σ2
using Eq.(3.17) and Eq.(3.15) to find the appropriate normalizing constant, we obtain
1 ⎛ ns 2 ⎞
p (σ x ) =
⎜
⎟
⎛n⎞⎝ 2 ⎠
Γ⎜ ⎟
⎝2⎠
n/2
⎡⎣σ 2 ⎤⎦
2
−1
⎡ ⎛ n ⎞ ⎤ ⎛ ns 2 ⎞
= ⎢Γ ⎜ ⎟ ⎥ ⎜
⎟
⎣ ⎝ 2 ⎠⎦ ⎝ 2 ⎠
( ns / 2 )
2
− ⎡⎣( n / 2 ) +1⎤⎦
n/2
⎡⎣σ 2 ⎤⎦
⎛ ns 2 ⎞
exp ⎜ − 2 ⎟
⎝ 2σ ⎠
− ⎡⎣( n / 2 ) +1⎤⎦
⎛ ns 2 ⎞
exp ⎜ − 2 ⎟
⎝ 2σ ⎠
(3.18)
n/2
⎛ ns 2 ⎞
exp ⎜ − 2 ⎟
=
2 ⎡⎣( n / 2 ) +1⎤⎦
⎝ 2σ ⎠
⎡
⎤
Γ ( n / 2 ) ⎣σ ⎦
− ⎣⎡( n / 2 ) +1⎦⎤
⎛ ns 2 ⎞
= k ⎡⎣σ 2 ⎤⎦
exp ⎜ − 2 ⎟
⎝ 2σ ⎠
( ns / 2 )
where k =
2
n/2
Γ ( n / 2)
is the normalizing constant requires to make the distribution
integrate to unity (i.e 1). To proof let the constants α = n / 2 and β = ns 2 / 2 then we
have the RHS of Eq. (3.18) as follows
( ns / 2 )
2
RHS =
n/2
Γ ( n / 2 ) ⎡⎣σ 2 ⎤⎦
⎡⎣( n / 2 ) +1⎤⎦
⎛ ns 2 ⎞
exp ⎜ − 2 ⎟
⎝ 2σ ⎠
( β ) ⎡σ 2 ⎤ −[α +1] exp ⎛ − β ⎞
=
⎜
2 ⎟
Γ (α ) ⎣ ⎦
⎝ σ ⎠
α
53
applying the integral formula in Appendix D ( formula (ii)), we integrate the RHS to
have
( β ) ⎡σ 2 ⎤ −[α +1] exp ⎛ − β ⎞ dσ 2
⎜
2 ⎟
∫0 Γ (α ) ⎣ ⎦
⎝ σ ⎠
α
β ) ∞ 2 −[α +1]
(
⎛ β ⎞
⎡⎣σ ⎤⎦
exp ⎜ − 2 ⎟ dσ 2
=
∫
Γ (α ) 0
⎝ σ ⎠
α
β)
(
=
Γ ( α ) β − α = 1 ■.
Γ (α )
α
∞
Therefore, the conjugate prior of σ 2 is Inverse-Gamma type of distribution with pdf:
( β ) ⎡σ 2 ⎤ −[α +1] exp ⎛ − β ⎞
p (σ ) =
⎜
2 ⎟
Γ (α ) ⎣ ⎦
⎝ σ ⎠
α
2
Carlin and Louis (2000), have stated that if X ~ IG (α , β ) then the E [ X ] = 1/ β ⎡⎣(α − 1) ⎤⎦
Var [ X ] = 1/ β 2 ⎡(α − 1) (α − 2 ) ⎤ provided α > 1, α > 2 respectively. They also named
⎣
⎦
2
the Inverse-Gamma to be reciprocal gamma since X = 1/ Y where Y ~ G (α , β ) .
Inverse Gamma is very commonly used in Bayesian statistics as the conjugate prior for a
variance parameters σ 2 arising in a normal likelihood function. Choosing α and β
appropriately for such a prior can be aided by solving IG distribution for µ ≡ E [ X ] and
σ 2 ≡ Var [ X ] . This result is
α = (µ /σ ) + 2
2
and
β=
(
1
)
µ (µ /σ 2 ) +1
setting the prior mean and standard deviation both equal to µ (i.e µ / σ 2 = 1 ), thus
produces α = 3 and β =
1
.
2µ
54
Finally, in this section we would like to apply the most widely used of prior
distribution for component probabilities as Dirichlet distributed with parametersη . This
type of distribution is a distribution on the sets of all probability distributions. Typically,
the vector π is assigned a Dirichlet prior with elements {η1 ,η2 , ⋅⋅⋅,η g } with typical choice
being such η1 = η2 = ⋅⋅⋅ = η g = 1 . This is set to make all components a priori equally
likely, all the element η must be equal. Thus, we have stated that π is symmetric
Dirichlet distributed with pdf:
(
)
p π ηg =
Γ (η0 )
∏ Γ (η )
G
g =1
G
∏π
g =1
η g −1
(3.19)
g
g
The Dirichlet is the multivariate generalization of the beta; for g = 2,
Di (η ) = Beta (η1 ,η 2 ) . If X ~ Di ( η ) , X = { x1 , x2 ,..., xg }′ where 0 ≤ xi ≤ 1 and
G
∑x
g =1
g
= 1 , η = (η1 ,η2 ,...,η g )′ and η g ≥ 0 , then E ⎡⎣ X g ⎤⎦ = η g / η0 ,
Var ⎡⎣ X g ⎤⎦ =
η g (η0 − η g )
η02 (η0 + 1)
and Cov ( X g , X h ) = −η gη h / η02 (η0+1 ) where η0 = ∑η g .
G
g =1
Here we are not describes more on Dirichlet distribution. Further information can be
obtained in Carlin and Louis (2000) pg. 51 and 327; Congdon (2003) pg. 58 and Gelman
et al. (1995) pg. 79 and 476.
3.3.9 Posterior Distribution
In section 3.37 we write the posterior distribution proportional to the product of
Likelihood and Prior distribution. That is
{
}
f {π , µ , σ 2 G, x} ∝ L x G, π , µ , σ 2 h {π , µ , σ 2 }
(3.20)
55
Therefore, to find the posterior of components mean we substitute the prior distribution
of µ g ;( or µ ~ N ( µ , σ 2 / n ) and the likelihood function of raw scores into Eq. (3.20).
⎧⎪ 1 ⎛ µ − x ⎞ 2 ⎫⎪
We have proved that p ( µ x ) ∝ exp ⎨− ⎜
⎟ ⎬ . Suppose a parameter µ distributed
2
σ
/
n
⎝
⎠ ⎭⎪
⎩⎪
as
⎧ 1 ⎛ µ −ν
1
⎪
g
exp ⎨− ⎜
p(µ ) =
⎜
2πδ g
⎪⎩ 2 ⎝ δ g
⎞
⎟⎟
⎠
2
⎫
⎪
⎬
⎭⎪
and the likelihood function of µ is proportional to a Normal distribution of µ given n
observation of raw scores x is
2
⎧ ⎛
⎫
⎪ 1 µ−x ⎞ ⎪
L ( µ x ) ∝ exp ⎨− ⎜
⎟ ⎬
2 ⎜ σ / n ⎟⎠ ⎪
⎩⎪ ⎝ g
⎭
provided −∞ < µ < ∞ , where x =
∑x
i ; xi ∈g
i
is the function of the observations x . Then, by
ng
Bayes’ theorem the posterior distribution of µ given the data of raw scores is
p ( µ x) =
p ( µ ) L ( µ x)
∫ p ( µ ) L ( µ x) d µ
f ( µ x)
=
∫ f ( µ x) d µ
∞
0
∞
0
56
where
⎧
1
⎪ 1 ⎛ µ −ν g
x
x
µ
µ
µ
f(
exp ⎨− ⎜
) = p( ) L( ) =
2 ⎜ δg
2π δ g
⎩⎪ ⎝
⎧ 1 ⎛ µ −ν
⎪
g
∝ exp ⎨ − ⎜
⎜
⎪⎩ 2 ⎝ δ g
⎧
⎡
⎪ 1 ⎢⎛ µ − ν g
= exp ⎨ − ⎜
⎜
⎪ 2 ⎢⎣⎝ δ g
⎩
⎞
⎟⎟
⎠
2
⎧
⎫
⎪
⎪ 1⎛ µ−x
⎬ exp ⎨− ⎜
⎪⎭
⎪⎩ 2 ⎜⎝ σ g / ng
⎞ ⎛ µ−x
⎟⎟ + ⎜
⎠ ⎜⎝ σ g / ng
2
⎞
⎟
⎟
⎠
2
⎞
⎟⎟
⎠
⎞
⎟
⎟
⎠
2
2
⎧
⎫
1
⎪
⎪ 1⎛ µ−x
exp ⎨− ⎜
⎬
π
σ
2
/
n
⎪⎩ 2 ⎜⎝ σ g / ng
g
⎭⎪
(
)
⎫
⎪
⎬
⎪⎭
⎤⎫
⎥ ⎪⎬
⎥⎪
⎦⎭
(3.21)
using the identity [Box and Tiao, 1973]
A ( z − a ) + B ( z − b ) = ( A + B )( z − c ) +
2
with c =
2
2
AB
2
(a − b)
A+ B
1
( Aa + Bb )
A+ B
we write the terms in exponential as
⎛ µ −ν g
⎜⎜
⎝ δg
⎞ ⎛ µ−x
⎟⎟ + ⎜
⎠ ⎜⎝ σ g / ng
2
2
⎞
2
1
1
2
⎟ = 2 ( µ −ν g ) + 2
µ − x)
(
⎟ δg
σ g / ng
⎠
∑x
1
1
x ∈g
we have A = 2 , B = 2 , z = µ , a = ν g , b = x = i
δg
σg / n
ng
∴c =
1
1
δ
2
g
+
1
σ / ng
2
g
⎛ 1
⎞
1
x⎟
⎜⎜ 2 ν g + 2
σ g / ng ⎟⎠
⎝ δg
⎛
∑ xi
⎜ 1
xi ∈g
=
ν + 2
1
1 ⎜ δ g2 g
σg
⎜
+
2
2
⎝
δ g σ g / ng
1
⎞
⎟
⎟
⎟
⎠
i
and since x =
∑x
xi ∈g
ng
i
⎞
⎟
⎟
⎠
2
⎫
⎪
⎬
⎪⎭
57
therefore,
⎛ µ −ν g
⎜⎜
⎝ δg
⎞ ⎛ µ−x
⎟⎟ + ⎜
⎠ ⎜⎝ σ g / ng
2
⎞
⎟
⎟
⎠
2
1
1
⎛ 1
δ σ / ng
2
2
1 ⎞
=⎜ 2 + 2
µ − υg ) +
µg − x )
⎟
(
(
⎜δ
⎟
1
1
⎝ g σ g / ng ⎠
+ 2
2
δ g σ g / ng
2
g
⎛ 1
1
=⎜ 2 + 2
⎜δ
⎝ g σ g / ng
2
g
⎞
2
⎟⎟ ( µ − υ g ) + d
⎠
where d is a constant independent of µ . Thus,
⎧⎪ 1 ⎡⎛ 1
1
f ( µ x ) = exp ⎨− ⎢⎜ 2 + 2
⎜
2 ⎣⎢⎝ δ g σ g / ng
⎩⎪
⎧⎪ 1 ⎡⎛ 1
1
= exp ⎨− ⎢⎜ 2 + 2
⎜
2 ⎢⎣⎝ δ g σ g / ng
⎩⎪
⎛
∑ xi
⎜ ν g xi ∈g
+ 2
where υ g =
1
1 ⎜ δ g2
σg
⎜
+ 2
2
δ g σ g / ng ⎝
1
⎤ ⎫⎪
⎞
2
⎟⎟ ( µ − υ g ) + d ⎥ ⎬
⎠
⎦⎥ ⎭⎪
⎞
2 ⎤⎫
⎪
⎛ d⎞
⎟⎟ ( µ − υ g ) ⎥ ⎬ exp ⎜ − ⎟
⎝ 2⎠
⎥⎦ ⎭⎪
⎠
⎞
−1 ⎛
∑ xi
⎟ ⎛ 1 ng ⎞ ⎜ ν g xi ∈g
⎟ = ⎜⎜ δ 2 + σ 2 ⎟⎟ ⎜ δ 2 + σ 2
g ⎠ ⎜ g
g
⎟ ⎝ g
⎠
⎝
⎞
⎟
⎟
⎟
⎠
so that
∞
∫
f ( µ x )d µ = exp
−∞
−d
2
1
⎪⎧ −1 ⎡⎛ 1
exp
∫−∞ ⎨ 2 ⎢⎢⎜⎜ δ g2 + σ g2 / ng
⎪⎩ ⎣⎝
∞
⎛ 1
1
= 2π ⎜ 2 + 2
⎜δ
⎝ g σ g / ng
now, substituting the results of f ( µ x ) and
⎞
⎟⎟
⎠
−1/ 2
⎫
⎞
2 ⎤⎪
⎟⎟ ( µ − υ g ) ⎥ ⎬ d µ
⎥⎦ ⎪⎭
⎠
⎛ −d ⎞
exp ⎜
⎟
⎝ 2 ⎠
∞
∫ f ( µ x ) d µ to obtain
−∞
p ( µ x) =
f ( µ x)
∫
∞
0
f ( µ x) d µ
=
⎧⎪ 1 ⎡⎛ 1 n ⎞
2 ⎤⎫
⎪
⎛ d⎞
exp ⎨− ⎢⎜ 2 + g2 ⎟ ( µ − υ g ) ⎥ ⎬ exp ⎜ − ⎟
⎜
⎟
2 ⎣⎢⎝ δ g σ g ⎠
⎝ 2⎠
⎦⎥ ⎭⎪
⎩⎪
⎛ 1 n ⎞
2π ⎜ 2 + g2 ⎟
⎜δ
⎟
⎝ g σg ⎠
−1/ 2
⎛ −d ⎞
exp ⎜
⎟
⎝ 2 ⎠
58
1/ 2
⎛ 1 ng ⎞
⎜⎜ 2 + 2 ⎟⎟
δg σ g ⎠
∴ p ( µ x) = ⎝
2π
=
⎧⎪ 1 ⎡⎛ 1
1
exp ⎨ − ⎢⎜ 2 + 2
2 ⎢⎣⎜⎝ δ g σ g / ng
⎩⎪
1
⎛ 1 n ⎞
2π ⎜ 2 + g2 ⎟
⎜δ
⎟
⎝ g σg ⎠
−1/ 2
⎞
2 ⎤⎫
⎪
−
µ
υ
g ) ⎥⎬
⎟⎟ (
⎥⎦ ⎭⎪
⎠
⎧
⎡
⎤⎫
⎪
⎢
⎥⎪
2
⎪⎪ 1 ⎢ ( µ − υ g ) ⎥ ⎪⎪
exp ⎨− ⎢
−1 ⎥ ⎬
⎪ 2 ⎢ ⎛ 1 ng ⎞ ⎥ ⎪
⎪
⎢ ⎜⎜ δ 2 + σ 2 ⎟⎟ ⎥ ⎪
g ⎠ ⎦⎭
⎪
⎣⎝ g
⎩⎪
whereas
⎛
∑ xi
⎜ ν g xi ∈g
υg =
+ 2
1 ng ⎜⎜ δ g2
σg
+ 2⎝
2
δg σ g
1
⎡1 n ⎤
defined Vg = ⎢ 2 + g2 ⎥
⎣⎢ δ g σ g ⎥⎦
−1
⎞
−1 ⎛
∑ xi
⎟ ⎛ 1 ng ⎞ ⎜ ν g xi ∈g
⎟ = ⎜⎜ δ 2 + σ 2 ⎟⎟ ⎜ δ 2 + σ 2
g ⎠ ⎜ g
g
⎟ ⎝ g
⎠
⎝
⎞
⎟
⎟
⎟
⎠
(3.22)
i
ν g x∑
∈g
and M g = 2 + 2
δg
σg
x
i
Thus, simplify Eq. (3.22) as follows
(
∴ p µ g x, σ g
)
2
⎧
⎡
⎤⎫
⎪ 1 ⎢ ( µ − Vg M g ) ⎥ ⎪
=
−
exp
⎨
⎬
−1/ 2
−1
⎥⎪
2π (Vg )
⎪⎩ 2 ⎢⎣ (Vg )
⎦⎭
1
(3.23)
■.
Applying this results, we have the conditional distribution for posterior µ g is
µ g ⋅⋅⋅ ~ N (Vg M g ,Vg−1 ) .
59
Now, we want to find the posterior distribution of σ g2 . Suppose a priori a parameter σ g2
is distributed as
( β ) ⎡σ
p (σ ) =
Γ (α ) ⎣
α
g
2
g
g
2
g
⎦⎤
−[α +1]
⎛ β ⎞
exp ⎜ − g2 ⎟
⎜ σg ⎟
⎝
⎠
using conditions in Section 3.37 we have
p (σ g2 x ) ∝ p (σ g2 ) L (σ g2 x )
1
=
Γ (α g )
=
1
Γ (α g )
(β )
αg
g
(β )
αg
g
⎣⎡σ g ⎦⎤
2
⎣⎡σ g ⎦⎤
2
−[α +1]
−[α +1]
σ
−n
⎧ ⎡
⎧⎪ β g ⎫⎪
⎪ 1 ⎢ ⎛ xi − µ g
exp ⎨− 2 ⎬ exp ⎨− ∑ ⎜
⎜
⎪⎩ 2 ⎣⎢ ⎝ σ g
⎩⎪ σ g ⎭⎪
⎧ ⎡β
1 ⎛ x − µg
⎪
−n
σ exp ⎨− ⎢ g2 + ∑ ⎜ i
⎜
⎪⎩ ⎢⎣ σ g 2 ⎝ σ g
⎞
⎟⎟
⎠
2
⎤⎫
⎥ ⎪⎬
⎥⎪
⎦⎭
simplify this equation to be
2
⎧ ⎡ β + 1/ 2
xi − µ g ) ⎤ ⎫
(
∑
g
⎪⎪ ⎢
⎥ ⎪⎪
− ⎡α + ( ng / 2 ) +1⎤
xi ∈g
⎦
p (σ 2 x ) ∝ ⎡⎣σ g 2 ⎤⎦ ⎣
exp ⎨− ⎢
⎥⎬
σ g2
⎥⎪
⎪ ⎢
⎢
⎥⎦ ⎭⎪
⎪⎩ ⎣
⎞
⎟⎟
⎠
2
⎤⎫
⎥ ⎪⎬
⎥
⎦ ⎭⎪
60
using an appropriate normalizing constant to have the integration of the above equation
equal to unity, we integrate the RHS to be as following
2
⎧ ⎡ β + 1/ 2
⎤⎫
x
µ
−
(
)
∑
g
i
g
∞
⎪
⎢
⎥ ⎪⎪
⎡
⎤
α g + ( ng / 2 )
1
xi ∈g
2 − ⎣α g + ( ng / 2 ) +1⎦ ⎪
2
⎡
⎤
β
σ
RHS = ∫
−
( g)
⎥ ⎬ dσ g
⎨ ⎢
2
⎣ g ⎦
σg
0 Γ α g + ( ng / 2 )
⎥⎪
⎪ ⎢
⎦⎥ ⎭⎪
⎩⎪ ⎣⎢
(
)
2
⎧ ⎡ β + 1/ 2
⎤⎫
x
µ
−
(
)
∑
g
i
g
∞
⎪
⎢
⎥ ⎪⎪
⎡
⎤
α g + ( ng / 2 )
1
xi ∈g
2 − ⎣α g + ( ng / 2 ) +1⎦ ⎪
2
⎡
⎤
β
σ
=
−
( g)
⎥ ⎬ dσ g
⎨ ⎢
2
∫
⎣ g ⎦
σg
Γ α g + ( ng / 2 )
0
⎥⎪
⎪ ⎢
⎪⎩ ⎣⎢
⎦⎥ ⎪⎭
α g + ( ng / 2 )
βg )
(
−α + ( n / 2 )
=
Γ α g + ( ng / 2 ) β g g = 1
Γ α g + ( ng / 2 )
(
)
(
)
(
)
therefore we have the conditional distribution for posterior σ g2 is
(
p σ g2 x, µ g
)
2
⎧ ⎡ β + 1/ 2
xi − µ g ) ⎤ ⎫
(
∑
g
⎪ ⎢
⎥ ⎪⎪
− ⎡α + n / 2 +1⎤
α + n /2
1
xi ∈g
=
( β g ) g ( g ) ⎡⎣σ 2 ⎤⎦ ⎣ g ( g ) ⎦ exp ⎪⎨− ⎢
⎥⎬
2
σ
Γ α g + ( ng / 2 )
g
⎢
⎥⎪
⎪
⎢
⎥⎦ ⎭⎪
⎩⎪ ⎣
(
)
(3.24)
■.
that is
−1
⎛
⎞
⎡
2⎤
1/ σ ⋅⋅⋅ ~ G ⎜ α g + ng / 2, ⎢ β g + 1/ 2 ∑ ( xi − µ g ) ⎥ ⎟
⎜
xi ∈g
⎣
⎦ ⎟⎠
⎝
2
g
or, similarly
−1
⎛
⎞
⎡ −1
2⎤
σ ⋅⋅⋅ ~ IG ⎜ α g + ng / 2, ⎢ β g + 1/ 2 ∑ ( xi − µ g ) ⎥ ⎟ .
⎜
xi ∈g
⎣
⎦ ⎟⎠
⎝
2
g
61
3.4
Interval Estimation
Credible or probability intervals (or credible set) usually used for Bayesian so as
not to confuse them with confidence intervals [Carlin and Rubin, 2000; Press, 2003;
Hogg et.al.,2005]. Suppose the posterior distribution for parameter θ given by
F (θ x1 , x2 , ⋅⋅⋅, xn ) ≡ F (θ raw score ) . Then for some preassigned α , we can find an
interval ( a, b ) :
Definition 3.1[Carlin and Rubin, 2000]
A 100 (1 − α ) % credible set for Θ is a subset C = ( a, b ) of Θ such that:
⎧ p (θ x ) dθ = F ( b ) − F ( a )
⎪∫
1 − α ≤ P {a < θ < b x} = ⎨C
⎪∑ p (θ x )
⎩C
continuous case
(3.25)
discrete case
Definition 3.1 enables direct probability statement about the likelihood of θ
falling in C , that is the probability that θ lies in C given the observed data x is at least
(1 − α ) . The interval ( a, b ) is called a credibility interval for θ , at credibility level α .
Similar to confidence interval, if we choose α = 0.05 then we refer to such an interval as
a 95 percent credibility interval. Press, (2003) refer to ( a, b ) as a 95 percent Bayesian
confident interval, attempting to distinguish it from a frequentist confident interval. We
used “ ≤ ” instead of “ = ” in Definition 3.1 in order to accommodate discrete setting,
where obtaining an interval with coverage probability exactly (1 − α ) may not be
possible. Moreover, the used “ = ” is often used by statistician to simplify the definition
for continuous type which convey the similar meanings and do have exactly the right
coverage in order to minimize the size and thus obtains a more precise estimates.
62
More generally, the Highest Posterior Density (HPD) gives a technique for
minimize and obtained a precise estimate. Since we have preassigned α , for known
functional form of F , this does not specify the interval ( a, b ) , therefore which interval
should be chosen to? It is not uniquely defined. Consequently, we choose the specific
(1 − α ) interval that contains most of the posterior probability. To do so we choose the
smallest interval ( a, b ) to specify two properties [Press, 2003]:
Property I
: F ( b ) − F ( a ) = 0.95
Property II
: If p (θ x1 , x2 , ⋅⋅⋅, xn ) denotes the posterior density, for a < θ < b ,
p (θ x1 , x2 , ⋅⋅⋅, xn ) has a greater value than that for any other
interval for which Property I holds.
The posterior density for every point inside HPD interval is greater than that for every
point outside the interval. Also for a given credibility level, the HPD interval is as small
as possible. The HPD interval always exist and it is unique, as long as for all intervals of
contents (1 − α ) , the posterior density never uniform in any interval of the space of θ
[Box and Tiao,1978; Section 2.8, pg 122-125]. The first property will be employed to
give a formal definition.
Definition 3.2: [Box and Tiao, 1978]
Let p ( Θ x1 , x2 , ⋅⋅⋅, xn ) be a posterior density function. An interval I in the
parameter space of Θ is called a HPD interval of content (1 − α ) if:
i)
Pr {Θ ∈ I x} = 1 − α
ii)
for θ1 ∈ I and θ 2 ∉ I
p {θ1 x} ≥ p {θ 2 x} .
63
Now, let φ = f ( θ ) define a one-to-one transformation of the parameters from θ to φ .
Any interval satisfied (1 − α ) in the space of Θ transform into an interval of the same
content in the space of φ if the transformation is linear. As in the univariate case, when
a noninformative prior is used, which is equivalently to assume that some transformed
set of parameters φ = f ( θ ) are locally uniformly distributed, then standardized HPD
intervals calculated in terms of φ are available.
In Chapter IV, we present the sampling using Gibbs Sampling approach to
determine the estimate of posterior distribution and exhibit the graphical representations
to figure out the interval for parameter estimates. The entire results and plots are
computed via WinBUGS programming package.
64
CHAPTER 4
NUMERICAL IMPLEMENTATION OF THE
BAYESIAN GRADING
4.1
Introduction to Markov Chain Monte Carlo Methods
We have set up a model to assign grades through conditional Bayesian. In this
chapter we will explained how to estimate the model given in Chapter III (Equation (3.9)
and (3.10)). We will discuss a general simulation approach for numerically calculating
the quantities that arise in the Bayesian prior-posterior analysis. In letter grades
assigning problem we are interested to find the optimal mean values for each well
defined grades component. Herein, we are interest to find the unknown parameter θ of
the posterior density. Suppose θ ~ p (θ ) and if we seek
E ⎡⎣ p ( µ , σ , π x ) ⎤⎦ = ∫ p ( µ , σ , π ) ⎡⎣ µ , σ , π x ⎤⎦ d ( µ , σ , π )
θ
≈
1
N
∑ p(µ
G
g =1
t
,σ t, π t
)
or simply we write as
γ ≡ E ⎡⎣ p (θ x ) ⎤⎦ = ∫ θ p (θ x ) dθ
65
then γˆ =
1
N
N
∑ f (θ ) where θ
i =1
i
is vector-valued parameter vector and x is a vector of raw
scores and converges to E ⎡⎣ p (θ x ) ⎤⎦ with probability 1 as N → ∞ . This integral cannot
be computed analytically and that the dimension of the integration exceeds three or four.
In such cases we can compute the integral by Monte Carlo sampling methods or
precisely as called as non-iterative Monte Carlo methods [Carlin and Louis, (2000)].
The idea behind Markov Chain (MC) is to discard the immediate task at hand
and to ask how the posterior density p (θ x ) may be sampled. Press (2003), explained
that if we were to have draws θ (1) , θ ( 2) , ⋅⋅⋅, θ ( N ) ~ p (θ x ) from the posterior density, then
ii d
provided the sample is large enough, we can estimate not just the above integral but also
other features of the posterior density by taking those draws and forming the relevant
sampling-based estimates. For example, the raw score sample average of the sampled
draws would be our simulation based estimate of the posterior mean. Under suitable
Strong Laws of Large Numbers these estimate would converges to the posterior
quantities as the simulation becomes large.
Furthermore, it is possible to sample complex and high-dimensional posterior
densities by a set of method that is called Markov Chain Monte Carlo (MCMC). The
objectives of MCMC are to generate a sample form a joint probability distribution of
posterior and to estimate expectation of parameter θ . These methods involved the
simulation of a suitable constructed MC that converges to the target density of interest
(the posterior density). In general MC is defined as the property that the conditional
density of θ ( t ) with posterior distribution p (θ x ) of the parameters θ (t ) conditioned on
the entire preceding history of the chain depends only the previous value θ ( t −1) . The idea
behind or the underlying rationale of MCMC simulation is to construct a transition
density that converges to the posterior density form any starting point at θ 0 (in the sense
(
)
that for any measurable set S under p , Pr θ ( t )∈ S x, θ 0 converges to
∫ p (θ x ) dθ
S
66
as t → ∞ . In this study we are not explained MC in details since MC is not the focus of
this study, however we used the concept of transition probabilities is and the formula of
Chapman-kolmogorov is applied. The extension of the topic related might be referred to
Walsh, (2004), Gelman et al. (1995), Carlin and Louis (2000) and Press (2003).
One problem with applying the Monte Carlo integration is in obtaining samples
from one complex probability distribution p ( x ) . This problem is solving by MCMC
methods. In particular, it is solving by mathematical physicist to integrate the very
complex function by random sampling, and the resulting most general MCMC approach
is called the Metropolis-Hasting algorithm (M-H algorithm) which is introduced by
Metropolis et al. (1953), [Press, (2003)]. A second technique for constructing MC
samplers is by Gibbs sampling algorithm. This methods has introduced by Geman and
Geman (1984), Tanner and Wong (1987) and Gelfand and Smith (1990), [Press, 2003].
Gibbs sampling algorithm is actually the best known for special case of the M-H
algorithm.
4.2
Gibbs Sampling
In this study we are not discuss the M-H algorithm since our objective is to
generate problem using Gibbs sampling in which the Gibbs sampling is one of the most
simple MCMC algorithm and the special case of M-H algorithm. Gibbs sampling is also
known as alternating conditional sampling [Gelman et al. (1995)]. Gibbs sampling is
defined in terms of subvectors of θ . Suppose the parameter θ from raw scores have been
divided into g components or subvectors, Θ = {θ1 , θ 2 , ⋅⋅⋅, θ g } . The MC is constructed by
sampling the set of full condition densities
{ p (θ
1
) (
)
(
x,θ 2 ,θ 3 ⋅⋅⋅, θ g ; p θ 2 x,θ1 ,θ 3 ⋅⋅⋅,θ g ; ⋅⋅⋅ ; p θ11 x,θ1 ,θ 2 ⋅⋅⋅,θ g −1
)}
67
each of the iteration of Gibbs sampler is cycles trough the subvectors of Θ , in which the
subset conditional on the value of all the others. There are g steps in iteration t .At each
iteration t , an ordering of the g subvectors of Θ is chosen and, in turn, each θ
t
j
is
sample from the known conditional distribution given all the other components of Θ :
(
)
p θ g x, Θt−−j1 where Θt−−j1 represent all of the parameters except for θ
vector of raw score for the students. Their current values θ
θ
, ⋅⋅⋅,θ
t −1
j +1
t −1
g
} . Thus, each subvector θ
{θ
t −1
−j
j
and x is the
, θ t2, ⋅⋅⋅, θ
t
1
t
j −1
,
is updated conditional on the latest value of
j
Θ for the other components, which are the iterations t values for the components
already updated and iteration t − 1 values for the others. Therefore we have the updated
posterior distribution for each letter grades component. The Gibbs sampler algorithm is:
Algorithm 4.1 (Gibbs Sampling)
1.
Specify any arbitrary initial values Θ 0 = {θ10 , θ 20 , ⋅⋅⋅,θ g0 }
2.
For t = 1, 2,..., B + T , Construct Θ( t ) as follows:
{
~ p {θ
~ p {θ
}
},
Draw θ 1t ~ p1 θ1 θ t2−1, θ 3t −1, ⋅⋅⋅,θ tg−1 ,
Draw θ
t
2
Draw θ
t
3
1
2
θ 1t , θ 3t −1, ⋅⋅⋅,θ tg−1
1
3
θ 1t ,θ t2, ⋅⋅⋅,θ tg−1 ,
}
⋅
⋅
⋅
{
Draw θ tg ~ p1 θ g θ 1t , θ t2, ⋅⋅⋅, θ tg −1
}
Example:
{
~ p {θ
}
θ ,θ , ⋅⋅⋅,θ } ,
{
}
Draw θ 11~ p1 θ1 θ 20 ,θ30 , ⋅⋅⋅, θ g0 , Draw θ 12 ~ p1 θ 2 θ11 , θ30 , ⋅⋅⋅,θ g0 ,
Draw θ 13
3.
1
3
1
1
1
2
0
g
{
After completing one iteration of the scheme use the draws from the
previous step to construct Θ t +1 . After t such iterations we would then
obtain {θ 1t , θ t2, ⋅⋅⋅,θ tg } .
}
⋅ ⋅ ⋅ , Draw θ 1g ~ p1 θ g θ11 , θ 21 , ⋅⋅⋅,θ g1 −1 .
68
Note that, we must eliminate Θ( t ) for all t ≤ B, where B
T is the “burn-in” period
and since we are presumes that the limiting distribution has been reached. The remaining
values of Θ( t ) are the simulated draws from the posterior distribution of Θ .
Alternatively, we can iterate step 2 and 3 as follows:
For t = 1, 2,..., B + T , Construct Θ( t ) as follows
2.
{
~ p {θ
Update θ 1t to θ 1t +1~ p θ 1t +1 x,θ t2,θ 3t , ⋅⋅⋅,θ
t
g
Update θ t2 to θ
t +1
2
},
}
t +1
2
x,θ 1t ,θ 3t , ⋅⋅⋅,θ
t
g
t +1
2
x,θ 1t ,θ t2, ⋅⋅⋅,θ
t
g −1
⋅
⋅
⋅
Update θ
t
g
to θ
t +1
g
{
~p θ
}
The complete updated vector is then labeled Θ(t +1) . Repeat the above steps
3.
B + T times.
Theorem 4.1 [Carlin and Louis (2000)]
For the Gibbs sampling algorithm (as outline above),
{
(t )
} →{θ
d
} ~ p {θ
} as t → ∞
(a)
Θ( t ) = θ 1(t ),θ (2t ), ⋅⋅⋅, θ
(b)
The converges in theorem (a) is exponential in t using L 1 norm.
g
,θ 2, ⋅⋅⋅,θ
1
g
,θ 2, ⋅⋅⋅,θ
1
g
From this theorem, all we require to obtain samples from the joint distribution of
{θ
,θ 2, ⋅⋅⋅,θ
1
g
} is ability to sample from g corresponding full conditional distributions.
In this study, the joint distribution of interest is the joint posterior distribution of the
given raw scores, denotes as p {θ 1, θ 2, ⋅⋅⋅,θ g x} .
69
4.3
Introduction to WinBUGS Computer Program
As mentioned in Chapter I and Chapter III, for Bayesian problem we will
analyses the estimation of the posterior distribution via WinBUGS. Before that, we
explain WinBUGS in general. WinBUGS is defined as the MS Windows operating
system version of Bayesian Analysis Using Gibbs Sampling which is a versatile package
that has been designed to carry out MCMC computations for a wide variety of Bayesian
models [Press, (2003)]. The software can be found in the internet in which a free
downloaded version might be run by clicking at the WinBUGS links. The program and
manuals are available over the web at www.mrc-bsu.cam.ac.uk/bugs/.
WinBUGS requires that the Bayesian model expressible as a directed graph.
There is no standard that the users exactly make a drawing of the model in the form of
directed graph. However based on understanding of the model, the instructor can handle
the directed graph as they understand the model in assigning letter grades for their
student. The directed graph is extremely helpful the instructor as the first step in doing
the analysis. To practice, in drawing the directed graph we can used the menu on the
WinBUGS window which permit the user to draw a directed graph. For example of
WinBUGS program and directed graph see Volume I, Volume II and Volume III in Help
menu of WinBUGS windows.
4.4
Model Description
Now, we want to decide the unknown parameter for letter grades assigning
problem. Let y i be equal to the component (i.e. particular letter grade) of mixture to
which raw scores x i fit in. Therefore the unknown parameters in this study are
Θ = { y, π , µ , σ 2 } . Enhance the unknown parameters with y is actually to make it easy in
finding the conditional distribution needed by Gibbs sampling algorithm. By augmenting
70
the data x by “missing data” Y = { y1 , y2 , ⋅⋅⋅, yn } , in which each raw score observation
xi is assumed to arise from a specific but unknown component yi of the mixture
[Stephens, 2000]. In addition, this variable is not observed and thus named latent
variable and indicates the original population of observation x i [Cornebise et al., 2005].
(
)
Therefore the model p ( xi ∈ G ) ∝ π gφ xi µ g , σ g2 can be written in terms of the latent
variables, with yi is assumed to be realization of independent and identically distributed
discrete random variable Y1 , Y2 , ⋅⋅⋅, Yn with probability mass function
(
)
p yi = g π , µ g , σ g2 = π g ,
(4.1)
provided i = 1, 2,..., n ; g = 1, 2,..., G . Conditional on the Y ' s , the raw scores x1 , x2 ,..., xn
are assumed to be independent observations from the densities
{
}
(
p xi yi = g , π, µ,σ 2 = p { xi ; µ g , σ g2 } = N ⋅ µ , σ 2
)
(4.2)
integrating out the latent variables Y1 , Y2 , ⋅⋅⋅, Yn then yield the model in Eq.(3.10) as
mentioned above.
Let ⋅ ⋅⋅⋅ mean conditioning on all other parameters in Θ , the raw score data, and
G = 11 . The conditional distributions are
{
p { yi = g ⋅⋅⋅} ∝ π g N xi µ g , σ g2
}
where the posterior for π is then provided by a Dirichlet elementsη g + ng , where ng is
the number of sample members assigned to gth letter grade [Congdon, 2003]. Thus we
have π g ⋅⋅⋅ ~ Di (η g + ng ) , and ng = # {i : y i = g} , n g is simply the number of raw score
71
allocated to group g according to the parameter y . Therefore, the Gibbs sampling from
Algorithm 4.1 of step 2 will be as follows:
Algorithm 4.2 (Gibbs Sampling for Normal Mixture)
2.
For t = 1, 2,..., B + T , Construct Θ( t ) as follows
π ⋅⋅⋅ ~ Di (η1 + n1 ,η2 + n2 , ⋅⋅⋅,η g + ng )
µ g ⋅⋅⋅ ~ N (Vg M g ,Vg )
⎛
⎡
σ ⋅⋅⋅ ~ IG ⎜ α g + ng / 2, ⎢ β g−1 + 1/ 2 ∑ xi − µ g
⎜
xi ∈g
⎢⎣
⎝
(
2
g
⎡ 1 ng ⎤
where Vg = ⎢ 2 + 2 ⎥
⎢⎣ δ g σ g ⎥⎦
−1
, Mg =
)
−1
⎞
⎥ ⎟
⎥⎦ ⎟
⎠
2⎤
ν g ng x g
x
and xg = ∑ i . The Gibbs sampling
2 +
2
δg σ g
i:zi = g ng
updates were performed in the order π , µ g , σ g2 . Since we have choose
(
)
η1 = η2 = ... = η g = 2 , therefore we also may write π ⋅⋅⋅ ~ Di 2 + n1 , 2 + n2 , ⋅⋅⋅, 2 + ng .
Note that, WinBUGS uses precision instead of variance to specify a normal distribution!
We denote τ =
1
σ2
or σ = 1/
τ . The model specified as shown in Appendix E. The
WinBUGS will treat everything in the opening and closing brackets; {} or () as a
description of the model. Therefore, the words MODEL, DATA and INITIAL VALUE
are not required. We locate the words as a reminder in explaining the information
involved for the model. The model contains of two chain in which the first chain (denote
as INITIAL VALUE1) is use to define the distribution of raw scores and for chain two
(denote as INITIAL VALUE2) we specify as to define parameters involved in
estimating that particular parameters.
72
WinBUGS requires that the Bayesian model be a directed graph. A directed
graph consists of nodes connected by descending links. The directed graph for the model
in implement the posterior distribution of Algorithm 4.2 is shown in Figure 4.1. Figure
4.1. A node in the graph represents each parameter or variable in the model including the
hyperparameters of the normal distribution. Each node has none, one or more “parents”
and none, one or more “children”. Constant are in rectangle; variable nodes which
depends functionally or stochastically on their parents, are in ovals.
Note that the WinBUGS programming language is used to specify the modelprior-likelihood. It is not a programming language. It does not specify series of
commands to be execute in sequence [Press, 2003]. The purpose of the WinBUGS
model specification language is to “paint a word picture” of the directed graph. Having
specified the model as a full joint distribution on all quantities, whether parameters or
observables, we wish to sample values of the unknown parameters from their conditional
posterior distribution given those stochastic nodes that have been observed. The basic
idea behind the Gibbs sampling algorithm is to successively sample from the conditional
distribution of each node given all the others in the graph (these are known as full
conditional distributions).
73
m[g]
alpha.a[g]
v[g]
phi[]
P[1:11]
alpha.tau[g]
alpha.b [g]
beta.b [g]
G[i]
sigma
mu[i]
tau[i]
y[i]
Figure 4.1: Graphical Model for Bayesian Grading
74 74
4.5
Setting the Priors and Initial Values
4.5.1 Setting the Prior
In this section, we explain how to set the prior parameters and its initial values.
The prior parameters have to bet set to something. They can be set to very uninformative
values, but they still have to be set to something. Whatever values we set them to,
someone can always ask “why these particular values? Why not some other values?”
The hint in setting the prior is setting the prior that are uninformative and that make
more or less sense. In practice it does not matter exactly to which values we set the prior
parameters. As long as they are reasonably uninformative, the end result is the same.
Note that, as long as the prior is uninformative, the result is driven by the data. So the
specific values of the prior parameters really make no difference.
In this study, we need two vectors that determine the prior for component means:
ν and δ 2 where ν g are the prior means of component means. Since the data of raw
scores is always put into the interval of [0,100], we approximate the prior of component
means equidistantly on that interval. Thus, for G = 11 , ν g ≈ 9 g . The prior variance of
component means should be set to some high value since the prior means ν g is very
uncertain corresponds to the true component means. Therefore we set the prior standard
deviation to be of 20, and then the variance is 400. We can also set the variance to be
high value such as 500, 600, and so on. The end result is the same!
For the component variances we follows the idea from Alex, (2003); whereas a
priori the expected of standard deviation of each components to be approximately 5. To
be uninformative, the variance is set to a relatively large value (let it be 4). Again if we
do not like this number, use other numbers; the answer will be the same. Hogg and Craig
(1978), remarks that the terminology mathematical expectation has origin in games of
chance. That is, the mathematical expectation of u ( X ) ; X is a random variable, and
75 75
u ( X ) is of such a character that if the integral (or sum) exist, the convergence is
absolute. In addition, we treat the component variance with its own distribution of
probability. We simply take this note in approximating the expected value and the
standard deviation of component variances as follows:
We sets the following to be uninformative:
E [σ ] = 5, Var [σ ] = 4
know that from Hogg and Craig (1978),
Var [σ ] = E ⎡⎣σ 2 ⎤⎦ − { E [σ ]}
2
∴ E ⎡⎣σ 2 ⎤⎦ = Var [σ ] + { E [σ ]}
2
= 4 + 52 = 29.
which gives the expected value for component variance is 29. Now, refer to Section
3.3.8 in Chapter 3, we have proved that the σ 2 ~ IG (α , β ) . To make the prior
distribution of component variance reasonably vague, Carlin and Louis (2000) pg. 326
explained that we can set the prior mean and standard deviation both equal to the µ ;
where µ ≡ E ⎡⎣σ ⎤⎦ ,τ
2
β=
⎛µ⎞
≡ Var ⎡⎣σ ⎤⎦ . This means that α = ⎜ ⎟ + 2 = 1 + 2 = 3 and
⎝τ ⎠
2
2
2
1
1
1
. These parameters created a diffuse, uninformative
=
=
µ [ µ / τ ] + 1 µ [1 + 1] 2µ
inverse Gamma with mean µ . For that reason, we have apply this idea in setting the
prior of component variances in which for all g of letter grades component it will
follows inverse Gamma distribution with parameters α = 3 and β =
1
= 0.0172 .
2(29)
The standard deviation of prior variance equal to its mean; i.e E ⎡⎣σ 2 ⎤⎦ = 29 .
76 76
In addition, note that an inverted Gamma distribution density for variance is the same as
the 1/ σ follows a Gamma distribution.
Finally, we need to set the prior of component probabilities. We have selected
that the component probability is Dirichlet distributed with parameterη . To make all the
components a priori equally likely, all the elements of η must be equal and greater than
or equal to zero; (η ≥ 0 ) . The elements of η can be interpreted as the number of synthetic
observation from each component of the mixture [Alex, 2003]. Therefore, we predict
that there is may be two observations that might be fake observation from each
component. The first observation might be the value at the lowest tail and the second
point is at the upper tail which these two points may take to consider as the borderline
value that we uncertain to which letter grade should we assigns. Therefore the elements
of η is choose to be equal to 2; i.e: η1 = η2 = ... = η g = 2 .
Know that the raw score is drawn from a distribution of Eq.(3.9), where a
Dirichlet process is adopted for µ ' s and σ ' s . The Dirichlet process specifies a baseline
prior from which candidate values for µ g and σ g are drawn. Suppose this parameter are
unknown for each letter grades component, and that clustering in these values is
expected. Then for similar letter grades of raw scores within a cluster, the same value θ
of µ g and σ g would be appropriate for them. Theoretically, the maximum number G of
cluster could be n [Congdon, 2003]. In addition, the cluster indicator for raw score i is
chosen according to
G [ i ] ~ Categorical ( P)
77 77
4.5.2 Initial Values
The first step in the computation problem is to obtain the crude estimates of the
model parameters. Initial parameter estimates for the computations are easily obtained.
In setting the initial parameter values, Θ ( 0) we first sort the data to the descending and
subdivided into G = 11 group of equal size [Raftery, 1996]. The lowest observations are
in group one, the lowest observations which are not in group one are in group two and so
on. The initial parameter estimates for the computations are easily obtained by
estimating µ g as xg that is the average of the observations in the g th group, for each
g = 1, 2,..., G , and estimating σ 2g as the average of the G within group sample variance,
s 2g . Simplify, we can set the initial values for means and variances to the sample
quantities from the corresponding groups and therefore we crudely estimate ν and δ 2 as
the mean and variance for G estimated values.
For the initial values of component probabilities we set the entire of them to be a
fair proportion that equal to have 1/ G . Further explanation in setting the initial values
revealed in Gelman et al. (1995) pg 424-426.
4.6
Label Switching in MCMC
Gibbs Sampler described above works in broad. In mixture models, there are also
issues called label switching in MCMC output. This is mainly cause by the
nonidentifiability of the components under symmetric priors. In the other words, the
problem is that, it is impossible to identify which component of mixture a draw is made
from. As a result, posterior densities for all components appear the same.
78 78
If sampling takes place from an unconstrained prior with G groups then the
parameter space has G ! subspaces correspond to different ways of labeling the states
[Congdon, 2003]. In an MCMC run on an unconstrained prior they may be jumps
between these subspace. A basic solution to these issues is to inflict identifiability
constraints when they can be found [Alex, 2003]. Constraint may be imposed to ensure
that components do not ‘flip over’ during estimation. For Bayesian mixtures of the
invariance of the likelihood to permutations in the labeling is not a problem that is easy
solved as in the frequentist approach [Jasra, Holmes and Stephens, 2005].
Providentially, in grading assigning application, one may specify that one
mixture probability is always greater than another. In other words, there is a very natural
constraints, that are the means are ordered µ1 < µ2 < ⋅⋅⋅ < µ g . Depending on grading
assignment problem, one sort of constraint may be more appropriate to a particular raw
score data set; which the inequality mean of each subgroups are well identified. That is,
the mean raw score of E’s is less than mean raw score of D’s, D+’s and so on. See
Appendix E; marginal posterior density estimates for the means of the different letter
grade. The symmetries in the posterior distribution are immediately seen, with the
posterior means being the same for each component, as well as the classification
probabilities all being close to 0.09 initially. Note that the Gibbs Sampler draws are
post-processes to implement this constraint that is ordering on one of the parameters.
4.7
Sampling Results
In this section, we present two real life sampling results. Both cases observed
from a small class and large class of students. We have assumed that the final scores are
transformed to the composite score. In addition, we compare the letter grades
assignment from GB to the letter grades actually assigned by instructors. Therefore the
reader can judge how well GB does by visual inspection. In Chapter II, we discussed the
method in weighting the raw scores that comes from several source (e.g. Test, Midterm
79 79
Test, Project, Assignment, Studio or Lab Work). We strictly mention that it is
inappropriate to assign grades base on combined score without transform the scores into
weighted composites score. The combination of several raw scores would produce
contaminated normal on the score distribution. However, in these study, instructors
actually assigned grades based on the combined raw score. For comparison purpose, we
consider how well GB workable also based on the combined score.
In this study, we are paying attention on the letter grades component means. The
results given will be considered as the decision to the instructors in assigning the letter
grade and evaluate their students performance corresponds to that particular semester.
To check whether the sampling estimates converge to its expected values or not, we
demonstrates numerous result in WinBUGS output as demonstrate in the subsequent
section.
4.7.1.1 Case 1: Small Class
The model and raw scores data for Case 1 are enclosed in Appendix E. We have
a small class of 62 students that attend one of a course for a semester. The mean raw
score is 75.9, the median is 74.5 and the standard deviation is 12.88. Table 4.1 show
WinBUGS output of the marginal moments and quantiles for means of each letter grade
upon sampling. Time for 150,000 sampling was less than 50s for computer on 3.0GHz
of Pentium 4. At least 500 updates burn in followed by a further 75,500 updates gave the
parameter estimates. In other words we discards µ g ( t ) for all t ≤ 500 , the burn in period
(or initial transient phase of chain)for sampling the components means and continues to
75,500 updates which implied the optimal estimates of letter grades means.
Table 4.1 indicates that the component means of each letter grade in small class
of 62 students. ‘node’ is the column for parameter that we want to estimate followed by
80 80
the columns of means and standard deviations of corresponding nodes. Now, focus on
the 95% equal-tail Credibility Intervals between 2.5% and 97.5% quantiles (refer to
Chapter III, Section 3.4). That is we can see the MC error for µ1 is too large, meaning
that for the particular class, there is no students should be assigned to the grade E by the
instructor. Besides, we have µ2 (i.e. mean for grade D) with lower bound of α = 0.05
( or α / 2 = 0.0025)
is 37.87 . Therefore the instructor would haves decide to assign
grade E if the raw scores of their students is less than 37. Conversely, grade D should be
assign for raw scores that between 37 and 43, grade D+ for raw score greater than 43
and less than 53 and so on.
Furthermore, we have the probability of the raw scores belongs to the
corresponding grades. For example, probability of the raw scores 96 are probably to be
assign grade A is about 0.0743, the raw scores of 70 will be assign grade B- by the
instructor with probability 0.1756 and so on as shown in Table 4.1.
In addition, Table 4.2 demonstrates the minimum and maximum score for each
letter grade and percent of students receiving to the respective letter grade. We seen that
the 25.81 percent of the students assigned to grade B- and more than half of the students
was assigned the better and meaningful grades.
The letter grades assigned by Straight Scale and Standard Deviation methods are
shown in Table 4.3. These results shows the different methods will assign the different
letter grades to the students. For example, the raw scores for grade B is between 74 to 82
for Bayesian grading, 70 to 74 for Straight Scale and 90 to 92 for grading through
Standard Deviation method. Moreover, the cumulative percentage also shown that more
than 50% of the class are assigned grade B and above for GB method, more than 60%
for grading via Straight Scale and more than 30% for grading via Standard Deviation
method.
81 81
Table 4.1: Optimal Estimates of Component Means for Case 1
Node
Mean
Std.
Dev
MC
error
2.5%
Median
97.5%
Start Sample
π1
π2
π3
π4
π5
π6
π7
π8
π9
π 10
π 11
0.0135
0.009429 2.57E-5
0.001647
0.01139
0.03707
501
150000
0.03374
0.01476
3.706E-5
0.01117
0.03166
0.06801
501
150000
0.03378
0.01476
4.052E-5
0.01118
0.03172
0.06816
501
150000
0.05401
0.01855
4.707E-5
0.02371
0.05197
0.09575
501
150000
0.05403
0.01846
4.61E-5
0.02393
0.05203
0.09541
501
150000
0.08111
0.02243
5.829E-5
0.04297
0.07916
0.13
501
150000
0.1756
0.03105
8.129E-5
0.1192
0.1741
0.2404
501
150000
0.1756
0.03114
7.68E-5
0.1189
0.1742
0.2407
501
150000
0.1893
0.03208
8.16E-5
0.1306
0.1879
0.256
501
150000
0.1149
0.02622
6.407E-5
0.06869
0.1132
0.1709
501
150000
0.07433
0.02148
5.709E-5
0.03788
0.07238
0.1213
501
150000
µ1
µ2
µ3
µ4
µ5
µ6
µ7
µ8
µ9
µ 10
µ 11
1.435E+6
3.2E+6
8863.0
-4.843E+6
1.43E+6
7.698E+6 501
150000
38.0
0.06298
1.609E-4
37.87
38.0
38.13
501
150000
45.0
0.05662
1.454E-4
44.89
45.0
45.11
501
150000
55.67
0.8745
0.005166
53.93
55.66
57.43
501
150000
60.0
0.02515
6.647E-5
59.95
60.0
60.05
501
150000
65.6
0.3317
9.094E-4
64.94
65.6
66.26
501
150000
69.5
0.1071
2.751E-4
69.29
69.5
69.71
501
150000
75.0
0.4676
0.001335
74.08
75.0
75.93
501
150000
84.0
0.5011
0.001446
83.01
84.0
84.99
501
150000
92.56
0.2583
6.781E-4
92.05
92.56
93.07
501
150000
95.33
0.1076
2.735E-4
95.12
95.33
95.55
501
150000
82 82
Table 4.2: Minimum and Maximum Score for Each Letter Grade, Percent of Students
and Probability of Raw Score Receiving that Grade for GB: Case 1
Grade
A
AB+
B
BC+
C
CD+
D
E
GB
From
To
95
92
83
74
69
64
59
53
44
37
0
100
94
91
82
73
68
63
58
52
43
36
Number of
Student
3
7
10
13
16
5
3
2
2
1
0
Percentage Cumulative Percentage
Probability
%
%
4.84
11.29
16.13
20.97
25.81
8.06
4.84
3.23
3.23
1.61
0
4.84
16.13
32.26
53.23
79.03
87.1
91.94
95.16
98.39
100
100
0.0743
0.1149
0.1893
0.1756
0.1756
0.0811
0.054
0.054
0.0338
0.0337
0.0135
Table 4.3: Straight Scale and Standard Deviation Methods: Case 1
Letter
Grades
Score
A
AB+
B
BC+
C
CD+
D
E
85-100
80-84
75-79
70-74
65-79
60-64
55-59
50-54
45-49
40-44
0-39
Standard Deviation
Straight Scale
Number of Cumulative
Cumulative
Number of
Students Percentage
Score
Percentage
Students
%
%
17
27.4
95.57-100.00
1
1.61
8
40.3
90.89-95.57
9
16.13
6
50.0
86.21-90.89
5
24.19
12
69.4
81.52-86.21
7
35.48
10
85.5
76.84-81.52
6
45.16
4
91.9
72.16-76.84
6
54.84
2
95.2
69.48-72.16
15
79.03
1
96.8
62.79-67.48
5
87.10
1
98.4
58.11-62.79
3
91.94
53.53-58.11
2
95.16
1
100.0
0.00-53.43
3
100.00
83 83
To examine the posterior density functions of the means see Figure 4.2. Figure
4.2, exhibits the sampling for mean distribution of grade B+. These types of plots are
called the smothered Kernel-Density estimate for the component means. The smoother
the curves pictured the better posterior distribution sampling plot for component means.
The optimal posterior distribution for the remainder grades are enclosed in Appendix E.
4.7.1.2 Convergence Diagnostics
In previous section we have mentioned that the sampling acquired 75,500
updates to converge an optimal solution. The issue is in which updates the solutions
converge to the optimal? A Markov chain that approaches its stationary distribution
slowly or exhibits high autocorrelation can produce an inaccurate picture of the posterior
distribution. In WinBUGS there are three simple convergence diagnostics;
autocorrelation functions, Gelman-Rubin and traces diagnostics. We would tests these
tools in Sampling Monitor Tool (see Appendix E).
In this small class example, now consider the convergence diagnostic of the
mean grade D and B+. First, for the traces diagnostics that also called time series trace;
the plot of random variable(s) being generated versus the number of iterations. We
found that the multiple chains cover the same range and not shows any trend or long
cycle. Figure 4.3 (a) and (b) demonstrates the stable posterior for the mean of the grade
D ( µ 2 ) and B+ ( µ 9 ). We conclude that the convergence achieved very quickly.
Next, Figure 4.4 (a) and (b) shows the cumulative graph for Gelman-Rubin
convergence statistics. The green trace shows the width of the central 80% of the
between runs, the blue trace is the average width of the 80% interval within the
individuals runs and we see that the ratio (the red trace) of the between and within chain
84 84
variances rapidly approach to 1; i.e ( R = (between / within) → 1) . This indicates
convergence.
Now, we examine the quantiles for a cumulative graph of the same grades.
Figure 4.5 (a) and (b) shown convergence when the quantiles of the parallel chains
rapidly coincide.
Finally, we look on the autocorrelation function of grade D and B+. See Figure
4.6. Autocorrelation is not directly a convergence diagnostics, a long-tailed
autocorrelation graph suggest that the model is ill conditioned and that the chain will
converge more slowly. In this case, the long-tails are not appear to the plots of the
autocorrelation. Therefore, from suggestion statement the model is highly conditioned
and that the chain converges rapidly.
As of above, the convergence is not an issue for GB since all the convergence
diagnostics shows the estimates of component means are converge very quickly. We can
stop the iteration at any updates converge to its optimal value, in which the estimates
satisfied the convergence diagnostics and the kernel-density display the smooth plot.
When the updates present an optimal estimate then the following updates converge to
same value. See Appendix E for convergence diagnostics of all letter grades.
85 85
mu.c[9] chains 1:2 sample: 1000
1.0
0.75
0.5
0.25
0.0
82.0
83.0
84.0
85.0
mu.c[9] chains 1:2 sample: 10000
1.0
0.75
0.5
0.25
0.0
82.0
84.0
86.0
mu.c[9] chains 1:2 sample: 50000
1.0
0.75
0.5
0.25
0.0
80.0
82.0
84.0
86.0
mu.c[9] chains 1:2 sample: 100000
1.0
0.75
0.5
0.25
0.0
80.0
82.0
84.0
86.0
mu.c[9] chains 1:2 sample: 150000
1.0
0.75
0.5
0.25
0.0
80.0
82.0
84.0
86.0
Figure 4.2: Kernel-Density Plots of Posterior Marginal Distribution of Mean for Grade
B+
86 86
(a)
(b)
Figure 4.3: Monitoring Plots for Traces Diagnostics of Mean: (a) Grade D and (b)
Grade B+.
mu.c[2] chains 1:2
mu.c[2] chains 1:2
1.5
1.0
1.0
0.5
0.5
0.0
0.0
501
600
800
iteration
501
20000
(a)
40000
60000
iteration
mu.c[9] chains 1:2
mu.c[9] chains 1:2
1.5
1.0
1.0
0.5
0.5
0.0
0.0
501
600
800
501
iteration
20000
40000
60000
iteration
(b)
Figure 4.4: Gelman-Rubin Convergence Diagnostics of Mean;
(a) Grade D and (b) Grade B+
87 87
mu.c[2] chains 2:1
mu.c[2] chains 2:1
38.2
38.1
38.0
37.9
37.8
37.7
38.2
38.1
38.0
37.9
37.8
521 600
800
3501
20000
(a)
iteration
40000
60000
iteration
mu.c[9] chains 2:1
mu.c[9] chains 2:1
86.0
85.0
84.0
83.0
82.0
85.0
84.0
83.0
521 600
800
3501
iteration
20000
40000
60000
iteration
(b)
Figure 4.5: Quantiles Diagnostics of Mean; (a) Grade D and (b) Grade B+
mu.c[2] chains 1:2
mu.c[2] chains 1:2
1.0
0.5
0.0
-0.5
-1.0
1.0
0.5
0.0
-0.5
-1.0
0
20
40
lag
0
20
1000 iterations
40
lag
75500 iterations
(a)
mu.c[9] chains 1:2
mu.c[9] chains 1:2
1.0
0.5
0.0
-0.5
-1.0
1.0
0.5
0.0
-0.5
-1.0
0
20
40
lag
0
1000 iterations
20
40
lag
75500 iterations
(b)
Figure 4.6: Autocorrelations Diagnostics of Mean; (a) Grade D and (b) Grade B+
88 88
4.7.2 Case 2: Large Class
The model and raw score data for Case 2 enclosed in Appendix E. Now we
consider to the class of 498 students that attend one of a course for a semester. The mean
is 71.53, the median is 73 and the standard deviation is 12.58. Table 4.4 show
WinBUGS output of the marginal moments and quantiles for means of each letter grade
upon sampling. At least 500 updates burn in followed by a further 75,500 updates gave
the parameter estimates. In other words we discard µ g ( t ) for all t ≤ 500 , the burn in
period (or initial transient phase of chain) for sampling the components means. Then the
sampling continues to 75,500 updates which an optimal estimates of letter grade means
for large class of students. Time for 150,000 sampling was less than 3 minutes.
Table 4.4 shows the optimal estimate of component means and component
probabilities of each letter grade. From Table 4.4 the instructor should assigned grade A
for the raw scores between 91 and 100, grade A- for the raw scores between 84 and 90,
and so on. The corresponding grades intervals are decided from the credibility interval
of 2.5% to 97.5% and with α = 0.05 ( or α / 2 = 0.0025 ) . In addition, similar to Case 1,
Table 4.6 shows the letter grades along with its score range of Straight Scale and
Standard Deviation methods.
In addition, the probability of the students in this course wills fail is about 0.03 if
the raw scores are less than 33. Differ to Case 1, there is no student assigned to grade E.
In other word, foe example raw score of 32 and 25 both are probably to be assign grade
E for this course. Most of the students in this class are also expecting to have grade B- as
in previous case. However the raw scores for B- in this class are between 71 and 75 with
probability 0.2593, while for Case 1 it is between 69 and 73 with probability 0.1765.
This not surprise since we are examines grade of different class and different course in
which the contents of the course is definitely vary. Clearly, we have shown that different
course or different class or different number of students will give an impact to instructor
grading plan.
89 89
Table 4.4: Optimal Estimates of Component Means for Case 2
Node
Mean
Std.
Dev
MC error
2.5%
Median
97.5%
Start
Sample
π1
π2
π3
π4
π5
π6
π7
π8
π9
π 10
π 11
0.03145
0.00547
1.44E-5
0.02163
0.03115
0.043
501
150000
0.03927
0.006077
1.514E-5
0.02825
0.03896
0.05204
501
150000
0.0334
0.00563
1.509E-5
0.02323
0.03309
0.04527
501
150000
0.04322
0.006361
1.651E-5
0.03159
0.04292
0.05644
501
150000
0.05497
0.007151
1.911E-5
0.0418
0.05468
0.06973
501
150000
0.09038
0.009001
2.303E-5
0.07356
0.09012
0.1088
501
150000
0.2593
0.01372
3.658E-5
0.2327
0.2591
0.2866
501
150000
0.1945
0.01238
3.078E-5
0.1708
0.1943
0.2193
501
150000
0.1297
0.01053
2.718E-5
0.1098
0.1294
0.151
501
150000
0.08255
0.008611
2.247E-5
0.06646
0.08226
0.1002
501
150000
0.04125
0.006233
1.621E-5
0.02987
0.04096
0.05433
501
150000
µ1
µ2
µ3
µ4
µ5
µ6
µ7
µ8
µ9
µ 10
µ 11
33.73
0.5143
0.001466
32.72
33.73
34.74
501
150000
43.37
0.374
9.542E-4
42.63
43.37
44.11
501
150000
51.75
0.2213
5.475E-4
51.31
51.75
52.19
501
150000
59.29
0.2298
5.945E-4
58.83
59.29
59.74
501
150000
64.04
0.1606
4.145E-4
63.72
64.04
64.35
501
150000
67.44
0.1117
3.01E-4
67.23
67.44
67.66
501
150000
71.89
0.07132
1.775E-4
71.75
71.89
72.03
501
150000
76.48
0.08646
2.315E-4
76.31
76.48
76.65
501
150000
80.54
0.0997
2.724E-4
80.34
80.54
80.73
501
150000
84.32
0.151
3.633E-4
84.02
84.32
84.61
501
150000
92.55
0.5138
0.00135
91.54
92.55
93.56
501
150000
90 90
Table 4.5: Minimum and Maximum Score for Each Letter Grade, Percent of Students
and Probability of Raw Score Receiving that Grade for GB: Case 2
Grade
A
AB+
B
BC+
C
CD+
D
E
GB
From
To
91
84
80
76
71
67
63
58
51
42
0
100
90
83
89
75
70
66
62
57
50
41
Number of
Student
13
32
64
84
143
53
32
23
16
18
20
Percentage Cumulative Percentage
Probability
%
%
2.6
6.4
12.9
16.9
28.7
10.6
6.4
4.6
3.2
3.6
4.0
2.6
9.0
21.9
38.8
67.5
78.1
84.5
89.2
92.4
96.0
100.0
0.04125
0.08255
0.1297
0.1945
0.2593
0.09038
0.05497
0.04322
0.0334
0.03927
0.03145
Table 4.6: Straight Scale and Standard Deviation Methods: Case 2
Letter
Grades
Score
A
AB+
B
BC+
C
CD+
D
E
85-100
80-84
75-79
70-74
65-79
60-64
55-59
50-54
45-49
40-44
0-39
Straight Scale
Number of Cumulative
Students Percentage
%
34
6.8
75
21.9
115
45.0
131
71.3
60
83.3
24
88.2
9
90.0
16
93.2
6
94.4
12
97.0
15
100.0
Standard Deviation
Score
Number of
Students
93.46-100
89.01-93.46
84.44-89.01
79.86-84.44
75.29-79.86
70.71-75.29
66.14-70.71
61.56-66.14
56.99-61.56
52.54-56.99
0-52.54
9
6
19
60
99
112
84
32
23
9
45
Cumulative
Percentage
%
1.81
3.01
6.83
18.88
38.76
61.24
78.11
84.54
89.16
90.96
100.00
91 91
Now, we compare Table 4.2 and Table 4.5 to the grades assigned by instructor
when they applying the Straight Scale and Standard Deviation method as shown in Table
4.3 and Table 4.6. The results indicates that the grading plan via GB, Straight Scale and
Standard Deviation method vary to the grades interval and to the number of student
getting the respective grade.
Before we diagnose the convergence of the estimates values, we examine the
posterior density functions of the means. See Figure 4.7. Figure 4.7 exhibits the
sampling for mean distribution of grade B. It shows the smoother curves which gives
better posterior distribution sampling plot for grade B. The optimal posterior distribution
for the remainder grades of Case 2 enclosed in Appendix E.
92 92
mu.c[8] chains 1:2 sample: 1000
6.0
4.0
2.0
0.0
76.2
76.4
76.6
mu.c[8] chains 1:2 sample: 10000
6.0
4.0
2.0
0.0
76.0
76.2
76.4
76.6
76.8
mu.c[8] chains 1:2 sample: 50000
6.0
4.0
2.0
0.0
76.0
76.2
76.4
76.6
76.8
mu.c[8] chains 1:2 sample: 100000
6.0
4.0
2.0
0.0
76.0
76.2
76.4
76.6
76.8
mu.c[8] chains 1:2 sample: 150000
6.0
4.0
2.0
0.0
76.0
76.25
76.5
76.75
Figure 4.7: Kernel-Density Plots of Posterior Marginal Distribution of Mean for
Grade B
93 93
4.7.2.2 Convergence Diagnostics
As in section 4.7.1.2, now we want to show the convergence diagnostics of the
estimates of components mean in Case 2. Time for 150,000 sampling was less than 2.5
minutes for computer on 3.0GHz of Pentium 4.
In this case, consider the convergence diagnostic of the mean grade A and B. The
traces diagnostics plot shows the multiple chains are not showing any trend or long cycle
and it cover the same range. Figure 4.8 (a) and (b) demonstrates the plotted chains
appear reasonably stable posterior for the mean of the grade A ( µ 1 ) and B ( µ 8 ). We
stop the sampler at this point, concluding that an acceptable degree of convergence has
been obtained.
Next, Figure 4.9 (a) and (b) display the cumulative graph for Gelman-Rubin
convergence statistics. We have the ratio (the red trace) of the between (the green trace)
and within (the blue trace) chain variances rapidly approach to 1;
i.e ( R = ( pooled / within) → 1) indicate improved convergence.
Now, quantiles for a cumulative graph of the same grades are examined as
shown in Figure 4.10 (a) and (b). The figure shows that the quantiles of the parallel
chains rapidly coincide. This imply convergence.
Finally, we look on the autocorrelation function of grade A and grade B. See
Figure 4.6. In this case, the long-tails are not view to the autocorrelation plots with lower
observed autocorrelations. Therefore from suggestion approved the model is highly
conditioned and that the chain converges rapidly.
94 94
(a)
(b)
Figure 4.8: Monitoring Plots for Traces Diagnostics of Mean: (a) Grade B and (b)
Grade A
mu.c[8] chains 1:2
P[8] chains 1:2
1.0
1.5
1.0
0.5
0.5
0.0
0.0
501
600
501
800
20000
40000
60000
iteration
iteration
(a)
mu.c[11] chains 1:2
mu.c[11] chains 1:2
1.0
1.0
0.5
0.5
0.0
0.0
501
600
800
501
iteration
20000
40000
iteration
(b)
Figure 4.9: Gelman-Rubin Convergence Diagnostics of Mean;
(a) Grade B and (b) Grade A
60000
95 95
mu.c[8] chains 2:1
mu.c[8] chains 2:1
76.8
76.7
76.6
76.5
76.4
76.3
76.6
76.4
76.2
521 600
800
3501
iteration
40000
60000
iteration
(a)
mu.c[11] chains 2:1
mu.c[11] chains 2:1
94.0
94.0
93.0
93.0
92.0
92.0
91.0
91.0
521 600
20000
800
3501
iteration
20000
40000
60000
iteration
(b)
Figure 4.10: Quantiles Diagnostics of Mean; (a) Grade B and (b) Grade A
mu.c[8] chains 1:2
mu.c[8] chains 1:2
1.0
0.5
0.0
-0.5
-1.0
1.0
0.5
0.0
-0.5
-1.0
0
20
40
lag
0
(a)
20
40
lag
mu.c[11] chains 1:2
mu.c[11] chains 1:2
1.0
0.5
0.0
-0.5
-1.0
1.0
0.5
0.0
-0.5
-1.0
0
20
40
lag
0
(b)
20
40
lag
Figure 4.11: Autocorrelations Diagnostics of Mean; (a) Grade B and (b) Grade A
96 96
4.8
Discussion
In the first part of Case 1 and Case 2, we set the T = 75,500 and at least B = 500 .
The iterations in burn-in periods are eliminate to reduce the effect of the starting
distribution. Generally, we discard the first half of each sequence and focus attention on
the second half iterations [Casella and George (1992)]. The Gibbs sampler generates a
Markov Chain of random variables which converge to the distribution of interest (target
distribution). We assumed that the posterior distribution of the simulated values θ t , for
large enough t , are close to the target distribution p (θ x ) .
Another issues that sometimes arises in mind is that, once approximate
convergence has been reached, is whether to used every t th simulation draw, for some
value of t such as 1,000 or 50,000 in order to have approximately independent draws
from the target distribution. In Case 1 and Case 2, we have not found it is useful to skip
iterations, except when the computer storage is problem or the speed is too low. If the
“effective” number of simulation is lower than the actual number of draws, the
inefficiency will automatically be reflected in the posterior intervals obtained by
simulation quantiles. See Table 4.1 and Table 4.4.
For each estimated values µˆ g of each raw scores, we label the optimal sequences
of length n as µˆ i g ( i = 1, 2,..., n ; g = 1, 2,...G ) and compute B and W , the between- and
within- (see Convergence Diagnostic section) sequence variances as following:
B=
s 2g =
2
n G
1 G 2
1 G
µ
−
µ
and
W
=
s
where
µ
=
∑ ( ⋅ g ⋅⋅ )
∑ g
∑ µ ig
⋅g
G − 1 g =1
G g =1
n g
(
1 G
∑ µ ig − µ ⋅ g
n − 1 g =1
)
2
, µ ⋅⋅=
1 G
∑ µ ⋅ g and
G g =1
[Gelman et al., 1995]. The between-sequence variance, B
contains a factor of n because it is based on the variance of the within-sequence
97 97
means, µ ⋅ g , each of which is an average of n values µ i g . If only one sequence is
simulated, B cannot be calculated.
Therefore, we estimate the posterior variances of µ g , by a weighted average of W
and B, namely
var + ( µ x ) =
n −1
1
W+ B
n
n
which overestimates the posterior variance assuming the starting distribution is
appropriately overdispersed, but is unbiased under stationarity. That is the starting
distribution equals the target distribution. Meanwhile, for any finite n the “within”
variance should W should be an underestimate of var ( µ x ) . In limit T → ∞ , the
expectation of W approaches var ( µ x ) .
We monitor convergence of the iterative simulation by estimating the factor by
which the scale of the current distribution for µ might be reduced if the simulations
were continued to the limit T → ∞ . This potential scale reduction is estimated by
R̂ =
var + ( µ x )
W
or
Rˆ =
n −1
1
W+ B
n
n
W
which tend to 1 as T → ∞ . Practically, convergence is considered as achieved when
Rˆ < 1.2 [Cornebise et al., 2005]. In a multiple parameters case, this diagnosis must be
carried out for each parameter separately, the convergence being attained when all
parameters have converge to its target distribution.
98 98
For example, consider the interval for the grades of Case 2. Table 4.7 shows the
posterior quantiles from the second halves of the Gibbs sampler sequences. In this case
75,500 iterations were sufficient for approximate convergence; Rˆ < 1.1 for all
parameters.
Table 4.7: Posterior for 95% Credible Interval of Component Means and its Ratio
If
Node
Mean
Std.
dev.
2.5%
Median
97.5%
µ1
µ2
µ3
µ4
µ5
µ6
µ7
µ8
µ9
µ 10
µ 11
33.73
0.5143
32.72
33.73
34.74
1.000062
43.37
0.374
42.63
43.37
44.11
1.000035
51.75
0.2213
51.31
51.75
52.19
1.000017
59.29
0.2298
58.83
59.29
59.74
1.000016
64.04
0.1606
63.72
64.04
64.35
1.00001
67.44
0.1117
67.23
67.44
67.66
1.000006
71.89
0.07132
71.75
71.89
72.03
1.000004
76.48
0.08646
76.31
76.48
76.65
1.000004
80.54
0.0997
80.34
80.54
80.73
1.000005
84.32
0.151
84.02
84.32
84.61
1.000007
92.55
0.5138
91.54
92.55
93.56
1.000022
R̂
R̂ not near to 1 for all of the estimates, then we need to continue the
simulation updates. Once
R̂ near to 1 for all parameter interest, just consider the
G × n samples from the second halves of the sequences together and treat them as
samples from the target distribution. Therefore for Case 1, we are permit to stop the
iteration at 11× 62 = 682 iterations and for Case 2 at 11× 498 = 5478 iterations in which
the sampling results for optimal solution are approximately the same as at this iteration.
Although, in above cases we are aim to have the smoothest density plots and therefore
conclude the convergence is sufficient to have the optimal estimates. In addition, if we
consider the ratio as a convergence diagnostics, then the sampling takes less than 5s and
15s respectively for both cases. Estimates of functions of the parameters is easily to
obtained. Suppose we seek an estimate of the distribution of γ = σ / µ , the coefficient of
99 99
variation. We simply define the transformed Monte Carlo samples γ i = σ i / µ i provided
i = 1, 2,..., n , and create a kernel density estimates as shown in Figure 4.2 and Figure 4.7
based on these values.
The method of monitoring convergence presented here has the key advantage of
not requiring the user to examine time series graph (in this study we recall as “trace”
graph) of simulated sequences. Inspection of such plots is a notoriously unreliable
method of assessing convergence and is unnormally when monitoring a large number of
quantities of the parameter interest, such as can arise in complicated hierarchical models.
This is because it is based on the means and variances. This method is also effective for
quantities if the posterior distribution is approximately normal. Figure 4.12 shows the
plots of grade cumulative density function for Case 1 (a) and Case 2 (b). The dotted line
represents the cumulative distribution of Straight Scale and Standard Deviation methods
and the smooth line is for grade according to GB grading. Whereas Figure 4.13 and
Figure 4.14 demonstrate the cumulative density plots for each letter grades along with
the histogram for Case 1 and Case 2.
(a)
(b)
GC
SS
GB
(a)
(b)
Figure 4.12: Cumulative Distribution Plots for Straight Scale (dotted line) and GB
Method; (a) Case 1 and (b) Case 2
100100
Figure 4.13: Density Plots with Histogram for Case 1
Figure 4.14: Density Plots and Histogram for Case 2
101101
4.9
Loss Function and Leniency Factor
Making a decision in assigning letter grades to students will face ‘loss’ when
they overlook on the accurate letter grade. The loss function, describes the ‘loss’ that the
instructor experience if they assign a certain letter grade while the students warrant
another letter grade [Alex, 2003]. For that reason, we need to minimized the expected
loss to have the optimal letter grade in which based on the probability distribution of the
letter grade that describe in Chapter III and the above sections in this chapter.
The loss function is the objective function generally used in Bayesian statistical
analysis. It must be nonnegative [Berger, 1985; Press, 2003; Hogg et al. 2005]. Carlin
and Louis (2000), present that the specific loss function forms correspond to point
estimation, interval estimation, an hypothesis testing. The notations used are
prior distribution:
P (θ ) , θ ∈ Θ
sampling distribution:
p {x θ }
allowable actions:
a∈ Α
decision rule:
d∈D : Χ → A
loss function:
L (θ , a )
In estimation problems, the action to be taken is to choose an estimator, θˆ , so that the
action a = θˆ . The loss function L (θ , a ) computes the loss deserved when the true state
of nature θ and we take action a . The most often used of loss function to be of
quadratic form either referred as squared error loss (SEL) or the weighted squared
error (WSEL).
102102
Generally, we denote the quadratic loss function as
( ) (
L θ ,θˆ =c θ − θˆ
)
2
where c is constant value. The quadratic loss function represents the loss function as a
symmetric function. Meaning that, underestimates of θ are equally consequential with
overestimates. The Bayes’ estimator with respect to a quadratic loss function is the mean
of the posterior distribution [Press, 2003]. In this study, we assumed that the instructor
will feels a different loss for overestimates and underestimate the letter grades.
Therefore, it is inappropriate to use symmetric loss functions. This is because the
instructor might feel worse if they assigned a grade that is too low than if they assigned a
grade that is too high. Finally, a low grade will influence the adversely by their students.
Suppose there is a raw score data set X =( x1 , x2 ,..., xn ) and we wish to specify the
Bayesian estimator θˆ ( X ) ≡ θˆ , depending upon X . The Bayesian decision maker should
minimized the expected loss with respect to the decision maker’s posterior distribution.
Since we are not consider the symmetric loss function, there is an alternative loss
function called the asymmetric loss function. The asymmetric loss function is linear
function. This function denote as
(
(
)
)
⎧k1 θˆ − θ ,
⎪
ˆ
L θ ,θ = ⎨
⎪k2 θ − θˆ ,
⎩
( )
θ − θˆ < 0
θˆ − θ ≥ 0.
The constant k1 ≠ k2 can be chosen to reflect the relative importance of underestimation
and overestimation. This constant will be usually different. The above form of
asymmetric loss function is also called the piecewise linear loss function and the Bayes’
estimator is the
k2
percentile of the posterior [Press, 2003]. In special case, where
k1 + k2
k1 = k2 = k , the Bayes’ estimator become the median of the posterior distribution. In this
103103
case, the loss function is called the absolute error loss and the function is equal
( )
to L θ ,θˆ = k θ − θˆ . These types of loss functions are quite often a useful
approximation to the true loss.
Refer to asymmetric loss and the absolute loss, we have decided, to design the
loss function in assigning letter grades as follows:
⎪⎧c yˆi − yi ,
C ( yi , yˆi ) = ⎨
⎪⎩ yˆi − yi ,
yˆi ≤ yi
yˆi > yi
(4.3)
which y is the numeric equivalent of the letter grade that the student truly deserves; ŷ is
the numeric equivalent of the letter grade that the instructor assigned; and c is the
positive constant that reflect instructor’s preference. This signify that, when c = 1 , the
instructors think equally badly about underestimating and overestimating the grade and
when c > 1 , the instructors think worse about underestimating than they thinks about
overestimating and conversely if 0 < c < 1 , the instructors think worse about
overestimating than they thinks about underestimating.
Now, let the Bayes’ estimator q =
c
present the q −th quantile that is the
c +1
optimal letter grade under this loss function. Since the distribution in Eq.(3.9) is not
continuous, we choose the highest letter grade whose the cumulative probability is less
than q . If q = 0.5 , the loss is symmetric, and the optimal letter grade is the mean [Alex,
2003]. For q > 0.5 , the instructor think loss if he underestimates, and therefore they
bump the grade up. This value of q > 0.5 can be interpret as the higher factor that reflect
the instructor decision. Alex, (2003) called this type of factor as Leniency Factor (LF)
which meaning that the instructor is neither lenient nor strict in assigning grade to the
student. LF can be classify as; if 0.0 ≤ q < 0.5 mean the instructor is strict and if
0.5 < q ≤ 1.0 means that the instructor is lenient. In addition, Alex (2003) defined if
104104
the LF = 0.5 then the factor is said to be Neutral Instructors’. The LFs of instructor are
summarized in Table 4.8.
Table 4.8: Leniency Factor and Loss Function Constant
q=
Leniency Factor
Lenient
0.5 < q ≤ 1.0
Neutral
0.5
Strict
c
c +1
0.0 ≤ q < 0.5
Underestimate
c >1
q > 0.5
Equally Underestimate and Overestimate
c =1
q = 0.5
0 < c <1
q < 0.5
Overestimate
How to compute the LF for instructor?. The LF is based on how the instructor
thinks about overestimating and underestimating the letter grade. For LF formula, the
constant c in the loss function is referred as the loss that the instructor thinks for
underestimating as contrasted to the loss of they think for overestimating. For example,
if the grader cares about underestimate the letter grade one and a half times as much as
they cares about overestimating it, then the LF is q = 1.5 /(1.5 + 1) = 0.6 ; the instructor is
lenient in assigning grade to the student.
Example 4.1:
Consider again Case 2. Table 4.9 contains of probability and cumulative
probability for the raw scores belongs to a letter grade. We know that the probability is
Dirichlet distributed and since this not continuous, we can choose the highest letter
grades whose the cumulative probability is less than q [Press, 2003].
105105
Now, if q = 0.5
(∴ c = 1) , that is the instructor is choose to be in neutral mode, then the
most optimal letter grade is C+. In other words, if the instructor is lenient that is
q > 0.5
( q = 0.55
∴c = 1.22 ) the optimal grade is B-. Assigning A- would involve a
very high LF of 0.96 or above; q > 0.5, i.e q = 0.96 and ∴ c = 24 .
Table 4.9: Cumulative Probability for GB; Case 2
Grade
E
D
D+
CC
C+
BB
B+
AA
4.10
GB
From
To
0
42
51
58
63
67
71
76
80
84
91
41
50
57
62
66
70
75
89
83
90
100
Probability Cumulative
Probability
0.03145
0.03927
0.0334
0.04322
0.05497
0.09038
0.2593
0.1945
0.1297
0.08255
0.04125
0.03145
0.07072
0.10412
0.14734
0.20231
0.29269
0.55199
0.74649
0.87619
0.95874
0.99999
Performance Measures
In measuring the performance, there are two measures to determining how well
grading methods executed. First, we can refer to the average loss defined in Eq.(4.3).
Now, we introduce the class loss (CC) as follows:
CC =
1 n
∑ Ci
n i =1
(4.4)
106106
where n is the number of students in the class. Let say the grades that assigned by the
instructor in the class as the “true” grades. In other words, the lower CC means that the
other grading methods assigns grades closer to those actually assigned by the instructor.
For example of Case 1, see Appendix E; Table E1, Table E2, Table E3, Table E and
Table E5. We have computed CC as shown in the first two columns in Table 4.10. Note
that, in this example we make the comparison between the straight scale against GB to
the grade that assigned by instructor in the class.
Another method in evaluating grading plan performance is by the raw coefficient
of determination. Before we go trough to this coefficient, first we consider coefficient of
correlation, r, is a statistical measure of how closely data fits a line. That is, it measures
the strength and direction of a relationship between two variables. The range of the
correlation coefficient is from -1 to +1. If there is a strong positive linear relationship
between the variables, the value of r will be close to +1. If there is a strong negative
linear relationship between the variables, the value of r will be close to -1. When there is
no linear relationship between the variables or there is only a weak relationship, the
value of r will be close to 0. Now, the raw coefficient of determination is given by
n
R 2r = 1 −
∑e
2
i
∑y
2
i
i =1
n
i =1
where e i = yi − yˆi and 0 ≤ R 2r ≤ 1 . Raw coefficient of determination is a measure of the
variation of the dependent variable that is explained by the regression line and the
independent variable. The value of
is usually expressed as a percentage and so it
ranges from 0% to 100%. Thus, the closer the value is to 100%, the better the model is
representing the data. Moreover, higher value indicates that a grading method gets closer
to the grades actually assigned. For example of the raw score coefficient of
107107
determinations in Case 1, see Table 4.10. These values are computed in Appendix E;
Table E1-continued and Table E3-continued and E5-continued.
Table 4.10: Performance of GB, Straight Scale and Standard Deviation Methods: Case 1
Neutral CC
Lenient CC
R 2r (%)
Straight Scale
0.7903
1.2677
98.98
Standard Deviation
1.4839
1.4839
GB
0.1935
0.3097
93.71
99.66
From Table 4.10, we have that the R 2r for GB is higher than Straight Scale and
Standard Deviation. Therefore, a GB method is gets closer to the grades actually
assigned by the instructor as compared to Straight Scale and Standard Deviation method.
However, the GB is no significant difference to Straight Scale since the percentage of
different is low but we can say GB and Standard Deviation method has significance
difference for the high different in R 2r value. In addition, this is sustainable since the CC
values for both lenient and neutral of GB are lower than Straight Scale and Standard
Deviation methods.
108108
CHAPTER 5
CONCLUSION AND SUGGESTION
5.1
Conclusion
This study describes the personal grading plan via statistical approach. We have
designed a statistical model to deal with grading philosophy. Grades are the instructors’
evaluations of the academic work and performance of their students’ complete, or in
their performance in a laboratory, on stage, or in studio. Discussions of educational
standard sometimes refer to instructors’ grading plan, and at other times to the
expectations instructional communicate to students. Since the philosophies and
instructional changes as curriculum changes, the instructors need to be prepared to adjust
their grading plans accordingly. Furthermore, the instructors should check to see what
the grade distributions in their department have been like at their course level. Normally,
the university policy written is the norm against which the reasonableness of each
instructor’s grades will be judged.
Now, consider the grading on curve namely Standard Deviation method and
conditional Bayesian method. For Standard Deviation method, takes into account the
109
hardness level of the examination and the cutoff point are not tied to random error.
When the instructor has some notion of what the grade distribution should be like, some
trial and some error might be needed to decide how many standard deviations each grade
cutoff should be from the composite average. In other words, the mean and gap are
decided appealing arbitrarily. When a standard deviation is desired, this is the most
attractive, despite its computational requirements. However, the curve grade the students
based on single class meaningless unless provided in relation to group student is being
scored against. We are also not recommend this method for classes with non-normal
distribution.
The conditional Bayesian method is inspired by an existing method called
Distribution Gap method. This method allow for screening students accordingly to their
performance relative to their peers and is useful for competitive circumstances where the
feedback allow the students to compare their performance to their peers. Moreover, it is
requires no fixed percentages in advance. Basically this method removes the subjectivity
from Distribution Gap, making it more applicable. The conditional Bayesian grading
reflects the common belief that a class is composed of several subgroups, each of which
should be assigned a different grade. In this study, we have showed that conditional
Bayesian grading successfully separates the letter grades. In applying conditional
Bayesian method, the instructor needs to determine their own Leniency Factor. This is a
spontaneous measure that reflects how leniently the instructor wants in assigning letter
grade. If the instructor is lenient, then the suggested leniency factor is around 0.6.
In this study, we have carried out a couple of experiments in which an
experienced instructor assigned letter grade by his judgmental grading . Then we use the
conditional Bayesian with Standard Deviation and Straight Scale methods to assigns
letter grades based on the raw scores. This study provide evidence that Bayesian grading
gets very closer to what the instructor actually assigned. The students also benefit
academically and the instructors improved the quality in assigning grades. Another
advantage of conditional Bayesian grading was a instructor who is using this method
does not have to be experienced and does not have to spend many time going through all
110
the raw scores since all the work of assigning grades is done by the computer. The
quality of this grading method is as good as the judgmental grading of experienced
instructor. The conditional Bayesian grading is easy to apply, as all the work is done by
the computer. The hardship in applying this method is the students may not easily
understand the method since they are used to the Straight Scale method. Students of
statistic, measurement and evaluation classes may understand it.
5.2
Suggestions
In this study we give a detailed account of the progress of grading methods and
conditional Bayesian grading through mixture modeling. In conditional Bayesian
method the Gibbs sampler run with completion is often not worth programming (unless
it can be quickly implemented, for example in WinBUGS), since the chance of it failing
to converge is too high. The challenge ahead is the assurance of its convergence. A
drawback of using Bayesian mixture model is the difficulty of simulating the highdimensional, variable-dimensional target measure, which is characteristic of such
problem ( an example on Bayesian grading; to be considered with instructor’s attributes,
more than one raw score per student, students’ attributes such as class attendance,
personality factor and other variables that can implicates to the students raw scores. For
those interested person, they can deals this question with multivariate problem of
mixture modeling. The challenge is to include these attributes in the model to assign
grades to the students.
It is however, premature to conclude from this study that conditional Bayesian
grading is unambiguously desirable. In addition, the current study helps us to better
understand the effect of grading methods. Moreover, it may still be the case that our
measure of instructor grading method is merely reflective of some other unmeasured
instructor attribute. Before we can apply the conditional Bayesian method as a policy
111
outcome, it is important to understand the distributional consequences at all levels
including the students, and understood by the policymakers in order to implement a
policy for a grading standard. In general the Bayesian grading method is more
appropriate and efficient.
112
REFERENCES
Alex, S. (2003). A Method for Assigning Letter Grades: Multi-Curve Grading. Dept. of
Economics, University of California - Irvine, 3151 SSPA, Irvine.
Ash, R.B., (1972), Real Analysis and Probability. New York: Academic Press Inc.
Berger, J.O. (1985), Statistical Decision Theory and Bayesian Analysis. 2nd Edition.
New York; Springer-Verlag New York, Inc.
Birnbaum, D.J., (2001). Grading System for Russian Fairy Tales.
www.clover.slavic.pitt.edu/~tales/02-1/grading.html
Box, G.E.P. and Tiao, G.C. (1973). Bayesian Inference in Statistical Analysis.
Massachusetts: Addison-Wesley Publishing Company, Inc.
Casella, G., and George, E. I., (1992). Explaining Gibbs Sampler. Vol. 46. No. 3. pg.
167-174.The American Statistical Assosiation.
Congdon, P (2003). Applied Bayesian Modelling. West Sussex, England: John Wiley &
Son Ltd.
Cornebise, J., Maumy, M., and Philippe, G. A.(2005) Practical Implementation of the
Gibbs Sampler for Mixture of Distribution: Application to the Determination of
Specifications in Food Industry. www.stat.ucl.ac.be/~lambert/BiostatWorkshop2005/
slidesMaumy.pdf.
113
Cross, L.H., (1995). Grading Students. ED398239. ERIC Clearinghouse on Assessment
and Evaluation, Washington DC. www.ericfacility.net/ericdigest.
Ebel, R. L. and Frisbie, D. A. (1991). Essentials of Educational. 3th ed.
Englewood Cliffs, NJ.: Prentice-Hall, Inc.
Figlio, D.N., and Lucas, M.E. (2003). Do High Grading Standards Effect Student
Performance?. Journal of Public Economics 88 (2004) 1815-1834.
Frisbie, D.A and Waltman, K.K.(1992). Developing a personal grading plan.
Educational Measurement: Issues and Practice. Iowa: National Council on
Measurement in Education.
Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995). Bayesian Data Analysis.
London: Chapman & Hall.
Glass, G.V., and Stanley, J.C.(1970). Statistical Methods in Education and Psychology.
Englewood Cliffs, New Jersey: Prentice-Hall, Inc.
Hogg, R.V. and Craig, A.T. (1978). Introduction to Mathematical Statistics. 4th ed. New
York: MacMillan Publishing Co.,Inc.
Hogg, R.V., McKean, J.W. and Craig, A.T. (2005). Introduction to Mathematical
Statistics. 6th ed. New Jersey: Pearson Prentice Hall.
Jasra, A., Holmes, C.C., and Stephens, D.A. (2005). Markov Chain Monte Carlo
Methods and the Label Switching Problem in Bayesian Mixture Modeling. Vol. 20.
No.1, 50-57. Statistical Sciece; Institute of Mathematical Statistics.
Johnson, B. and Christensen, L.(2000). Educational Research: Chapter 5 -Quantitative
and Qualitative Approaches. 2nd ed. Alabama ,Allyn and Bacon Inc.
114
Jones, P.N., (1991). Mixture Distributions Fitted to Instrument Counts. Rev. Sci.
Instrum. 62(5); Australia: American Institute of Physics.
Lawrence D.A (2005). A Guide to Teaching & Learning Practices: Chapter 13Grading. Tallahassee , Florida State University. Unpublished.
Lawrence, H.C.(1995). Grading Students. ED398239. ERIC Clearinghouse on
Assessment and Evaluation, Washington, DC.
Lee, P.M., (1989). Bayesian Statistics. New York. Oxford University Press.
Martuza, V.R. (1977). Applying Norm-Referenced and Criterion-Referenced:
Measurement and Evaluation. Boston Massachusetts: Allyn and Bacon Inc.
Merle, W.T. (1968). Statistics in Education and psychology: A First Course. New York:
The Macmillan Company.
Newman, K. (2005). Bayesian Inference. Lecture Notes. http://www.creem.stand.ac.uk/ken/mt4531.html.
Peers, I.S.(1996). Statistical Analysis For Education and Psychology Researchers.
London: Falmer Press.
Pradeep, D. and John, G. (2005). Grading Exams: 100, 99, ... , 1 or A, B, C? Incentives
in Games of Status. 3-6.
Press, S.J. (2003). Subjective and Objective Bayesian Statistic: Principle, Models and
Application. New Jersey: John Wiley & Son, Inc.
Raftery, A.E (1996), Hypothesis Testing and Model Selection, in S.R.W.R. Gilks &
D.Spiegelhalter, eds, “Markov Chain Monte Carlo in Pratice”, Chapman and Hall,
pg. 163-188.
115
Robert, L.W. (1972). An Introduction to Bayesian Inference and Decision. New York:
Holt, Rinehart and Winston, Inc.
Robert, M.H. (1998). Assessment and Evaluation of Developmental Learning:
Qualitative Individual Assessment and Evaluation Models. Westport: Greenwood
Publishing Group Inc.
Spencer, C. (1983). Grading on the Curve. Antic, 1(6): 64
Stanley, J.C. and Hopkins, K.D. (1972). Educational Psychological Measurement and
Evaluation. Englewood Cliffs, NJ.: Prentice Hall, Inc.
Stephens, M. (2000). Bayesian Analysis of Mixture Models with An Unknown Number of
Component-An Alternative To Reversible Jump Methods. University of Oxford. The
Annals Statistics. Vol. 28, No.1, 40-74.
Walsh, B. (2004). Markov Chain Monte Carlo. Lecture Notes for EEB 581.
Walvoord, B.E. and Anderson, V.J. (1998). Effective Grading: A Tool for Learning and
Assessment. San Francisco: Jossey-Bass Publisher.
116
APPENDIX A1
φ (z)
Table A.1 : Normal Distribution Table
0 z
z
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0
0.0000
0.0040
0.0080
0.0120
0.0160
0.0199
0.0239
0.0279
0.0319
0.0359
0.1
0.0398
0.0438
0.0478
0.0517
0.0557
0.0596
0.0636
0.0675
0.0714
0.0753
0.2
0.0793
0.0832
0.0871
0.0910
0.0948
0.0987
0.1026
0.1064
0.1103
0.1141
0.3
0.1179
0.1217
0.1255
0.1293
0.1331
0.1368
0.1406
0.1443
0.1480
0.1517
0.4
0.1554
0.1591
0.1628
0.1664
0.1700
0.1736
0.1772
0.1808
0.1844
0.1879
0.5
0.1915
0.1950
0.1985
0.2019
0.2054
0.2088
0.2123
0.2157
0.2190
0.2224
0.6
0.2257
0.2291
0.2324
0.2357
0.2389
0.2422
0.2454
0.2486
0.2517
0.2549
0.7
0.2580
0.2611
0.2642
0.2673
0.2704
0.2734
0.2764
0.2794
0.2823
0.2852
0.8
0.2881
0.2910
0.2939
0.2967
0.2995
0.3023
0.3051
0.3078
0.3106
0.3133
0.9
0.3159
0.3186
0.3212
0.3238
0.3264
0.3289
0.3315
0.3340
0.3365
0.3389
1
0.3413
0.3438
0.3461
0.3485
0.3508
0.3531
0.3554
0.3577
0.3599
0.3621
1.1
0.3643
0.3665
0.3686
0.3708
0.3729
0.3749
0.3770
0.3790
0.3810
0.3830
1.2
0.3849
0.3869
0.3888
0.3907
0.3925
0.3944
0.3962
0.3980
0.3997
0.4015
1.3
0.4032
0.4049
0.4066
0.4082
0.4099
0.4115
0.4131
0.4147
0.4162
0.4177
1.4
0.4192
0.4207
0.4222
0.4236
0.4251
0.4265
0.4279
0.4292
0.4306
0.4319
1.5
0.4332
0.4345
0.4357
0.4370
0.4382
0.4394
0.4406
0.4418
0.4429
0.4441
1.6
0.4452
0.4463
0.4474
0.4484
0.4495
0.4505
0.4515
0.4525
0.4535
0.4545
1.7
0.4554
0.4564
0.4573
0.4582
0.4591
0.4599
0.4608
0.4616
0.4625
0.4633
1.8
0.4641
0.4649
0.4656
0.4664
0.4671
0.4678
0.4686
0.4693
0.4699
0.4706
1.9
0.4713
0.4719
0.4726
0.4732
0.4738
0.4744
0.4750
0.4756
0.4761
0.4767
2
0.4772
0.4778
0.4783
0.4788
0.4793
0.4798
0.4803
0.4808
0.4812
0.4817
2.1
0.4821
0.4826
0.4830
0.4834
0.4838
0.4842
0.4846
0.4850
0.4854
0.4857
2.2
0.4861
0.4864
0.4868
0.4871
0.4875
0.4878
0.4881
0.4884
0.4887
0.4890
2.3
0.4893
0.4896
0.4898
0.4901
0.4904
0.4906
0.4909
0.4911
0.4913
0.4916
2.4
0.4918
0.4920
0.4922
0.4925
0.4927
0.4929
0.4931
0.4932
0.4934
0.4936
2.5
0.4938
0.4940
0.4941
0.4943
0.4945
0.4946
0.4948
0.4949
0.4951
0.4952
2.6
0.4953
0.4955
0.4956
0.4957
0.4959
0.4960
0.4961
0.4962
0.4963
0.4964
2.7
0.4965
0.4966
0.4967
0.4968
0.4969
0.4970
0.4971
0.4972
0.4973
0.4974
2.8
0.4974
0.4975
0.4976
0.4977
0.4977
0.4978
0.4979
0.4979
0.4980
0.4981
2.9
0.4981
0.4982
0.4982
0.4983
0.4984
0.4984
0.4985
0.4985
0.4986
0.4986
3
0.4987
0.4987
0.4987
0.4988
0.4988
0.4989
0.4989
0.4989
0.4990
0.4990
3.1
0.4990
0.4991
0.4991
0.4991
0.4992
0.4992
0.4992
0.4992
0.4993
0.4993
3.2
0.4993
0.4993
0.4994
0.4994
0.4994
0.4994
0.4994
0.4995
0.4995
0.4995
3.3
0.4995
0.4995
0.4995
0.4996
0.4996
0.4996
0.4996
0.4996
0.4996
0.4997
3.4
0.4997
0.4997
0.4997
0.4997
0.4997
0.4997
0.4997
0.4997
0.4997
0.4998
3.5
0.4998
0.4998
0.4998
0.4998
0.4998
0.4998
0.4998
0.4998
0.4998
0.4998
3.6
0.4998
0.4998
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
3.7
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
3.8
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
3.9
0.5000
117
APPENDIX A2
Grading via Standard Deviation Method for Selected Means and Standard Deviation
z
µ=50
T
LG
z
µ=51
T
LG
z
µ=55
T
LG
z
µ=59
T
LG
z
µ=60
T
LG
z
µ=63
T
LG
z
µ=65
T
LG
z
µ=70
T
LG
σ=10
-1.9998
-1.6362
-1.2726
-0.9090
-0.5454
-0.1818
0.1818
0.5454
0.9090
1.2726
1.6362
1.9998
30.00
33.64
37.27
40.91
44.55
48.18
51.82
55.45
59.09
62.73
66.36
70.00
E
D
D+
CC
C+
BB
B+
AA
-1.9998
-1.6362
-1.2726
-0.9090
-0.5454
-0.1818
0.1818
0.5454
0.9090
1.2726
1.6362
1.9998
31.00
34.64
38.27
41.91
45.55
49.18
52.82
56.45
60.09
63.73
67.36
71.00
E
D
D+
CC
C+
BB
B+
AA
-1.9998
-1.6362
-1.2726
-0.9090
-0.5454
-0.1818
0.1818
0.5454
0.9090
1.2726
1.6362
1.9998
30.00
33.64
37.27
40.91
44.55
48.18
56.82
60.45
64.09
67.73
71.36
75.00
E
D
D+
CC
C+
BB
B+
AA
-1.9998
-1.6362
-1.2726
-0.9090
-0.5454
-0.1818
0.1818
0.5454
0.9090
1.2726
1.6362
1.9998
39.00
42.64
46.27
49.91
53.55
57.18
60.82
64.45
68.09
71.73
75.36
79.00
E
D
D+
CC
C+
BB
B+
AA
-1.9998
-1.6362
-1.2726
-0.9090
-0.5454
-0.1818
0.1818
0.5454
0.9090
1.2726
1.6362
1.9998
40.00
43.64
47.27
50.91
54.55
58.18
61.82
65.45
69.09
72.73
76.36
80.00
E
D
D+
CC
C+
BB
B+
AA
-1.9998
-1.6362
-1.2726
-0.9090
-0.5454
-0.1818
0.1818
0.5454
0.9090
1.2726
1.6362
1.9998
43.00
46.64
50.27
53.91
57.55
61.18
64.82
68.45
72.09
75.73
79.36
83.00
E
D
D+
CC
C+
BB
B+
AA
-1.9998
-1.6362
-1.2726
-0.9090
-0.5454
-0.1818
0.1818
0.5454
0.9090
1.2726
1.6362
1.9998
45.00
48.64
52.27
55.91
59.55
63.18
66.82
68.45
72.09
75.73
79.36
83.00
E
D
D+
CC
C+
BB
B+
AA
-1.9998
-1.6362
-1.2726
-0.9090
-0.5454
-0.1818
0.1818
0.5454
0.9090
1.2726
1.6362
1.9998
50.00
53.64
57.27
60.91
64.55
68.18
71.82
75.45
79.09
82.73
86.36
90.00
E
D
D+
CC
C+
BB
B+
AA
33.00
36.09
39.18
42.27
45.36
48.45
51.55
54.64
57.73
60.82
63.91
E
D
D+
CC
C+
BB
B+
AA
-1.9998
-1.6362
-1.2726
-0.9090
-0.5454
-0.1818
0.1818
0.5454
0.9090
1.2726
1.6362
34.00
37.09
40.18
43.27
46.36
49.45
52.55
55.64
58.73
61.82
64.91
E
D
D+
CC
C+
BB
B+
AA
-1.9998
-1.6362
-1.2726
-0.9090
-0.5454
-0.1818
0.1818
0.5454
0.9090
1.2726
1.6362
38.00
41.09
44.18
47.27
50.36
53.45
56.55
59.64
62.73
65.82
68.91
E
D
D+
CC
C+
BB
B+
AA
-1.9998
-1.6362
-1.2726
-0.9090
-0.5454
-0.1818
0.1818
0.5454
0.9090
1.2726
1.6362
42.00
45.09
48.18
51.27
54.36
57.45
60.55
63.64
66.73
69.82
72.91
E
D
D+
CC
C+
BB
B+
AA
-1.9998
-1.6362
-1.2726
-0.9090
-0.5454
-0.1818
0.1818
0.5454
0.9090
1.2726
1.6362
43.00
46.09
49.18
52.27
55.36
58.45
61.55
64.64
67.73
70.82
73.91
E
D
D+
CC
C+
BB
B+
AA
-1.9998
-1.6362
-1.2726
-0.9090
-0.5454
-0.1818
0.1818
0.5454
0.9090
1.2726
1.6362
46.00
49.09
52.18
55.27
58.36
61.45
64.55
67.64
70.73
73.82
76.91
E
D
D+
CC
C+
BB
B+
AA
-1.9998
-1.6362
-1.2726
-0.9090
-0.5454
-0.1818
0.1818
0.5454
0.9090
1.2726
1.6362
48.00
51.09
54.18
57.27
60.36
63.45
66.55
69.64
72.73
75.82
78.91
E
D
D+
CC
C+
BB
B+
AA
-1.9998
-1.6362
-1.2726
-0.9090
-0.5454
-0.1818
0.1818
0.5454
0.9090
1.2726
1.6362
53.00
56.09
59.18
62.27
65.36
68.45
71.55
74.64
77.73
80.82
83.91
E
D
D+
CC
σ=8.5
-1.9998
-1.6362
-1.2726
-0.9090
-0.5454
-0.1818
0.1818
0.5454
0.9090
1.2726
1.6362
BB
B+
AA
118
APPENDIX B
The Probability of Set Function
Definition B1
If P ( C ) is defined for type of subset of the space Ω , and if
(a)
P ( C ) ≥ 0,
(b)
P {C1 ∪ C2 ∪ C3 ∪ ⋅⋅⋅} = P ( C1 ) + P ( C2 ) + P ( C3 ) + ⋅⋅⋅ , where the sets Ci , i ∈
,
are such that no two have a point in common; that is Ci ∩ C j = ∅ , j ≠ i ,
(c)
P ( Ω ) = 1,
then P ( C ) is called the probability set function of the outcome of the random
experiment. For each subset C of Ω , the number P ( C ) is called the probability that the
outcome of the random experiment is an element set C , or the probability of the event
C , or the probability measure of the set C . A probability set function tells us how the
probability distributed over various subsets C of a sample space Ω .
In definition B1.(b), if these subsets are such that no two have an element in general,
they are called mutually disjoint sets and the corresponding events C1 , C2 , C3 , ⋅⋅⋅ are said
to be mutually exclusive events. Moreover if Ω = C1 ∪ C2 ∪ C3 ∪ ⋅⋅⋅ , the mutually
exclusive events are then characterized as being exhaustive and the probability of their
union is obviously equal to 1.
The probability set function P and the random variable X and is sometimes denoted
PX ( A ) , where set A ⊂ Ω , X ( c ) ∈ A and c ∈ Ω . That is Pr { X ∈ A} = PX ( A ) = P ( C ) .
The probability PX ( A ) is often called an induced probability (Hogg and Craig, 1978).
119
Mixture Model
Example B1
Consider the model
X i λ ~ iid Poisson ( λ )
Θ ~ Γ (α , β ) , α and β are known
Random sample is drawn from Poisson distribution with mean λ and the prior
distribution is a Γ (α , β ) distribution. We let X′ = { X 1 , X 2 , ⋅⋅⋅, X n } . Thus the joint
conditional distribution of X, given Θ = λ is derived by following:
Poisson pdf:
⎧ e−λ λ x
⎪
f ( x λ ) = ⎨ x!
⎪0
⎩
x = 0,1, 2,...
elsewhere
Likelihood function:
n
L ( x λ ) = ∏ f ( xi λ ) = f ( x1 λ ) f ( x2 λ ) ⋅⋅⋅ f ( xn λ )
i =1
e − λ λ xi
=∏
xi !
i =1
n
n
=
e
− nλ
λ
∑ xi
i =1
n
∏x !
i =1
i
the prior pdf:
π (λ ) =
λ α −1e − λ / β
Γ (α ) β α
;0<λ <∞
hence, the joint mixed continuous pdf [Eq.3.4 or 3.5] is given by
n
L ( x λ )π (λ ) =
∑ xi
e − nλ λ i=1 λ α −1e − λ / β
n
Γ (α ) β α
!
x
∏i
i =1
120
provided that 0 < λ < ∞ , xi = 0,1,2,… and i=1,2,3,…,n and equal to zero elsewhere.
Then the marginal distribution of the sample is
∞
m( x) = ∫
λ∑
∏ xi !Γ (α ) β α
e
0
− λ ( n +1/ β )
xi +α −1
dλ
∞
1
− λ n +1/ β ) ∑ xi +α −1
λ
e (
dλ
=
α ∫
∏ xi !Γ (α ) β 0
∗ Γ (α ) = ∫ zα −1e − z dz
let z = λ
n +1
β
∑
1
−z ⎛ β z ⎞
e ⎜
=
⎟
∏ xi !Γ (α ) β α ∫0 ⎝ n + 1 ⎠
∞
∴ dz =
xi +α −1
β
n +1
∞
1
x +α −1 ⎛ n + 1 ⎞
e− z z ∑ i
/⎜
=
⎟
α ∫
!
x
Γ
α
β
(
)
⎝ β ⎠
∏i
0
=
n +1
β
dλ, λ =
βz
n +1
dλ
∑ xi +α
dλ
Γ ( ∑ xi + α − 1)
1
α
∏ xi !Γ (α ) β ⎛ n + 1 ⎞∑ xi +α
⎜ β ⎟
⎝
⎠
therefore, the posterior pdf of Θ , given X = x is
n
f (λ x) =
=
=
L ( x λ )π (λ )
m( x)
e
⎛ nβ +1 ⎞
−λ⎜
⎟
⎝ β ⎠
λ∑
∑ xi
=
Γ ( ∑ xi + α − 1)
e − nλ λ i =1 λ α −1e − λ / β
1
/
n
Γ (α ) β α ∏ xi !Γ (α ) β α ⎛ n + 1 ⎞∑ xi +α
x
!
∏
i
⎜ β ⎟
i =1
⎝
⎠
xi +α −1
( n + 1/ β )∑
xi +α
Γ ( ∑ xi + α )
e
⎛ nβ +1 ⎞
−λ⎜
⎟
⎝ β ⎠
λ∑
xi +α −1
Γ ( ∑ xi + α ) ( β / ( nβ + 1) )∑
xi +α
(**)
121
provided 0 < λ < ∞ , xi = 0,1,2,… and i=1,2,3,…,n and is equal to zero elsewhere. We
can see that ( ∗∗) is the Gamma type with
β
nβ + 1
n
α ∗ = ∑ xi + α and β ∗ =
i =1
when divided the joint mixed pdf by marginal distribution to have the posterior
distribution, the denominator of ( ∗∗) seen does not depends upon λ but it is depend
upon random variable x . Therefore we can writes the denominator as any constant that
depend upon x , say c ( x ) such that the posterior rewritten in terms of
x +α −1
f (λ x) = c (x) λ ∑ i e
⎛ nβ +1 ⎞
−λ⎜
⎟
⎝ β ⎠
λ
− ( nβ +1)
x +α −1
= c ( x) λ ∑ i e β
provided 0 < λ < ∞ , xi = 0,1,2,… and i=1,2,3,…,n which is
c (x) =
1
Γ ( ∑ xi + α ) ( β / ( nβ + 1) )∑
xi +α
In addition, we write the posterior distribution is proportional to L ( x λ ) π ( λ ) , that is
f ( λ x ) proportional to L ( x λ ) π ( λ ) and given by
f (λ x) ∝ L ( x λ )π (λ )
Generally we write this form as follows
f (θ x ) ∝ L ( x θ ) π (θ )
where θ is our parameter of interest. In words we can say that
posterior probability ∝ prior distribution × likelihood
122
Note that in the right hand number of these equation all factors involving constants and
x alone (not θ ) can be dropped. For illustration in this example, we simply write
λ
− ( nβ +1)
x +α −1
f (λ x) ∝ λ ∑ i e β
or equivalently
f ( λ x ) ∝ λ ∑ i e − nλ λ α −1e − λ / β ■.
x
123
APPENDIX C
Example C1: Weighting Grades Component
Suppose a student took two tests, one project assignment and one final exam. The
instructor want to weight the corresponding components at 20%,20%,10% and 50%
respectively. First, the instructor must construct the probability distribution of letter
grade based on each test. Then combined this distribution using the weights and adds
first distribution multiplied by 20% to the second distribution multiplied by 20% to the
project’s distribution multiplied by 10% and to the final exam multiplied by 50%.
124
APPENDIX D
Some Useful Integrals
The Gamma, Inverse Gamma and Related Integrals
For α > 0, β > 0;
∞
(i )
∫x
α −1 − β x
e
dx = Γ (α ) β −α
0
∞
( ii ) ∫ x −(α +1)e− β / x dx = Γ (α ) β −α
0
∞
1
( iii ) ∫ xα −1e− β x dx = Γ (α / 2 ) β −α / 2
2
2
0
∞
( iv ) ∫ x −(α +1)e− β x
−2
0
1
dx = Γ (α / 2 ) β −α / 2
2
Generally, for α > 0, β > 0, a > 0;
∞
∫x
α −1 − β x a
e
0
∞
dx = ∫ x
−(α +1) − β x − a
0
∞
Γ (α ) = ∫ xα −1e − x dx
0
e
dx =
1 ⎛ α ⎞ −α / a
Γ⎜ ⎟β
a ⎝a⎠
125
APPENDIX E
WinBUGS for Bayesian Grading
Specification Tool
To check “the model is
syntactically correct”,
to enter the data, to
compile the model and
to enter or generate the
initial values
Sample Monitor Tool
To select the nodes to be
monitored; WinBUGS
save a file of the values of
the node generated by
MCMC. To explore the
posterior distribution
Update Tool
To run the model by
entering the desired
number of iteration of
the chain
Complete WinBUGS session showing model, tool windows, and output windows
126
Case 1: Small Class
a)
Model
MODEL{
for( i in 1 : N )
{
Y[i] ~ dnorm(mu[i],tau[i])
mu[i] <- mu.c[G[i]]
tau[i]<-tau.c[G[i]]
G[i] ~ dcat(P[])
}
for( g in 1 : M )
{
tau.c[g] ~ dgamma(alpha.b[g],beta.b[g])
alpha.b[g]<-3+numgrade[g]/2
beta.b[g]<-1/(2*L[g])
sigma.c[g] <- 1 / sqrt(tau.c[g])
mu.c[g] ~ dnorm(alpha.a[g], alpha.tau[g])
alpha.a[g]<-(1/(1/v[g]+numgrade[g]*tau.c[g]))*(m[g]*v[g]+numgrade[g]* L[g]*tau.c[g])
alpha.tau[g]<-(1/(numgrade[g]*tau.c[g]+1/v[g]))
m[g]~dnorm(q[g],0.0025)
q[g]<-9*g
v[g]<-400
}
P[1:11 ] ~ ddirch(phi[])
}
~ DATA (N = 62)
38, 45, 52, 57, 58, 60, 60, 60, 64, 65, 65, 67, 67, 69, 69, 69, 69, 69, 69, 70,
70, 70, 70, 70, 70, 72, 72, 72, 73, 74, 74, 75, 76, 76, 78, 79, 79, 81, 81, 81,
82, 82, 83, 83, 84, 85, 85, 87, 89, 89, 90, 91, 92, 92, 93, 93, 94, 94, 94, 95, 95, 96.
127
b)
Posterior Marginal Density
Marginal posterior density estimates for the means of the different letter grade .
mu.c[1] chains 1:2 sample: 150000
mu.c[2] chains 1:2 sample: 150000
1.50E-7
8.0
6.0
4.0
2.0
0.0
1.00E-7
5.00E-8
0.0
-2.0E+7 -1.0E+7
0.0 1.00E+7
37.5
mu.c[3] chains 1:2 sample: 150000
38.0
38.5
mu.c[4] chains 1:2 sample: 150000
8.0
6.0
4.0
2.0
0.0
0.6
0.4
0.2
0.0
44.25
44.75
45.25
50.0
mu.c[5] chains 1:2 sample: 150000
52.5
55.0
57.5
60.0
mu.c[6] chains 1:2 sample: 150000
20.0
15.0
10.0
5.0
0.0
1.5
1.0
0.5
0.0
59.8
59.9
60.0
60.1
63.0
mu.c[7] chains 1:2 sample: 150000
64.0
65.0
66.0
67.0
mu.c[8] chains 1:2 sample: 150000
4.0
3.0
2.0
1.0
0.0
1.0
0.75
0.5
0.25
0.0
68.5
69.0
69.5
70.0
72.0
mu.c[9] chains 1:2 sample: 150000
74.0
76.0
mu.c[10] chains 1:2 sample: 150000
1.0
0.75
0.5
0.25
0.0
2.0
1.5
1.0
0.5
0.0
80.0
82.0
84.0
86.0
91.0
mu.c[11] chains 1:2 sample: 150000
6.0
4.0
2.0
0.0
94.5
95.0
95.5
96.0
92.0
93.0
128
c)
Convergence Diagnostics
~ Trace
129
~ Gelman-Rubin Statistics
mu.c[1] chains 1:2
mu.c[2] chains 1:2
1.0
1.0
0.5
0.5
0.0
0.0
501
20000
40000
60000
501
20000
iteration
mu.c[3] chains 1:2
1.0
0.5
0.5
0.0
0.0
20000
60000
mu.c[4] chains 1:2
1.0
501
40000
iteration
40000
60000
501
20000
iteration
40000
60000
iteration
mu.c[5] chains 1:2
mu.c[6] chains 1:2
1.0
1.0
0.5
0.5
0.0
0.0
501
20000
40000
60000
501
20000
iteration
mu.c[7] chains 1:2
mu.c[8] chains 1:2
1.0
0.5
0.5
0.0
0.0
20000
60000
iteration
1.0
501
40000
40000
60000
501
20000
iteration
40000
60000
iteration
mu.c[9] chains 1:2
mu.c[10] chains 1:2
1.0
1.0
0.5
0.5
0.0
0.0
501
20000
40000
60000
501
iteration
20000
40000
iteration
mu.c[11] chains 1:2
1.0
0.5
0.0
501
20000
40000
iteration
60000
60000
130
~ Running Quantiles
mu.c[1] chains 2:1
mu.c[2] chains 2:1
1.00E+7
38.2
38.1
38.0
37.9
37.8
5.00E+6
0.0
-5.0E+6
3501
20000
40000
60000
3501
20000
iteration
40000
60000
iteration
mu.c[3] chains 2:1
mu.c[4] chains 2:1
45.2
45.1
45.0
44.9
44.8
58.0
57.0
56.0
55.0
54.0
53.0
3501
20000
40000
60000
3501
20000
iteration
40000
60000
iteration
mu.c[5] chains 2:1
mu.c[6] chains 2:1
60.05
66.5
66.0
65.5
65.0
64.5
60.0
59.95
3501
20000
40000
60000
3501
20000
iteration
40000
60000
iteration
mu.c[7] chains 2:1
mu.c[8] chains 2:1
69.8
76.0
75.5
75.0
74.5
74.0
69.6
69.4
69.2
3501
20000
40000
60000
3501
20000
iteration
40000
60000
iteration
mu.c[9] chains 2:1
mu.c[10] chains 2:1
85.0
93.5
93.0
84.0
92.5
83.0
92.0
3501
20000
40000
60000
3501
iteration
20000
40000
iteration
mu.c[11] chains 2:1
95.6
95.4
95.2
95.0
3501
20000
40000
iteration
60000
60000
131
~ Autocorrelation
mu.c[1] chains 1:2
mu.c[2] chains 1:2
1.0
0.5
0.0
-0.5
-1.0
1.0
0.5
0.0
-0.5
-1.0
0
20
40
0
20
lag
40
lag
mu.c[3] chains 1:2
mu.c[4] chains 1:2
1.0
0.5
0.0
-0.5
-1.0
1.0
0.5
0.0
-0.5
-1.0
0
20
40
0
20
lag
40
lag
mu.c[5] chains 1:2
mu.c[6] chains 1:2
1.0
0.5
0.0
-0.5
-1.0
1.0
0.5
0.0
-0.5
-1.0
0
20
40
0
20
lag
40
lag
mu.c[7] chains 1:2
mu.c[8] chains 1:2
1.0
0.5
0.0
-0.5
-1.0
1.0
0.5
0.0
-0.5
-1.0
0
20
40
0
20
lag
40
lag
mu.c[9] chains 1:2
mu.c[10] chains 1:2
1.0
0.5
0.0
-0.5
-1.0
1.0
0.5
0.0
-0.5
-1.0
0
20
40
0
lag
20
40
lag
mu.c[11] chains 1:2
1.0
0.5
0.0
-0.5
-1.0
0
20
40
lag
132
Table E1: Measuring Performance of class loss for Case 1 : GB method
c=1
Score
Instructor
GB
C
C
Score
38
45
52
57
58
60
60
60
64
65
65
67
67
69
69
69
69
69
69
70
70
70
70
70
70
72
72
72
73
74
74
2
3
4
4
4
5
5
5
6
6
6
6
6
7
7
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
8
2
3
3
4
4
5
5
5
6
6
6
6
6
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
8
8
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
0
75
76
76
78
79
79
81
81
81
82
82
83
83
84
85
85
87
89
89
90
91
92
92
93
93
94
94
94
95
95
96
Instructor GB
8
8
8
8
8
8
9
9
9
9
9
9
9
9
9
9
9
9
9
10
10
10
10
10
10
10
10
10
11
11
11
5
8
8
8
8
8
8
8
8
8
8
8
9
9
9
9
9
9
9
9
9
9
10
10
10
10
10
10
10
11
11
11
C
C
0
0
0
0
0
0
1
1
1
1
1
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
7
CC=12/62 0.193548
133
Continued : Table E1
c=1
Score
Instructor
GB
38
45
52
57
58
60
60
60
64
65
65
67
67
69
69
69
69
69
69
70
70
70
70
70
70
72
72
72
73
74
74
1
3
4
4
4
5
5
5
6
6
6
6
6
7
7
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
8
2
3
3
4
4
5
5
5
6
6
6
6
6
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
8
8
GB2 e e2
4
9
9
16
16
25
25
25
36
36
36
36
36
49
49
49
49
49
49
49
49
49
49
49
49
49
49
49
49
64
64
1221
1-R=
R=
-1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
0
Score
75
76
76
78
79
79
81
81
81
82
82
83
83
84
85
85
87
89
89
90
91
92
92
93
93
94
94
94
95
95
96
6
0.003422854
0.996577
Instructor GB GB2 e e2
8
8
8
8
8
8
9
9
9
9
9
9
9
9
9
9
9
9
9
10
10
10
10
10
10
10
10
10
11
11
11
8
8
8
8
8
8
8
8
8
8
8
9
9
9
9
9
9
9
9
9
9
10
10
10
10
10
10
10
11
11
11
64
64
64
64
64
64
64
64
64
64
64
81
81
81
81
81
81
81
81
81
81
100
100
100
100
100
100
100
121
121
121
2577
0
0
0
0
0
0
1
1
1
1
1
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
7
134
Table E2: Measuring Performance of class loss for Case 1 : GB method
c=1.6
Score
Instructor
GB
C
C
Score
38
45
52
57
58
60
60
60
64
65
65
67
67
69
69
69
69
69
69
70
70
70
70
70
70
72
72
72
73
74
74
2
3
4
4
4
5
5
5
6
6
6
6
6
7
7
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
8
1
3
3
4
4
5
5
5
6
6
6
6
6
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
8
8
1
0
1.6
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1.6
1.6
1.6
1.6
0
0
1
0
1.6
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1.6
1.6
1.6
1.6
0
0
75
76
76
78
79
79
81
81
81
82
82
83
83
84
85
85
87
89
89
90
91
92
92
93
93
94
94
94
95
95
96
Instructor GB
8
8
8
8
8
8
9
9
9
9
9
9
9
9
9
9
9
9
9
10
10
10
10
10
10
10
10
10
11
11
11
8
8
8
8
8
8
8
8
8
8
8
8
9
9
9
9
9
9
9
9
9
9
10
10
10
10
10
10
10
11
11
11
C
C
0
0
0
0
0
0
1.6
1.6
1.6
1.6
1.6
0
0
0
0
0
0
0
0
1.6
1.6
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1.6
1.6
1.6
1.6
1.6
0
0
0
0
0
0
0
0
1.6
1.6
0
0
0
0
0
0
0
0
0
0
11.2
CC=19.2/62 0.309677
135
Table E3: Measuring Performance of class loss for Case 1 : Straight Scale method
c=1
Score
Instructor
38
45
52
57
58
60
60
60
64
65
65
67
67
69
69
69
69
69
69
70
70
70
70
70
70
72
72
72
73
74
74
2
3
4
4
4
5
5
5
6
6
6
6
6
7
7
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
8
Straight
Scale
1
3
4
5
5
6
6
6
6
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
8
8
8
8
8
8
8
C
C
1
0
0
-1
-1
-1
-1
-1
0
-1
-1
-1
-1
0
0
0
0
0
0
-1
-1
-1
-1
-1
-1
0
0
0
0
0
0
1
0
0
1
1
1
1
1
0
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
Score Instructor
75
76
76
78
79
79
81
81
81
82
82
83
83
84
85
85
87
89
89
90
91
92
92
93
93
94
94
94
95
95
96
8
8
8
8
8
8
9
9
9
9
9
9
9
9
9
9
9
9
9
10
10
10
10
10
10
10
10
10
11
11
11
Straight
Scale
9
9
9
9
9
9
10
10
10
10
10
10
10
10
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
16
C
C
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-2
-2
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
1
1
1
1
1
1
1
1
1
0
0
0
33
CC=49/62
0.790322581
136
Continued: Table E3
c=1
Score
Instructor
Straight
Scale
38
45
52
57
58
60
60
60
64
65
65
67
67
69
69
69
69
69
69
70
70
70
70
70
70
72
72
72
73
74
74
2
3
4
4
4
5
5
5
6
6
6
6
6
7
7
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
8
1
3
4
5
5
6
6
6
6
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
8
8
8
8
8
8
8
1
9
16
25
25
36
36
36
36
49
49
49
49
49
49
49
49
49
49
64
64
64
64
64
64
64
64
64
64
64
64
1478
e
e2
Score
1
0
0
-1
-1
-1
-1
-1
0
-1
-1
-1
-1
0
0
0
0
0
0
-1
-1
-1
-1
-1
-1
0
0
0
0
0
0
1
0
0
1
1
1
1
1
0
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
75
76
76
78
79
79
81
81
81
82
82
83
83
84
85
85
87
89
89
90
91
92
92
93
93
94
94
94
95
95
96
InstructorStraight Scale
8
8
8
8
8
8
9
9
9
9
9
9
9
9
9
9
9
9
9
10
10
10
10
10
10
10
10
10
11
11
11
16
9
9
9
9
9
9
10
10
10
10
10
10
10
10
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
e e2
81
81
81
81
81
81
100
100
100
100
100
100
100
100
121
121
121
121
121
121
121
121
121
121
121
121
121
121
121
121
121
3343
1-R= 0.010164
R= 0.989836
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-2
-2
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
1
1
1
1
1
1
1
1
1
0
0
0
33
137
Table E4: Measuring Performance of class loss for Case 1 : Straight Scale method
c=1.6
Score
Instructor
38
45
52
57
58
60
60
60
64
65
65
67
67
69
69
69
69
69
69
70
70
70
70
70
70
72
72
72
73
74
74
2
3
4
4
4
5
5
5
6
6
6
6
6
7
7
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
8
Straight
Scale
1
3
4
5
5
6
6
6
6
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
8
8
8
8
8
8
8
C
C
1.6
0
0
-1.6
-1.6
-1.6
-1.6
-1.6
0
-1.6
-1.6
-1.6
-1.6
0
0
0
0
0
0
-1.6
-1.6
-1.6
-1.6
-1.6
-1.6
0
0
0
0
0
0
1.6
0
0
1.6
1.6
1.6
1.6
1.6
0
1.6
1.6
1.6
1.6
0
0
0
0
0
0
1.6
1.6
1.6
1.6
1.6
1.6
0
0
0
0
0
0
Score Instructor Straight Scale C
75
76
76
78
79
79
81
81
81
82
82
83
83
84
85
85
87
89
89
90
91
92
92
93
93
94
94
94
95
95
96
8
8
8
8
8
8
9
9
9
9
9
9
9
9
9
9
9
9
9
10
10
10
10
10
10
10
10
10
11
11
11
9
9
9
9
9
9
10
10
10
10
10
10
10
10
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
25.6
-1.6
-1.6
-1.6
-1.6
-1.6
-1.6
-1.6
-1.6
-1.6
-1.6
-1.6
-1.6
-1.6
-1.6
-3.2
-3.2
-3.2
-3.2
-3.2
-1.6
-1.6
-1.6
-1.6
-1.6
-1.6
-1.6
-1.6
-1.6
0
0
0
C
1.6
1.6
1.6
1.6
1.6
1.6
1.6
1.6
1.6
1.6
1.6
1.6
1.6
1.6
3.2
3.2
3.2
3.2
3.2
1.6
1.6
1.6
1.6
1.6
1.6
1.6
1.6
1.6
0
0
0
53
CC=78.6
1.267741935
138
Table E5: Measuring Performance of class loss for Case 1 : Standard Deviation method
c =1
Score
Instructor
38
45
52
57
58
60
60
60
64
65
65
67
67
69
69
69
69
69
69
70
70
70
70
70
70
72
72
72
73
74
74
2
3
4
4
4
5
5
5
6
6
6
6
6
7
7
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
8
Std.
C
deviation
1
1
1
2
2
3
3
3
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
6
6
6
1
2
3
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
3
3
3
2
2
2
C
Score
Instructor
1
2
3
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
3
3
3
2
2
2
75
76
76
78
79
79
81
81
81
82
82
83
83
84
85
85
87
89
89
90
91
92
92
93
93
94
94
94
95
95
96
8
8
8
8
8
8
9
9
9
9
9
9
9
9
9
9
9
9
9
10
10
10
10
10
10
10
10
10
11
11
11
65
Std.
C C
deviation
6
6
6
7
7
7
7
7
7
8
8
8
8
8
8
8
8
9
9
9
9
10
10
10
10
10
10
10
10
10
11
2
2
2
1
1
1
2
2
2
1
1
1
1
1
1
1
1
0
0
1
1
0
0
0
0
0
0
0
1
1
0
2
2
2
1
1
1
2
2
2
1
1
1
1
1
1
1
1
0
0
1
1
0
0
0
0
0
0
0
1
1
0
27
CC=92/62 1.483871
139
Continued: Table E5
c=1
Std.
Score Instructor deviation std2
38
45
52
57
58
60
60
60
64
65
65
67
67
69
69
69
69
69
69
70
70
70
70
70
70
72
72
72
73
74
74
2
3
4
4
4
5
5
5
6
6
6
6
6
7
7
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
8
1
1
1
2
2
3
3
3
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
6
6
6
1
1
1
4
4
9
9
9
16
16
16
16
16
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
36
36
36
601
e e2 Score Instructor
-1
-2
-3
-2
-2
-2
-2
-2
-2
-2
-2
-2
-2
-2
-2
-2
-2
-2
-2
-2
-2
-2
-2
-2
-2
-3
-3
-3
-2
-2
-2
1
4
9
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
9
9
9
4
4
4
75
76
76
78
79
79
81
81
81
82
82
83
83
84
85
85
87
89
89
90
91
92
92
93
93
94
94
94
95
95
96
8
8
8
8
8
8
9
9
9
9
9
9
9
9
9
9
9
9
9
10
10
10
10
10
10
10
10
10
11
11
11
141
Std.
std.dev2 e e2
deviation
6
6
6
7
7
7
7
7
7
8
8
8
8
8
8
8
8
9
9
9
9
10
10
10
10
10
10
10
10
10
11
36
36
36
49
49
49
49
49
49
64
64
64
64
64
64
64
64
81
81
81
81
100
100
100
100
100
100
100
100
100
121
2259
1-R=
R=
0.062937
0.937063
-2
-2
-2
-1
-1
-1
-2
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
0
0
-1
-1
0
0
0
0
0
0
0
-1
-1
0
4
4
4
1
1
1
4
4
4
1
1
1
1
1
1
1
1
0
0
1
1
0
0
0
0
0
0
0
1
1
0
39
140
Table E6: Measuring Performance of class loss for Case 1 : Standard Deviation method
c=1.6
y
Score
Instructor
38
45
52
57
58
60
60
60
64
65
65
67
67
69
69
69
69
69
69
70
70
70
70
70
70
72
72
72
73
74
74
2
3
4
4
4
5
5
5
6
6
6
6
6
7
7
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
8
Std.
C
deviation
1
1
1
2
2
3
3
3
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
6
6
6
1
2
3
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
3
3
3
2
2
2
C
Score
Instructor
1
2
3
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
3
3
3
2
2
2
75
76
76
78
79
79
81
81
81
82
82
83
83
84
85
85
87
89
89
90
91
92
92
93
93
94
94
94
95
95
96
8
8
8
8
8
8
9
9
9
9
9
9
9
9
9
9
9
9
9
10
10
10
10
10
10
10
10
10
11
11
11
65
Std.
C C
deviation
6
6
6
7
7
7
7
7
7
8
8
8
8
8
8
8
8
9
9
9
9
10
10
10
10
10
10
10
10
10
11
2
2
2
1
1
1
2
2
2
1
1
1
1
1
1
1
1
0
0
1
1
0
0
0
0
0
0
0
1
1
0
2
2
2
1
1
1
2
2
2
1
1
1
1
1
1
1
1
0
0
1
1
0
0
0
0
0
0
0
1
1
0
27
CC=92/62 1.483871
141
Case 2: Large Class
c)
Model
MODEL
{
for( i in 1 : N )
{
Y[i] ~ dnorm(mu[i],tau[i])
mu[i] <- mu.c[G[i]]
tau[i]<-tau.c[G[i]]
G[i] ~ dcat(P[])
}
for( g in 1 : M )
{
tau.c[g] ~ dgamma(alpha.b[g],beta.b[g])
alpha.b[g]<-3+numgrade[g]/2
beta.b[g]<-1/(2*L[g])
sigma.c[g] <- 1 / sqrt(tau.c[g])
mu.c[g] ~ dnorm(alpha.a[g], alpha.tau[g])
alpha.a[g]<-(1/(1/v[g]+numgrade[g]*tau.c[g]))*(m[g]*v[g]+numgrade[g]*L[g]*tau.c[g])
alpha.tau[g]<-(1/(numgrade[g]*tau.c[g]+1/v[g]))
m[g]~dnorm(q[g],0.0025)
q[g]<-9*g
v[g]<-400
}
P[1:11 ] ~ ddirch(phi[])
}
~DATA (N=498)
29,30,30,30,31,32,34,34,35,36,36,36,37,38,38,40,40,41,41,41,42,42,42,43,43,43,43,44,45,45,
46,47,48,48,50,50,50,50,51,51,51,52,52,52,52,53,53,53,54,54,56,57,57,57,58,58,58,59,59,60,
60,60,60,60,60,61,61,61,61,61,61,62,62,62,62,62,62,63,63,64,64,64,64,65,65,65,65,65,65,65,6
5,65,65,65,65,65,65,65,66,66,66,66,66,66,66,66,66,66,66,67,67,67,67,67,67,67,67,67,67,67,
67,67,68,68,68,68,68,68,68,68,68,68,68,69,69,69,69,69,69,69,69,69,69,70,70,70,70,70,70,70,7
0,70,70,70,70,70,70,70,70,70,70,70,71,71,71,71,71,71,71,71,71,71,71,71,71,71,71,71,71,71,
71,71,71,71,71,71,71,71,71,71,71,71,71,72,72,72,72,72,72,72,72,72,72,72,72,72,72,72,72,72,7
2,72,72,72,72,72,72,72,72,72,72,72,72,72,72,72,72,72,72,73,73,73,73,73,73,73,73,73,73,73,
73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,73,74,74,74,74,74,74,7
4,74,74,74,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,
75,75,75,75,75,76,76,76,76,76,76,76,76,76,76,76,76,76,76,76,76,76,76,76,76,77,77,77,77,77,7
7,77,77,77,77,77,77,77,77,77,77,78,78,78,78,78,78,78,78,78,78,78,78,78,78,78,78,78,78,78,
78,78,78,78,78,78,78,78,78,78,78,78,79,79,79,79,79,79,79,79,79,79,79,79,79,79,79,79,79,80,8
0,80,80,80,80,80,80,80,80,80,80,80,80,80,81,81,81,81,81,81,81,81,81,81,81,81,81,81,82,82,
82,82,82,82,82,82,82,82,82,82,82,82,82,82,82,82,82,83,83,83,83,83,83,83,83,83,83,83,83,83,8
3,83,83,84,84,84,84,84,84,84,84,84,84,84,85,85,85,85,86,86,86,86,86,87,87,87,87,87,88,88,
88,89,89,90,90,91,91,91,92,94,94,95,95,96,96,98,98,98
142
b)
Posterior Marginal Density
Marginal posterior density estimates for the means of the different letter grade.
mu.c[1] chains 1:2 sample: 150000
mu.c[2] chains 1:2 sample: 150000
1.0
0.75
0.5
0.25
0.0
1.5
1.0
0.5
0.0
30.0
32.0
34.0
36.0
41.0
mu.c[3] chains 1:2 sample: 150000
42.0
43.0
44.0
45.0
mu.c[4] chains 1:2 sample: 150000
2.0
1.5
1.0
0.5
0.0
2.0
1.5
1.0
0.5
0.0
50.0
51.0
52.0
58.0
mu.c[5] chains 1:2 sample: 150000
59.0
60.0
mu.c[6] chains 1:2 sample: 150000
3.0
4.0
3.0
2.0
1.0
0.0
2.0
1.0
0.0
63.0
63.5
64.0
64.5
66.5
mu.c[7] chains 1:2 sample: 150000
6.0
4.0
4.0
2.0
2.0
0.0
0.0
71.6
71.8
72.0
72.2
76.0
mu.c[9] chains 1:2 sample: 150000
3.0
4.0
2.0
2.0
1.0
0.0
0.0
80.5
81.0
83.5
mu.c[11] chains 1:2 sample: 150000
0.8
0.6
0.4
0.2
0.0
90.0
76.25
76.5
76.75
mu.c[10] chains 1:2 sample: 150000
6.0
80.0
67.5
mu.c[8] chains 1:2 sample: 150000
6.0
71.4
67.0
92.0
94.0
84.0
84.5
85.0
143
c)
Convergence Diagnostics
~ Trace
144
~ Gelman-Rubin Statistics
mu.c[1] chains 1:2
mu.c[2] chains 1:2
1.0
1.0
0.5
0.5
0.0
0.0
501
20000
40000
60000
501
20000
iteration
mu.c[3] chains 1:2
mu.c[4] chains 1:2
1.0
0.5
0.5
0.0
0.0
20000
40000
60000
iteration
1.0
501
40000
60000
501
20000
iteration
40000
60000
iteration
mu.c[5] chains 1:2
mu.c[6] chains 1:2
1.0
1.0
0.5
0.5
0.0
0.0
501
20000
40000
60000
501
20000
iteration
mu.c[7] chains 1:2
mu.c[8] chains 1:2
1.0
0.5
0.5
0.0
0.0
20000
40000
60000
iteration
1.0
501
40000
60000
501
20000
iteration
40000
60000
iteration
mu.c[9] chains 1:2
mu.c[10] chains 1:2
1.0
1.0
0.5
0.5
0.0
0.0
501
20000
40000
60000
501
20000
iteration
40000
iteration
mu.c[11] chains 1:2
1.0
0.5
0.0
501
20000
40000
iteration
60000
60000
145
~ Running Quantiles
mu.c[1] chains 2:1
mu.c[2] chains 2:1
35.0
44.5
44.0
43.5
43.0
42.5
34.0
33.0
32.0
3501
20000
40000
60000
3501
20000
iteration
40000
60000
iteration
mu.c[3] chains 2:1
mu.c[4] chains 2:1
52.25
52.0
51.75
51.5
51.25
59.75
59.5
59.25
59.0
58.75
3501
20000
40000
60000
3501
20000
iteration
40000
60000
iteration
mu.c[5] chains 2:1
mu.c[6] chains 2:1
64.4
64.2
64.0
63.8
63.6
67.8
67.6
67.4
67.2
3501
20000
40000
60000
3501
20000
iteration
40000
60000
iteration
mu.c[7] chains 2:1
mu.c[8] chains 2:1
72.1
72.0
71.9
71.8
71.7
76.7
76.6
76.5
76.4
76.3
3501
20000
40000
60000
3501
20000
iteration
40000
60000
iteration
mu.c[11] chains 2:1
mu.c[9] chains 2:1
94.0
80.8
80.7
80.6
80.5
80.4
80.3
93.0
92.0
91.0
3501
20000
40000
60000
3501
iteration
20000
40000
iteration
mu.c[10] chains 2:1
84.8
84.6
84.4
84.2
84.0
3501
20000
40000
iteration
60000
60000
146
~Autocorrelation
mu.c[3] chains 1:2
mu.c[1] chains 1:2
1.0
0.5
0.0
-0.5
-1.0
1.0
0.5
0.0
-0.5
-1.0
0
20
40
0
20
lag
40
lag
mu.c[2] chains 1:2
mu.c[4] chains 1:2
1.0
0.5
0.0
-0.5
-1.0
1.0
0.5
0.0
-0.5
-1.0
0
20
40
0
20
lag
40
lag
mu.c[5] chains 1:2
mu.c[6] chains 1:2
1.0
0.5
0.0
-0.5
-1.0
1.0
0.5
0.0
-0.5
-1.0
0
20
40
0
20
lag
40
lag
mu.c[7] chains 1:2
mu.c[8] chains 1:2
1.0
0.5
0.0
-0.5
-1.0
1.0
0.5
0.0
-0.5
-1.0
0
20
40
0
20
lag
40
lag
mu.c[9] chains 1:2
mu.c[10] chains 1:2
1.0
0.5
0.0
-0.5
-1.0
1.0
0.5
0.0
-0.5
-1.0
0
20
40
0
lag
20
40
lag
mu.c[11] chains 1:2
1.0
0.5
0.0
-0.5
-1.0
0
20
40
lag
147
APPENDIX F
Bayes
Metropolis
David A.Frisbie
Download