Introduction to multilevel models: getting started with your own data

advertisement
Course handouts
Introduction to Multilevel Models: Getting started
with your own data
University of Bristol
Monday 31ST March– Friday 4th April 2008
Resources
Centre for Multilevel Modelling http://www.mlwin.com/
Provides access to general information about multilevel modelling and
MlwiN.
Includes Multilevel newsletter (free electronic publication)
http://www.mlwin.com/publref/newsletters.html
Email discussion group:
www.jiscmail.ac.uk/multilevel/
Lemma will include training repository
http://www.ncrm.ac.uk/nodes/lemma/about.php
1.0 Introductions
Participants introduce themselves : Who you are?
Whare are you from?
2.00 Multilevel Data Structures
Multilevel modelling is designed to explore and analyse data that come
from populations which have a complex structure.
In any complex structure we can identify atomic units. These are the
units at the lowest level of the system. The response or y variable is
measured on the atomic units.
Often, but not always, these atomic units are individuals.
Individuals are then grouped into higher level units, for example, schools.
By convention we then say that students are at level 1 and schools are at
level 2 in our structure.
2.01 Levels, classifications and units
A level(eg pupils, schools, households, areas) is made up of a number of
individuals units(eg particular pupils, schools etc).
The term classification and level can be used somewhat interchangeably but
the term level implies a nested hierarchical relationship of units (in which
lower units nest in one, and one only, higher-level unit) whereas
classification does not.
2.02 Two-level hierarchical structures
Students within schools
Unit diagram one node per unit
School
Sc1
Sc2
Classification diagram one
node per classification
Sc3
Sc4
Schoo
l
Student
Students
St1
St2
St3
St1
St2
St1
St2
St3
St1
St2
St3 St4
Students within a school are more alike than a random sample of
students. This is the ‘clustering’ effect of schools.
2.03 Data frame for student within school example
Classifications
or levels
Respons
e
Explanatory variables
Studen
t
i
School
j
Student
Exam
scoreij
Student previous
Examination
scoreij
Student
genderij
School
typej
1
1
75
56
M
State
2
1
71
45
M
State
3
1
91
72
F
State
1
2
68
49
F
Private
2
2
37
36
M
Private
3
2
67
56
M
Private
1
3
82
76
F
State
1 Do Males make greater progress
than Females?
2 *Does the gender gap vary
across schools?
3* Are Males more or less variable
in their progress than Females?
4 *What is the between-school
variation in student’s progress?
5 *Is School X (that is a specific
school) different from other schools
in the sample in its effect?
6* Are schools more variable in their progress for students with low prior
attainment?
7 Do students make more progress in private than public schools?
8* Are students in public schools less variable in their progress?
* Requires multilevel model to answer
2.04 Variables, levels, fixed and random
classifications
Given that school type(state or private) classifies schools, we could redraw our
classification diagram
Schoo
l
Schoo
l
type
as
Schoo
l
Student
Do we now have a 3-level
multilevel model?
Student
We can divide classifications into two types : fixed classifications and
random classifications. The distinction has important implications for
how we handle the classifying variable in a statistical analysis.
For a classification to be a level in a multilevel model it must be a random
classification. It turns out that school type is not a random classification.
2.05 Random and Fixed Classifications
A classification is a random classification if its units can be regarded as a
random sample from a wider population of units. For example the students
and schools in our example are a random sample from a wider population
of students and schools. However, school type or indeed, student gender
has a small fixed number of categories. There is no wider population of
school types or genders to sample from.
Traditional or single level statistical models have only one random
classification which classifies the units on which measurements are made,
typically people. Multilevel models have more than one random
classification.
2.06 Other examples of two-level hierarchical
structures
Repeated measures, panel data
Mutivariate response models
2.07 Repeated Measures data
In the previous example we have measures on an individual at two occasions a
current and a prior test score. We can analyse change (that is progress) by
specifying current attainment as the response and prior attainment as a
predictor variable.
Classifications
or levels
Response
Explanatory variables
Stude
nt
i
School
j
Student
Exam
scoreij
Student
previous
Examination
scoreij
Student
genderij
School
typej
1
1
75
56
M
State
2
1
71
45
M
State
3
1
91
72
F
State
1
2
68
49
F
Private
2
2
37
36
M
Private
3
2
67
56
M
Private
1
3
82
76
F
State
However, when there are
measurements on more than
two occasions there are
advantages as treating
occasion as a level nested
within individuals. Such a two
level strict hierarchical
structure is known as a
repeated measurement or
panel design
2.08 Classification, unit diagrams and data framesfor
repeated measures structures.
Person
P1
Measurement Occasion
O1 O2 O3 O4
P2
P3 .....
O1 O2
O1 O2 O3
HOcc1
HOcc2
HOcc3
AgeOcc1
AgeOcc2
AgeOcc3
Gende
r
1
75
85
95
5
6
7
F
2
82
91
*
7
8
*
M
3
88
93
96
5
6
7
F
Perso
n
Wide form 1 row per individual
Long form 1 row per
occasion(required by MLwiN)
Classifications or
levels
Response
Explanatory variables
Occasio
n
I
Person
J
Heightij
Ageij
Genderj
1
1
75
5
F
2
1
85
6
F
3
1
95
7
F
1
2
82
7
M
2
2
91
8
M
1
3
88
5
F
2
3
93
6
F
3
3
96
7
F
2.09 Repeated Measures Cntd
Atomic units are occasions not individuals.
Modelling between individual variation in growth, growth curves.
In a multilevel repeated measures model data need not be balanced
or equally spaced.
Explanatory variables can be time invariant (gender) or time varying
(age)
2.10 Multivariate responses within individuals
Sometimes we may wish to model not a single response (y-variable)
we may have many. For example, we may wish to consider jointly
English and Mathematics exam scores for students as two possibly
related responses. We can regard this as a multilevel model with
subjects (English and Maths) nested within students
Student
St1
St2
Subject
E M
E
St3
E M
St4…
M
A multilevel multivariate response
model can estimate the covariance
(or correlation) matrix between
responses and efficiently handle
missing data.
2.11 Data frames for multivariate response models
Studen
t
Englis
h
Score
Maths
Score
Gende
r
1
95
75
M
2
55
*
F
3
65
40
F
4
*
75
M
Wide form 1 row per individual
Response
Explanatory variables
Classifications or
levels
Exam
Subject
I
Stude
nt
J
Exam
Scoreij
EngIndicij
MathIndicij
GenderEngj
GenderMathj
Eng
1
1
95
1
0
M
0
Math 2
1
75
0
1
0
M
Eng
1
2
55
1
0
F
0
Eng
1
3
65
1
0
F
0
Math 2
3
40
0
1
0
M
Math 2
4
75
0
1
0
M
Long form 1 row per
measurement(required by MLwiN)
2.12 Three level structures
Students:classes:schools
School
Sc1
Sc2
Sc3
School
Class
C1
C2
C1
C2
Class
Student
Student St1 St2 St3 St1 St2 St1 St2 St3 St1 St2 St3 St4
MLM allow a different number of students in each class and a different number
of classes in each school. Bennett(1976) used a single level model to asses
whether teaching styles affected test scores for reading and mathematics at age
11. The results prompted a call for return to traditional or formal teaching
methods. This analysis did not take account of the dependency structures in the
data: students in a class more similar than a random sample of students,
likewise classes in a school. Subsequent ML analysis found the effects of
traditional methods non-significant.
2.13 Data Frame for 3 level model, students:
classes: schools
Classifications or levels
Response
Explanatory variables
Stude
nt
I
Class
j
School
k
Current
Exam
scoreijk
Student
previous
Examination
scoreijk
Student
genderijk
Class
teaching
stylejk
School
typek
1
1
1
75
56
M
Formal
State
2
1
1
71
45
M
Formal
State
3
1
1
91
72
F
Formal
State
1
2
1
68
49
F
Informal
State
2
2
1
37
36
M
Informal
State
1
1
2
67
56
M
Formal
Private
2
1
2
82
76
F
Formal
Private
3
1
2
85
50
F
Formal
Private
1
1
3
54
39
M
Informal
State
2.14 Other three level structures
 Repeated measures within students within schools. This allows us to look
how learning trajectories vary across students and schools.
Multivariate responses on four health behaviours (drinking, smoking
exercise & diet) on individuals within communities, such a design will allow
the assessment of the how correlated are the behaviors at the individual
level and the community level and to do so taking account of other
characteristics at both the individual and community level. We can also can
assess the extent to which there are unhealthy communities as well as
unhealthy individuals
A repeated cross-sectional design with students:cohorts:schools
2.15 Repeated cross-sectional design
School
Cohort
Student
Sc1
1990
St1 St2....
Sc3....
Sc2
1991
St1 St2.....
1990
1991
St1 St2... St1 St2...
1990 1991
St1 St2..... St1 St2...
Above are unit and classification diagrams where we have Exam scores for
groups of students who entered school in 1990 and a further group who
entered in 1991. The model can be extended to handle an arbitrary number of
cohorts. In a multilevel sense we do not have 2 cohort units but 2S cohort units
where S is the number of schools.
2.16 Four level hierarchical structures
By now you should be getting a feel about how basic random classifications
such as people, time, multivariate responses, institutions, families and
areas can be combined within a multilevel framework to model a wide
variety of nested population structures. Here areas some examples of 4level nested structures.
•student within class within school within LEA
•multivariate responses within repeated measures within students within schools
•repeated measures within patients within doctor within hospital
•people within households within postcode sectors within regions
As a final example of a strict hierarchy we will consider a doubly nested
repeated measures structure.
2.17 repeated measures within students within
cohorts within schools
Sc1
School
Cohort
1990
student
St1
Sc2...
1991
St2...
St1
1990
St2..
St1
1991
St2..
St1
St2..
Msmnt occ
O1
O2
O1 O2
O1 O2 O1 O2 O1 O2
O1 O2
O1 O2 O1 O2
Cohorts are now repeated measures on schools and tell us about stability
of school effects over time
Measurement occasions are repeated measures on students and can tell us
about students’ learning trajectories.
2.18 Non-hierarchical structures
So far all our examples have been exact nesting with lower level units nested
in one and only one higher-level unit.
That is we have been dealing with strict hierarchies. But social reality can be
more complicated than that.
In fact we have found that we need two non-hierarchical structures which in
combination with strict hierarchies have been able to deal with all the different
types of designs, realities and research questions that we have met
•Cross-classified structures
•Multiple membership structures
2.19 Cross-classified Model
School
S1
S2
S3
S4
school
Pupils
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
pupil
Area
A1
A2
A3
In this structure schools are not nested within areas. For
example
Pupils 2 and 3 attend school 1 but come from different areas
Pupils 6 and 10 come from the same area but attend
different schools
Schools are not nested within areas and areas are not nested
within schools. School and area are are cross-classified
area
2.20 Tabulation of students by school and area to
reveal across-classified structure
Area
School S1
A1
A2
S2
S3
S4
Pupils P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
area 1
area 2
A1
A3
S1
S2
P1 P2 P3
area 3
A2
A3
S3
P4 P5 P6 P7 P8
area 1
area 2
School 1
P1,P2,P3
School 1
P1,P3
P2
School 2
P4,P5
School 2
P5
P4
School 3
School 4
P6,P7,P8
P9,P10,P11,
P12
All elements in a row lie in a single
column
S4
P9 P10 P11 P12
area 3
School 3
P6,P7
P8
School 4
P10
P9,P11,P12
Elements in a row span multiple columns,
Elements in a column span multiple rows
2.21 Data frame for pupils in a cross-classification
of schools and areas
Classifications or
levels
Respons
e
Explanatory variables
Stu
den
t
i
Scho
ol
j
Are
a
k
Exam
scorei(jk)
Student
gender
1
1
1
75
2
1
2
3
1
4
Area
IMDk
School
type j
M
24
State
71
F
46
State
1
91
F
24
State
2
2
68
M
46
Private
5
2
1
37
M
24
Private
6
3
2
67
F
46
Private
7
3
2
82
F
46
State
8
3
3
85
M
11
State
9
4
3
54
M
11
Private
10
4
2
91
M
46
Private
11
4
3
43
F
11
Private
12
4
3
66
M
11
Private
i(jk)
2.22 Other examples of cross-classified structures
Exam marks within a cross classification of student and examiner, where a student’s
paper is marked by more than one examiner to get an indication of examiner
reliability.
examiner 1
examiner 2
student 1
m1
m2
student 2
m3
m4
examiner 3
Student 3
m5
m6
Student 4
m7
m8
Note in this case we have at
most 1 level one unit(mark) per
cell in the cross-classification.
Students within a cross-classification of primary school by secondary school. We
may have students’ exam scores at age 16 and wish to assess the relative
effects of primary and secondary schools on attainment at age 16
Patients within a cross-classification of GP practice and hospital.
2.24 Multiple membership models
Where atomic units are seen as nested within more than one unit from a
higher level classification :.
Health outcomes where patients are treated by a number of nurses, patients
are multiple members of nurses
Students move schools, so some pupils are multiple members of
schools.
School
S1
S2
S3
S4
Teacher
Pupils
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
Pupil
2.23 Combining structures: crossed-classifications and multiple
membership relationships
School
Pupils
Area
S1
S2
S3
S4
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
A1
A2
A3
Lets take the cross-classified model of the previous slide but
suppose
Pupil 1 moves in the course of the study from residential
area 1 to 2 and from school 1 to 2
Pupil 8 has moved schools but sill lives in the same area
Student 7 has moved areas but still attends the same school
Now in addition to schools being crossed with residential areas
pupils are multiple members of both areas and schools.
2.24 Classification diagram for multiple
membership model
School
Pupils
S1
A1
Area
Student
S3
S4
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
Area
School
S2
A2
A3
area 1
area 2
School 1
P1,P3
P1,P2
School 2
P1,P5
P1,P4
area 3
School 3
P6,P7
P7,P8
School 4
P10
P8,P9,P11,P1
2
Students nested within a cross-classification of
school by area
Students multiple members of schools
Students multiple members of areas
2.25 Combining structures : crossed, nested and multiple
membership relationships
H1
H2
Hospital
N1
N2
N3
N4
Nurse
P1
P2
P3
P4
P5
GP practice
P6
Patient
GP1
GP2
GP3
Patients can be treated by more than one nurse during their stays in hospital,
patients are multiple members of nurses
Nurses work in only one hospital therefore nurses are nested within hospitals
Patients nested within referring GPs. GP’s crossed with nurses. GP’s
crossed with Hospitals.
2.26 Distinguishing Variables and Levels
School type
state
School
Pupils
S1
P1
P2
private
S3
P3
P6
S2
P7 P8
P4 P5
Classifications or levels
Response
Explanatory Variables
Pupil
I
School
j
School
Type
k
Pupil Exam
Scoreijk
Previous
Exam
scoreijk
Pupil
genderijk
1
1
State
75
56
M
2
1
State
71
45
M
3
1
State
91
72
F
1
2
Private
68
49
F
2
2
Private
37
36
M
Etc
S4
NO!
P9 P10 P11 P12
School type is not a random classification it
is a fixed classification, and therefore a
variable not as a level.
Random classification if units can be regarded
as a random sample from a wider population of
units. Eg pupils and schools
Fixed classsification is a small fixed number of
categories. Eg State and Private are not two
types sampled from a large number of types,
on the basis of these two we cannot generalise
to a wider population of types of schools,
Similarly gender…..
3.0 Work with partner discussing what type of Multilevel
data Structure corresponds to participant’s data(20 mins)
Draw free-hand a classification diagram giving labels for units at each level
and linking the nodes by appropriate arrows to reflect nested, crossed or
MM relationships
Complete a schematic data frame for your data set.
Either use overheads provided or whatever software you find convenient.
4.0 Discussion of Exercise 3.0
Each participant takes 2 minutes to present the multilevel
structure for their research problem
5: Modelling
varying relations: from graphs to equations
“There are NO general laws in social science that are
constant over time and independent of the context
in which they are embedded”
Rein (quoted in King, 1976)
5. 1 Varying relations plot
• Simple set up
Two level model
houses at level 1 nested within districts at level 2
• Single continuous response: price of a house
• Single continuous predictor: size = number of rooms and
this variable has been centred around average size of 5
Rooms
1
2
3
4
5
6
7
8
x1
-4
-3
-2
-1
0
1
2
3
5. 3 General Structure for Statistical models
• Response = general trend + fluctuations
• Response = systematic component + stochastic element
• Response = fixed + random
• Specific case: the single level simple regression model
Response
Systematic Part
Random Part
House =
Price
Price of
averagesized
house
house
residual
variation
+
Intercept
Cost
of
extra
room
Slope
+
Residual
5 4 Simple regression model
1
0
y
x1
is the outcome, price of a house
is the predictor, number of rooms,
which we shall deviate around its mean, 5
Rooms
1
2
3
4
5
6
7
8
x1
-4
-3
-2
-1
0
1
2
3
5.5 Simple regression model (cont)
yi
x1
0
yi   0  1 x1i  (ei )
is the price of house i
is the individual predictor variable
is the intercept;
ei
1
is the fixed slope term:
is the residual/random term, one for every house
Summarizing the random term: ASSUME IID
Mean of the random term is zero
Constant variability (Homoscedasticy)
No patterning of the residuals (i.e, they are independent)
e ~ N (0,  2 )
 e2
e
between house variance; conditional on size
5.6 Random intercepts model
Premium
Y  
^
Citywide line
 u0 j
0
Discount
 u0 j
Differential shift for each district j : index the intercept
yij   0 j  1 x1ij  eij
Micro-model
Macro-model: index parameter as a response
 0 j   0  u0 j
Price of average =
district j
citywide +
price
Substitute macro into micro…….
differential for
district j
1
x
1
5.7 Random intercepts COMBINED model
Substituting the macro model into the micro model yields
yij  (  0  u0 j )  1 x1ij  eij
Grouping the random parameters in brackets
y     x  (u  e )
ij
• Fixed part
0
1 1ij
0j
ij
0  1
• Random part (Level 2)
u0 j ~ N (0,  u20 )
• Random part (Level 1)
e0ij ~ N (0,  e20 )
• District and house
differentials are
independent
Cov[u0 j , e0ij ]  0
5.8 The meaning of the random terms
• Level 2 : between districts
u0 j ~ N (0,  u20 )

2
u0
• Between district variance conditional
on size
• Level 1 : within districts between houses
e0ij ~ N (0,  e20 )

2
e0
• Within district, between-house
variation variance conditional on size
5.9 Variants on the same model
• Combined model
yij   0  1 x1ij  (u0 j  eij )
• Combined model in full
yij   0 x0ij  1 x1ij  (u0 j x0ij  e0ij x0ij )
x0 ij • Is the constant ; a set of 1’s
• In MLwiN
Differentials at
each level
5.10 Random intercepts and random slopes
5. 11 Random intercepts and slopes model
Micro-model
yij   0 j x0ij  1 j x1ij  e0ij x0ij
Note: Index the intercept and the slope associated with a constant,
and number of rooms, respectively
Macro-model (Random Intercepts)
 0 j   0  u0 j
Macro-model (Random Slopes)
1 j  1  u1 j
Slope for district j = citywide slope + differential slope for district j
Substitute macro models into micro model…………
5.12 Random slopes model
Substituting the macro model into the micro model yields
yij  (  0  u0 j ) x0ij  ( 1  u1 j ) x1ij  e0ij x0ij
Multiplying the parameters with the associated variable and
grouping them into fixed and random parameters yields the
combined model:
yij   0 x0ij  1 x1ij  (u0 j x0ij  u1 j x1ij  e0ij x0ij )
5.13 Characteristics of random intercepts & slopes model
yij   0 x0ij  1 x1ij  (u0 j x0ij  u1 j x1ij  e0ij x0ij )
Fixed part
Random part (Level 2)
Random part (Level 1)
0  1
u0 j 
2

u
0
  ~ N (0, 
2 )


u1 j 

u1 
 u 0 u1

 
[e0ij ] ~ N (0,  e0 )
2
5. 14 Interpreting varying relationship plot through mean and variancecovariances
Intercepts: terms
associated with
Constant
Slopes terms
associated with
Predictor
x0
Graph
Mean
0
Variance
 u20
x1
Mean
1
Variance
Intercept/Slope terms
associated with
x0 x1
Covariance
 u 0u1
 u21
A
+
0
+
0
undefined
B
+
+
+
0
undefined
C
+
+
+
+
+
D
+
+
+
+
-
E
+
+
+
+
0
attain
pre-test

y     x
 i
0
1 i

2
ei ~ N (0,  e )


 yij   0 j  1 xij  eij

 0 j   0  u0 j


2
u0 j ~ N (0,  u 0 )

e ~ N (0,  2 )

e
 ij
attain
pre-test
attain
pre-test
attain
pre-test
attain
pre-test







y     x  e
0j
1 j ij
ij
 ij
    u
0
0j
 0j

1 j  1  u1 j

u0 j 
 2

  ~ N (0,  ) :    u 0

2
u
u

 u1 
u1 j 
 u 01

e ~ N (0,  2 )
e
 ij




5.16 Random intercepts and slopes model in MLwiN
6.1 Fitting models in MLwiN
• Work through (at your own pace) Chapter
4 of the manual; Random slopes and
intercepts models
• Don’t be afraid to ask!
Summary of Sessions 5+6
S1: Type of questions tackled by multilevel
modelling I
• 2-level model: current attainment given prior attainment of
pupils(1) in schools(2)
• NB assuming a random sample of pupils from a random samples of
schools
• Do Boys make greater progress than Girls (F)
• Are boys more or less variable in their progress than girls?(R)
• What is the between-school variation in progress? (R)
• Is School X different from other schools in the sample in its effect?
(F)
• continued…….
S2: Type of questions tackled by multilevel
modelling II
• Are schools more variable in their progress for pupils with low
prior attainment? (R)
• Does the gender gap vary across schools? (R)
• Do pupils make more progress in denominational schools?(F)
• Are pupils in denominational schools less variable in their
progress? (R)
• Do girls make greater progress in denominational schools? (F)
(cross-level interaction)
S3 Problems with not doing a multilevel analysis
•Substantive: the between school variability and what factors
reduce it are generally of fundamental interest to us. A single
level model gives us no estimate of between school variability.
•Technical: If the higher level clustering is not properly
accounted for in the model then inferences we make about other
predictors will be incorrect. We will tend to infer a relationship
where none exists.
S4 : Fixed and Random classifications
Random classification
Fixed classification
Generalization of a level
(e.g., schools)
Discrete categories of a
variable (eg Gender)
Random effects come
from a distribution
Not sample from a
population
All schools contribute to
between-school variance
Specific categories only
contribute to their
respective means
S5 When levels become variables...
Schools can be treated as a variable and placed in the
fixed part; achieved by a set of dummy variables one
for each school; target of inference is each specific
school; each one treated as an ‘island unto itself’
No shrinkage but no ‘help; from rest of the data; hence
unreliable estimates when no of pupils in school is
small
Schools in the random part, treated as a level, with
generalization possible to ALL schools (or ‘population’ of
schools), in addition can predict specific school effects
given that they come from an overall distribution
Shrinkage towards zero for unreliably estimated schools
S6 Recap on: Random intercept models(parallel
lines)
yij   0 j  1xij  eij
 0 j   0  u0 j
u0 j ~ N (0, u20 )
eij ~ N (0, e2 )
school 1
0 + 1x1ij
u0,1
school 2
u0,2
-3
0
1
+3
S7 Recap on: Random intercepts and slopes model
yij   0 j  1 j xij  eij
 0 j   0  u0 j
1 j  1  u1 j
school 1
u1,1
u0 j 
  u20

~
N
(
0
,

)
:


 

u
u
2 
u



u1 
 u 01

 1 j 
0 + 1x1ij
eij ~ N (0, e2 )
u0,2
-3
0
u1,2
1
school 2
3
S8 Model in Manual : p54
S9 Estimates in Manual : p54
S10 Plot of predictions for schools:
p56
7: Multilevel residuals
8.0 Contextual effects
In the previous sections we found that schools vary in both their intercepts
and slopes resulting in crossing lines. The next question is are there any
school level variables that can explain this variation?
Interest lies in how the outcome of individuals in a cluster is affected by their
social contexts (measures at the cluster level). Typical questions are
• Does school type effect students' exam results?
• Does area deprivation effect the health status of individuals in the area?
In our data set we have a contextual school ability measure, schav. The
mean intake score is formed for each school, these means are ranked and
the ranks are categorised into 3 groups :
low<=25%,25%>mid<=75%, high>75%
8.1 Exploring contextual effects and the tutorial data
Does school gender effect exam score by gender?
Do boys in boys’ schools do better or worse or the same compared with
boys in mixed schools?
Do girls in girls’ schools do better or worse or the same compared with
girls in mixed schools?
Does peer group ability effect individual pupil performance?
That is given two pupils of equal intake ability do they progress
differently depending on whether they are educated in a low, mid or
high ability peer group?
8.2 School gender effects
girl boysch
0
0
1
0
0
1
1
0
girlsch
0
0
0
1
boy/mixed school = -0.189
girl/mixed school = -0.189+0.168
boy/boy school =-0.189+0.180
girl/girl school =-0.189+0.168+0.175
8.3 Peer group ability effects
The effect of peer group
ability is modelled as
being constant across
gender, school gender
and standlrt.
boy,boy school,high
boy,boy school,mid
boy,boy school,low
For example, comparing
peer group ability effects
for boys in mixed
schools and boys in
boy’s schools:
}
Boys school =0.187
+0.174 : boy mixed school high
+0.067 : boy,mixed school,mid
-0.265+0.552*standlrtij : boy,mixed school,low(reference group)
8.4 Cross level interactions
There may be interactions between school gender, peer group ability, gender
and standlrt. An interesting interaction is between peer group ability and
standlrt. This tests whether the effect of peer group differs across the standlrt
intake spectrum. For example, being in a high ability group may have a
different effect for pupils of different ability. This is a cross level interaction
because it is the interaction between a pupil level variable(standlrt) and a
school level variable(schav).
8.5 Cross level interactions cont’d
Which leads to three lines for the low,mid and
high groupings.
-0.347+0.455standlrtij :low
(-0.347+0.144)+(0.455+0.092) standlrtij :mid
(-0.347+0.290)+(0.455+0.180) standlrtij :high
Note that high ability pupils
(standlrt=2.6) score nearly 1sd
higher if they are educated in
high rather than low ability peer
groups.
9.1 Repeated measures.
We may have repeated measurements on individuals, for example: a
series of heights or test scores. Often we want to model peoples growth.
We can fit this structure as a multilevel model with repeated
measurements nested within people. That is:
Person
P1
Occasion O1 O2 O3
P2
P3…
O1 O2
O1 O2 O3 O4
9.2 Advantages of fitting repeated measures models in a
multilevel framework
Fitting these structures using a multilevel model has the advantages that
data can be
• Unbalanced (individuals can have different numbers of measurement
occasions)
• Unequally spaced (different individuals can be measured at different ages)
As opposed to traditional multivariate techniques which require data to be
balanced and equally spaced.
Again the multilevel model requires response measurements are MCAR or
MAR.
9.3 An example from the MLwiN user guide
Repeated measures model for childrens’ reading scores
This (random intercepts model)
models growth as a linear process
with individuals varying only in their
intercepts. That is for the 405
individuals in the data set
The global mean is predicted by
 0 x0  1 x1ij
The jth child’s growth curve is predicted by
(  0  u0 j ) x0  1age ij
{
9.4 Further possibilities for repeated measures model
•We can go on and fit a random slope model. Which in this
case allows the model to deal with children growing at
different rates.
•We can fit polynomials in age to allow for curvilinear
growth.
•We can also try and explain between individual variation
in growth by introducing child level variables.
•If appropriate we can include further levels of nesting. For
example, if children are nested within schools we could fit
a 3 level model [occasions:children:schools]. We could
then look to see if childrens’ patterns of growth varied
across schools.
10.0 Variance functions or modelling heteroscedasticity
Tabulating normexam by gender we see that the means and variances for
boys and girls are (–0.140 and 1.051) and (0.093 and 0.940).
We may want to fit a model that estimates separate variances for boys and
girls. The notation we have been using so far assumes a common
intercept(0) and a single set of student residuals, ei, with a common variance
e2. We need to use a more flexible notation to build this model.
10.1 Working with general notation in MLwiN
A model with no variables specified in
general notation looks like this.
A new first line is added stating that the response variable follows a
Normal distribution. We now have the flexibility to specify
alternative distributions for our response. We will explore these
models later.
The 0 coefficient now has an explanatory x0 associated with it. The
values x0 takes determines the meaning of the 0 coefficient. If x0 is
a vector of 1s then 0 will estimate an intercept common to all
individuals, in the absence of other predictors this would be the
overall mean. If x0 variable, say 1 for boys and 0 for girls, then 0
will estimate the mean for boys.
10.2 A simple variance function
The new notation allows us to
set up this simple model
where x0i is a dummy variable
for boy and x1i is a dummy
variable for girl. This model
estimates separate means and
variances for the two groups.
This is an example of a
variance function because the
variance changes as a
function of explanatory
variables. The function is :
var( yi )   e20 x0i   e21x1i
10.3 Deriving the variance function
We arrive at the expression
var( yi )   e20 x0i   e21x1i
(1)
By taking the basic model
yi   0 x0i  1 x1i
 0i   0  e0i
1i  1  e1i
and rearrangin g it
yi   0 x0i  1 x1i  e0i x0i  e1i x1i
var( yi )  var( e0i x0i  e1i x1i )  var( e0i x0i )  2 cov( e0i x0i , e1i x1i )  var( e1i x1i )
 var( e0i ) x02i  2 cov( e0i , e1i ) x0i x1i  var( e1i ) x12i   e20 x02i  2 e 01 x0i x1i   e21 x12i
 e 01 because a student cannot be both a boy and a girl. Also x0i and x1i are (0,1)
variables so x02i  x0i and x12i  x1i so we arrive at (1).
10.4 Variance functions at level 2
The notion of variance functions is powerful and not restricted to
level 1 variances.
The random slopes model fitted earlier produces
the following school level predictions which
show school level variability increasing with
intake score.
The model
yij   0ij x0  1 j x1ij
 0 j   0  u0 j  e0ij
1 j  1  u1 j
Can be rewritten as
yij   0 x0  1 x1ij  u0 j x0  u1 j x1ij  e0ij x0
  u20

u0 j 
~
N
(
0
,

)



u 
u
u
2 


1
j
u1 
 
 u 01
e0ij ~ N (0,  e2 )
  u20

u0 j 
 u  ~ N (0,  u )  u  
2 
 1j 
 u 01  u1 
e0ij ~ N (0,  e2 )
So the between school variance is
var(u0 j x0  u1 j x1ij )  E ((u0 j x0  u1 j x1ij ) 2 )
  u20 x02  2 u 01 x0 x1ij   u21 x12ij
10.5 Two views of the level 2 variance
Given x0 = [1], we have
var(u0 j x0  u1 j x1ij )   u20 x02  2 u 01 x0 x1ij   u21 x12ij   u20  2 u 01 x1ij   u21 x12ij
Which shows that the level 2 variance is polynomial function of x1ij
var(u0 j x0  u1 j x1ij )  a  bx1ij  cx12ij  0.9  (2 * 0.018) x1ij  0.015 x12ij
• View 1: In terms of school lines predicted
intercepts and slopes varying across schools.
 View 2 : In terms of a variance function
which shows how the level 2 variance
changes as a function of 1 or more
explanatory variables.
10.6 Elaborating the level 1 variance
Maybe the student level departures
around their schools summary lines
are not constant.
2 schools
2 students
Note at level 2 we have 2
interpretations of level 2 random
variation, random coefficients (varying
slopes and intercepts across level 2
units) and variance functions. In each
level 1 unit, by definition, we only
have one point, therefore the first
interpretation does not exist because
you cannot have a slope given a single
data point.
10.7 Variance functions at level 1
If we allow standlrt(x1ij) to have a random term at level 1, we get
yij   0 x0  1 x1ij  u0 j x0  u1 j x1ij  e0ij x0  e1ij x1ij
  u20

u0 j 
~
N
(
0
,

)


u
u

u 
2 
 1j 
 u 01  u1 
  e20

e0ij 
~
N
(
0
,

)


e
e

e 
2 
 1ij 
 e01  e1 
So the student level variance is now:
var(e0ij x0  e1ij x1ij )   e20 x02  2 e01 x0 x1ij   e21 x12ij
 0.533  (2 * 0.015) x1ij  0.001x12ij
The resulting graph shows
decreasing level 1 variance wrt
standlrt extenuates the importance
of school level factors driving
variation in the outcome score,
particularly for high ability pupils
10.8 Modelling the mean and variance simultaneously
In our model
yij   0 x0  1 x1ij  u0 j x0  u1 j x1ij  e0ij x0  e1ij x1ij
  u20

u0 j 
~
N
(
0
,

)


u
u

u 
2 
 1j 
 u 01  u1 
  e20

e0ij 
~
N
(
0
,

)


e
e

e 
2 


1
ij
 
 e01
e1 
The global mean is predicted by
 0 x0  1 x1ij
The jth school mean is predicted by
(  0  u0 j ) x0 0  ( 1  u1 j ) x1ij
The student level variance is
var(e0ij x0  e1ij x1ij )   e20  2 e01 x1ij   e21 x12ij
The school level variance is
var(u0 j x0  u1 j x1ij )   u20  2 u 01 x1ij   u21 x12ij
Where as ordinary regression:
yi   0  1 x1i  ei
ei ~ N (0, e2 )
estimates the global relationship and has
a single catch all bucket for the variance.
11.00 Applied Paper – Example of Variance
functions
Understanding the sources of differential parenting:
the role of child and family level effects. Jenny
Jenkins, Jon Rasbash and Tom O’Connor
Developmental Psychology 2003(1) 99-113
11.01 Mapping multilevel terminology to psychological
terminology
• Level 2 : Family, shared environment
Variables : family ses, marital problems
• Level 1 : Child, non-shared environment, child specific
Variables : age, sex, temperament
11.02 Background
• Recent studies in developmental psychology and behavioural genetics
emphasise non-shared environment is much more important in
explaining children’s adjustment than shared environment has led to a
focus on non-shared environment.(Plomin et al, 1994;
Turkheimer&Waldron, 2000)
• Has this meant that we have ignored the role of the shared family
context both empirically and conceptually?
11.03 questions
• One key aspect of the non-shared environment that has been
investigated is differential parental treatment of siblings.
• Differential treatment predicts differences in sibling adjustment
• What are the sources of differential treatment?
• Child specific/non-shared: age, temperament, biological relatedness
• Can family level shared environmental factors influence differential
treatment?
11.04 The Stress/Resources Hypothesis
Do family contexts(shared environment) increase or decrease the extent
to which children within the same family are treated differently?
“Parents have a finite amount of resources in terms of time, attention,
patience and support to give their children. In families in which most of
these resources are devoted to coping with economic stress, depression
and/or marital conflict, parents may become less consciously or
intentionally equitable and more driven by preferences or child
characteristics in their childrearing efforts”. Henderson et al 1996.
This is the hypothesis we wish to test. We operationalised the
stress/resources hypothesis using four contextual variables:
socioeconomic status, single parenthood, large family size, and marital
conflict
11.05 How differential parental treatment has
been analysed
Previous analyses, in the literature exploring the sources of
differential parental treatment ask mother to rate two siblings in
terms of the treatment(positive or negative) they give to each child.
The difference between these two treatment scores is then analysed.
This approach has several major limitations…
11.06 The sibling pair difference difference model, for
exploring determinants of differential parenting
( y1i  y2i )  0  1x1i ...
Where y1i and y2i are parental ratings for siblings 1 and 2 in family I
x1i is a family level variable for example family ses
Problems
• One measurement per family makes it impossible to separate
shared and non-shared random effects.
•All information about magnitude of response is lost (2,4) are the
same as (22,24)
•It is not possible to introduce level 1(non-shared) variables since the
data has been aggregated to level 2.
•Family sizes larger than two can not be handled.
11.07 With a multilevel model…
yij   0  1 x1ij   2 x2 j  u j  eij
u j ~ N (0, u2 ) eij ~ N (0, e2 )
Where yij is the j’th mothers rating of her treatment of her i’th child
x1ij are child level(non-shared variables), x2j are child level(shared
variables)
uj and eij are family and child(shared and non-shared environment)
random effects.
Note that the level 1 variance  2 is now a measure of differential
e
parenting
11.08 Advantages of the multilevel approach
•Can handle more than two kids per family
•Unconfounds family and child allowing estimation of family and child
level fixed and random effects
•Can model parenting level and differential parenting in the same model.
11.09 Overall Survey Design
• National Longitudinal Survey of Children and Youth (NLSCY)
• Statistics Canada Survey, representative sample of children across the
provinces
• Nested design includes up to 4 children per family
• PMK respondent
• 4-11 year old children
• Criteria: another sibling in the age range, be living with at least one
biological parent, 4 years of age or older
• 8, 474 children
• 3, 860 families
• 4 child =60, 3 child=630, 2 child=3157
11.10 Measures of parental treatment of child
Derived form factor analyses..
• PMK report of positive parenting: frequency of praise of child, talk or
play focusing on child, activities enjoyed together a=.81
• PMK report of negative parenting: frequency of disapproval,
annoyance, anger, mood related punishment a=.71
• Will talk today about positive parenting
PMK is parent most known to the child.
Child specific factors
Family context factors
•
•
•
•
•
•
•
•
•
Age
Gender
Child position in family
Negative emotionality
Biological relatedness to father
and mother
Socioeconomic status
Family size
Single parent status
Marital dissatisfaction
11.12 Model 1: Null Model
yij   0  u j  eij
u j ~ N (0, u2 )
eij ~ N (0, e2 )
ˆ0  12.51(0.04) ˆ u2  5.13(0.17) ˆ e2  3.8(0.08)
The base line estimate of differential parenting is 3.8. We can now add further
shared and non-shared explanatory variables and judge their effect on
differential parenting by the reduction in the level 1 variance.
11.13 Model 2 : expanded model
yij   0 j  1 j ageij   2 ageij2   3 girlij   4 notBioM ij
 5notBioFij   6 oldestSibij   7 midSibij
  7 hses j   8 famsize j   9loneParent j  10allGirls j 
11mixedGenderj  12maritalprb j  13 famsize * age
 0 j   0  u0 j
u0 j 
 u  ~ N (0,  u )
 1j 
eij ~ N (0, e2 )
1 j   1 j  u
1
j
11.14 positive parenting
Child level predictors
• Strongest predictor of positive parenting is age. Younger siblings get
more attention. This relationship is moderated by family membership.
• Non-bio mother and Non_bio father reduce positive parenting
• Oldest sibling > youngest sibling > middle siblings
Family level predictors
• Household SES increases positive parenting
• Marital dissatisfaction, increasing family size, mixed or all girl sibships all decrease positive parenting
• Lone parenthood has no effect.
11.15 Differential parenting
Modelling age reduced the level 1 variance (our measure of differential
parenting) from 3.8 to 2.3, a reduction of 40%. Other explanatory variables
both child specific and family(shared environment) provide no significant
reduction in the level 1 variation.
Does this mean that there is no evidence to support the stress/resources
hypothesis.
11.16 Testing the stress/resource hypothesis
• The mean and the variance are modelled simultaneously. So far we
have modelled the mean in terms of shared environment but not the
variance.
• We can elaborate model 2 by allowing the level 1 variance to be a
function of the family level variables household socioeconomic status,
large family size, and marital conflict. That is
 ej2  w0  2w1hses j  w2 hses 2j  2w3marital j 
2w4 maritalprb.ses j  2w5 familysize j
wˆ 0  1.84(0.1) wˆ 1  0.23(0.04) wˆ 2  0.17(0.07)
wˆ 4  0.29(0.13) wˆ 5  0.11(0.05)
Reduction in the deviance with 7df is 78.
11.17 Graphically …
family size
family size
family size
family size
differential parenting
5
4
= 2, no marital problems
= 2, marital problems
> 2, marital problems
> 2, no marital problems
3
2
1
-2.0
-1.5
-1.0
-0.5
0.0
0.5
household ses
1.0
1.5
2.0
positive parenting
11.18 Modelling the mean and variance simultaneously
We show a possible pattern of how the mean, within family variance and
between family variance might behave as functions of HSES in the schematic
diagram below.
Here are 5 families of increasing
HSES(in the actual data set there
are 3900 families.
We can fit a linear function of SES
to the mean.
The family means now vary around
the dashed trend line. This is now
the between family variation;
which is pretty constant wrt HSES
HSES
However, the within family variation(measure of differential
parenting) decreases with HSES – this supports the SR hypothesis.
12 Multivariate response models
We may have data on two or more responses we wish to analyse jointly. For
example, we may have english and maths scores on pupils within schools.
We can consider the response type as a level below pupil.
S1
S2…
P1
E
P2
M E M
P3
P4….
E M
E M
12.01 Rearranging data
school
pupil
English
Maths
1
1
50
60
1
2
80
70
1
3
50
45
2
4
75
85
2
5
60
40
Often data comes like this
with one row per person
For MLwiN to analyse the data we
require the data matrix to have one
row per level 1 unit. That is one row
per response measurement
school
1
1
1
1
1
1
2
2
2
2
pupil
1
1
2
2
3
3
4
4
5
5
subject
50
60
80
70
50
45
75
85
60
40
x0
1
0
1
0
1
0
1
0
1
0
x1
0
1
0
1
0
1
0
1
0
1
x0 is 1 if response for this record is English, 0 otherwise
x1 is 1 if response for this record is Maths, 0 otherwise
12.02 Writing down the model
y1 j   0 j x0 ij
 0 j   0  u0 j
y2 j  1 j x1ij
1 j  1  u1 j
  u20

u0 j 
~
N
(
0
,

)



u 
u
u
2 


1
j
u1 
 
 u 01
u0j
0
english
Where y1j is the english score for
student j and y2j is the maths score for
student j.
The means and variances for english
and maths(0,1,u02,u12) are
estimated. Also the covariance
between maths and english,  u01is
estimated.
Note there is no level 1(eij) variance. This can
be seen if we consider the picture for one
pupil.
u1j
maths
1
12.03 Advantages of framing a multivariate response model
as a multilevel model
The model has the following advantages over traditional multivariate
techniques:
 It can deal with missing responses-provided response data is
missing completely at random(MCAR) or missing at random(MAR)
that is missingness is related to explanatory variables in the model.
 Covariates can be added giving us the conditional covariance matrix
of the responses.
 Further levels can be added to the model
12.04 Example from MLwiN user guide
pupils have two responses : written and coursework
mean for written = 46.8
Variance(written) = 178.7
mean for coursework = 73.36
Variance(coursework) = 265.4
covariance(written, coursework) = 102.3
That is we have two means and a covariance matrix, which we could
get from any stats package. However, the data are unbalanced. Of the
1905 pupils 202 are missing a written response and 180 are missing a
coursework response.
12.05 Further extensions
We can add further explanatory
variables.
For example, female. We see
that females do better for
coursework than males and
worse than males on written
exams males do better on
written exams.
We can add further levels.
Here we partition the covariance
structure into student and school
components.
13.0 MCMC estimation in MlwiN
MCMC estimation is a big topic and is given a pragmatic and cursory treatment
here. Interested students are referred to the manual “MCMC estimation in
MLwiN” available from
http://www.cmm.bris.ac.uk/mlwin/download/manuals.shtml
In the workshop so far you have been using IGLS (Iterative
Generalised Least Squares) algorithm to estimate the models.
13.1 IGLS versus MCMC
IGLS
MCMC
Fast to compute
Deterministic
convergence-easy to judge
Slower to compute
Stochastic convergence-harder to
judge
Uses mql/pql approximations to fit
discrete response models which
can produce biased estimates in
some cases
Does not use approximations when
estimating discrete response models,
estimates are less biased
In samples with small numbers
of level 2 units confidence
intervals for level 2 variance
parameters assume Normality,
which is inaccurate.
Can not incorporate prior information
In samples with small numbers of
level 2 units Normality is not
assumed when making inferences
for level 2 variance parameters
Difficult to extend to new models
Easy to extend to new models
Can incorporate prior information
13.2 Bayesian framework
MCMC estimation operates in a Bayesian framework. A bayesian framework
requires one to think about prior information we have on the parameters we are
estimating and to formally include that information in the model. We may make
the decision that we are in a state of complete ignorance about the parameters
we are estimating in which case we must specify a so called “uninformative
prior”. The “posterior” distribution for a paremeter  given that we have
observed y is subject to the following rule:
p(|y) p(y| )p()
Where
p(|y) is the posterior distribution for  given we have observed y
p(y| ) is the likelihood of observing y given 
p() is the probability distribution arising from some statement of prior
belief such as “we believe ~N(1,0.01)”. Note that “we believe
~N(1,1)” is a much weaker and therefore less influential statement of
prior belief.
13.3 Applying MCMC to multilevel models
In a two level variance components model we have the following
unknowns
 , u, u2 , e2
There joint posterior is
p(  , u, u2 , e2 | y)  p( y |  , u, e2 ) p(u |  u2 )
Likelihood –
“what the data
says”-estimated
from data
p(  ) p( u2 ) p( e2 )
Posterior – final
answers- a
combination of
likelihood and priors
Prior beliefsupplied by the
researcher
13.4 Gibbs sampling
Evaluating the expression for the joint posterior with all the parameters
unknown is for most models, virtually impossible. However, if we take each
unknown parameter in turn and temporarily assume we know the values of
the other parameters, then we can simulate from the so called “conditional
posterior” distribution. The Gibbs sampling algorithm cycles through the
following simulation steps. First we assume some starting values for our
unknown parameters :
 (0) , u(0) , u2(0) , e2(0)
13.5 Gibbs sampling cnt’d
Sampling from the following conditiona l distributi ons in rotation, firstly
p (  | y, u(0) , u2(0) , e2(0) )
to get  (1) , then
p (u | y,  (1) , u2( 0) , e2(0) )
to get u(1) , then
p ( u2 | y,  (1) , u(1) , e2( 0) )
2
to get  u(1)
, then finally
p( e2 | y,  (1) , u(1) , u2(1) )
We now have updated all the unknowns in the model. This process is
repeated many times until eventually we converge on the distribution of
each of the unknown parameters.
13.6 IGLS vs MCMC convergence
IGLS algorithm
converges,
deterministically to a
distribution.
MCMC algorithm
converges on a
distribution. Parameter
estimates and intervals are
then calculated from the
simulation chains.
13.7 Other MCMC issues
By default MLwiN uses flat, uniformative priors see page 5 of MCMC
estimation in MLwiN (MEM)
For specifying informative priors see chapter 6 of MEM.
For model comparison in MCMC using the DIC statistic see chapters 3 and 4
MEM.
For description of MCMC algorithms used in MLwiN see chapter 2 of MEM.
13.8 When to consider using MCMC in MLwiN
If you have discrete response data – binary, binomial, multinomial or Poisson
(chapters 11, 12, 20 and 21). Often PQL gives quick and accurate estimates for
these models. However, it is a good idea to check against MCMC to test for bias
in the PQL estimates.
If you have few level 2 units and you want to make accurate inferences about
the distribution of higher level variances.
Some of the more advanced models in MLwiN are only available in MCMC.
For example, factor analysis (chapter 19), measurement error in predictor
variables (chapter 14) and CAR spatial models (chapter 16)
Other models, can be fitted in IGLS but are handled more easily in MCMC
such as multiple imputation (chapter 17), cross-classified(chapter 14) and
multiple membership models (chapter 15).
All chapter references to MCMC estimation in MLwiN.
14.0 Generalised Multilevel
Models 1 : Binary Responses
and Proportions
14.1 Generalised multilevel models
•So Far
Response at level 1 has been a continuous variable and
associated level 1 random term has been assumed to have
a Normal distribution
•Now a range of other data types for the response
All can be handled routinely by MLwiN
•Achieved by 2 aspects
a non-linear link between response and predictors
a non-Gaussian level 1 distribution
14.2 Typology of discrete responses
Response
Example
Binary
Yes/No
Categorical
Proportion
Multiple
categories
Count
Count
Proportion
unemployed
Travel by
train, car,
foot
No of
crimes in
area
LOS
Model
Logit or probit or
log-log model
with binomial L1
random term
Logit etc. with
binomial L1
random term
Logit model with
ordered or
unordered multinomial random
term
Log model with
L1 Poisson
random term
Log model with
L1 NBD random
term
14.3 Focus on modelling proportions
•Proportions eg death rate; employment rate; can be conceived as the
underlying probability of dying; probability of being employed
•Four important attributes of a proportion that MUST be taken into
account in modelling
(1)Closed range: bounded between 0 and 1
(2)Anticipated non-linearity between response and predictors; as predicted
response approaches bounds, greater and greater change in x is required to
achieve the same change in outcome; examination analogy
(3)Two numbers: numerator subset of denominator
(4)Heterogeneity: variance is not homoscedastic; two aspects
(a)
the variance depends on the mean;
as approach bound of 0 and 1, less room to vary
ie Variance is a function of the predicted probability
(b)
the variance depends on the denominator;
small denominators result in highly variable proportions
14.4 Modelling Proportions
•Linear probability model: that is use standard regression model with linear
relationship and Gaussian random term
•But 3 problems
(1)
Nonsensical predictions: predicted proportions are
unbounded, outside range of 0 and 1
(2)
Anticipated non-linearity as approach bounds
(3)
Heteogeneity: inherent unequal variance
dependent on mean and on denominator
•Logit model with Binomial random term resolves all three problems (could
use probit, clog-clog)
14.5 The logistic model: resolves problems 1 & 2
•The relationship between the probability and predictor(s) can
be represented by a logistic function, that resembles a Sshaped curve
• Models not the proportion but a non-linear transformation
of it (solves problems 1+2)
14.6 The Logit transformation
• L
• L
=
=
LOGe(p/ (1-p))
Logit = the log of the odds
• p
• 1-p
=
=
proportion having an attribute
proportion not having the attribute
• p/(1-p)
=
the odds of having an attribute
compared to not having an attribute
• As p goes from 0 to 1, L goes from minus to plus infinity, so if
model L, cannot get predicted proportions that lie outside 0 and
1; (ie solves problem 1)
• Easy to move between proportions, odds and logits
14.7 Proportions, Odds and Logits
A
B
C
Proportion/Probability
5 out of 10
6 out of 10
8 out of 10
Proportion
(p)
A
0.5
B
0.6
C
0.8
A
B
C
Logit
e0
e0.41
e1.39
Odds
1.0
1.5
4
Odds
(p/1-p)
1.0
1.5
4
Odds
5 to 5
6 to 4
8 to 2
Log of odds
Loge (p/1-p)
0
0.41
1.39
A
B
C
Logit
Proportion
e0/(1+ e0)
0.5
e0.41/(1+ e0.41)
0.6
e1.39/(1+ e1.39)
0.8
14.8 The logistic model
The underlying probability or proportion is nonlinearly related to the predictor
e  0  1x1

 0  1 x1
1 e
where e is the base of the natural logarithm
• linearized by the logit transformation(log = natural
logarithm)
 
log 
1

   0  1 x1

14.9 The logistic model: key characteristics
• The logit transformation produces a linear function of the
parameters.
• Bounded between 0 and 1
• Thereby solving problems 1 and 2
14.10 Solving problem 3:assume Binomial variation
• Variance of the response in logistic models is presumed to be
binomial:
Var ( y |  ) 
 (1   )
n
Ie depends on underlying proportion and the denominator
• In practice this is achieved by replacing the constant variable
at level 1 by a binomial weight, z, and constraining the level-1
variance to 1 for exact binomial variation
• The random (level-1) component can be written as
yi   i  ei zi , zi 
ˆi (1  ˆi )
ni
,  ei2  1
14.11 Multilevel Logistic Model
•
Assume observed response comes from a Binomial distribution
with a denominator for each cell, and an underlying
probability/proportion
yij ~ Binomial (nij ,  ij )
• Underlying proportions/probabilities, in turn, are related to a set of
individual and neighborhood predictors by the logit link function
 ij
logit ( ij )  ln
  0  1 x1ij   2 x2ij   3 x3ij  u0 j
(1   ij )
• Linear predictor of the fixed part and the higher-level random part
14.12 Estimation 1
•Quasi-likelihood (MQL/PQL – 1st and 2nd order)
–model linearised and IGLS applied.
–1st or 2nd order Taylor series expansion (to linearise the non-linear
model)
– MQL versus PQL are higher-level effects included in the
linearisation
–MQL1 crudest approximation. Estimates may be biased
downwards (esp. if within cluster sample size is small and between
cluster variance is large eg households). But stable.
–PQL2 best approximation, but may not converge.
–Tip: Start with MQL1 to get starting values for PQL.
14.13 Estimation 2
•MCMC methods: get deviance of
model (DIC) for sequential model
testing, and good quality estimates
even where cluster size is small; start
with MQL1 and then switch to MCMC
14.14 Variance Partition Coefficient
For 2-level Normal response random intercept model:
VPC 
Level 2 variance
Level 1 variance  Level 2 variance
yij~Binomial(ij,1)
logit(ij | xij, uj,) = a + x1ij + uj
Var(uj) =u2
var(yij- ij) = ij(1- ij) Level 1 variance is function of
predicted probability
The level 2 variance u2 is on the logit scale and the level 1 variance
var(yij- ij) is on the probability scale so they can not be directly
compared. Also level 1 variance depends on ij and therefore x1ij.
Possible solutions include i) set the level 1 variance = variance of a
standard logistic distribution; ii) simulation method
14.15 VPC 1: Threshold Model
Formulate logit model as:
yij*   T xij  u j   ij
where yij* is continuous latent variable underlying yij , and
 ij has a standard logistic distribution with variance
 2 / 3  3.29
Then VPC 
 u2
 u2  3.29
But this ignores the fact that the level –1 variance is
not constant, but is function of the mean probability
which depends on the predictors in the fixed part of
the model
14.16 VPC 2: Simulation Method
(i) Generate M values for random effect u from N (0, ˆ u2 ) :
u(1) , u(2) . . ., u(M) say 5000 group-level logit values
(ii) For m=1,…,M compute (for any chosen value x*):
 (*m)  [1  exp( ( ˆ T x * u( m) ))]1 and
v1*( m)   (*m) (1   (*m) )
(iii) Level 1 variance is mean of v1*( m ) (m=1,…,M) and
*
level 2 variance is variance of  (m
) and then use ordinary
VPC
14.17 Multilevel modelling of binary data
• Exactly the same as proportions except
• The response is either 1 or 0
• The denominator is a set of 1’s
• So that a ‘Yes’ is 1 out of 1 , while a ‘No’ is 0 out of 1
14.18 Chapter 9 of Manual:
Contraceptive Use in Bangladesh
• 2867 women nested in 60 districts
• y=1 if using contraception at time of
survey, y=0 if not using
contraception
• Covariates: age (mean centred),
urban residence (vs. rural)
14.19 Random Intercept Model: PQL2
Estimate (SE)
Fixed
0
1 (urban)
 2 (age)
-0.69 (0.08)
0.71 (0.10)
0.015 (0.004)
Random (between-district)
 u20
0.21 (0.06)
14.20 Variance Partition Coefficient
Threshold model approach
0.21/(0.21+3.29)=0.060
Simulation approach
(M=5000, mean age)
Urban
0.050
Rural
0.045
14.21 MLwiN Gives
• UNIT or (subject) SPECIFIC Estimates
the fixed effects conditional on higher level unit random
effects, NOT the
• POPULATION-AVERAGE estimates
iethe marginal expectation of the dependent variables across
the population "averaged " across the random effects
• In non-linear models these are different and the PA will
generally be smaller than US, especially as size of random
effects grows
• Can derive PA fom US but not vice-versa (next version give
both)
14.22 Unit specific / Population average
•
•
•
•
Probability of adverse reaction against dose
Left: subject-specific; big differences between subjects for middle dose (the
between –patient variance is large),
Right is the population average dose response curve,
Subject-specific curves have a steeper slope in the middle range of the dose
variable
15.0 Multilevel Multinomial Models
Logistic models handle the situation where we have a binary
response(two response categories eg alive/dead or pass/fail.)
Where we have a response variable with more than two
categories we use multinomial models.
Two types of multinomial response:
Unordered – eg voting prerference(lab, tory, libdem, other) or cause of death.
Ordered – attitude scales(strongly disagree...strongly agree) or exam grades.
First we deal with unordered multinomial responses
15.1 Extending a binary to a multinomial model
Take a binary variable (yi) which is 1 if an individual votes tory 0
otherwise.
The underlying probability of individual i voting tory is i .
We model the log odds of voting tory as a function of
explanatory variables
log[i / (1- i )]=0 1x1i.....
(1)
Lets call i = 1i = prob of individual i voting tory and
2i =(1- 1i )= prob of individual i not voting tory
We can now write (1) as
log[1i / 2i]=0 1x1i.....
15.2 Moving to more than two response categories
Suppose now that yi can take three values {1,2,3} vote tory, vote
labour, vote lib dem. Now
1i is probability of individual i votes tory
2i is probability of individual i votes labour
3i is probability of individual i votes lib dem
Now we must choose a reference category, say vote lib dem, and model
the log odds of all remaining categories against the reference category.
Therefore with t categories we need t-1 equations to model this set of
log odds ratios. In our case
log[1i / 3i]=0 1x1i.....
log[2i / 3i]=2 3x1i.....
15.3 Notation
The MLwiN software uses the notation
log[1i / 3i]=0 1x1i.....
log[2i / 3i]=2 3x1i.....
.....
Often in papers you will see the more succinct notational form
log[i(s) / i(t) ]=0(s) 1(s) x1i
Which becomes
For s = 1
log[i(1) / i(3) ]=0(1) 1(1)x1i.....
For s = 2
log[i(2) / i(3) ]=0(2) 1(2)x1i.....
s=1,..,t-1
15.4 Interpretation(odds ratios)
We can interpret as with logistic regression. In the political
example, {1,2,3} vote tory, vote labour, vote lib dem.
log[1i / 3i]=0 1x1i.....
log[2i / 3i]=2 3x1i.....
1is the change in the log odds of voting tory as opposed to lib dem for 1
unit increase in x1i.
3is the change in the log odds of voting labour as opposed to lib dem
for 1 unit increase in x1i.
and expo()gives odds ratios
15.5 Interpretation(probabilities)
Or in general notation
log[1i / 3i]=0 1x1i.....
log[2i / 3i]=2 3x1i.....
Probability of voting tory
for individual i
e(  0  1x1i )
 1i 
1  (e(  0  1x1i )  e(  2   3 x1i ) )
Probability of voting
labour for individual i
e (  2   3 x1i )
 2i 
1  (e (  0  1x1i )  e (  2   3 x1i ) )
 3  1   2i   1i

(s)
i

e
(  0( s )  1( s ) x1i )
t 1
1  e
(  0( k )  1( k ) x1i )
k 1
t 1
 i(t )  1    i( k )
k 1
15.6 Multilevel Multinomial models
Suppose the individuals in the voting example are clustered into
constituencies and we wish to include constituency effects in our
model. We include intercept level residuals for each log odds
equiation in our model
log[1ij / 3ij]=0 1x1ij +u0j
log[2ij / 3ij]=2 + 3x1ij  u2j
u0j is the effect of the constituency j on the log odds of voting tory
as opposed to lib dem. So if u0j is 1 the log odds of voting tory as
opposed to lib dem increase by 1 compared to u0j where u0j = 0
(the is average constituency)
Likewise u2j is the effect of the constituency j on the log odds of
voting labour as opposed to lib dem.
15.7 Variance of level 2 random effects
log[1ij / 3ij]=0 +u0j 1x1ij
log[2ij / 3ij]=2  u2j + 3x1ij
u0j
u2j
~N( 0,u )
u=
2u0
u02
2u2
2u0 is the betwen constituency variance of the vote tory:lib dem log odds
ratio
2u2 is the between constituency variance of the vote labour:lib dem log odds
ratio
u02 is the constituency level covariance between tory and labour
constituency level effects. A negative covariance means there is a
tendency for constituencies where labour do well as opposed to libdems;
for tories to do badly as opposed to the libdems and vice versa.
16.0 Ordered categorical data
Where there is an underlying ordering to the categories a convenient
parameterisation is to work with cumulative probabilities that an
individual crosses a threshold. For example, with exam grades
Grade
probability
Threshold
Cumulative probability
D
1i
D
g1i1i
C
2i
 C: (C,D)
g2i1i+ 2i
B
3i
B:(B,C,D)
g3i1i+ 2i+ 3i
A
4i
A:(A,B,C,D)
g4i1i+ 2i+ 3i+ 4i=1
With an ordered multinomialwe work with the set of cumulative
probabalities g. As before with t categories in the the model has t-1
categories.
16.1 Writing the ordered multinomial model
log(g1i/(1g1i)0log odds of  D
log(g2i/(1g2i)1log odds of  C
log(g3i/(1g3i)2log odds of  B
The threshold probability gkiare given by antilogit(k)
We must have 0<1< 2to ensure g1<g2< g3
16.2 Adding covariates to the model
log(g1i/(1g1i)0hi
log odds of  D
log(g2i/(1g2i)1hi
log odds of  C
log(g3i/(1g3i)2hi
log odds of  B
hi= 3x1i.....
Note that the covariates hi are the same for each of the response threshold
categories.
33x1i log odds of  B
Log
odds
23x1i log odds of  C
03x1i
log odds of  D
xi
This means that the log odds ratios and odds ratios for threshold category
16.3 Proportional odds models
Sio far we have assumed that the odds ratios of response category
membership remains constant wrt predictor variables. This is known as
the proportional odds assumption.
We can test the assumption that odds ratio’s of response category
membership being independent of predictor variables by fitting:
log(g1i/(1g1i)03x1i
log odds of  D
log(g2i/(1g2i)14x1i
log odds of  C
log(g3i/(1g3i)25x1i
log odds of  B
Now if our assumptions are correct 3,4,5will be very similar. We can
formally test 3 4,5 using the intervals and tests window
16.4 Multilevel ordered multinomial models
log(g1i/(1g1i)0hi
log odds of  D
log(g2i/(1g2i)1hi
log odds of  C
log(g3i/(1g3i)2hi
log odds of  B
hi= 3x1i+u0j
Log
odds
u0j is a random effect for school j,
which shifts all the threshold
probabilities equally for all kids in
school j. Again odds ratios for
category membership are
independent of u0j
k+ 3x1i
k+ 3x1i+ u0j for +ve u0j
k+ 3x1i+ u0j for -ve u0j
xi
16.5 Higher level variances
u0j~N(0,2u0)
The greater 2u0 The greater the variability in the school
level shifts in the response threshold probabilities.
17 Non-hierarchical multilevel models
Two types :
•Cross-classified models
•Multiple membership models
17.01 Cross-classification
For example, hospitals by neighbourhoods. Hospitals will draw patients
from many different neighbourhoods and the inhabitants of a
neighbourhood will go to many hospitals. No pure hierarchy can be found
and patients are said to be contained within a cross-classification of
hospitals by neighbourhoods :
nbhd 1
nbhd 2
hospital 1
xx
x
hospital 2
x
x
Nbhd 3
hospital 3
xx
x
hospital 4
x
xxx
Hospital
Patient
Nbhd
H1
H2
H3
H4
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
N1
N2
N3
17.02 Other examples of cross-classifications
• pupils within primary schools by secondary schools
• patients within GPs by hospitals
• interviewees within interviewers by surveys
• repeated measures within raters by individual(e.g. patients by nurses)
17.03 Notation
With hirearchical models we have subscript notation that has one
subscript per level and nesting is implied reading from left. For
example, subscript pattern ijk denotes the i’th level unit within the
j’th level 2 unit within the k’th level 3 unit.
If models become cross-classified we use the term classification
instead of level. With notation that has one subscript per
classification, that captures the relationship between classifications,
notation can become very cumbersome. We propose an alternative
notation that only has a single subscript no matter how many
classifications are in the model.
17.04 Single subscript notation
Hospital
Patient
H1
H3
H4
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
Nbhd
i
1
2
3
4
5
6
7
8
9
10
11
12
H2
N1
nbhd(i)
1
2
1
2
1
2
2
3
3
2
3
3
hosp(i)
1
1
1
2
2
2
3
3
4
4
4
4
N2
N3
We write the model as
( 2)
(3)
yi   0  unbhd

u
(i )
hosp(i )  ei
(1)
Where classification 2 is nbhd and classification 3 is
hospital. Classification 1 always corresponds to the
classification at which the response measurements are
made, in this case patients. For patients 1 and 11 equation
(1) becomes:
y1   0  u1( 2)  u1(3)  e1
y11   0  u3( 2)  u 4(3)  e1
17.05 Classification diagrams
In the single subscript notation we loose informatin about the
relationship(crossed or nested) between classifications. A useful way of
conveying this informatin is with the classification diagram. Which has one
node per classification and nodes linked by arrows have a nested relationship
and unlinked nodes have a crossed relationship.
Hospital
Neighbourhood
Neighbourhood
Hospital
Patient
Nested structure where
hospitals are contained
within neighbourhoods
Patient
Cross-classified structure where
patients from a hospital come from
many neighbourhoods and people
from a neighbourhood attend several
hospitals.
17.06 Data example : Artificial insemination by donor
1901 women
279 donors
1328 donations
12100 ovulatory cycles
response is whether conception occurs in a given cycle
In terms of a unit diagram:
Women
Cycles
w1
c1 c2 c3 c4…
Or a classification diagram:
w2
w3
c1 c2 c3 c4…
Donor
c1 c2 c3 c4…
Donation
Donations
Donors
d1
m1
d2
d1 d2 d3
d1 d2
m2
m3
Woman
Cycle
17.07 Model for artificial insemination data artificial
insemination
We can write the model as
yi ~ Binomial(1, i )
Results:
Parameter
Description
Estimate(se)
( 2)
( 3)
( 4)
logit(  i )  ( X ) i  u woman
(i )  u donation(i )  u donor(i )
0
intercept
( 2)
2
u woman
(i ) ~ N (0, u ( 2 ) )
1
azoospermia *
0.22(0.11)
( 3)
2
u donation
(i ) ~ N (0, u (3) )
2
semen quality
0.19(0.03)
( 4)
2
u donor
(i ) ~ N (0, u ( 4 ) )
3
womens age>35
4
sperm count
0.20(0.07)
5
sperm motility
0.02(0.06)
6
insemination to early
-0.72(0.19)
7
insemination to late
-0.27(0.10)
-4.04(2.30)
-0.30(0.14)
 u2( 2)
women variance
1.02(0.21)
 u2(3)
donation variance
0.644(0.21)
 u2( 4)
donor variance
0.338(0.07)
17.08 Multiple membership models
Where level 1 units are members of more than one higher
level unit. For example,
• Pupils change schools/classes and each school/class has
an effect on pupil outcomes
• Patients are seen by more than one nurse during the
course of their treatment
17.09 Notation
yi  ( XB) i 
Note that nurse(i) now indexes the set of
nurses that treat patient i and w(2)i,j is a
(1) weighting factor relating patient i to nurse j.
For example, with four patients and three
nurses, we may have the following weights
 wi(,2j)u (j2)  ei
jnurse(i )
u (j2) ~ N (0, u2( 2) )
ei ~ N (0, e2 )
n1(j=1)
n2(j=2)
n3(j=3)
p1(i=1)
0.5
0
0.5
p2(i=2)
1
0
0
p3(i=3)
0
0.5
0.5
p4(i=4)
0.5
0.5
0
y1  XB  0.5u1( 2 )  0.5u3( 2 )  ei
y2  XB  1u1( 2 )  ei
y3  XB  0.5u2( 2 )  0.5u3( 2 )  ei
y4  XB  0.5u1( 2 )  0.5u2( 2 )  ei
Here patient 1 was seen by nurse 1
and 3 but not nurse 2 and so on. If
we substitute the values of w(2)i,j , i
and j. from the table into (1) we get
the series of equations :
17.10 Classification diagrams for multiple membership
relationships
Double arrows indicate a multiple membership relationship between
classifications
nurse
We can mix multiple membership, crossed and
hierarchical structures in a single model
hospital
patient
nurse
GP practice
patient
Here patients are multiple members of nurses,
nurses are nested within hospitals and GP
practice is crossed with both nurse and hospital.
17.11 Example involving, nesting, crossing and multiple
membership – Danish chickens
Production hierarchy
10,127 child flocks
725 houses
304 farms
Breeding hierarchy
10,127 child flocks
200 parent flocks
As a unit diagram:
farm
As a classification diagram:
f1
f2…
Farm
Houses
Child flocks
Parent flock
h1
c1 c2 c3…
p1
h2
h1
c1 c2 c3….
p2
h2
c1 c2 c3…. c1 c2 c3….
p3
p4
p5….
House
Parent flock
Child flock
17.12 Model and results
yi ~ Binomial(1,  i )
logit( i )  ( XB )i 
( 3)
( 4)
 wi(,2j)u (j2)  uhouse
( i )  u farm ( i )  ei
j p . flock ( i )
( 3)
2
u (j2 ) ~ N (0,  u2( 2 ) ) uhouse
( i ) ~ N (0,  u ( 3) )
4)
2
u (farm
( i ) ~ N (0,  u ( 4 ) )
Results:
Parameter
Description
Estimate(se)
0
intercept
-2.322(0.213)
1
1996
-1.239(0.162)
2
1997
-1.165(0.187)
3
hatchery 2
-1.733(0.255)
4
hatchery 3
-0.211(0.252)
5
hatchery 4
-1.062(0.388)
 u2( 2)
parent flock variance
0.895(0.179)
 u2( 3)
house variance
0.208(0.108)
 u2( 4)
farm variance
0.927(0.197)
17.13 Alspac data
All the children born in the Avon area in 1990 followed up
longitudinally
Many measurements made including educational
attainment measures
Children span 3 school year cohorts(say 1994,1995,1996)
Suppose we wish to model development of numeracy over
the schooling period. We may have the following attainment
measures on a child :
m1 m2
m3 m4
primary school
m5
m6 m7 m8
secondary school
17.14 Structure for primary schools
Primary school
Area
P School Cohort
Pupil
P. Teacher
M. Occasion
•Measurement occasions within pupils
•At each occasion there may be a different teacher
•Pupils are nested within primary school cohorts
•All this structure is nested within primary school
• Pupils are nested within residential areas
17.15 A mixture of nested and crossed relationships
Primary school
P School Cohort
Area
Pupil
P. Teacher
M. occasions
Nodes directly connected by a single arrow are nested, otherwise nodes are crossclassified. For example, measurement occasions are nested within pupils. However,
cohort are cross-classified with primary teachers, that is teachers teach more than one
cohort and a cohort is taught by more than one teacher.
T1
T2
T3
Cohort 1
95
96
97
Cohort 2
96
97
98
Cohort 3
98
99
00
17.16 Multiple membership
It is reasonable to suppose the attainment of a child in a particualr year is
influenced not only by the current teacher, but also by teachers in previous
years. That is measurements occasions are “multiple members” of teachers.
m1
t1
m2
t2
m3
t3
m4
t4
Primary school
We represent this in
the classification
diagram by using a
double arrow.
Area
P School Cohort
Pupil
M. occasions
P. Teacher
17.17 What happens if pupils move area?
Primary school
Area
P School Cohort
P. Teacher
Classification diagram
without pupils moving
residential areas
Pupil
M. occasions
If pupils move area, then pupils are no longer nested within areas. Pupils and areas are cross-classified.
Also it is reasonable to suppose that pupils measured attainments are effected by the areas they have
previously lived in. So measurement occasions are multiple members of areas
Primary school
P School Cohort
Area
P. Teacher
Classification diagram
where pupils move between
residential areas
Pupil
M. occasions
BUT…
17.18 If pupils move area they will also move schools
Primary school
P School Cohort
Area
P. Teacher
Classification diagram
where pupils move between
areas but not schools
Pupil
M. occasions
If pupils move schools they are no longer nested within primary school or primary school
cohort. Also we can expect, for the mobile pupils, both their previous and current cohort
and school to effect measured attainments
Primary school
Area
P School Cohort
Pupil
M. occasions
P. Teacher
Classification diagram
where pupils move
between schools and
areas
17.19 If pupils move area they will also move schools cnt’d
And secondary schools…
Primary school
Area
P School Cohort
Pupil
P. Teacher
M. occasions
We could also extend the above model to take account of Secondary school,
secondary school cohort and secondary school teachers.
17.20 Other predictor variables
Remember we are partitioning the variability in attainment over time
between primary school, residential area, pupil, p. school cohort,
teacher and occasion. We also have predictor variables for these
classifications, eg pupil social class, teacher training, school budget and
so on. We can introduce these predictor variables to see to what extent
they explain the partitioned variability.
18 Significance testing and model
comparison
• Individual fixed part and random coefficients at each
level
• Simultaneous and complex comparisons
• Comparing nested models: likelihood ratio test
• Use of Deviance Information Criteria
18.1 Individual coefficients
•
Akin to t tests in regression models
•
Either specific fixed effect or specific variance-covariance component
– H0:
– H0:
1

2
u0
is 0;
H1:
is 0;
H1:
1

2
u0
is not 0
is not 0
•
Procedure: Divide estimated coefficient by their standard error
– Judge against a z distribution
– If ratio exceeds 1.96 then significant at 0.05 level
•
Approximate procedure; asymptotic test, small sample properties not wellknown.
•
OK for fixed part coefficients but not for random (typically small numbers;
variance distribution is likely to have + skew)
18.2 Simultaneous/complex comparisons & recommended
for random part testing
• Example: Testing H0: 2 – 3 = 0 AND 3 = 5
• H0: [C][][k]
• [C] is the contrast matrix (p by q) specifying the nature of hypothesis (q
is number of parameters in model; p is the number of simultaneous
tests)
FILL Contrast matrix with
1 if parameter involved
-1 if involved as a difference
0 not involved otherwise
• []is a vector of parameters (fixed or random); q
• [k] is a vector of values that the parameters are contrasted against
(usually the null); these have to be set
•
Example: Testing H0: 2 – 3 = 0 AND 3 = 5
– q = 4 (intercept and 3 slope terms)
– p = 2 (2 sets of tests)
[C]
[]
[k]
0
0
0
1
-1
0
0
0
1
1
*
=
2
0
5
3
•
Overall test against chi square with p degrees of freedom
•
Output
– Result of the contrast
– Chi-square statistic for each test separately
– Chi-square statistic for overall test; all contrasts simultaneously
Testing in fixed part
1 slope for Standlrt;
2 BoySch from mixed
3 GirlSch from mixed
4 Boysch from Girlsch
Model > Intervals& tests
>Fixed coefficients; 4 tests
Basic Statistics > Tail Areas Chi square;
CPRObability 1.586 1
0.20790
Testing in random part
1 school variance
2 difference between school
and student variance
Model > Intervals& tests
>Random coefficients; 2 tests
Basic Statistics > Tail Areas Chi square;
CPRObability 25.019 1
5.6768e-007
18.6 Do we need a quadratic variance function at level 2?
->CPRObability 32.126 3
4.9230e-007
CPRO 4 1 Benchmarks
0.046
CPRO 6 2
0.050
CPRO 8 3
0.046
18.7 Comparing nested models: likelihood ratio test
• Akin to F tests in regression models, i.e., is a more complex model a
significantly model better fit to the data; or is simpler model a
significantly worse fit
• Procedure:
– Calculate the difference in the deviance of the two models
– Calculate the change in complexity as the difference in the number of
parameters between models
– Compare the difference in deviance with a chi-square distribution with
df = difference in number of parameters
• Example: tutorial data
do we get a significant improvement in the fit if we move from a constant
variance function for schools to a quadratic involving Standlrt?
-2*log(lh) is
9305.78: quadratic
-2*log(lh) is
9349.42: constant
->calc b3 = b2-b1
43.644
->cpro 43.410 2
3.7466e-010
NB significantly worse fit; ie need quadratic
18.9 Deviance Information Criterion
•
•
•
•
•
•
Diagnostic for model comparison
Goodness of fit criterion that is penalized for model complexity
Generalization of the Akaike Information Criterion (AIC; where df is known)
Used for comparing non-nested models (eg same number but different variables)
Valuable in Mlwin for testing improved goodness of fit of non-linear model (eg Logit) because
Likelihood (and hence Deviance is incorrect)
Estimated by MCMC sampling; on output get
Bayesian Deviance Information Criterion (DIC)
Dbar
D(thetabar)
pD
DIC
9763.54 9760.51
3.02
9766.56
Dbar:
D(thetaBar):
pD:
DIC:
the average deviance from the complete set of iterations
the deviance at the expected value of the unknown parameters
the Estimated degrees of freedom consumed in the fit, ie DbarD(thetaBar)
Fit + Complexity; Dbar + pD
NB lower values = better parsimonious model
• Somewhat contoversial!
Spiegelhalter, D.J., Best, N.G., Carlin, B.P. and van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical
Society, Series B 64: 583-640.
18.10 Some guidance
• any decrease in DIC suggests a better model
• But stochastic nature of MCMC; so, with small difference in DIC you should confirm if
this is a real difference by checking the results with different seeds and/or starting values.
More experience with AIC, and common rules of thumb………
18.11 Example: Tutorial dataset example
Model 1: NULL model: a constant and level 1 variance
Model 2: additionally include slope for Standlrt
Model 3: 65 fixed school effects (64 dummies and constant)
Model 4: school as random effects
Model 5: 65 fixed school intercepts and slopes
Model 6: random slopes model; quadratic variance function
Best = Model 6
Note: random models (4 & 6) have more nominal parameters
than their fixed equivalents but less effective parameters and a
lower DIC value (due to distributional assumptions)
Download