Uploaded by Eugenio Costagliola Di Mignovillo

PML-Progetto

advertisement
BAYESIAN HIERARCHICAL
MODELLING
Theory and an application
Bayesian paradigma
Introduction to grouped data
No-pooling scenario
Index :
Complete pooling scenario
Partial pooling scenario (Hierarchical Modeling)
Sources of Variability
Application to BES dataset
⦁ Is a statistical analysis that answers research questions about unknown parameters of
statistical models by using probability statements.
⦁Rests on the assumption that all model parameters are random quantities and thus can
incorporate prior knowledge.
Bayesian
Paradigma
⦁ Follows a simple rule of probability, the Bayes rule, which provides a formalism for combining
prior information with evidence from the data at hand and we use these two components to
form the so called "posterior distribution" of model parameters.
⦁The posterior distribution results from updating the prior knowledge about model
parameters with evidence from the observed data.
Adding to the predictive distribution the information achieved in the posterior one, we obtain
the final model
Introduction to grouped data
Typically, when we are modelling a dataset, we make an underlying assumption about
the source of data: observations are assumed to be identically and independently
distributed (i.i.d.) following a single distribution with one or more unknown
parameters.
In many situations, treating observations as i.i.d. may not be sensible because the
hypotesis could be too restrictive. Indeed, for many applications, some observations
share characteristics (such as age, ethnicity, etc.) that distinguish them from other
observations, therefore multiple distinct groups are observed.
Example:
As an example, consider a study in which students’
scores of a standardized test are collected from five
different high schools in a given year. Suppose a
researcher is interested in learning about the mean
score of the test. Since five different schools
participated in this study and students’ scores might
vary from school to school, it makes sense for the
researcher to learn about the mean score for each
school and compare students’ mean performance
across schools.
To start modelling this education data, it is
inappropriate to use Yi as the random variable for the
score of the i-th student (i = 1, …, n, where n is the
total number of students from all five schools).
Since this ignores the inherent grouping of the
observations.
Example:
Instead, the researcher adds a school label j to Yi to reflect the grouping.
Let Yij denotes the score of student i in school j, where j = 1, …, 5 and
i = 1, … nj, where nj is the number of students in school j, and
Since the scores are continuous,the Gaussian model is a reasonable choice
for a data distribution.
Within school j, one assumes that the scores are i.i.d. from a Gaussian data
model with a mean and standard deviation depending on the school.
Specifically, one assumes a school-specific mean μj and a schoolspecific standard deviation σj for the Gaussian data model for school j.
Combining the information for the five schools, one has:
No pooling scenario
One approach for handling this group estimation problem is to find separate estimates for each group (we will call
this approach “no pooling”).
So let's:

focus on the observations in group j :

choose a prior distribution:

follow the Bayesian paradigma and make inference on
If we assume that the prior distributions on the individual parameters for the groups are independent, we are
essentially fitting "J" separate Bayesian models. Inferences about one particular group will be independent of the
inferences on the remaining others. This “no pooling” approach may be reasonable, especially if we think the
parameter
from the j-th model is completely unrelated from the others.
Complete
Pooling
Scenario
If we assume that every individual is equivalent (i.e.
the observations are i.i.d.), then we can pool the
data ignoring the grouping variable.
This is the so called "Complete Pooling Approach"
and it consists in a "standard" bayesian regression model :
Partial pooling scenario (Hierarchical
model)
Within a population there may be some subpopulations sharing some common
features. Thus, we should statistically aknowledge for this distinct groups' membership.
Hierarchical models are extension of regression models in which data are structured in
groups and parameters can vary by group.
Partial pooling scenario
(Hierarchical model)
So we may try to formalize the Hierarchical modelling by this procedure:
§
Individual level: the observed

Group level: the unobserved
§
Heterogeneity level: unobserved
depending on an hyperparameter
Partial pooling scenario
(Hierarchical model)
To recap :
We model outcomes in groups through the group-specific sampling density function
and the common prior distribution for the parameters.
An important and appealing feature of this approach is learning simultaneously:

About each group:

About groups' population:
In essence, hierarchical modelling takes into account information from multiple levels,
acknowledging differences and similarities among groups.
Hierarchical Linear Regression with
predictors
We can think of a generalization of linear regression, where intercepts, and possibly slopes,
are allowed to vary by group. Applying the same principle of partial pooling for more
general models we can now consider a predictor xi.
In this case we can obtain the following models:
1)Varying intercept model
2)Varying slope model
3)Varying intercept and slope model
Hierachical Linear Regression with predictors
1)The varying intercept model:
2) The varying slope model:
3) The varying intercept and varying slope model:
Sources of Variability
Recalling the schools' example, where we have:
We can see that there are two sources of variability among the
:
The standard deviation
measures the variability of the
within the groups
The standard deviation
measures the variability between the groups
Source of Variability
After observing the data, we can compare
and
through the statistic R:
Interpretation:
•
R is closed to 1 --> most of the total variability is attributed to the between variance
•
R is closed to 0 --> most of the total variability is attributed to within variance
Application to BES dataset
Application to BES dataset:
The goal of the analisys is understand how the life expectancy is influenced by the other variables, we start our work
without considering the partiniong between geographical zone, so a completely pooled model. After that we pass to
the partial pooling, the so called hierarchical modelling.
The project BES was born in 2010 for measuring the benessere equo e sostenibile, with the goal of assessing the
progress of society regarding the economy, sociality and environment. The matrix of data we use is composed of 110
rows x 8 columns, with variables:
Ripartizione: qualitative variable that represents the goegraphical ripartition of Italy in Nord, Centro and Mezzogiorno.
Spvita2016: continue quantitative variable that measures life expectancy at birth in 2016.
Dipl2016: continue quantitative variable that represents the percentage of people that finish the high school in 2016.
Occ2016: rate of occupation in 2016.
Redd2016: average disposable income per capita in 2016.
Omic2016: num. of murdering for 100.000 peoples in 2016.
Urb2016: consistency of the historic urban fabric.
Preliminary analysis:
This is the histogram of Spvita (target variable),
it looks like a normal distribution. So we did the
Shapiro test for normality, to see if a bayesian
linear regression was suitable for the analysis.
Correlation matrix between variables, to have a
preliminary idea of which covariates have a greater
effect on the response variable.
Preprocess: it is convenient to standardize the variables before starting the analysis, having noticed the differences in
scale.
We start our analysis with a classic bayesian linear regression model. After some analysis on the prior distribution for the
parameters, the numebers of iteration and the choice of the optimizer, the best model we obtain is the one in which we
choose Laplace distribution for the slope and the intercept (the prior distribution for the parameters "slope" and
"intercept" Laplace instead of the Normal distribution can affect the ability of the model to fit the data, because the
Laplace distribution is characterized by heavy tails, which could allow for more flexibility in the model).
For this model we have obtained the following results:
Diagnostic: we can evaluate the overall goodness of
the model adaptation with the Root Mean Square
Error (RMSE) and the R-squared.
RMSE: the average deviation between the observed
values and their predictions (the lower the better).
R-squared: The R-squared represents the proportion
of the variance of the dependent variable (y) that can
be explained by the independent variables (X) (good
fitting if close to 1).
The residuals are good between -2 and 2 and have no
pattern.
A hierarchical bayesian model allows to embed a hierarchical structure in the model parameters, allowing for greater
flexibility in data modeling. Since in our dataset there is a geographical ripartition in Nord, Centro, Mezzogiorno, is
convenient to incorporate this information into a model. We start out tractation with a simple varying intercept model
(whit slope fixed) and then we will try to use a varying slope model.
These are the results for the varying intercept model:
These are the results for the varying slope model:
Conclusions:
In our application case, we can conclude that the simple bayesian linear regression is better respect to hierarchical model.
This is probably due to the datastructure, in which the groups don’t differ enough to justify a hierarchical procedure.
Moreover the dimension of the dataset is important, indeed as the size is larger it becomes easier to recognize patterns in
data and a complex model may be better in those cases.
Download