BAYESIAN HIERARCHICAL MODELLING Theory and an application Bayesian paradigma Introduction to grouped data No-pooling scenario Index : Complete pooling scenario Partial pooling scenario (Hierarchical Modeling) Sources of Variability Application to BES dataset ⦁ Is a statistical analysis that answers research questions about unknown parameters of statistical models by using probability statements. ⦁Rests on the assumption that all model parameters are random quantities and thus can incorporate prior knowledge. Bayesian Paradigma ⦁ Follows a simple rule of probability, the Bayes rule, which provides a formalism for combining prior information with evidence from the data at hand and we use these two components to form the so called "posterior distribution" of model parameters. ⦁The posterior distribution results from updating the prior knowledge about model parameters with evidence from the observed data. Adding to the predictive distribution the information achieved in the posterior one, we obtain the final model Introduction to grouped data Typically, when we are modelling a dataset, we make an underlying assumption about the source of data: observations are assumed to be identically and independently distributed (i.i.d.) following a single distribution with one or more unknown parameters. In many situations, treating observations as i.i.d. may not be sensible because the hypotesis could be too restrictive. Indeed, for many applications, some observations share characteristics (such as age, ethnicity, etc.) that distinguish them from other observations, therefore multiple distinct groups are observed. Example: As an example, consider a study in which students’ scores of a standardized test are collected from five different high schools in a given year. Suppose a researcher is interested in learning about the mean score of the test. Since five different schools participated in this study and students’ scores might vary from school to school, it makes sense for the researcher to learn about the mean score for each school and compare students’ mean performance across schools. To start modelling this education data, it is inappropriate to use Yi as the random variable for the score of the i-th student (i = 1, …, n, where n is the total number of students from all five schools). Since this ignores the inherent grouping of the observations. Example: Instead, the researcher adds a school label j to Yi to reflect the grouping. Let Yij denotes the score of student i in school j, where j = 1, …, 5 and i = 1, … nj, where nj is the number of students in school j, and Since the scores are continuous,the Gaussian model is a reasonable choice for a data distribution. Within school j, one assumes that the scores are i.i.d. from a Gaussian data model with a mean and standard deviation depending on the school. Specifically, one assumes a school-specific mean μj and a schoolspecific standard deviation σj for the Gaussian data model for school j. Combining the information for the five schools, one has: No pooling scenario One approach for handling this group estimation problem is to find separate estimates for each group (we will call this approach “no pooling”). So let's: focus on the observations in group j : choose a prior distribution: follow the Bayesian paradigma and make inference on If we assume that the prior distributions on the individual parameters for the groups are independent, we are essentially fitting "J" separate Bayesian models. Inferences about one particular group will be independent of the inferences on the remaining others. This “no pooling” approach may be reasonable, especially if we think the parameter from the j-th model is completely unrelated from the others. Complete Pooling Scenario If we assume that every individual is equivalent (i.e. the observations are i.i.d.), then we can pool the data ignoring the grouping variable. This is the so called "Complete Pooling Approach" and it consists in a "standard" bayesian regression model : Partial pooling scenario (Hierarchical model) Within a population there may be some subpopulations sharing some common features. Thus, we should statistically aknowledge for this distinct groups' membership. Hierarchical models are extension of regression models in which data are structured in groups and parameters can vary by group. Partial pooling scenario (Hierarchical model) So we may try to formalize the Hierarchical modelling by this procedure: § Individual level: the observed Group level: the unobserved § Heterogeneity level: unobserved depending on an hyperparameter Partial pooling scenario (Hierarchical model) To recap : We model outcomes in groups through the group-specific sampling density function and the common prior distribution for the parameters. An important and appealing feature of this approach is learning simultaneously: About each group: About groups' population: In essence, hierarchical modelling takes into account information from multiple levels, acknowledging differences and similarities among groups. Hierarchical Linear Regression with predictors We can think of a generalization of linear regression, where intercepts, and possibly slopes, are allowed to vary by group. Applying the same principle of partial pooling for more general models we can now consider a predictor xi. In this case we can obtain the following models: 1)Varying intercept model 2)Varying slope model 3)Varying intercept and slope model Hierachical Linear Regression with predictors 1)The varying intercept model: 2) The varying slope model: 3) The varying intercept and varying slope model: Sources of Variability Recalling the schools' example, where we have: We can see that there are two sources of variability among the : The standard deviation measures the variability of the within the groups The standard deviation measures the variability between the groups Source of Variability After observing the data, we can compare and through the statistic R: Interpretation: • R is closed to 1 --> most of the total variability is attributed to the between variance • R is closed to 0 --> most of the total variability is attributed to within variance Application to BES dataset Application to BES dataset: The goal of the analisys is understand how the life expectancy is influenced by the other variables, we start our work without considering the partiniong between geographical zone, so a completely pooled model. After that we pass to the partial pooling, the so called hierarchical modelling. The project BES was born in 2010 for measuring the benessere equo e sostenibile, with the goal of assessing the progress of society regarding the economy, sociality and environment. The matrix of data we use is composed of 110 rows x 8 columns, with variables: Ripartizione: qualitative variable that represents the goegraphical ripartition of Italy in Nord, Centro and Mezzogiorno. Spvita2016: continue quantitative variable that measures life expectancy at birth in 2016. Dipl2016: continue quantitative variable that represents the percentage of people that finish the high school in 2016. Occ2016: rate of occupation in 2016. Redd2016: average disposable income per capita in 2016. Omic2016: num. of murdering for 100.000 peoples in 2016. Urb2016: consistency of the historic urban fabric. Preliminary analysis: This is the histogram of Spvita (target variable), it looks like a normal distribution. So we did the Shapiro test for normality, to see if a bayesian linear regression was suitable for the analysis. Correlation matrix between variables, to have a preliminary idea of which covariates have a greater effect on the response variable. Preprocess: it is convenient to standardize the variables before starting the analysis, having noticed the differences in scale. We start our analysis with a classic bayesian linear regression model. After some analysis on the prior distribution for the parameters, the numebers of iteration and the choice of the optimizer, the best model we obtain is the one in which we choose Laplace distribution for the slope and the intercept (the prior distribution for the parameters "slope" and "intercept" Laplace instead of the Normal distribution can affect the ability of the model to fit the data, because the Laplace distribution is characterized by heavy tails, which could allow for more flexibility in the model). For this model we have obtained the following results: Diagnostic: we can evaluate the overall goodness of the model adaptation with the Root Mean Square Error (RMSE) and the R-squared. RMSE: the average deviation between the observed values and their predictions (the lower the better). R-squared: The R-squared represents the proportion of the variance of the dependent variable (y) that can be explained by the independent variables (X) (good fitting if close to 1). The residuals are good between -2 and 2 and have no pattern. A hierarchical bayesian model allows to embed a hierarchical structure in the model parameters, allowing for greater flexibility in data modeling. Since in our dataset there is a geographical ripartition in Nord, Centro, Mezzogiorno, is convenient to incorporate this information into a model. We start out tractation with a simple varying intercept model (whit slope fixed) and then we will try to use a varying slope model. These are the results for the varying intercept model: These are the results for the varying slope model: Conclusions: In our application case, we can conclude that the simple bayesian linear regression is better respect to hierarchical model. This is probably due to the datastructure, in which the groups don’t differ enough to justify a hierarchical procedure. Moreover the dimension of the dataset is important, indeed as the size is larger it becomes easier to recognize patterns in data and a complex model may be better in those cases.