Scottish Social Survey Network: Master Class 1 Data Analysis with Stata Dr Vernon Gayle and Dr Paul Lambert 23rd January 2008, University of Stirling The SSSN is funded under Phase II of the ESRC Research Development Initiative 1 Multilevel data and analysis with Stata (in 15 minutes) 2 Generalised linear model • Y = BX + e • Y = outcome variable(s) • X = explanatory variables • e = error term for each individual response Generalised linear mixed models – Adding complexity to the GLM, such as by disaggregating the error structures 3 The work of statistical modelling • Yi = BXi + ei • Most of the time: – we have a single Y – we ignore e – we concentrate on what goes into B 4 Example • Data: British Household Panel Survey 2005 adult interviews (7k adults in work) • Y = GHQ scale score for adults in employment (General Health Questionnaire, higher = worse subjective well-being) • X = various possible measures, including gender, age, marital status, occupational advantage, education, partner’s GHQ • You can run this example, the files are at: 5 Results from four linear models 1 2 3 4 11.03** 6.29** 6.14** 6.56** Fem 1.25** 1.28** 1.39** Age 0.22** 0.23** 0.22** -0.0024** -0.0026** -0.0024** -0.77** -0.76** -1.52** -0.01* -0.01 Cons Age-squared Cohab -0.33* Own CAMSIS Father’s CAMSIS 0.01 Degree/Diploma -0.05 Vocational qual -0.13 No qual -0.11 Works > 10hrs 0.13 Partner’s GHQ R2 0.08** 0.0009 0.0234 0.0244 0.0293 Some regression assumptions All variables are measured without errors All relevant predictors of the independent variable are included in the analysis Expected value of the error is zero Heteroscedasticity of the error No autocorrelation (no relation between error terms for different cases) – [above using: Menard, S. 1995. Applied Logistic Regression Analysis, London: Sage.] 7 Multilevel modelling • What if there was some connection between some of the cases within the dataset? – This occurs by design in certain projects • e.g. educational research, sample includes multiple children from the same school – Some connections (‘hierarchical clusters’) are standard in most social surveys 8 . . Regions PSU1 Individuals Person Groups PSU2 . . PSU3 Wave 1 Wave 2 Wave 3 . . . . Interviewers : Interviewer1 W 1, 3 : Interviewer2 W 2 only : . . . . . . Interviewer2 Interviewer3 . . Interviewer3 Interviewer1 9 How to account for hierarchy / clustering in individual data? 1. We could try a unique dummy var. for every cluster – – – – Country: Y = BX + scot + wal + Nir + e ‘areg’ in Stata allows several hundred variables like this often called a ‘hierarchical fixed effect’ but many hierarchies have too many clusters for this to be satisfactory 2. We could use higher level explanatory variables – – e.g. average unemployment rate in local authority district these are also ‘hierarchical fixed effects’ 3. We could try telling the model that we expect the error terms to be related – these are ‘hierarchical random effects’ = multilevel models 10 Creating a multilevel model • Linear model: Yi = BXi + ei • Multilevel model (‘random intercepts’) Yij = BXij + uj + eij • Multilevel model (‘random coefficients’) Yij = BXij + UBj + uj + eij 11 How to implement multilevel models? • In SPSS and Stata, there are extension specifications which can be made in order to specify the simplest random intercepts model 12 Stata examples • regress ghq fem age age2 cohab • regress ghq fem age age2 cohab, robust cluster(ohid) • xtmixed ghq fem age age2 cohab ||ohid: 13 Comments • Models which ignore clustering should be unbiassed but inefficient • The simplest multilevel model: Shouldn’t change coefficent estimates (unbiased) Should change confidence intervals (inefficient) 14 15 16 3-level model in Stata (xtmixed) 17 The same model in MLwiN 18 A controversial claim about Stata • Stata is the best package to use for multilevel modelling, because: – It is integrated with data management capacity: easy to change variables; change cases; add higher level explanatory variables; etc – It has a wide range of hierarchical model estimators – It allows easy comparison between long-standing hierarchical estimators (from economics) and new random effects models • By constrast: – Other mainstream packages don’t have adequate range of model estimators – Specialist packages (e.g. MLwiN; HLM) do have more advanced modelling estimators, but they inhibit data manipulation / serious model building 19