Introduction to Stata using the Northern Ireland Household Panel Survey (NIHPS) Katrina Lloyd (QUB) Patricia McKee (UU) Format • • • • 9:15 Intro to NIHPS 9:30 Intro to Stata 10:30 – 11:00 Coffee break 11:00 Stata files – log / do Advantages of Stata • 12:30 Questions / examples NIHPS • NIHPS began in 2001 and is an extension of the BHPS (1991) • ISER at Essex University has overall responsibility for survey • NISRA carries out fieldwork in NI • 6 waves of NIHPS data available from UK Data Archive (2001-2006) NIHPS • NIHPS follows representative sample of individuals • Household-based interviewing: – All adults aged 16+ – From Wave 4 all children aged 11-15 (Youth Panel) • Unique value is that NIHPS measures change at the individual level NIHPS • Achieved sample (full interviews all years) – Wave 1 - 3,458 individuals in 1,978 households – Wave 2 - 2,692 individuals – Wave 3 - 2,414 individuals – BY Wave 6 - 2,151 individuals • Attrition Content of the NIHPS • NIHPS has 3 components: • Core component asked every year – Includes health, housing, finances • Rotating core component – every 3 years – Includes wealth, assets and debt, parenting • Variable component – once in the panel – Includes race, place of birth, age left school NIHPS datasets • Cross-sectional files for each wave • Longitudinal files for individuals • Files linked by common variables – PID (unique Personal Identification Number) – wHID (Household ID – changes year on year) – wPNO (person number – changes year on year) • w refers to the wave id k,l,m,n,o,p - years 2001-2006 respectively NIHPS Record Type Record Description wHHSAMP - household-level data for issued households wHHRESP - household-level data for responding households wINDSAMP - individual-level data for issued households wINDALL - enumerated individuals' data (including children and nonrespondents) NIHPS Record Type Record Description wINDRESP - individual-level data for respondents wEGOALT - relationship of each individual in a household wINCOME - income and payment data wJOBHIST - information from the employment history NIHPS additional files wMARRIAG - one record for each reported legal marriage wCOHABIT - one record for each cohabitation spell outside marriage wCHILDAD - information about adopted and/or stepchildren wCHILDNT - information about natural children wCHILD - information on children and parenting styles wYOUTH - responses to the Young persons questionnaire wLIFEMST - information about employment status spells NIHPS additional files For ALL Waves XWAVEID - information for matching individuals between waves XWLSTEN - information on the latest known sample status of individuals XWAVEDAT - central source of data on individuals which is fixed and only measured once in the panel e.g. race Files using today: wINDALL kindall.dta obs: 5,188 vars: 52 lindall.dta obs: 4,589 vars: 54 mindall.dta obs: 4,210 vars 55 nindall.dta obs: 3,940 vars: 55 oindall.dta obs: 3,809 vars: 55 pindall.dta obs: 3,650 vars: 55 Stata windows Previous commands Results Variables Commands Edit Preferences Click on Edit tab Come down to preferences Select general preferences LOG files – record your session • Start – Either click icon or select File > Log > Begin • Types – .smcl = Stata formatted – .log = a text file or ASCII file • Choices – View existing file – Append new to old – Overwrite with new • Closure – When you exit – Choose to suspend / resume Log file LOG file Choose folder Give filename Choose type LOG Note : if a log file is on the name appears below results and above commands DO files • Text file containing commands rather than typing commands at the keyboard • Contents of review window (previous commands can be saved into a do file • Do files may call other do-files which call other do-files nested 64 deep or in a master.do up to 1,000 do files can be called one after the other Do file Note: comment Select commands to run and click icon Built-in Variables • _pi contains the value π to machine precision • _n contains the number of the current obs. – Eg age 23 34 45 56 _n 1 2 3 4 • _N contains the total number of obs. – Eg age 23 34 45 56 _N 4 4 4 4 Note Stata respects case: 3 distinct names myvar Myvar MYVAR Example of _n and _N use kindall, clear sort khid kpno // sort file hhold and pno within gen totcases = _N // generate total number of obs * For each hhold generate no of people in hhold bysort khid: gen totninhh = _N * For each hhold generate the number within bysort khid: gen nwithinhh = _n list pid khid kpno totninhh nwithinhh in 1/20 tab totninhh nwithinhh ,miss // crosstab include missing gen totcases = _N // generate total number of obs tab totcases totcases Freq. Percent Cum. 5188 5,188 100.00 100.00 Total 5,188 100.00 bysort khid: gen totninhh = _N tab totninhh totninhh Freq. Percent Cum. 1 518 9.98 2 1,238 23.86 33.85 <- 2 persons 3 915 17.64 51.48 4 1,176 22.67 74.15 <- 4 persons 5 830 16.00 90.15 6 252 4.86 95.01 7 175 3.37 98.38 8 56 1.08 99.46 9 18 0.35 99.81 10 10 0.19 100.00 Total 5,188 100.00 9.98 list pid khid kpno totninhh nwithinhh in 1/20 Case pid khid kpno totninhh nwithinhh 1. 118500023 11850027 1 3 1 2. 118500058 11850027 2 3 2 3. 118500074 11850027 3 3 3 4. 118500317 11850043 1 1 1 5. 118501135 11850116 1 1 1 Saved Results summarize produces summary statistics sum kage12 Variable kage12 Obs 5188 Mean Std. Dev. 35.46164 22.59792 Min 0 Max 97 Also saves in r( ) 19 scalars like: r(N) – no of obs r(mean) – mean r(sum) – sum of age r(sd) – std deviation r(p1) – 1st percentile r(p95) 95th percentile some are only available with sum kage12, detail To list results stored in r( ) type return list . sum kage12, detail age at 1.12.2001 Percentiles Smallest 1% 0 0 5% 3 0 10% 6 0 Obs 5188 25% 16 0 Sum of Wgt. 5188 50% 34 Mean Largest Std. Dev. 35.46164 22.59792 75% 53 92 90% 68 94 Variance 510.6658 95% 75 96 Skewness .2723639 99% 83 97 Kurtosis 2.072386 After sum kage12,detail type return list scalars: r(N) r(sum_w) r(mean) r(Var) r(sd) r(skewness) r(kurtosis) r(sum) r(min) r(max) = = = = = = = = = = 5188 5188 35.46164225134927 510.66577343513 22.59791524533026 .2723638715033958 2.072386222684342 183975 0 97 r(p1) = r(p5) r(p10) r(p25) r(p50) r(p75) r(p90) r(p95) r(p99) = = = = = = = = 0 3 6 16 34 53 68 75 83 LOCAL variables eg var referred to as `var’ ` from key beside 1 and ‘ from key down beside L Programming - loop over items/values • foreach var in – loops over items – Can be varlist or newlist or numlist • forvalues x = – loops over consecutive values – loop is executed as long as `x’ is in range Example * Comment Setup a local variable testvars local testvars " khgr2r khgsex kage12" * Start of loop – note { and ending } * Could also use foreach x in khgr2r khgsex kage12 { foreach x of local testvars { display " the current variable is `x' tab `x' // displays frequencies sum `x' // produces summary statistics ret list // displays all the saved results } // end of loop Merging data files • Two kinds of merges – One-to-one – Match-merge • Result contained in new var _merge – 1 = obs occurred ONLY in master dataset – 2 = obs occurred ONLY in using dataset – 3 = obs occurred in BOTH master and using datasets Example of merging Local dirdata “j:\nihps\nihps data\” foreach x in k l m n o p { use “`dirdata’`x'indall”, clear keep pid `x'age12 `x'newhy sort pid save temp`x’,replace } use tempk,clear foreach x in l m n o p { merge pid using temp`x', _merge(mer`x') sort pid } Command to check number of obs: tab1 *newhy kindall.dta obs: 5,188 vars: 52 lindall.dta obs: 4,589 vars: 54 mindall.dta obs: 4,210 vars 55 nindall.dta obs: 3,940 vars: 55 oindall.dta obs: 3,809 vars: 55 pindall.dta obs: 3,650 vars: 55