Developing Geographical Information Systems In A Cohort Study Andy Boyd ALSPAC, Social Medicine University of Bristol Geographical Data Matching - the ALSPAC resource • Overview of our data, the issues involve and our plan for the future • Time for questions • Time for discussion on how other studies have developed their GIS data resource 2 Defining GIS GIS combine mapping and a record of location with database technology. This can be used in the storage, analysis, management or presentation of data. E.W.Gilbert‘s1955 version of John Snow’s 1855 Soho Cholera Outbreak Map 3 Scope of this presentation • Not about GIS tools • Not about GIS analysis or techniques • It is about the capture and storage of data in an accessible manner to allow future GIS analysis • Uses ALSPAC as an example 4 The ALSPAC GIS dataset • Geographic identifiers collected directly from the cohort • Data collected via external data sources • Geographical data linkage • Precision of geographic variables – accuracy • Precision of geographic variables – ethics • Providing the data as an integral part of the resource • Current data availability 5 ALSPAC administered data collection Residential Address (~50000 address points) • updated from cohort (self reported) • team who tracks lost cases • email • second contacts • database searches (osis, electoral roll) School the young person attends / wishes to attend • via questionnaire (ALSPAC questionnaires/assessments administered in schools, primary to secondary transition questionnaire) • clinic attendance interview • collected from the school 6 Linkage to external data sources Validation / Cleaning • Validation and cleaning of self reported data using data collected via record linkage (NSTS – NHS Tracing, NPD – National Pupil DB, Royal Mail/OS products) Missing Data • Enhancing the resource through record linkage Data collection via geographical identifiers • Accessing existing data organised around geographical IDs (census data,neighbourhood data) • Primary data collection (distance to overhead power lines, air quality, commuting, school selection) 7 Data Collection through Record Linkage • Office National Statistics (ONS) Tracing • Health Authority • Embarkation • NSTS (NHS Strategic Tracing Service) • Address registered with GP • National Pupil Database (DCSF, DIUS*, UCAS*) • School Address • Pupil Residential Address • DWP* • Home Office* * Linkage currently being investigated 8 G.I.S – ALSPAC Resource • ~50,000 ALSPAC residential address points, associated with a date range which can then be linked to ALSPAC data collection • Schools attendance data from NPD ~17000 • Schools attendance data from ALSPAC collection ~ 10000 The geographic relation between household income and polluting factories – FoE 1999 9 G.I.S Precision • Spatial data held at many geographic levels • Geographies range in scale from 0.1 meters to regional/national data • Tied together via address, postcode or grid reference as central ID • Key resources include: – NSPD ( was All Fields Postcode Directory) - geo linking database – Deprivation & Socio Economic indices (IMD, Townsend, Acorn) – Census data 10 G.I.S – How we link cases to data • • • • Master file of Postcodes (NSPD) Postcodes linked to grid reference Grid references of various scales PCs/GridRef mapped to: – Electoral geographies – Census geographies • Ethics: – We don’t generally identify residence at PC or equivalent level Ordinance Survey – The National Grid 11 G.I.S – How we link geographies Current Situation • Use Postcode / postcode centroid grid reference as our highest precision variable • Link geographies using NSPD/AFPD appropriate to the measure required Proposed Method • Use property reference number (UPRN) / property centroid grid reference as highest precision variable 12 G.I.S Problems • Shifting geographies across time points • Royal Mail change postcode areas (and therefore postcode centroids) • Postcodes are ‘recycled’ • Postcode not precise enough in some cases • Postcode boundaries are not contiguous with other geographic boundaries 13 Accuracy issues with analysis at postcode level Address level Postcode level 14 Accuracy issues with analysis at postcode level Address level Postcode level 15 Accuracy issues with analysis at postcode level Address level Postcode level 16 Linkage problems with the cohort data • Missing data – Especially problematic for the cases who didn’t enrol in the original recruitment – Gaps in the address data – Move date often date we were informed not the actual move date • However… – ONS matched 99.7% mothers, so we have their old & new NHS numbers and cleaned data (original recruitment cases only) 17 GIS Data Availability • Collected as administrative resource • Not yet cleaned, documented and presented to usual ALSPAC standards • Initiatives under way to validate and fill gaps in record • Schools GIS data in the main not processed • Aim to build into standard ALSPAC resource 18 GIS Ethics • Postcode level or greater accuracy treated as a personal identifier • Research proposals to use these data need ALSPAC Law & Ethics Approval • Broader geographical data can be released in normal manner • A two-stage process is used to collect and process precise data • Data collected via linkage not available for all cases due to ethical decisions 19 GIS Data Access Step 1 – Postcodes (or full address) provided to researcher with unique collection ID with no other data attached Step 2 – Researcher attaches their data and returns file to ALSPAC Step 3 – ID converted to the appropriate collaborator ID, postcode data removed Step 4 – Requested ALSPAC data added to the file and data sent to the researcher 20 Andy Boyd A.W.Boyd@Bristol.ac.uk