GEODE: Grid Enabled Occupational Data Environment Paul Lambert and Larry Tan University of Stirling Paul Lambert, Larry Tan, Ken Turner, & Vernon Gayle University of Stirling Ken Prandy Cardiff University Richard Sinnott University of Glasgow Erik Bihagen Stockholm University Marco van Leeuwen Intl. Institute for Social History (Amsterdam) www.geode.stir.ac.uk GEODE - NeSC workshop, Oct 2006 ‘The Grid’ and New Technologies of Data Collection ‘The Grid’ and ‘eScience’: 1. Online Coordination of electronic resources and collaborations 2. (Distributed computing) Large scale Collaborative Heterogeneous Standard protocols / information management systems UK eSocial Science: 1) 2) 3) 4) Investment in assessing / implementing technology Computationally demanding data analysis Qualitative and quantitative data collection technologies **Data sharing, processing and access** GEODE - NeSC workshop, Oct 2006 GEODE: Survey records’ occupational data The importance of occupational micro-data Collecting occupational data 1) Initial occupational records (textual description) 2) Processing occupational records: Text descriptions →(1) Standardised Occupational Index (e.g. unit group: OUG) →(2) Substantive occupational summary (e.g. social class code) Good practice: Preservation of original, OUG and substantive variables NSI’s favour transparent occupational data coding (1) and translation systems (2) GEODE - NeSC workshop, Oct 2006 Occupational data collection and processing (1) Text records → OUG data (2) OUG data → summary indicators Currently: Text coding software (e.g. CASCOT) Manual look-up Currently: Numerous aggregate occupational information resources Bespoke data programming requirements GEODE: Linkage to existing resources Further facilities possible but not planned (users typically have adequate resources) GEODE: Core provision: management and access of these data resources Service to large volumes of users GEODE - NeSC workshop, Oct 2006 Some illustrative occupational information resources Index units # distinct files Updates? (average size kb) CAMSIS, 200 (100) y www.camsis.stir.ac.uk Local OUG*(e.s.) CAMSIS value labels Local OUG 50 (50) n Int. OUG 20 (50) y Int. OUG*(e.s.) 20 (200) n Local OUG 2 (paper) n www.camsis.stir.ac.uk ISEI tools, home.fsw.vu.nl/~ganzeboom E-Sec matrices www.iser.essex.ac.uk/esec Hakim gender seg codes (Hakim 1998) GEODE - NeSC workshop, Oct 2006 What’s the problem? External user (micro-social data) User’s output (micro-social data) Occ info (index file) (aggregate) id oug sex . oug CS-M CS-F EGP id oug CS 1 110 1 . 110 60 58 I 1 110 60 . 2 320 1 . 320 69 71 II 2 320 69 . 3 320 2 . 874 39 51 VIIa 3 320 71 . 4 874 1 . 4 874 39 . 5 874 2 . 5 874 51 . Indexed mainly by Occupational Unit Group (OUG). But… • • • • • • Numerous alternative occupational data files (time; country; format) Alternative OUG schemes; other index factors (‘employment status’) Inconsistent translations to social classifications – ‘by file or by fiat’ Dynamic updates to occupational data resources Low uptake of existing occupational information resources Strict security constraints on users’ micro-social survey data GEODE - NeSC workshop, Oct 2006 GEODE: Grid Enabled Occupational Data Environment Strategy: 1) Occupational data index service (depository) i. Semantic data curation (DDI) ii. Data storage (OGSA-DAI) iii. Data indexing / access (OGSA-DAI) 2) User-friendly ‘portal’ access • • Entry to an international virtual organisation for data depositors and users (GridSphere, GT4, OGSA-DAI) Facilitate linking occupational information to users’ datasets (OGSA-DAI) (initial focus on CAMSIS resources) GEODE - NeSC workshop, Oct 2006 Occupational information depository 1.1) Semantic curation of occupational information Establish a ‘GEODE-M’ metadata subset (.xml) • Founded on Michigan Data Documentation Initiative • • <docDscr> <stdyDscr> Release date Country Time period Author <fileDscr> <otherMat> Format Missing data Data extensions Minimise curation requirements <dataDscr> <varGrp><var> Web proforma entry OUG variable Other identifier variables Output variables • [via Portal using Gridsphere] GEODE - NeSC workshop, Oct 2006 Technical Objectives Create a virtual community of occupational information researchers – Gateway for occupational information – Data abstraction – Uniform access to resources Accessible via a portal Occupational data curation – Annotation of data using DDI Occupational matching services – e.g. Linking surveyed data to CAMSIS scores GEODE - NeSC workshop, Oct 2006 GEODE - NeSC workshop, Oct 2006 GEODE - Architecture VO members can deploy own data services, also occupational matching services – Scalable – Distributed Possible application for other types of social science data – Annotation with DDI – Custom services can be deployed GEODE - NeSC workshop, Oct 2006 GEODE – Prototype Simple occupational matching services VO of Occupational Data Resources Portal for searching external resources GEODE - NeSC workshop, Oct 2006 GEODE - Prototype GEODE - NeSC workshop, Oct 2006 GEODE - Prototype Windows environment Java GridSphere Portal Framework Globus Toolkit 4 – Index Service (Virtual Organization) – OGSA-DAI WSRF (Data Access Middleware) • Custom OGSA-DAI resources and activities • Accesses CSV, Relational data resources GEODE - NeSC workshop, Oct 2006 GEODE - Prototype Data Documentation Initiative – Annotate the data resources Occupational Matching Grid Services – Checks if DDI of target resource is compatible (e.g. category specified matches requirement) – Map occupational unit group to data – Returns mapped/matched results Demonstration of prototype GEODE - NeSC workshop, Oct 2006 Future Work Possible extension of VO to other social science related datasets – With services Variety of occupational data analysis services GEODE - NeSC workshop, Oct 2006