GEODE: Grid Enabled Occupational Data Environment Paul Lambert and Larry Tan

advertisement
GEODE: Grid Enabled Occupational
Data Environment
Paul Lambert and Larry Tan
University of Stirling
Paul Lambert, Larry Tan,
Ken Turner, & Vernon Gayle
University of Stirling
Ken Prandy
Cardiff University
Richard Sinnott
University of Glasgow
Erik Bihagen
Stockholm University
Marco van Leeuwen
Intl. Institute for Social
History (Amsterdam)
www.geode.stir.ac.uk
GEODE - NeSC workshop, Oct 2006
‘The Grid’ and New Technologies of Data Collection
‘The Grid’ and ‘eScience’:
1.
Online Coordination of electronic resources and collaborations




2.
(Distributed computing)
Large scale
Collaborative
Heterogeneous
Standard protocols / information management systems
UK eSocial Science:
1)
2)
3)
4)
Investment in assessing / implementing technology
Computationally demanding data analysis
Qualitative and quantitative data collection technologies
**Data sharing, processing and access**
GEODE - NeSC workshop, Oct 2006
GEODE: Survey records’ occupational data
The importance of occupational micro-data
Collecting occupational data
1)
Initial occupational records (textual description)
2)
Processing occupational records:
Text descriptions
→(1) Standardised Occupational Index (e.g. unit group: OUG)
→(2) Substantive occupational summary (e.g. social class code)
Good practice:
 Preservation of original, OUG and substantive variables
 NSI’s favour transparent occupational data coding (1) and
translation systems (2)
GEODE - NeSC workshop, Oct 2006
Occupational data collection and processing
(1) Text records → OUG data
(2) OUG data → summary
indicators
Currently:
Text coding software
(e.g. CASCOT)
Manual look-up
Currently:
Numerous aggregate occupational
information resources
Bespoke data programming
requirements
GEODE:
Linkage to existing resources
Further facilities possible but not
planned (users typically have
adequate resources)
GEODE:
Core provision: management and
access of these data resources
Service to large volumes of users
GEODE - NeSC workshop, Oct 2006
Some illustrative occupational information resources
Index units
# distinct files
Updates?
(average size kb)
CAMSIS,
200 (100)
y
www.camsis.stir.ac.uk
Local
OUG*(e.s.)
CAMSIS value labels
Local OUG
50 (50)
n
Int. OUG
20 (50)
y
Int.
OUG*(e.s.)
20 (200)
n
Local OUG
2 (paper)
n
www.camsis.stir.ac.uk
ISEI tools,
home.fsw.vu.nl/~ganzeboom
E-Sec matrices
www.iser.essex.ac.uk/esec
Hakim gender seg
codes (Hakim 1998)
GEODE - NeSC workshop, Oct 2006
What’s the problem?
External user
(micro-social data)
User’s output
(micro-social data)
Occ info (index file)
(aggregate)
id
oug
sex
.
oug
CS-M
CS-F
EGP
id
oug
CS
1
110
1
.
110
60
58
I
1
110
60
.
2
320
1
.
320
69
71
II
2
320
69
.
3
320
2
.
874
39
51
VIIa
3
320
71
.
4
874
1
.
4
874
39
.
5
874
2
.
5
874
51
.
Indexed mainly by Occupational Unit Group (OUG). But…
•
•
•
•
•
•
Numerous alternative occupational data files (time; country; format)
Alternative OUG schemes; other index factors (‘employment status’)
Inconsistent translations to social classifications – ‘by file or by fiat’
Dynamic updates to occupational data resources
Low uptake of existing occupational information resources
Strict security constraints on users’ micro-social survey data
GEODE - NeSC workshop, Oct 2006
GEODE: Grid Enabled Occupational Data Environment
Strategy:
1) Occupational data index service (depository)
i. Semantic data curation (DDI)
ii. Data storage (OGSA-DAI)
iii. Data indexing / access (OGSA-DAI)
2) User-friendly ‘portal’ access
•
•
Entry to an international virtual organisation for data
depositors and users (GridSphere, GT4, OGSA-DAI)
Facilitate linking occupational information to users’ datasets
(OGSA-DAI) (initial focus on CAMSIS resources)
GEODE - NeSC workshop, Oct 2006
Occupational information depository
1.1) Semantic curation of
occupational information

Establish a ‘GEODE-M’ metadata subset (.xml)
• Founded on Michigan Data
Documentation Initiative
•
•
<docDscr>
<stdyDscr>
Release date
Country
Time period
Author
<fileDscr>
<otherMat>
Format
Missing data
Data extensions
Minimise curation requirements
<dataDscr> <varGrp><var>
Web proforma entry
OUG variable
Other identifier variables
Output variables
• [via Portal using Gridsphere]
GEODE - NeSC workshop, Oct 2006
Technical Objectives

Create a virtual community of occupational information
researchers
– Gateway for occupational information
– Data abstraction
– Uniform access to resources


Accessible via a portal
Occupational data curation
– Annotation of data using DDI

Occupational matching services
– e.g. Linking surveyed data to CAMSIS scores
GEODE - NeSC workshop, Oct 2006
GEODE - NeSC workshop, Oct 2006
GEODE - Architecture

VO members can deploy own data services, also occupational
matching services
– Scalable
– Distributed

Possible application for other types of social science data
– Annotation with DDI
– Custom services can be deployed
GEODE - NeSC workshop, Oct 2006
GEODE – Prototype



Simple occupational matching services
VO of Occupational Data Resources
Portal for searching external resources
GEODE - NeSC workshop, Oct 2006
GEODE - Prototype
GEODE - NeSC workshop, Oct 2006
GEODE - Prototype




Windows environment
Java
GridSphere Portal Framework
Globus Toolkit 4
– Index Service (Virtual Organization)
– OGSA-DAI WSRF (Data Access Middleware)
• Custom OGSA-DAI resources and activities
• Accesses CSV, Relational data resources
GEODE - NeSC workshop, Oct 2006
GEODE - Prototype

Data Documentation Initiative
– Annotate the data resources

Occupational Matching Grid Services
– Checks if DDI of target resource is compatible (e.g. category
specified matches requirement)
– Map occupational unit group to data
– Returns mapped/matched results

Demonstration of prototype
GEODE - NeSC workshop, Oct 2006
Future Work

Possible extension of VO to other social science
related datasets
– With services

Variety of occupational data analysis services
GEODE - NeSC workshop, Oct 2006
Download