• Applications: Many large-scale research questions in the social sciences may only be answered fully using a multi-disciplinary computationally-intensive analysis;
• Data: The complexity of observational social science data can make data curation, data management and the subsequent analysis particularly difficult;
• Methodologies: Much of the quantitative technology presently used in the social sciences dates back to the
1960s and 1970s. Many assumptions in this technology were made in order to minimise computation;
• Computational Culture: Currently, most social scientists in the UK perform their analyses using standard packages or software written for single processors, limiting the scope of the substantive research questions.
• Funded by the ESRC, EPSRC and HEFCE (£1.2M) and consists of an array of 103 dual-processor Sun-
Blade workstations, each having between 1 and 8 gigabytes of memory.
• Fileserver with 1300 gigabytes of disk storage.
• Sixteen of the workstations have "Myrinet" cards installed to allow very high speed communication between them, supporting parallel programs which distribute large amounts of data.
• Jobs are submitted to the array from the HPC frontend machine through the Sun Grid Engine/Codine queuing system or via Globus.
• This in turns distributes each submitted job to one of the many execution hosts, or holds it until a host becomes available.
Rob Allan’s HPCGrid InfoPortal web page: http://esc.dl.ac.uk/InfoPortal/
We are normally visible here but its not picking us up at the moment as there seems to be monitoring and discovery service (mds) registration problems at grid-support.
Lancaster’s HPC details can be found at: http://giis.globus.org/ldapbrowser/login.php
• Running globus 2.4 with the following enhancements:
- Andrew McNab's GridPP Pool Account patch
( http://www.gridpp.ac.uk/authz/gridmapdir/ ) to accommodate external job submissions from users without a local HPC account
- a modified version of the original release of
Marko Krznaric's SGE Integeration Package
(new version is at http://www.lesc.ic.ac.uk/projects/epic-gtsge.html
)
• Currently investigating adding gt3 functionality to HPC services
• The new LESC EPIC package adds SGE jobmanager functionality to gt3
• Funded by NWDA (£1.77M)
• Four 10GbE links
– 10GbE Carlisle to Lancaster
– 2x 10GbE Lancaster to SJIV C-PoP at
Warrington
– 10GbE Lancaster to Daresbury Labs
• Eight 1GbE links
– Carlisle – Lancaster, Carlisle-Penrith
– Penrith-Kendal, Kendal-Lancaster
– Lancaster-Preston, Lancaster-Chorley
– Lancaster-SJIV C-PoP at Warrington
• A Training and Support Environment for
Advanced Quantitative Methods in the Social
Sciences
• An OGSA Component Based Approach to
Middleware for Statistical Modelling
• JISC-funded e-Social Science ReDRESS portal
A Training and Support Environment for
Advanced Quantitative Methods in the Social
Sciences (ESRC)
Short Courses and Masterclasses (£154k over
2 yrs).
1. Courses cover the main methods of data collection, fundamental aspects of research design, and statistical methods of data analysis;
2. Courses viewed on-line via web browser;
3. Software courses to cover packages and languages ranging from PC to HPC specific software, such as SAS, SPSS, GAUSS and
LIMDEP, and programming languages such as
C++, FORTRAN and parallel programming.
• National Consultancy Service (£39k over 2 yrs)
An OGSA Component Based Approach to
Middleware for Statistical Modelling (£100k)
• SABRE: Statistical software written in Fortran designed to model recurrent events. Standard generalised linear models can be fitted as well as various mixture models with random effects
• R: A free-to-use language and environment for statistical computing and graphics providing a wide variety of statistical and graphical techniques
• Middleware for e-Social Science: Development of a parallel, multilevel, multiprocess (OGSA) implementation of SABRE as an R object to enable
Social Scientists to disentangle the full stochastic complexity of socio-economic processes
Multilevel, multiprocess models
• Most random effect models are for responses of a single type, either dichotomous, ordinal or count. A single link function and family are specified. (Take 2 days on a 2GHz
0.5MB RAM P4)
• Multi-process models, are models with two or more substantively different outcomes, correlated random effects.
• Some examples of two process models include health status and mortality, or getting pregnant and finishing school. Each process may, but need not, include repeated outcomes.
• The models can also be used when the data possess a hierarchical structure, e.g. multi-stage cluster sample, where the responses at the lower levels are more correlated than those higher up, e.g. responses on individual pupils in the same class are more correlated than those between classes at the same school.
• Introductory material from roadshows
• Specific material from the Agenda Setting Workshops
• On-line demonstrators
• Course timetables/notes
• Video/audio material
• Associated reference material and FAQs
• Links to JISC national collections
• Links to partner institutions in Social Science
• World wide links
• E-mail for students/staff
• Additional help for self learners
• Examination and monitoring results
• Single sign-on/certificate-based authentication (same as the Grid and Athens)
• Role-based authorisation (students, staff, managers, developers etc.)
• Database back end for managing users and resources (OGSA-DAI)
• Content management for staff and developers
• Active portal services for Grid-based demonstrations (OGSA, Web services)
• Active monitoring suite to capture workflow and mine for enhanced requirements
• XML/XSLT-driven dynamic pages
• uPortal or Jetspeed framework with services based on BlackBoard, HPCPortal and
DataPortal
• Nesstar is a web-based facility that allows 66 major datasets to be explored online, allows simple sub-setting and simple analyses.
• Only uses one data set at a time;
• Has very limited facilities for sub-setting and none for fusing;
• Restricted statistical facilities, e.g. descriptive analysis, linear regression;
• No facilities for handling missing data;
• Not currently Grid enabled.
• A free web-based service using R, allowing users to submit R jobs and get output back to their web session
• Rweb it needs more menus, R has available a very extensive statistical library, not used in
Rweb;
• Rweb uses R and not Rmpi. For use in a Grid environment we would need these hooks to extend functionality;
• R also lacks some of the key multiprocess/multilevel and selection model frameworks appropriate to social science data, these are being developed;
1. Social scientists have much less experience and expertise in the use of the Grid than those typically from other research council areas;
2. There is a significant intellectual gap between such disciplines and computer science;
3. Distributed systems are also inherently complex and associated middleware products are not easy to use;
4. The Open Middleware Infrastructure Institute
(OMII) will provide (open-source) middleware and associated services, but not specifically targeted for the social science community ;
5. Need to build a more computer-literate collaborative culture for Social Science.
We propose:
1. To promote the use of component-based software development and visual composition tools and scripting languages for ease of use;
2. To offer a middleware consultancy service for application developers;
3. To exploit local expertise to develop bespoke middleware solutions for customers ;
4. To develop exemplar e-Social Science demonstrators for end users;
5. To exploit state-of-the-art software development technologies such as aspectoriented programming to enhance flexibility.
Multiple video streams can be delivered into an AG or portlet environment
Video
Corpus
Researcher A
Researcher B
Researcher C
Main ESDS Data Sets
TTWA Data,
NOMIS
Select Data Set and Appropriate
Variables:
Merge Files:
Add Variables
Contextual
Data
Working Data
Results
Data
Management
A
Data
Management
B
Data
Management
C
Analysis A
Middleware
Analysis B Analysis C
Lancaster/Daresbury
Other Contributors/Steering
Committee
… plus other contributors in the UK, from the USA & Europe
Key components will be accessible on the GRID and linked into the portal and demonstrators
• NWDA NW-GRID (400K kit 4 staff over 3 years, starts Dec 2003)
A collaboration between Lancaster (£1.0M),
Daresbury (£1.0M), Liverpool and
Manchester. Staff and equipment (Grids) at each site.
Projects at Lancaster in Env. Science,
Physics, Computing, Sociology, Economics,
Applied Statistics and Grid Training
• Our existing quantitative tools rely heavily on assumptions, they come out a technology that was formed in the 60s and 70s when computers were
10**9 slower.
• What new research agendas are now relevant? The 3 exponentials will change everything.
• The new opportunities for collaboration and evidence based research will lead to new (e)science, not just making legacy approaches faster.
• We can now move away from the assumption ridden technologies and develop robust nonparametric procedures for decomposing the complexity of socioeconomic processes.
• There will be amazing opportunities to make a difference/test policy instruments and address some grand challenges be they in reducing drug abuse, crime and poverty, or improving educational attainment.