L C o E e S S

advertisement

The Main e-Social Science Issues

• Applications: Many large-scale research questions in the social sciences may only be answered fully using a multi-disciplinary computationally-intensive analysis;

• Data: The complexity of observational social science data can make data curation, data management and the subsequent analysis particularly difficult;

• Methodologies: Much of the quantitative technology presently used in the social sciences dates back to the

1960s and 1970s. Many assumptions in this technology were made in order to minimise computation;

• Computational Culture: Currently, most social scientists in the UK perform their analyses using standard packages or software written for single processors, limiting the scope of the substantive research questions.

Lancaster’s Infrastructure

• Lancaster’s HPC

• NW Trunk

Lancaster’s HPC

• Funded by the ESRC, EPSRC and HEFCE (£1.2M) and consists of an array of 103 dual-processor Sun-

Blade workstations, each having between 1 and 8 gigabytes of memory.

• Fileserver with 1300 gigabytes of disk storage.

• Sixteen of the workstations have "Myrinet" cards installed to allow very high speed communication between them, supporting parallel programs which distribute large amounts of data.

• Jobs are submitted to the array from the HPC frontend machine through the Sun Grid Engine/Codine queuing system or via Globus.

• This in turns distributes each submitted job to one of the many execution hosts, or holds it until a host becomes available.

Lancaster’s HPC

Lancaster’s HPC

Rob Allan’s HPCGrid InfoPortal web page: http://esc.dl.ac.uk/InfoPortal/

We are normally visible here but its not picking us up at the moment as there seems to be monitoring and discovery service (mds) registration problems at grid-support.

Lancaster’s HPC details can be found at: http://giis.globus.org/ldapbrowser/login.php

Lancaster’s HPC

• Running globus 2.4 with the following enhancements:

- Andrew McNab's GridPP Pool Account patch

( http://www.gridpp.ac.uk/authz/gridmapdir/ ) to accommodate external job submissions from users without a local HPC account

- a modified version of the original release of

Marko Krznaric's SGE Integeration Package

(new version is at http://www.lesc.ic.ac.uk/projects/epic-gtsge.html

)

• Currently investigating adding gt3 functionality to HPC services

• The new LESC EPIC package adds SGE jobmanager functionality to gt3

NW Trunk

• Funded by NWDA (£1.77M)

• Four 10GbE links

– 10GbE Carlisle to Lancaster

– 2x 10GbE Lancaster to SJIV C-PoP at

Warrington

– 10GbE Lancaster to Daresbury Labs

• Eight 1GbE links

– Carlisle – Lancaster, Carlisle-Penrith

– Penrith-Kendal, Kendal-Lancaster

– Lancaster-Preston, Lancaster-Chorley

– Lancaster-SJIV C-PoP at Warrington

Existing Lancaster Projects

• A Training and Support Environment for

Advanced Quantitative Methods in the Social

Sciences

• An OGSA Component Based Approach to

Middleware for Statistical Modelling

• JISC-funded e-Social Science ReDRESS portal

A Training and Support Environment for

Advanced Quantitative Methods in the Social

Sciences (ESRC)

Short Courses and Masterclasses (£154k over

2 yrs).

1. Courses cover the main methods of data collection, fundamental aspects of research design, and statistical methods of data analysis;

2. Courses viewed on-line via web browser;

3. Software courses to cover packages and languages ranging from PC to HPC specific software, such as SAS, SPSS, GAUSS and

LIMDEP, and programming languages such as

C++, FORTRAN and parallel programming.

• National Consultancy Service (£39k over 2 yrs)

An OGSA Component Based Approach to

Middleware for Statistical Modelling (£100k)

• SABRE: Statistical software written in Fortran designed to model recurrent events. Standard generalised linear models can be fitted as well as various mixture models with random effects

• R: A free-to-use language and environment for statistical computing and graphics providing a wide variety of statistical and graphical techniques

• Middleware for e-Social Science: Development of a parallel, multilevel, multiprocess (OGSA) implementation of SABRE as an R object to enable

Social Scientists to disentangle the full stochastic complexity of socio-economic processes

Multilevel, multiprocess models

• Most random effect models are for responses of a single type, either dichotomous, ordinal or count. A single link function and family are specified. (Take 2 days on a 2GHz

0.5MB RAM P4)

• Multi-process models, are models with two or more substantively different outcomes, correlated random effects.

• Some examples of two process models include health status and mortality, or getting pregnant and finishing school. Each process may, but need not, include repeated outcomes.

• The models can also be used when the data possess a hierarchical structure, e.g. multi-stage cluster sample, where the responses at the lower levels are more correlated than those higher up, e.g. responses on individual pupils in the same class are more correlated than those between classes at the same school.

ReDRESS Portal (Content)

• Introductory material from roadshows

• Specific material from the Agenda Setting Workshops

• On-line demonstrators

• Course timetables/notes

• Video/audio material

• Associated reference material and FAQs

• Links to JISC national collections

• Links to partner institutions in Social Science

• World wide links

• E-mail for students/staff

• Additional help for self learners

• Examination and monitoring results

ReDRESS Portal (functionality)

• Single sign-on/certificate-based authentication (same as the Grid and Athens)

• Role-based authorisation (students, staff, managers, developers etc.)

• Database back end for managing users and resources (OGSA-DAI)

• Content management for staff and developers

• Active portal services for Grid-based demonstrations (OGSA, Web services)

• Active monitoring suite to capture workflow and mine for enhanced requirements

• XML/XSLT-driven dynamic pages

• uPortal or Jetspeed framework with services based on BlackBoard, HPCPortal and

DataPortal

ReDRESS will use/contribute to this technology

ReDRESS

Content:

(Existing Tools)

• Nesstar is a web-based facility that allows 66 major datasets to be explored online, allows simple sub-setting and simple analyses.

• Only uses one data set at a time;

• Has very limited facilities for sub-setting and none for fusing;

• Restricted statistical facilities, e.g. descriptive analysis, linear regression;

• No facilities for handling missing data;

• Not currently Grid enabled.

ReDRESS Content:

(Existing Tools)

• A free web-based service using R, allowing users to submit R jobs and get output back to their web session

• Rweb it needs more menus, R has available a very extensive statistical library, not used in

Rweb;

• Rweb uses R and not Rmpi. For use in a Grid environment we would need these hooks to extend functionality;

• R also lacks some of the key multiprocess/multilevel and selection model frameworks appropriate to social science data, these are being developed;

Content: New Tools / Middleware

1. Social scientists have much less experience and expertise in the use of the Grid than those typically from other research council areas;

2. There is a significant intellectual gap between such disciplines and computer science;

3. Distributed systems are also inherently complex and associated middleware products are not easy to use;

4. The Open Middleware Infrastructure Institute

(OMII) will provide (open-source) middleware and associated services, but not specifically targeted for the social science community ;

5. Need to build a more computer-literate collaborative culture for Social Science.

Content New Tools / Middleware

We propose:

1. To promote the use of component-based software development and visual composition tools and scripting languages for ease of use;

2. To offer a middleware consultancy service for application developers;

3. To exploit local expertise to develop bespoke middleware solutions for customers ;

4. To develop exemplar e-Social Science demonstrators for end users;

5. To exploit state-of-the-art software development technologies such as aspectoriented programming to enhance flexibility.

New Tools : Ex 1. VIDGRID

Multiple video streams can be delivered into an AG or portlet environment

Video

Corpus

Researcher A

Researcher B

Researcher C

New Tools : Ex2. The Analysis Cycle

Main ESDS Data Sets

TTWA Data,

NOMIS

Select Data Set and Appropriate

Variables:

Merge Files:

Add Variables

Contextual

Data

Working Data

Results

New Tools : Ex2. Linking Components

Data

Management

A

Data

Management

B

Data

Management

C

Analysis A

Middleware

Analysis B Analysis C

The ReDRESS Community

Lancaster/Daresbury

Other Contributors/Steering

Committee

… plus other contributors in the UK, from the USA & Europe

Key components will be accessible on the GRID and linked into the portal and demonstrators

New Lancaster Projects

• NWDA NW-GRID (400K kit 4 staff over 3 years, starts Dec 2003)

A collaboration between Lancaster (£1.0M),

Daresbury (£1.0M), Liverpool and

Manchester. Staff and equipment (Grids) at each site.

Projects at Lancaster in Env. Science,

Physics, Computing, Sociology, Economics,

Applied Statistics and Grid Training

The e-Social Science Future

• Our existing quantitative tools rely heavily on assumptions, they come out a technology that was formed in the 60s and 70s when computers were

10**9 slower.

• What new research agendas are now relevant? The 3 exponentials will change everything.

• The new opportunities for collaboration and evidence based research will lead to new (e)science, not just making legacy approaches faster.

• We can now move away from the assumption ridden technologies and develop robust nonparametric procedures for decomposing the complexity of socioeconomic processes.

• There will be amazing opportunities to make a difference/test policy instruments and address some grand challenges be they in reducing drug abuse, crime and poverty, or improving educational attainment.

Download