LCoEeSS LCoEeSS The Main e-Social Science Issues • Applications: Many large-scale research questions in the social sciences may only be answered fully using a multi-disciplinary computationally-intensive analysis; • Data: The complexity of observational social science data can make data curation, data management and the subsequent analysis particularly difficult; • Methodologies: Much of the quantitative technology presently used in the social sciences dates back to the 1960s and 1970s. Many assumptions in this technology were made in order to minimise computation; • Computational Culture: Currently, most social scientists in the UK perform their analyses using standard packages or software written for single processors, limiting the scope of the substantive research questions. LCoEeSS Lancaster’s Infrastructure • Lancaster’s HPC • NW Trunk LCoEeSS Lancaster’s HPC • Funded by the ESRC, EPSRC and HEFCE (£1.2M) and consists of an array of 103 dual-processor SunBlade workstations, each having between 1 and 8 gigabytes of memory. • Fileserver with 1300 gigabytes of disk storage. • Sixteen of the workstations have "Myrinet" cards installed to allow very high speed communication between them, supporting parallel programs which distribute large amounts of data. • Jobs are submitted to the array from the HPC frontend machine through the Sun Grid Engine/Codine queuing system or via Globus. • This in turns distributes each submitted job to one of the many execution hosts, or holds it until a host becomes available. LCoEeSS Lancaster’s HPC LCoEeSS Lancaster’s HPC LCoEeSS Rob Allan’s HPCGrid InfoPortal web page: http://esc.dl.ac.uk/InfoPortal/ We are normally visible here but its not picking us up at the moment as there seems to be monitoring and discovery service (mds) registration problems at grid-support. LCoEeSS Lancaster’s HPC details can be found at: http://giis.globus.org/ldapbrowser/login.php LCoEeSS Lancaster’s HPC • Running globus 2.4 with the following enhancements: - Andrew McNab's GridPP Pool Account patch (http://www.gridpp.ac.uk/authz/gridmapdir/) to accommodate external job submissions from users without a local HPC account - a modified version of the original release of Marko Krznaric's SGE Integeration Package (new version is at http://www.lesc.ic.ac.uk/projects/epic-gtsge.html) • Currently investigating adding gt3 functionality to HPC services • The new LESC EPIC package adds SGE jobmanager functionality to gt3 LCoEeSS NW Trunk • Funded by NWDA (£1.77M) • Four 10GbE links – 10GbE Carlisle to Lancaster – 2x 10GbE Lancaster to SJIV C-PoP at Warrington – 10GbE Lancaster to Daresbury Labs • Eight 1GbE links – – – – Carlisle – Lancaster, Carlisle-Penrith Penrith-Kendal, Kendal-Lancaster Lancaster-Preston, Lancaster-Chorley Lancaster-SJIV C-PoP at Warrington LCoEeSS LCoEeSS Existing Lancaster Projects • A Training and Support Environment for Advanced Quantitative Methods in the Social Sciences • An OGSA Component Based Approach to Middleware for Statistical Modelling • JISC-funded e-Social Science ReDRESS portal LCoEeSS A Training and Support Environment for Advanced Quantitative Methods in the Social Sciences (ESRC) Short Courses and Masterclasses (£154k over 2 yrs). 1. Courses cover the main methods of data collection, fundamental aspects of research design, and statistical methods of data analysis; 2. Courses viewed on-line via web browser; 3. Software courses to cover packages and languages ranging from PC to HPC specific software, such as SAS, SPSS, GAUSS and LIMDEP, and programming languages such as C++, FORTRAN and parallel programming. • National Consultancy Service (£39k over 2 yrs) LCoEeSS An OGSA Component Based Approach to Middleware for Statistical Modelling (£100k) • SABRE: Statistical software written in Fortran designed to model recurrent events. Standard generalised linear models can be fitted as well as various mixture models with random effects • R: A free-to-use language and environment for statistical computing and graphics providing a wide variety of statistical and graphical techniques • Middleware for e-Social Science: Development of a parallel, multilevel, multiprocess (OGSA) implementation of SABRE as an R object to enable Social Scientists to disentangle the full stochastic complexity of socio-economic processes LCoEeSS Multilevel, multiprocess models • Most random effect models are for responses of a single type, either dichotomous, ordinal or count. A single link function and family are specified. (Take 2 days on a 2GHz 0.5MB RAM P4) • Multi-process models, are models with two or more substantively different outcomes, correlated random effects. • Some examples of two process models include health status and mortality, or getting pregnant and finishing school. Each process may, but need not, include repeated outcomes. • The models can also be used when the data possess a hierarchical structure, e.g. multi-stage cluster sample, where the responses at the lower levels are more correlated than those higher up, e.g. responses on individual pupils in the same class are more correlated than those between classes at the same school. LCoEeSS ReDRESS Portal (Content) • • • • • • • • • • • • Introductory material from roadshows Specific material from the Agenda Setting Workshops On-line demonstrators Course timetables/notes Video/audio material Associated reference material and FAQs Links to JISC national collections Links to partner institutions in Social Science World wide links E-mail for students/staff Additional help for self learners Examination and monitoring results LCoEeSS ReDRESS Portal (functionality) • Single sign-on/certificate-based authentication (same as the Grid and Athens) • Role-based authorisation (students, staff, managers, developers etc.) • Database back end for managing users and resources (OGSA-DAI) • Content management for staff and developers • Active portal services for Grid-based demonstrations (OGSA, Web services) • Active monitoring suite to capture workflow and mine for enhanced requirements • XML/XSLT-driven dynamic pages • uPortal or Jetspeed framework with services based on BlackBoard, HPCPortal and DataPortal LCoEeSS ReDRESS will use/contribute to this technology LCoEeSS ReDRESS Content: (Existing Tools) • Nesstar is a web-based facility that allows 66 major datasets to be explored online, allows simple sub-setting and simple analyses. • Only uses one data set at a time; • Has very limited facilities for sub-setting and none for fusing; • Restricted statistical facilities, e.g. descriptive analysis, linear regression; • No facilities for handling missing data; • Not currently Grid enabled. ReDRESS Content: LCoEeSS (Existing Tools) • A free web-based service using R, allowing users to submit R jobs and get output back to their web session • Rweb it needs more menus, R has available a very extensive statistical library, not used in Rweb; • Rweb uses R and not Rmpi. For use in a Grid environment we would need these hooks to extend functionality; • R also lacks some of the key multiprocess/multilevel and selection model frameworks appropriate to social science data, these are being developed; LCoEeSS Content: New Tools / Middleware 1. Social scientists have much less experience and expertise in the use of the Grid than those typically from other research council areas; 2. There is a significant intellectual gap between such disciplines and computer science; 3. Distributed systems are also inherently complex and associated middleware products are not easy to use; 4. The Open Middleware Infrastructure Institute (OMII) will provide (open-source) middleware and associated services, but not specifically targeted for the social science community; 5. Need to build a more computer-literate collaborative culture for Social Science. Content New Tools / Middleware LCoEeSS We propose: 1. To promote the use of component-based software development and visual composition tools and scripting languages for ease of use; 2. To offer a middleware consultancy service for application developers; 3. To exploit local expertise to develop bespoke middleware solutions for customers; 4. To develop exemplar e-Social Science demonstrators for end users; 5. To exploit state-of-the-art software development technologies such as aspectoriented programming to enhance flexibility. LCoEeSS New Tools : Ex 1. VIDGRID Multiple video streams can be delivered into an AG or portlet environment Video Corpus Researcher A Researcher B Researcher C New Tools : Ex2. The Analysis Cycle LCoEeSS Main ESDS Data Sets TTWA Data, NOMIS Select Data Set and Appropriate Variables: Merge Files: Add Variables Working Data Results Contextual Data LCoEeSS New Tools : Ex2. Linking Components Data Management A Data Management B Data Management C Analysis A Analysis B Analysis C Middleware LCoEeSS The ReDRESS Community Lancaster/Daresbury Other Contributors/Steering Committee … plus other contributors in the UK, from the USA & Europe Key components will be accessible on the GRID and linked into the portal and demonstrators LCoEeSS New Lancaster Projects • NWDA NW-GRID (400K kit 4 staff over 3 years, starts Dec 2003) A collaboration between Lancaster (£1.0M), Daresbury (£1.0M), Liverpool and Manchester. Staff and equipment (Grids) at each site. Projects at Lancaster in Env. Science, Physics, Computing, Sociology, Economics, Applied Statistics and Grid Training LCoEeSS The e-Social Science Future • Our existing quantitative tools rely heavily on assumptions, they come out a technology that was formed in the 60s and 70s when computers were 10**9 slower. • What new research agendas are now relevant? The 3 exponentials will change everything. • The new opportunities for collaboration and evidence based research will lead to new (e)science, not just making legacy approaches faster. • We can now move away from the assumption ridden technologies and develop robust nonparametric procedures for decomposing the complexity of socioeconomic processes. • There will be amazing opportunities to make a difference/test policy instruments and address some grand challenges be they in reducing drug abuse, crime and poverty, or improving educational attainment.