Data Services What, Why, How e-Research Meeting NeSC, 2nd March 2005 Neil Chue Hong Project Manager, EPCC N.ChueHong@epcc.ed.ac.uk +44 131 650 5957 Overview • The difficulty with data • Data Services • Data Middleware • Data Repositories e-Research within The University of Edinburgh The Data Deluge • Entering an age of data – Data Explosion – CERN: LHC will generate 1GB/s = 10PB/y – VLBA (NRAO) generates 1GB/s today – Pixar generate 100 TB/Movie – Storage getting cheaper • Data stored in many different ways – Data resources – Relational databases – XML databases / files – Result files • Need ways to facilitate – Data discovery – Data access – Data integration • Empower e-Business and e-Science – The Grid is a vehicle for achieving this e-Research within The University of Edinburgh What is e-Science? • Goal: to enable better research • Method: Invention and exploitation of advanced computational methods – to generate, curate and analyse research data – From experiments, observations and simulations – Quality management, preservation and reliable evidence – to develop and explore models and simulations – Computation and data at extreme scales – Trustworthy, economic, timely and relevant results – to enable dynamic distributed virtual organisations – Facilitate collaboration with resource sharing – Security, reliability, accountability, and manageability Multiple, independently managed sources of data – each with own time-varying structure Creative researchers discover new knowledge by combining data from multiple sources e-Research within The University of Edinburgh Composing Observations in Astronomy No. & sizes of data sets as of mid-2002, grouped by wavelength • 12 waveband coverage of large areas of the sky • Total about 200 TB data • Doubling every 12 months • Largest catalogues near 1B objects Data and images courtesy Alex Szalay, John Hopkins e-Research within The University of Edinburgh Data Services: challenges • Scale – Many sites, large collections, many uses • Longevity – Research requirements outlive technical decisions • Diversity – No “one size fits all” solutions will work – Primary Data, Data Products, Meta Data, Administrative data, … • Many Data Resources – Independently owned & managed – No common goals – No common design – Work hard for agreements on foundation types and ontologies – Autonomous decisions change data, structure, policy, … – Geographically distributed • and I haven’t even mentioned security yet! e-Research within The University of Edinburgh The Discovery Process • Choosing data sources – How do you find them? – How do they describe and advertise them? – Is the equivalent of Google possible? • Obtaining access to that data – Overcoming administrative barriers – Overcoming technical barriers • Understanding that data and extracting from multiple sources – The parts you care about for your research • Combing them using sophisticated models – The picture of reality in your head • Analysis on scales required by statistics – Coupling data access with computation • Repeated Processes – Examining variations, covering a set of candidates – Monitoring the emerging details e-Research within The University of Edinburgh Small problems • Not just “Grand Challenges”! – Also the small problems • For instance: – What happens to data when a researcher leaves a team? – How can a research leader point to “popular” data when a new researcher joins? – How can you manage your data when you start to run out of local storage space? – How do I get my data from one format/database to another? – How do I combine my data with your data? • You need to manage your data: metadata e-Research within The University of Edinburgh What is a data service? • An interface to a stored collection of data – e.g. Google and Amazon – web services • But the data could be: – – – – – replicated shared federated virtual incomplete • Don’t care about the underlying representation – do care about the information it represents e-Research within The University of Edinburgh Examples of Data Services • Many Data Services and applications – Commercial databases – Web interfaces – Applications developed individually by groups and projects • Also many places to get hold of public data – Publications and citation servers – Results servers • Highlight a few of these – principally ones trying to bridge the gap between “local” and “distributed” • But… no such thing as a free lunch – Things are not yet “Plug and Play” – You will need to expend some effort to use these tools effectively e-Research within The University of Edinburgh OGSA-DAI / DQP • Data Access and Integration / Distributed Query Processing – http://www.ogsadai.org.uk – Provides a way to access and query hetereogenous, structured data resources – Relational databases – XML databases – files – Provides a framework for extending services – more smarts, closer to the data – “Everything looks like a database” • National Grid Service starting to host – both through OGSA-DAI and Oracle e-Research within The University of Edinburgh SRB • Storage Resource Broker – http://www.sdsc.edu/srb/ – Provides a way to access data sets and resources based on their attributes and/or logical names rather than their names or physical locations. – may be hetererogenous, distributed and/or replicated – Many different ways of connecting – Can connect SRB systems together – zoneSRB – “Everything looks like a filesystem” e-Research within The University of Edinburgh SRM and more • Storage Resource Managers – http://forge.gridforum.org/projects/gsm-wg/ – a joint effort between a number of institutions – EU DataGrid/CERN, FermiLab, LBNL, JL – to define a standardised interface to “Storage Resource Managers” so that different implementations can work together – principally between physics communities, extending further now • Many other examples of data middleware – – – – Replication management and location: RLS, QCDGrid Many “datagrids”: SciDAC, Gfarm GridFTP for efficient transfer Packaged software: Virtual Data Toolkit e-Research within The University of Edinburgh EDINA and friends • EDINA – http://edina.ac.uk/ – Offers the UK tertiary education and research community networked access to a library of data, information and research resources, e.g geographical data • Digital Curation Centre – http://www.dcc.ac.uk – support UK institutions to store, manage and preserve these data to ensure their enhancement and their continuing long-term use. • Other national data centres: – MIMAS, UKDA, CCLRC DataPortal… e-Research within The University of Edinburgh Summary • Data is important to research – across all disciplines • There is already a large amount of data – but it’s sometimes difficult to find and bring together • Data Services are built to standards – which define particular functionality • Data Services should be composable – so that it is easier to work with data • There is already software out there – so it is possible to evaluate against your requirements e-Research within The University of Edinburgh