Data Services What, Why, How e-Research Meeting NeSC, 2

advertisement
Data Services
What, Why, How
e-Research Meeting
NeSC, 2nd March 2005
Neil Chue Hong
Project Manager, EPCC
N.ChueHong@epcc.ed.ac.uk
+44 131 650 5957
Overview
• The difficulty with data
• Data Services
• Data Middleware
• Data Repositories
e-Research within The University of Edinburgh
The Data Deluge
• Entering an age of data
– Data Explosion
– CERN: LHC will generate 1GB/s =
10PB/y
– VLBA (NRAO) generates 1GB/s today
– Pixar generate 100 TB/Movie
– Storage getting cheaper
• Data stored in many different ways
– Data resources
– Relational databases
– XML databases / files
– Result files
• Need ways to facilitate
– Data discovery
– Data access
– Data integration
• Empower e-Business and e-Science
– The Grid is a vehicle for achieving this
e-Research within The University of Edinburgh
What is e-Science?
• Goal: to enable better research
• Method: Invention and exploitation of
advanced computational methods
– to generate, curate and analyse research data
– From experiments, observations and simulations
– Quality management, preservation and reliable evidence
– to develop and explore models and simulations
– Computation and data at extreme scales
– Trustworthy, economic, timely and relevant results
– to enable dynamic distributed virtual organisations
– Facilitate collaboration with resource sharing
– Security, reliability, accountability, and manageability
Multiple, independently managed sources of data – each with own time-varying structure
Creative researchers discover new knowledge by combining data from multiple sources
e-Research within The University of Edinburgh
Composing Observations in Astronomy
No. & sizes of data sets as of mid-2002,
grouped by wavelength
• 12 waveband coverage of large
areas of the sky
• Total about 200 TB data
• Doubling every 12 months
• Largest catalogues near 1B objects
Data and images courtesy Alex Szalay, John Hopkins
e-Research within The University of Edinburgh
Data Services: challenges
• Scale
– Many sites, large collections, many uses
• Longevity
– Research requirements outlive technical decisions
• Diversity
– No “one size fits all” solutions will work
– Primary Data, Data Products, Meta Data, Administrative data, …
• Many Data Resources
– Independently owned & managed
– No common goals
– No common design
– Work hard for agreements on foundation types and ontologies
– Autonomous decisions change data, structure, policy, …
– Geographically distributed
• and I haven’t even mentioned security yet!
e-Research within The University of Edinburgh
The Discovery Process
• Choosing data sources
– How do you find them?
– How do they describe and advertise them?
– Is the equivalent of Google possible?
• Obtaining access to that data
– Overcoming administrative barriers
– Overcoming technical barriers
• Understanding that data and extracting from multiple sources
– The parts you care about for your research
• Combing them using sophisticated models
– The picture of reality in your head
• Analysis on scales required by statistics
– Coupling data access with computation
• Repeated Processes
– Examining variations, covering a set of candidates
– Monitoring the emerging details
e-Research within The University of Edinburgh
Small problems
• Not just “Grand Challenges”!
– Also the small problems
• For instance:
– What happens to data when a researcher leaves a team?
– How can a research leader point to “popular” data when a new
researcher joins?
– How can you manage your data when you start to run out of local
storage space?
– How do I get my data from one format/database to another?
– How do I combine my data with your data?
• You need to manage your data: metadata
e-Research within The University of Edinburgh
What is a data service?
• An interface to a stored collection of data
– e.g. Google and Amazon
– web services
• But the data could be:
–
–
–
–
–
replicated
shared
federated
virtual
incomplete
• Don’t care about the underlying representation
– do care about the information it represents
e-Research within The University of Edinburgh
Examples of Data Services
• Many Data Services and applications
– Commercial databases
– Web interfaces
– Applications developed individually by groups and projects
• Also many places to get hold of public data
– Publications and citation servers
– Results servers
• Highlight a few of these
– principally ones trying to bridge the gap between “local” and “distributed”
• But… no such thing as a free lunch
– Things are not yet “Plug and Play”
– You will need to expend some effort to use these tools effectively
e-Research within The University of Edinburgh
OGSA-DAI / DQP
• Data Access and Integration / Distributed Query Processing
– http://www.ogsadai.org.uk
– Provides a way to access and query hetereogenous, structured data
resources
– Relational databases
– XML databases
– files
– Provides a framework for extending services
– more smarts, closer to the data
– “Everything looks like a database”
• National Grid Service starting to host
– both through OGSA-DAI and Oracle
e-Research within The University of Edinburgh
SRB
• Storage Resource Broker
– http://www.sdsc.edu/srb/
– Provides a way to access data sets and resources based on their
attributes and/or logical names rather than their names or physical
locations.
– may be hetererogenous, distributed and/or replicated
– Many different ways of connecting
– Can connect SRB systems together
– zoneSRB
– “Everything looks like a filesystem”
e-Research within The University of Edinburgh
SRM and more
• Storage Resource Managers
– http://forge.gridforum.org/projects/gsm-wg/
– a joint effort between a number of institutions
– EU DataGrid/CERN, FermiLab, LBNL, JL
– to define a standardised interface to “Storage Resource Managers” so
that different implementations can work together
– principally between physics communities, extending further now
• Many other examples of data middleware
–
–
–
–
Replication management and location: RLS, QCDGrid
Many “datagrids”: SciDAC, Gfarm
GridFTP for efficient transfer
Packaged software: Virtual Data Toolkit
e-Research within The University of Edinburgh
EDINA and friends
• EDINA
– http://edina.ac.uk/
– Offers the UK tertiary education and research community networked access
to a library of data, information and research resources, e.g geographical
data
• Digital Curation Centre
– http://www.dcc.ac.uk
– support UK institutions to store, manage and preserve these data to ensure
their enhancement and their continuing long-term use.
• Other national data centres:
– MIMAS, UKDA, CCLRC DataPortal…
e-Research within The University of Edinburgh
Summary
• Data is important to research
– across all disciplines
• There is already a large amount of data
– but it’s sometimes difficult to find and bring together
• Data Services are built to standards
– which define particular functionality
• Data Services should be composable
– so that it is easier to work with data
• There is already software out there
– so it is possible to evaluate against your requirements
e-Research within The University of Edinburgh
Download