Data-Intensive Research theme welcome Data-Intensive Research

advertisement
Data-Intensive
Research theme
welcome
Malcolm Atkinson
mpa@nesc.ac.uk
3 November 2010
Data-Intensive Research
Theme opening lecture
1
Welcome to the e-Science Institute
DIR Theme Goals
• Improve understanding of data-intensive
research data/computational challenges
• Initiate computing science research to address
key challenges drawing on database knowledge
and experience
Today’s Programme
10:35
11:30
11:40
11:45
12:30
13:30
15:00
15:30
16:00
17:15
Opening theme talk
Malcolm Atkinson
Short break
Shaping our questions
Your talks
Three Volunteers
Lunch
Breakout groups
Models & Analysis
Plenary
Coffee break
Closing DPA theme
Shantenu Jha
Joint DIR & DPA reception
Previous work
• Data-Intensive workshop at eSI
•Report draft bit.ly/cfMRn3
•http://wikis.nesc.ac.uk/escienvoy/DataIntensive_Research:_how_should_we_improve_our_a
bility_to_use_data
•http://wiki.esi.ac.uk/Data-Intensive_Research
•Twitter hash tag - #datares
• USA data-use report (Atkinson & De Roure)
•Draft bit.ly/c0G2rn
This DIR theme
• Twitter hash tag - #datares
• http://www.esi.ac.uk/research-themes/15
• http://wiki.esi.ac.uk/Data-IntensiveResearch_Theme
Data-Intensive
Research: Can
database experience
help?
Malcolm Atkinson, Paolo Bresana,
Martin Kersten and Alex Szalay
mpa@nesc.ac.uk
3 November 2010
Data-Intensive Research
Theme opening lecture
8
Order of Service
•
The data bonanza
•
Data-intensive challenges
•
Data-intensive constraints
•
The shape of answers
•
Our question
Definitions
What Is DATA?
•
collections of data from instruments, observatories, surveys and simulations;
•
results from previous research and earlier surveys;
•
data from engineering and built-environment design, planning and production
processes;
•
data from diagnostic, laboratory, personal and mobile devices;
•
streams of data from sensors in the built and natural environment,
•
data from monitoring digital communications;
•
data transferred during the transactions for business, administration,
healthcare and government;
•
digital material produced by news feeds, publishing, broadcasting and
entertainment;
•
documents in collections and held privately; the texts and multi-media ‘images’
in web pages, wikis, blogs, emails and tweets; and
•
digitised representations of diverse collections of objects, e.g. of museums’
curated objects and books in literary collections.
What is Data-Intensive?
A problem is data intensive when considerable care is
needed over the use and handling of data in order to
solve it
Data-Intensive
Research Events
Oregon DI Systems
1993
Bermuda agreement 1996, 97 & 98
SDSS Archive DB
1999
Human Genome
2001
DI Comp. Environm’s
2001
BaBar@SLAC
2002
Fort Lauderdale 2003
Hey&Trefethen D.Del.
2003
Digital Curation Cen.
2004
NSF DataNet call
2007
XLDB series starts
2007
SciDB starts
2008
Yahoo DI workshop
2008
Harnessing data
2009
Beyond data del.
2009
Gov’s use Linked D.
2009
NSF CISE DI call 2009
Toronto Statement
2009
4th Paradigm book
2009
JISC Research DM
2009
e-IRG DMTF report
2009
DIR workshop, Edin.
DIEW Japan
2010
2010
DIDC workshop, HPDC 2010
The Data Bonanza
Growth in data
•
•
•
•
•
•
Faster, cheaper, more sensitive
digital devices
Ubiquitous digital devices
Automated experimentation and
observation
More and larger simulations
Ubiquitous connectivity
Increasing bandwidth and storage
capacity
QuickTime™ and a
decompressor
are needed to see this picture.
Images from Mario Caccamo’s talk at DIR workshop wiki.esi.ac.uk/DataIntensive_Research
Growth in Data
•
Business, administration and government
•
Healthcare, Engineering, Planning, Transport,
Communication, ...
•
Entertainment, social interaction, games and logging
•
Mandated data retention
•
Personal data retention
Data IS EVERYWHERE
•
It never will be in one place
•
Almost all of it is in files
•
There are a very large number of small data collections
•
A small number of very large collections
•
Most questions are best answered using multiple data
sources
•
Most questions are asked against single data
collections
Data IS DIVERSE
•
There are many islands of standardisation
•
There are many agreed interchange formats
•
There are many devices generating data in proprietary
forms
•
People continuously invent new representations
•
Most data organisation grows serendipitously
•
Investment in current practices cannot be ignored
Data-Intensive Challenges
Answering society’s big questions
•
How to feed everybody
•
How to live with climate change
•
How to run stable economies
•
How to provide health and well-being to an ageing
population
•
How to deliver sustainable energy
•
How to live peacefully and safely on planet Earth
•
How to act most effectively in an emergency
Scientists’ hard questions
•
What happened at the start of the universe
•
How can we understand living organisms
•
How does our brain work
•
How do our planet’s systems work
•
Is there a universal language
Answering RESEARCHERs’ HARD questions
•
How to detect and verify subtle correlations
•
How to characterise very infrequent phenomena
•
How to understand very complex systems
•
How to collaborate by sharing data
•
How to recognise what data is needed
•
How to decide what methods to use
•
How best to help in an emergency
Scientific Data Analysis Today
• Scientific data is doubling every year, reaching PBs
• Data is everywhere, never will be at a single location
• Need randomised, incremental algorithms
– Best result in 1 min, 1 hour, 1 day, 1 week
• Architectures increasingly CPU-heavy, IO-poor
• Data-intensive scalable architectures needed
• Most scientific data analysis done on small to midsize
BeoWulf clusters, from faculty startup
• Universities hitting the “power wall”
• Soon we cannot even store the incoming data stream
• Not scalable, not maintainable…
We have a data bonanza
We need a method bonanza
Data-Intensive Constraints
Many LIMITS TO GROWTH
•
Cost of equipment for storage and computation
•
Energy and operational costs
•
Time and cost of data movement
•
Limited supplies of skilled practitioners
Cost of a Petabyte
From backblaze.com
Aug 2009
Slide from Alex Szalay’s talk at XLDB4 workshop www-conf.slac.stanford.edu/xldb10/
DISC Needs Today
• Disk space, disk space, disk space!!!!
• Current problems not on Google scale yet:
– 10-30TB easy, 100TB doable, 300TB really hard
– For detailed analysis we need to park data for several months
• Sequential IO bandwidth
– If not sequential for large data set, we cannot do it
• How do can move 100TB within a University?
– 1Gbps
10 days
– 10 Gbps
1 day
– 100 lbs box few hours
(but need to share backbone)
• From outside?
– Dedicated 10Gbps or FedEx
Slide from Alex Szalay’s talk at XLDB4 workshop www-conf.slac.stanford.edu/xldb10/
Popularity / Sales
• Power distribution
• 80:20 rule
• Netflixs vs Blockbuster
Head
Tail
Products / Results
Slide from Carole Goble’s talk at e-Science AHM 2010 www.allhands.org.uk/events/all-hands-
The Shape of Data-Intensive Answers
scientific Information continua
allowed in a “new world”
between experimental data and publications (new paradigm)
between different scientific disciplines (multidisciplinary)
between past, present and future (preservation)
between different institutions (organisation)
between humans and computers (e-Infrastructure)
between research and education (public mission)
Reference: Klein Bottle with Moebius Band. Reference to article "Imaging maths Inside the Klein bottle" at http://plus.maths.org/issue26/index.html. The Klein bottle is
a non-orientable surface found by Felix Klein in 1882 while working on a topological
classification of surfaces.
Slide from Yannis Ioannidis’s talk at GDRI2020 & GRL2020 Workshop, Stellenbosch,
October 2010
More CONTINUA
•
From well resourced teams of experts to the long-tail of
small groups and individuals
•
From small to large
•
From new opportunity to well-established practises in a
collaborating global community
•
Across variations in computing, communication and
storage technology
•
Across variations in platforms and e-Infrastructure
Find a service & relax
Intellectual ramps
Easy and low risk to start
Progress to advanced skills
For research data users
No obligation
Go as far as you want
How do we build RAMPS
•
Intellectual ramps
•
Fitting with existing practice
•
Embedded in existing tools
•
Incrementally gaining understanding
•
Not a dead end
•
Technical ramps
•
Fitting with existing technology
Datascopes for the naked mind
To reveal evidence in data
you could never see before
NRAO/AUI/NSF
Changed our place in the universe
6
What FRAMEWORK FOR
DATASCOPES
•
Query systems
•
Map-reduce systems
•
Workflow systems
•
Batched-analysis data scans
•
Data-streaming systems
MAKING Your DATASCOPE
•
Specialised by choosing data sources
•
Specialised by selection and transformation of source
data
•
Specialised by choice of rules for combining data
•
Specialised by choice of data aggregations
•
Specialised by how results are presented
•
Specialised by which results are preserved
Algorithms and Code
•
All of those specialisations require algorithms
•
Each algorithm is written and handled as code
•
They capture generic and domain specific knowledge
•
They are often hand optimised
•
Different versions for each framework and platform
is unsustainable
Datascope and Ramp
Images from Roger Barga’s talk at AHM 2010 www.allhands.org.uk/events/all-hands-meeting-
Gray’s Laws of Data Engineering
Jim Gray:
• Scientific computing is revolving around data
• Need scale-out solution for analysis
• Take the analysis to the data!
• Start with “20 queries”
• Go from “working to working”
DISC: Data Intensive Scalable Computing
Slide from Alex Szalay’s talk at DIR workshop wiki.esi.ac.uk/Data-Intensive_Research
Cyberbricks/Amdahl Blades
• Scale down the CPUs to the disks!
– Solid State Disks (SSDs)
– 1 low power CPU per SSD
• Current SSD parameters
– OCZ Vertex 120GB, 250MB/s read, 10,000+ IOPS, $350
– Power consumption 0.2W idle, 1-2W under load
• Low power motherboards
– Intel dual Atom N330 + NVIDIA ION chipset 28W at 1.6GHz
• Combination achieves perfect Amdahl blade
– 200MB/s=1.6Gbits/s 
1.6GHz of Atom
Slide from Alex Szalay’s talk at Microsoft e-Science research.microsoft.com/enus/events/escience2010/
Questions a priori
•
How can we enable researchers who understand their
field, the data and the methods to specialise, tune and
control their data scope?
Questions a priori
•
How can we enable researchers who understand their
field or an analytic technique to capture that as an
algorithm just once?
Questions a priori
•
How can we optimise a datascope taking account of
the data, the computational environment and the userdefined algorithms?
Our Question
How can we help?
Download