Data-Intensive Research theme welcome Malcolm Atkinson mpa@nesc.ac.uk 3 November 2010 Data-Intensive Research Theme opening lecture 1 Welcome to the e-Science Institute DIR Theme Goals • Improve understanding of data-intensive research data/computational challenges • Initiate computing science research to address key challenges drawing on database knowledge and experience Today’s Programme 10:35 11:30 11:40 11:45 12:30 13:30 15:00 15:30 16:00 17:15 Opening theme talk Malcolm Atkinson Short break Shaping our questions Your talks Three Volunteers Lunch Breakout groups Models & Analysis Plenary Coffee break Closing DPA theme Shantenu Jha Joint DIR & DPA reception Previous work • Data-Intensive workshop at eSI •Report draft bit.ly/cfMRn3 •http://wikis.nesc.ac.uk/escienvoy/DataIntensive_Research:_how_should_we_improve_our_a bility_to_use_data •http://wiki.esi.ac.uk/Data-Intensive_Research •Twitter hash tag - #datares • USA data-use report (Atkinson & De Roure) •Draft bit.ly/c0G2rn This DIR theme • Twitter hash tag - #datares • http://www.esi.ac.uk/research-themes/15 • http://wiki.esi.ac.uk/Data-IntensiveResearch_Theme Data-Intensive Research: Can database experience help? Malcolm Atkinson, Paolo Bresana, Martin Kersten and Alex Szalay mpa@nesc.ac.uk 3 November 2010 Data-Intensive Research Theme opening lecture 8 Order of Service • The data bonanza • Data-intensive challenges • Data-intensive constraints • The shape of answers • Our question Definitions What Is DATA? • collections of data from instruments, observatories, surveys and simulations; • results from previous research and earlier surveys; • data from engineering and built-environment design, planning and production processes; • data from diagnostic, laboratory, personal and mobile devices; • streams of data from sensors in the built and natural environment, • data from monitoring digital communications; • data transferred during the transactions for business, administration, healthcare and government; • digital material produced by news feeds, publishing, broadcasting and entertainment; • documents in collections and held privately; the texts and multi-media ‘images’ in web pages, wikis, blogs, emails and tweets; and • digitised representations of diverse collections of objects, e.g. of museums’ curated objects and books in literary collections. What is Data-Intensive? A problem is data intensive when considerable care is needed over the use and handling of data in order to solve it Data-Intensive Research Events Oregon DI Systems 1993 Bermuda agreement 1996, 97 & 98 SDSS Archive DB 1999 Human Genome 2001 DI Comp. Environm’s 2001 BaBar@SLAC 2002 Fort Lauderdale 2003 Hey&Trefethen D.Del. 2003 Digital Curation Cen. 2004 NSF DataNet call 2007 XLDB series starts 2007 SciDB starts 2008 Yahoo DI workshop 2008 Harnessing data 2009 Beyond data del. 2009 Gov’s use Linked D. 2009 NSF CISE DI call 2009 Toronto Statement 2009 4th Paradigm book 2009 JISC Research DM 2009 e-IRG DMTF report 2009 DIR workshop, Edin. DIEW Japan 2010 2010 DIDC workshop, HPDC 2010 The Data Bonanza Growth in data • • • • • • Faster, cheaper, more sensitive digital devices Ubiquitous digital devices Automated experimentation and observation More and larger simulations Ubiquitous connectivity Increasing bandwidth and storage capacity QuickTime™ and a decompressor are needed to see this picture. Images from Mario Caccamo’s talk at DIR workshop wiki.esi.ac.uk/DataIntensive_Research Growth in Data • Business, administration and government • Healthcare, Engineering, Planning, Transport, Communication, ... • Entertainment, social interaction, games and logging • Mandated data retention • Personal data retention Data IS EVERYWHERE • It never will be in one place • Almost all of it is in files • There are a very large number of small data collections • A small number of very large collections • Most questions are best answered using multiple data sources • Most questions are asked against single data collections Data IS DIVERSE • There are many islands of standardisation • There are many agreed interchange formats • There are many devices generating data in proprietary forms • People continuously invent new representations • Most data organisation grows serendipitously • Investment in current practices cannot be ignored Data-Intensive Challenges Answering society’s big questions • How to feed everybody • How to live with climate change • How to run stable economies • How to provide health and well-being to an ageing population • How to deliver sustainable energy • How to live peacefully and safely on planet Earth • How to act most effectively in an emergency Scientists’ hard questions • What happened at the start of the universe • How can we understand living organisms • How does our brain work • How do our planet’s systems work • Is there a universal language Answering RESEARCHERs’ HARD questions • How to detect and verify subtle correlations • How to characterise very infrequent phenomena • How to understand very complex systems • How to collaborate by sharing data • How to recognise what data is needed • How to decide what methods to use • How best to help in an emergency Scientific Data Analysis Today • Scientific data is doubling every year, reaching PBs • Data is everywhere, never will be at a single location • Need randomised, incremental algorithms – Best result in 1 min, 1 hour, 1 day, 1 week • Architectures increasingly CPU-heavy, IO-poor • Data-intensive scalable architectures needed • Most scientific data analysis done on small to midsize BeoWulf clusters, from faculty startup • Universities hitting the “power wall” • Soon we cannot even store the incoming data stream • Not scalable, not maintainable… We have a data bonanza We need a method bonanza Data-Intensive Constraints Many LIMITS TO GROWTH • Cost of equipment for storage and computation • Energy and operational costs • Time and cost of data movement • Limited supplies of skilled practitioners Cost of a Petabyte From backblaze.com Aug 2009 Slide from Alex Szalay’s talk at XLDB4 workshop www-conf.slac.stanford.edu/xldb10/ DISC Needs Today • Disk space, disk space, disk space!!!! • Current problems not on Google scale yet: – 10-30TB easy, 100TB doable, 300TB really hard – For detailed analysis we need to park data for several months • Sequential IO bandwidth – If not sequential for large data set, we cannot do it • How do can move 100TB within a University? – 1Gbps 10 days – 10 Gbps 1 day – 100 lbs box few hours (but need to share backbone) • From outside? – Dedicated 10Gbps or FedEx Slide from Alex Szalay’s talk at XLDB4 workshop www-conf.slac.stanford.edu/xldb10/ Popularity / Sales • Power distribution • 80:20 rule • Netflixs vs Blockbuster Head Tail Products / Results Slide from Carole Goble’s talk at e-Science AHM 2010 www.allhands.org.uk/events/all-hands- The Shape of Data-Intensive Answers scientific Information continua allowed in a “new world” between experimental data and publications (new paradigm) between different scientific disciplines (multidisciplinary) between past, present and future (preservation) between different institutions (organisation) between humans and computers (e-Infrastructure) between research and education (public mission) Reference: Klein Bottle with Moebius Band. Reference to article "Imaging maths Inside the Klein bottle" at http://plus.maths.org/issue26/index.html. The Klein bottle is a non-orientable surface found by Felix Klein in 1882 while working on a topological classification of surfaces. Slide from Yannis Ioannidis’s talk at GDRI2020 & GRL2020 Workshop, Stellenbosch, October 2010 More CONTINUA • From well resourced teams of experts to the long-tail of small groups and individuals • From small to large • From new opportunity to well-established practises in a collaborating global community • Across variations in computing, communication and storage technology • Across variations in platforms and e-Infrastructure Find a service & relax Intellectual ramps Easy and low risk to start Progress to advanced skills For research data users No obligation Go as far as you want How do we build RAMPS • Intellectual ramps • Fitting with existing practice • Embedded in existing tools • Incrementally gaining understanding • Not a dead end • Technical ramps • Fitting with existing technology Datascopes for the naked mind To reveal evidence in data you could never see before NRAO/AUI/NSF Changed our place in the universe 6 What FRAMEWORK FOR DATASCOPES • Query systems • Map-reduce systems • Workflow systems • Batched-analysis data scans • Data-streaming systems MAKING Your DATASCOPE • Specialised by choosing data sources • Specialised by selection and transformation of source data • Specialised by choice of rules for combining data • Specialised by choice of data aggregations • Specialised by how results are presented • Specialised by which results are preserved Algorithms and Code • All of those specialisations require algorithms • Each algorithm is written and handled as code • They capture generic and domain specific knowledge • They are often hand optimised • Different versions for each framework and platform is unsustainable Datascope and Ramp Images from Roger Barga’s talk at AHM 2010 www.allhands.org.uk/events/all-hands-meeting- Gray’s Laws of Data Engineering Jim Gray: • Scientific computing is revolving around data • Need scale-out solution for analysis • Take the analysis to the data! • Start with “20 queries” • Go from “working to working” DISC: Data Intensive Scalable Computing Slide from Alex Szalay’s talk at DIR workshop wiki.esi.ac.uk/Data-Intensive_Research Cyberbricks/Amdahl Blades • Scale down the CPUs to the disks! – Solid State Disks (SSDs) – 1 low power CPU per SSD • Current SSD parameters – OCZ Vertex 120GB, 250MB/s read, 10,000+ IOPS, $350 – Power consumption 0.2W idle, 1-2W under load • Low power motherboards – Intel dual Atom N330 + NVIDIA ION chipset 28W at 1.6GHz • Combination achieves perfect Amdahl blade – 200MB/s=1.6Gbits/s 1.6GHz of Atom Slide from Alex Szalay’s talk at Microsoft e-Science research.microsoft.com/enus/events/escience2010/ Questions a priori • How can we enable researchers who understand their field, the data and the methods to specialise, tune and control their data scope? Questions a priori • How can we enable researchers who understand their field or an analytic technique to capture that as an algorithm just once? Questions a priori • How can we optimise a datascope taking account of the data, the computational environment and the userdefined algorithms? Our Question How can we help?