XLDB ‘09 Luke Lonergan llonergan@greenplum.com 7/2/2016 1 “Big” numbers for GP today • • • • • • • 7/2/2016 70K/day - Query Rate 6.5PB – Dataset Size +100GB/s – Analysis Rate +3GB/s – Net Loading Rate 100,000/s – Transaction Rate 56 TB / kW, 1.6 GB/s/kW – Power Rate 100s – Number of Data/Compute nodes 2 Things I’ve Heard • Tiered computing – Organizational / Political / Geographic boundaries require it • Metadata computing for HEP – “10TB sounds small but it’s not easy” • Processing for Radio Astronomy, HEP – Data intensive computing – Requires an efficient pipeline from raw to consumables 7/2/2016 3 Thoughts • A lot of plumbing! Moving data around, pipeline processing – Core engine should do this so the plumbing isn’t done over and over • Need for specialized access methods and storage classes • “Computing in data” is key to success 7/2/2016 4 GP Basic Features • Access Methods – Compression, Column Store, Heap Store, External Tables, Indexes (GIST, GIN, Rtree, Bitmap, B-Tree, …) – Network Ingest / Export directly into parallel pipeline – Logical Partitioning by Range, List • Parallel Programming Languages – SQL 2003 with Analytics – Map Reduce in Perl, Python, C, SQL, … – PL/R,python,perl,C,pgSQL,SQL, … 7/2/2016 5 From Enterprise Data Clouds • Elastic / adaptive infrastructure for data warehousing and analytics – IT Operations deploy pools of low-cost commodity infrastructure • Physical servers, virtual infrastructure, or onramp to public cloud – DBAs and Analysts provision sandboxes and warehouses in minutes • Warehouses Infrastructure 7/2/2016 Assemble the data they need (common, private, etc) for agile analytics 40 16 8 16 120 Free Consumer Division DBA 16 16 96 68 Free Packaged Goods 40 Finance 6 64 Free Analyst IT Operations Proprietary & Confidential Use Case: Big Telco Data Mart Consolidation Goals: Approach: •Reduce maintenance and support costs from proliferation of data mart platforms •Embrace data – encourage ‘physical consolidation’ in advance of data model unification •Reduce risks and exposure due to data in shadow IT systems •Provide ‘self serve’ model to bring shadow IT into the light •Break down silo walls - provide a unified way to find and access all data •Allow unified data access and pragmatic ‘logical’ data model unification incrementally X X X X US- West 100 nodes X X X X 7/2/2016 Data Sources X 7 Proprietary & Confidential Use Case: Big Ad Network Project Sandboxes Goals: EDC •Remove IT barriers to analyst productivity and value creation Self-Serve Dashboard •Dramatically reduce IT resource constraints and delays – i.e. realize ideas sooner •Combine centralized ‘EDW’ data with freshly discovered feeds and other useful sources 40 16 US – West 200 nodes 8 Analyst’s Private Data Feed 16 120 Free 16 Approach: •Self-serve creation of project warehouses in minutes – and elastically expand as needed •Load new data feeds without requiring formal modeling •Bring together any data within the EDC – even if globally distributed – and analyze 7/2/2016 16 Europe 100 nodes 68 Free 96 Asia 200 nodes Analyst’s New Warehouse US- East 100 nodes 40 64 Free 8 Proprietary & Confidential GP is Software – Develop Now • Download at: – Gpn.greenplum.com – Get the VMWare image or use it on OSX, Linux, Solaris 7/2/2016 9 Think Big. Think Fast.