I. o o II. o o o III. o o o New Insights through Big Data 4 New World of Big Data & DW – Yet another ‘Hype’? BI and analytics Data warehouse ETL … data warehousing has reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing. – Gartner, “The State of Data Warehousing in 2012” Data sources 5 Impact on traditional DW BI and analytics 3 Data warehouse Near real-time insights ETL 1 Increasing data volumes Data sources 2 New data sources and types Non-relational data 4 Cloud-born data 6 The modern Data Warehouse Data sources Non-relational data When to use what? Gartner Survey April 2014 10% 4% 5% 15% 57% Piloting on premise Piloting in the cloud Production on-premise with cluster Production on premise with appliance 14% 2014 – think Hadoop will replace existing DW solution (2013: 10%) ↓ 50% decline even with the presence of Hadoop 2.0 60% 50% → primarily includes SQL over HDFS (!) 40% 30% 20% 10% 0% Interactive Analytics DBMS Stream Processing Search Graph applications Challenges for a modern data warehouse Keep existing & legacy investment Limited scalability and ability to handle new data types Acquire Big Data solution Buy new tier-one hardware appliance Acquire business intelligence & tools Significant training and data silos High acquisition and migration costs Complex with low adoption Microsoft’s Analytics Platform Systems (APS) 12 Introducing the Microsoft Analytics Platform System The turnkey modern data warehouse appliance Enterprise-ready Big Data Next-generation performance at scale Engineered for optimal value • Relational and non-relational data in a single appliance • Near real-time performance with In-Memory Columnstore • Enterprise-ready Hadoop aka HDInsight Region (HDI) • Ability to scale out to accommodate growing data • Industry’s lowest data warehouse appliance price per terabyte • Integrated querying across HDI and PDW via PolyBase • Removal of data warehouse bottlenecks with MPP SQL Server • Integration with Microsoft BI/Excel & 3rd party BI tools • Concurrency that fuels rapid adoption • Value through a single appliance solution • Value with flexible hardware options using commodity hardware Core APS Features MPP and In-Memory Columnstore for best-in-class performance Columnstore index representation 100x Up to faster queries C1 C2 C3 C4 C5 C6 15x Up to more compression Updateable clustered columnstore vs. table with customary indexing Parallel query execution • Query • Results • Store data in columnar format for massive compression Load data into or out of memory for next-generation performance with up to 60% improvement in data loading speed Updateable and clustered for real-time trickle loading Core APS Features HDInsight Region What’s HDInsight in the appliance? o Hadoop bits run on same HW as PDW region o Separate appliance region – interconnected via Infiniband to PDW region o Hortonworks HDP 2.0 bits on Windows Server – including YARN execution framework o PolyBase certified with HDInsight on-premise 15 Core APS Features Your Apps PolyBase for the (Big) Data Continuum 1. Querying all of data sources through APS/PDW o allowing to query relational & nonrelational stores (on-premise) with single query interface o On-premise data sources include not only Hadoop, but also RDBMSs or document-oriented stores* cloud service HDInsight Azure SQL Database (Hadoop) Azure Storage 2. Microsoft Azure 1. On-premise Your Apps IBM DB2 2. APS/PDW as cloud-attach for Azure o allowing to scale storage & leverage cloud compute (SQL DB, Azure HDI, Azure DocDB, Azure ML, Azure Search) o Trickle effect (data migration from on-premise to Azure) Oracle Hadoop APS with PolyBase SQL Server Split-based query processing Supporting various Hadoop distributions Best-In-Class Performance Openness & Flexibility Integration with Microsoft Azure Hybrid Mature T-SQL language surface Seamless with Microsoft BI & Third-Party BI tools Preserving Existing skill T-SQL Semantics sets & tools 17 18 PolyBase – Building Blocks & Functionality 19 Data Scientists, BI Users, DB Admins Your Apps Microsoft APS APS control & data nodes w/ PolyBase PowerPivot PowerView External Table Hadoop External External File Data Source Format Relational DW PolyBase Split-Based Query Processing Web Apps Social Apps Mobile Apps Sensor &RFID Key principle - Leverage computation power of external Hadoop clusters Microsoft Azure o Transparent pushing of computation into Hadoop on-the-fly o No map/reduce or Hadoop skills needed o Reducing data volume to be transferred Indefinite Azure Storage Generalizing key principle in future releases o Apply for all external data sources o Azure for indefinite storage + as appliance extension by making use of Azure compute o Leveraging compute of other data sources relational data sources or NoSQL platforms Azure Express Route On-premise Your Apps Extensible approach o Ideas to plug-in other push-down engines in future o Further performance improvements in near future Hadoop APS with PolyBase PolyBase Query Use Cases Your Apps PolyBase query scenarios (SELECT) 1. Run T-SQL over HDFS/Hadoop data 2. Combine data from different Hadoop clusters (e.g. production & dev/test) via PolyBase 3. Join relational data with non-relational data in HDFS/Hadoop T-SQL Customer Value o Ease-of-use & Improved Time-To-Insights ― Build the data lake w/o heavily investing into new resources, i.e. Java & map/reduce experts ― Leverage familiar & mature T-SQL scripts and constructs ― Seamless tool integration w/ PolyBase Hadoop Microsoft APS w/ PolyBase Hadoop Dev/Test PolyBase Import & Export Use Cases PolyBase ETL scenarios (CTAS/CETAS) Your Apps o Storing subset of Hadoop in PDW for frequent access, i.e. in SQL Server’s columnar format (→ updatable CCI) o Enabling data aging scenarios to more economic storage (→ ‘data lake’) Customer Value T-SQL ― Avoids the need of maintaining a separate import or export utility ― Allows building multi-temperature DW platforms o PDW/APS acts as hot query engine processing most recent/relevant data sets o Aged data immediately accessible via external tables o No need for deleting any data anymore Hadoop Microsoft APS w/ PolyBase PolyBase for Hybrid Use Cases (Azure Integration) APS as Azure-attach o Azure as cheap storage for keeping your data forever o Mesh-up on-premise and cloud apps on your own terms o PolyBase as bridge between on-premise and Azure Your Apps cloud service HDInsight Azure SQL Database (Hadoop) Customer Value Azure Storage o Indefinite storage and compute ― Azure as extension for your on-premise data assets o Cloud transition on your own terms ― Move only subsets of on-premise data, e.g. non-sensitive data o Leverage new Azure data services ― Reduced capex & availability of new emerging data services in Azure for on-premise focused users T-SQL Your Apps APS w/ PolyBase PolyBase for Round-Trip Scenarios Your Apps PolyBase RoundTrip Scenario (CETAS) o Processing non-relational data with relational data and exporting it to your Big Data platform of choice cloud service HDInsight Azure SQL Database (Hadoop) Customer Value ― Total freedom & flexibility o Azure Storage User can decide where to store the results of querying different data sets ― Simplicity o Example: 1 T-SQL statement for a) querying different Hadoop data sets, b) combining with relational data and c) storing results in Azure or in a different Hadoop cluster T-SQL Your Apps APS w/ PolyBase Hadoop Dev/Test 26 Real-world customer production use cases with PolyBase 27 ‘We love Open-Source, but we love SQL Server Analysis Services (even more)’ o SQL Server SMP & Hadoop cluster Pain Points with current solution 1. ‘We are currently using a mix of SQOOP & BCP.’ 2. The integration with SQL Server should be better, simpler & as fast as possible’ !s 3. ‘Take a definition of an aggregate in Hive and map it through SQL DDL’! 4. ‘Loading data from Hadoop to SQL must be easier’. 5. ‘Hadoop is hard to deploy and maintain'. insights ‘Pay-as-you-drive’ - Driver Discount & Policy adjustment o SQL Server 2012 PDW & HDP 2.0/2.1 on Linux What they want to accomplish 1. ‘We have increasing set of non-relational data – sensor data & web data which we want to store in Hadoop.’ 2. ‘We want to get answers to our complex questions fast’. 3. ‘We don’t want to learn a new tool or programming language or paradigm to get the answers’. 4. ‘We know how to use Microsoft tools and services’. 5. ‘We don’t want to maintain a complex infrastructure – we trust Microsoft & SQL Server to help us’. Investing into Online Shopping website (‘Korea’s Amazon’) o SQL Server 2012 PDW & HDP 1.3/HDP 2.0 on Linux What they want to accomplish 1. ‘We want perform complex data mining on customer purchase data – basket analysis’. 2. ‘We want to understand the social media data (reviews/Twitter) – specifically around our products & stores’. 3. ‘We will use Hadoop to keep all of our data ~ envisioned to be around 480 TB. PDW will be the efficient analysis engine for the hot & more recent data’. 4. ‘PDW & PolyBase are much faster than Hive’. 5. ‘We’re interested in using data mining cloud services in Azure (hybrid scenarios)’ Solution Architecture (Details) – ShinSeGae Web log data(160GB/daily) – 1. External Tables A, B, C 2. Unstructured/semi-structured text data - External Tables D, E, F Text (Board/SNS/ Internal Text ) 3. Complex Event Processing (Storm) SSG.com (renewal) Message Queues (KAFKA, Open source) Weather.. Recommendation engine & personalized advertising Tracking Log Servers PolyBase Queries Company emails – External Tables G, H, I Mails Online Shopping Mall raw/cold/ warm data HDP on Linux Campaign Analytic information (right customer targeting) 10 GB Ethernet APS/PDW Operational Data Store EDW Recent/hot data stored in PDW EIS OLAP (Tabular) DATA Mining Visualization (Silverlight) BI analyst What they want to accomplish o On-premise: 15-20 HDP Hadoop cluster + 1 Rack PDW o How they want using Azure: 1. Cost-optimal backup for Hadoop data [WASB] 2. Scale-out environment to spin-up more compute on demand at busy times 3. Platform to leverage new data services, in particular Azure HDI and Azure Machine Learning o Daily - 50 GB compressed data into Azure o Your Apps cloud service HDInsight Azure SQL Database (Hadoop) Azure Storage Either planning to use custom API or PolyBase CETAS o Planning to use PolyBase to combine data from on-premise and Azure Your Apps T-SQL APS with scale-out PDW region ‘Understanding network quality’ o SQL Server 2012 PDW & Hadoop on Linux What they want to accomplish 1. ‘We collect millions of network records for quality assessment and capacity planning – on a daily basis’. 2. ‘Hadoop will be used for storage and ETL of these network record files’. 3. ‘PDW for more operational analysis, ad-hoc analysis, operational reports’. 4. ‘We are using Polybase along with Oozie-based orchestration for a seamless & automated integration’. Solution Architecture (Details) – Telcom Capturing Network logs (>300 GB/per day) – External Tables A, B, C High-frequency Event Processing (Network logs) Usage of Hive’s Metadata stores HCatalog Hadoop cluster (18+ servers) PolyBase Queries APS/PDW Operational Data Store Infiniband raw/cold data (Petabyte of network logs) EDW Hot operational PDW data Oozie Workflows Remote procedure calls via stored procedures to trigger PolyBase queries Network quality analysis Capacity Planning Visualization (PowerPivot) BI analyst/Planner/ Decision-maker ‘Analyzing oil drilling rigs’ o SQL Server 2012 PDW & external Hadoop What they want to accomplish 1. ‘Each drilling rig is equipped with sensor devices and we want to store the incoming sensor data into Hadoop’. 2. ‘We want to use PolyBase to access these sensor data’. 3. ‘We want to monitor our rigs in near real-time’. 4. ‘We want to conduct predictive analysis – which rigs are running poorly, which rigs show similar symptoms’. ‘Bridging the gap between cloud & on-prem’ o SQL Server 2012 PDW & Azure HDInsight What they want to accomplish 1. ‘HDInsight/Hadoop in the cloud to store and massage our raw data (XML files) generated by our web-application’. 2. ‘PDW to keep the data on-prem (legal requirement) and to have an efficient query engine for analysis purposes’. 3. ‘PolyBase is a great way of accessing our files in the cloud via simple T-SQL.’. 4. ‘With this solution, we can allow web users to quickly ask questions while the heavy, more complex business analysis is accomplished by PDW users’. Solution Architecture (Details) – Government HDI tools for data transformation Web apps- Generating tons of smaller XML files (~7KB each) Web Application for Tax Filing (e-invoice) Other Web Feeds Transforming to large text files ~ 10 GBs each (External Tables) PolyBase Queries Operational Data Store HDI on Azure Azure Blob Storage APS/PDW Public Internet or Azure Express Route cheap data store – alternative to Hadoop on-prem solution EDW PDW/APS for fast query response & data processing of hot data Microsoft BI stack IBM Cognos ‘What are you drinking? Why is the machine down’? o SQL Server/APS with PDW & HDI region What they want to accomplish 1. ‘We want a complete solution stack – we do not have Hadoop experts in-house and don’t have the money to get it’. 2. ‘We want to store all raw data coming from vending machines into Hadoop’. 3. ‘360 degree of all our data – structured customer data & unstructured data coming from vending machines’. 4. ‘Predicate maintenance of machines’. ‘Data Management Services for Microsoft Teams’ o APS with PDW & HDInsight region What they want to accomplish o Offering data management services to other Microsoft groups o Cookie scoring for msn.com o 3TB of data for each analysis o Added PolyBase for Reporting & PowerQuery o Previously – o Analysis solely done via HiveQL o Results are writing back in SQL Server for reporting o PowerQuery via HiveODBC driver Solution Architecture (Details) – Internal Microsoft Data Scientist Data scientist group 1 - using chaining of Hive queries & PowerQuery via HiveODBC Hive & PowerQuery via Hive ODBC System Center & AdminConsole msn.com – Log files Microsoft servers – Log files Secure Gateway & AD Integration Analyzing ~3 TB Web Traffic Polybase Queries HDI region 1 scale unit HDI region Infiniband Data scientist group 2 - Using Polybase for existing tooling (T-SQL, BI tools) performing processing of complex analytical queries & consistent management experience PowerQuery/Power View/PowerMap PDW region Full Rack PDW APS with PDW & HDI region Analytical queries via SSDT 44 microsoft.com/sqlserver and Amazon Kindle Store microsoftvirtualacademy.com Azure Machine Learning, DocumentDB, and Stream Analytics http://channel9.msdn.com/Events/TechEd www.microsoft.com/learning http://microsoft.com/technet http://developer.microsoft.com