SQLCAT: Big Data – All Abuzz About Hive Cindy Gross SQLCAT BI/Big Data PM Microsoft http://blogs.msdn.com/cindygross @SQLCindy Cindy.Gross@microsoft.com Ed Katibah SQLCAT Spatial PM Microsoft http://blogs.msdn.com/b/edkatibah/ @Spatial_Ed Ed.Katibah@Microsoft.com November 6-9, Seattle, WA BIG AGENDA What’s the social sentiment for my brand or products How do I better predict future outcomes? How do I optimize my fleet based on weather and traffic patterns? Increases ad revenue by processing 3.5 billion events per day Measures and ranks online user influence by processing 3 billion signals per day Uses sentiment analysis and web analytics for its internal cloud Massive Volumes Cloud Connectivity Real-Time Insight Processes 464 billion rows per quarter, with average query time under 10 secs. Connects across 15 social networks via the cloud for data and API access Improves operational decision making for IT managers and users MANAGE ANY DATA, ANY SIZE, ANYWHERE 010101010101010101 1010101010101010 01010101010101 101010101010 VVVVROOM! 6 BIG DATA 8 BIG DATA REQUIRES AN END-TO-END APPROACH INSIGHT Self-Service Collaboration Corporate Apps Devices DATA ENRICHMENT Discover Combine Refine DATA MANAGEMENT Relational Non-relational Analytical Streaming Hadoop architecture. Distributed Processing (Map Reduce) Distributed Storage (HDFS) HIVE ARCHITECTURE Hive Hadoop DEMO: Analyzing a Frankenstorm 14 November 6-9, Seattle, WA Behind the Scenes November 6-9, Seattle, WA15 GET HDINSIGHT Sign up for Windows Azure HDInsight Service http://HadoopOnAzure.com (Cloud CTP) Download Microsoft HDInsight Server http://microsoft.com/bigdata (On-Prem CTP) 16 CREATE TABLE CREATE EXTERNAL TABLE censusP (State_FIPS int, County_FIPS int, Population bigint, Pop_Age_Over_69 bigint, Total_Households bigint, Median_Household_Income bigint, KeyID string) COMMENT 'US Census Data' PARTITIONED BY (Year string) ROW FORMAT DELIMITED FIELDS TERMINATED by '\t' STORED AS TEXTFILE; ALTER TABLE censusP ADD PARTITION (Year = '2010') LOCATION '/user/demo/census/2010'; 17 INSIDE A HIVE TABLE DATA TYPES EXTERNAL / INTERNAL PARTITIONED BY | CLUSTERED BY | SKEWED BY Terminators ROW FORMAT DELIMITED | SERDE STORED AS FIELDS/COLLECTION ITEMS/MAP KEYS TERMINATED BY LOCATION 18 METADATA Metadata is stored in a MetaStore database such as Derby SQL Azure SQL Server View SHOW TABLES 'ce.*'; DESCRIBE census; DESCRIBE census.population; DESCRIBE EXTENDED census; DESCRIBE FORMATTED census; SHOW FUNCTIONS "x.*"; SHOW FORMATTED INDEXES ON census; 19 DATA TYPES Primitives Numbers: Int, SmallInt, TinyInt, BigInt, Float, Double Characters: String Special: Binary, Timestamp Collections STRUCT<City:String, State:String> | Struct (‘Boise’, ‘Idaho’) ARRAY <String> | Array (‘Boise’, ‘Idaho’) MAP <String, String> | Map (‘City’, ‘Boise’, ‘State’, ‘Idaho’) UNIONTYPE <BigInt, String, Float> Properties No fixed lengths NULL handling depends on SerDe 20 STORAGE – EXTERNAL AND INTERNAL CREATE EXTERNAL TABLE census(…) LOCATION '/user/demo/census'; LOCATION ‘hdfs:///user/demo/census'; LOCATION ‘asv://user/demo/census'; Use EXTERNAL when Data also used outside of Hive Data needs to remain even after a DROP TABLE Use custom location such as ASV Hive should not own data and control settings, directories, etc. Not creating table based on existing table (AS SELECT) And ASV = Azure Storage Vault (blob store) INTERNAL is NOT a keyword, just leave off EXTERNAL 21 STORAGE – PARTITION AND BUCKET CREATE EXTERNAL TABLE census (…) PARTIONED BY (Year string) CLUSTERED BY (population) into 256 BUCKETS Partition Directory for each distinct combination of string partition values Partition key name cannot be defined in table itself Allows partition elimination Useful in range searches Can slow performance if partition is not referenced in query Buckets Split data based on hash of a column One HDFS file per bucket within partition sub-directory Performance may improve for aggregates and join queries Sampling 22 STORAGE – FILE FORMATS AND SERDES CREATE EXTERNAL TABLE census (…) ROW FORMAT DELIMITED FIELDS TERMINATED by ‘\001‘ STORED AS TEXTFILE, RCFILE, SEQUENCEFILE, AVRO Format TEXTFILE is common, useful when data is shared and all alphanumeric Extensible storage formats via custom input, output formats Extensible on disk/in-memory representation via custom SerDes 23 CREATE INDEX CREATE INDEX census_population ON TABLE census (population) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD IN TABLE census_population_index; ALTER INDEX census_population ON census REBUILD; Key Points No keys Index data is another table Requires REBUILD to include new data SHOW FORMATTED INDEXES on MyTable; Indexing May Help Avoid many small partitions GROUP BY 24 CREATE VIEW CREATE VIEW censusBigPop (state_fips, county_fips, population) AS SELECT state_fips, county_fips, population FROM census WHERE population > 500000 ORDER BY population; Sample Code SELECT * FROM censusBigPop; DESCRIBE FORMATTED censusBigPop; Key Points Not materialized Can have ORDER BY or LIMIT 25 QUERY SELECT c.state_fips, c.county_fips, c.population FROM census c WHERE c.median_household_income > 100000 GROUP BY c.state_fips, c.county_fips, c.population ORDER BY county_fips LIMIT 100; Key Points Minimal caching, statistics, or optimizer Generally reads entire data set for every query Performance The order of columns, tables can make a difference to performance Use partition elimination for range filtering 26 SORTING ORDER BY One reducer does final sort, can be a big bottleneck SORT BY Sorted only within each reducer, much faster DISTRIBUTE BY Determines how map data is distributed to reducers SORT BY + DISTRIBUTE BY = CLUSTER BY Can mimic ORDER BY, better perf if even distribution 27 JOINS Supported Hive Join Types Equality OUTER - LEFT, RIGHT, FULL LEFT SEMI Not Supported Non-Equality IN/EXISTS subqueries (rewrite as LEFT SEMI JOIN) 28 JOINS Characteristics Multiple MapReduce jobs unless same join columns in all tables Put largest table last in query to save memory Joins are done left to right in query order JOIN ON completely evaluated before WHERE starts 29 EXPLAIN EXPLAIN SELECT * FROM census; EXPLAIN SELECT * FROM census WHERE population > 100000; EXPLAIN EXTENDED SELECT * FROM census; Characteristics Does not execute the query Shows parsing Lists stages, temp files, dependencies, modes, output operators, etc. ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME census))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)))) STAGE DEPENDENCIES: Stage-0 is a root stage STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 30 CONFIGURE HIVE Configuration Hive default configuration <install-dir>/conf/hive-default.xml Configuration variables <install-dir>/conf/hive-site.xml Hive configuration directory HIVE_CONF_DIR environment variable Log4j configuration <install-dir>/conf/hive-log4j.properties Typical Log: c:\Hadoop\hive-0.9.0\logs\hive.log 31 WHY USE HIVE BUZZ! Cross-pollinate your existing SQL skills! Makes Hadoop cross-correlations, joins, filters easier Allows storage of intermediate results for faster/easier querying Batch based processing Individual queries still often slower than a relational database E2E insight may be much faster 32 BI ON BIG DATA Gain Insights Mash-up Hive + other data in Excel Hive data source to PowerPivot for in-memory analytics Power View on top of PowerPivot for spectacular visualizations leading to insights Securely share on SharePoint for collaboration, re-use, centralized data Microsoft on top of Hadoop / Hive includes PowerPivot Power View Analysis Services PDW StreamInsight SQL Server SQL Azure Excel 33 BIG DEAL NEXT STEPS Get Involved Read a bit http://sqlblog.com/blogs/lara_rubbelke/archive/2012/09/10/big-data-learningresources.aspx Programming Hive Book http://blogs.msdn.com/cindygross Sign up: Windows Azure HDInsight Service http://HadoopOnAzure.com (Cloud CTP) Download Microsoft HDInsight Server http://microsoft.com/bigdata (On-Prem CTP) Think about how you can fit Big Data into your company data strategy Suggest uses, be prepared to combat misuses 35 BIG DATA REFERENCES Hadoop: The Definitive Guide by Tom White SQL Server Sqoop http://bit.ly/rulsjX JavaScript http://bit.ly/wdaTv6 Twitter https://twitter.com/#!/search/%23bigdata Hive http://hive.apache.org Excel to Hadoop via Hive ODBC http://tinyurl.com/7c4qjjj Hadoop On Azure Videos http://tinyurl.com/6munnx2 Klout http://tinyurl.com/6qu9php Microsoft Big Data http://microsoft.com/bigdata Denny Lee http://dennyglee.com/category/bigdata/ Carl Nolan http://tinyurl.com/6wbfxy9 Cindy Gross http://tinyurl.com/SmallBitesBigData MICROSOFT BIG DATA AT PASS SUMMIT BIA-305-A SQLCAT: Big Data – All Abuzz About Hive Wednesday 1015am | Cindy Gross, Dipti Sangani, Ed Katibah BIA-204-M MAD About Data: Solve Problems and Develop a “Data Driven Mindset” Wednesday 1015am | Darwin Schweitzer AD-300-M Bootstrapping Data Warehousing in Azure for Use with Hadoop Thursday 1015am | Steve Howard, James Podgorski, Olivier Matrat, Rafael Fernandez BIA-306-M How Klout Changed the Landscape of Social Media with Hadoop and BI Thursday 130pm | Denny Lee, Dave Mariani AD-316-M Harnessing Big Data with Hadoop Friday 8am | Mike Flasko DBA-410-S Big Data Meets SQL Server Friday 945am | David DeWitt AD-315-M NoSQL and Big Data Programmability Friday 415p | Michael Rys 37 Don’t Miss! Win prizes with new online evaluations Build experience with Hands On Labs NEW: TCC 304 Attend David DeWitt’s spotlight session Big Data Meets SQL Server DBA-410-S, Room 6E Friday, 9:45 AM Be SQL Server 2012 Certified with onsite testing Find hidden session announcements by following: Room 212-214 @sqlserver #sqlpass Visit the SQL Clinic and new “I MADE THAT!” Developer Chalk talks NEW: 4C-3 & 4C-4 PASS Resources Free SQL Server and BI training Free 1-day Training Events Regional Event Local and Virtual User Groups Free Online Technical Training This is Community Learning Center 39 Thank you for attending this session and the 2012 PASS Summit in Seattle November 6-9, Seattle, WA40 Please fill out evaluations! SQLCAT: Big Data – All Abuzz About Hive Cindy Gross SQLCAT BI/Big Data PM Microsoft http://blogs.msdn.com/cindygross @SQLCindy Cindy.Gross@microsoft.com Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Ed Katibah SQLCAT Spatial PM Microsoft http://blogs.msdn.com/b/edkatibah/ @Spatial_Ed Ed.Katibah@Microsoft.com November 6-9, Seattle, WA