MVPS13 PowerPoint Template 16x9

advertisement
Cindy Gross
CAT PM
SQL, BI, Big Data
HDInsight: Jiving about Hadoop
and Hive with CAT
http://blogs.msdn.com/cindygross/
@SQLCindy
Cindy.Gross@microsoft.com
Why are we
here?
Objectives
 Quick Overview: Big Data,
Hadoop, HDInsight, Open
Source
 What Hive is
 Why Hive for Hadoop?
 Why Hive for SQL Pros?
 How Hive fits into
Hadoop/HDInsight
 Hive is better together with
SQL, AS, BI
Key Takeaways
 How Hive fits
 Hive DDL and DML
 Formats, Structure
 Storage options
Big Data
What’s the social
sentiment for my
brand or
products
How do I better
predict future
outcomes?
How do I optimize
my fleet based on
weather and traffic
patterns?
Increases ad revenue by processing 3.5
billion events per day
Measures and ranks online user influence
by processing 3 billion signals per day
Uses sentiment analysis and web analytics
for its internal cloud
Massive Volumes
Cloud Connectivity
Real-Time Insight
Processes 464 billion rows per quarter, with
average query time under 10 secs.
Connects across 15 social networks via the
cloud for data and API access
Improves operational decision making for
IT managers and users
Big Data Technologies
SQL
Server and PDW
Office, Analysis
Services
Analysis
Services
HDInsight
Office, Analysis
Services
Cloud (Azure) Flexibility + On-Premises Option
How it fits together
Apache Hadoop
Hortonworks
HDInsight
Open Source Community
Microsoft Partner
We Consume Code
We Contribute Code
Heavy Contributors to
Open Source Hadoop
HDInsight Service,
HDInsight Server Built
on Hortonworks
Platform
Core Code Same Across
Distributions
Trusted in Open Source
Community
Additional
Functionality
Stats
processing
(RHadoop)
Scripting
(Pig)
Query
(Hive)
Event Pipeline
(Flume)
Distributed Processing
(MapReduce)
Distributed Storage
(HDFS)
Machine
Learning
(Mahout)
Data Integration
( ODBC / SQOOP/ REST)
Metadata
(HCatalog)
NoSQL Database
(HBase)
Pipeline / Workflow
(Oozie)
Graph
(Pegasus)
Legend
Red = Core
Hadoop
Blue = Data
processing
Purple = Microsoft
integration points
and value adds
Yellow = Data
Movement
Green = Packages
White = Coming
Soon
Hive Architecture
Hive
Hadoop
Why Hive for
Hadoop?
Enables BI tools via ODBC, structure
Structure without full relational modeling
Familiar HiveQL - skillset reuse
Simplifies Hadoop data access
Microsoft Confidential
Hive Characteristics
Batch oriented
Data Warehouse focused
Entire data sets (table scans)
Generates/runs MapReduce (not faster than MR!)
Limited indexing, no stats, no cache
Programmer is the optimizer
Append only (mostly)
Microsoft Confidential
Why Hive for SQL
Pros?
Someone in your org will be doing it, why not you?
Fit projects to appropriate tech
Adds to, complements SQL, AS, BI
New opportunities for biz and for you
Explore, archive, prototype, pre-aggregate,
refine algorithms, some self-service
Microsoft Confidential
SQL/AS still needed
for….
Updates, OLTP, ACID
Subsets, indexes/aggs, built-in optimizer, caching
Apps, data, structure, infrastructure already exists
Each query has to be fast
You know what you need to know
Where it makes sense
Microsoft Confidential
Create Table
Not Partitioned
CREATE EXTERNAL TABLE baconUnPart (type string COMMENT 'type of bacon')
COMMENT 'SQL Bacon!'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/demo/bacon';
Partitioned
CREATE EXTERNAL TABLE baconPart (type string COMMENT 'type of bacon strips')
COMMENT 'SQL Bacon strips'
PARTITIONED BY (year string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
ALTER TABLE baconPart ADD PARTITION (Year = ‘1’)
LOCATION '/user/demo/bacon1';
ALTER TABLE baconPart ADD PARTITION (Year = ‘2’)
LOCATION '/user/demo/bacon2';
Inside a Hive
Table
DATA TYPES
EXTERNAL / INTERNAL
PARTITIONED BY | CLUSTERED BY | SKEWED BY
Terminators
ROW FORMAT DELIMITED | SERDE
STORED AS
Fields/Collection Items/Map Keys
TERMINATED BY
LOCATION
MetaData
Metadata is stored in a MetaStore database such as
Derby
SQL Azure
SQL Server
See Schema
SHOW TABLES 'ba.*';
DESCRIBE baconunpart;
DESCRIBE baconunpat.type;
DESCRIBE EXTENDED baconunpart;
DESCRIBE FORMATTED baconunpart;
SHOW FUNCTIONS "x.*";
SHOW FORMATTED INDEXES ON baconunpart;
Data Types
Primitives
Numbers: Int, SmallInt, TinyInt, BigInt, Float, Double
Characters: String
Special: Binary, Timestamp
Collections
STRUCT<City:String, State:String> | Struct (‘Boise’, ‘Idaho’)
ARRAY <String> | Array (‘Boise’, ‘Idaho’)
MAP <String, String> | Map (‘City’, ‘Boise’, ‘State’, ‘Idaho’)
UNIONTYPE <BigInt, String, Float>
Properties
No fixed string lengths
NULL handling depends on SerDe
Storage –
External and
Internal
CREATE EXTERNAL TABLE baconUnPart(…)
LOCATION '/user/demo/bacon';
LOCATION ‘hdfs:///user/demo/bacon';
LOCATION ‘asv://user/demo/bacon';
Use EXTERNAL when
Data also used outside of Hive
Data needs to remain even after a DROP TABLE
Use custom location such as ASV
Hive should not own data and control settings, dirs, etc.
Use INTERNAL when
You want Hive to manage the data and storage
Short term usage
Creating table based on existing table (AS SELECT)
Storage –
Partition and
Bucket
CREATE EXTERNAL TABLE baconPart (…)
PARTIONED BY (Year string)
CLUSTERED BY (type) into 256 BUCKETS
Partition
Directory for each distinct combo of string partition values
Partition key name cannot be defined in table itself
Allows partition elimination
Useful in range searches
Can slow performance if partition is not referenced in query
Buckets
Split data based on hash of a column
One HDFS file per bucket within partition sub-directory
Performance may improve for aggregates and join queries
Sampling
set hive.enforce.bucketing = true;
Storage – File
Formats
CREATE EXTERNAL TABLE baconPart (…)
ROW FORMAT DELIMITED
FIELDS TERMINATED by ‘\001‘
STORED AS TEXTFILE, RCFILE, SEQUENCEFILE, AVRO
Format
Generally remove headers before loading files
TEXTFILE is common, useful when data is shared and alphanumeric
Extensible storage formats via custom input, output formats
Extensible on disk/in-memory representation via custom SerDes
Storage –
SerDes
CREATE EXTERNAL TABLE CustomSerDeUsage(…)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION ….
SerDes
Create your own Java Serialization/Deserialization
Includes parse input/output, optimization
Usually overrides CREATE TABLE DDL
Common SerDes: CSV, XML, JSON, Custom
Library: org.apache.hadoop.hive.serde2
Storage –
HDFS and
ASV
ASV://containername@account.blob.core.windows.net/user/demo/
HDFS:///user/demo/
Storage Format
HDFS is Hadoop distributed file system
ASV is Azure Storage Vault using an API on top of HDFS
ASV allows reuse across clusters and with other apps
ASV data quickly available to new HDInsight clusters
CREATE
INDEX
CREATE INDEX baconPart_idx
ON TABLE baconPart (type)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD
IN TABLE baconPart_index;
ALTER INDEX baconPart_idx ON baconPart REBUILD;
Key Points
No keys
Index data is another table
Requires REBUILD to include new data
SHOW FORMATTED INDEXES on MyTable;
Indexing May Help
Avoid many small partitions
GROUP BY
Create View
CREATE VIEW baconOneYear (type)
AS SELECT type
FROM baconPart
WHERE year = 1
ORDER BY type;
Sample Code
SELECT * FROM baconOneYear;
DESCRIBE FORMATTED baconOneYear;
Key Points
Not materialized
Can have ORDER BY or LIMIT
Query
SELECT c.state_fips, c.county_fips, c.population
FROM census c
WHERE c.median_household_income > 100000
GROUP BY c.state_fips, c.county_fips, c.population
ORDER BY county_fips
LIMIT 100;
Key Points
Minimal caching, statistics, or optimizer
Generally reads entire data set for every query
Performance
The order of columns, tables can make a difference to performance
Use partition elimination for range filtering
Sorting
ORDER BY
One reducer does final sort, can be a big bottleneck
SORT BY
Sorted only within each reducer, much faster
DISTRIBUTE BY
Determines how map data is distributed to reducers
SORT BY + DISTRIBUTE BY = CLUSTER BY
Can mimic ORDER BY, better perf if even distribution
Joins
Supported Hive Join Types
Equality
OUTER - LEFT, RIGHT, FULL
LEFT SEMI
Not Supported
Non-Equality
IN/EXISTS subqueries (rewrite as LEFT SEMI JOIN)
Characteristics
Multiple MapReduce jobs unless same join columns in all tables
Put largest table last in query to save memory
Joins are done left to right in query order
JOIN ON completely evaluated before WHERE starts
EXPLAIN
EXPLAIN SELECT * FROM baconPart;
EXPLAIN SELECT * FROM baconPart WHERE year > 1;
EXPLAIN EXTENDED SELECT * FROM baconPart;
Characteristics
Does not execute the query
Shows parsing
Lists stages, temp files, dependencies, modes, output operators, etc.
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME baconPart))) (TOK_INSERT (TOK_DESTINATION
(TOK_DIR TOK_TMP_FIL
E)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF))))
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Configure
Hive
Configuration
cd %hive_home%\bin
<install-dir> currently: C:\Hadoop\hadoop-1.1.0-SNAPSHOT
Hive default configuration <install-dir>/conf/hive-default.xml
Configuration variables <install-dir>/conf/hive-site.xml
Hive configuration directory HIVE_CONF_DIR environment variable
Log4j configuration <install-dir>/conf/hive-log4j.properties
Typical Log: c:\Hadoop\hive-0.9.0\logs\hive.log
Why Use
Hive
BUZZ!
BI on Big Data
Cross-pollinate your existing SQL skills!
Makes Hadoop cross-correlations, joins, filters easier
Allows storage of intermediate results for faster/easier querying
Batch based processing
E2E insight may be much faster
Get the right projects on the right technologies
Next Steps
Get Involved
Read a bit
http://sqlblog.com/blogs/lara_rubbelke/archive/2012/09/10/big-data-learningresources.aspx
Programming Hive Book
http://blogs.msdn.com/cindygross
Subscribe to Windows Azure HDInsight Service http://HadoopOnAzure.com (Cloud CTP)
Download Microsoft HDInsight Server http://microsoft.com/bigdata (On-Prem CTP)
Think about how you can fit Big Data into your company data strategy
Suggest uses, be prepared to combat misuses
Big Data References
Hadoop: The Definitive Guide by Tom White
SQL Server Sqoop http://bit.ly/rulsjX
JavaScript http://bit.ly/wdaTv6
Twitter https://twitter.com/#!/search/%23bigdata
Hive http://hive.apache.org
Excel to Hadoop via Hive ODBC http://tinyurl.com/7c4qjjj
Hadoop On Azure Videos http://tinyurl.com/6munnx2
Klout http://tinyurl.com/6qu9php
Microsoft Big Data http://microsoft.com/bigdata
Denny Lee http://dennyglee.com/category/bigdata/
Carl Nolan http://tinyurl.com/6wbfxy9
Cindy Gross http://tinyurl.com/SmallBitesBigData
What we
covered
Objectives
 Quick Overview: Big Data,
Hadoop, HDInsight, Open
Source
 What Hive is
 Why Hive for Hadoop?
 Why Hive for SQL Pros?
 How Hive fits into
Hadoop/HDInsight
 Hive is better together with
SQL, AS, BI
Key Takeaways
 How Hive fits
 Hive DDL and DML
 Formats, Structure
 Storage options
Cindy Gross
CAT PM
SQL, BI, Big Data
HDInsight: Jiving about Hadoop
and Hive with CAT
http://blogs.msdn.com/cindygross/
@SQLCindy
Cindy.Gross@microsoft.com
Download