Executive Summary Microsoft has established a firm foothold in the

Microsoft Big Data and Analytics
Executive Summary
Microsoft has established a firm foothold in the world of
traditionally structured data with Microsoft SQL Server* and
an even firmer foothold in the world of data analysis with tools
such as Microsoft Excel*. However, the big data era requires
solutions to store, query, and analyze data beyond that which is
traditionally structured in relational databases or spreadsheets.
Microsoft has responded to this big data challenge not only
Server, an on-premises solution, and Windows Azure HDInsight
by offering a new big data solution, but also by describing
Service*, a completely cloud-based solution.
a broad solution for comprehensive data management and
analysis that is supported by a combination of new and old
Although Microsoft does offer these two new Apache Hadoop
Microsoft products.
products for storing and mining both semi-structured and
unstructured data, the company has also been keen to steer the
The big data trend in recent years has been largely driven
big data conversation away from the need for big data solutions
by the popular, open-source software framework of Apache
per se and toward the need for a universal data management
Hadoop*. Apache Hadoop allows massive amounts of data
and analysis solution. Until recently, in fact, Microsoft used
that is not structured into relational databases to be stored
the term “big data” to refer to this universal vision, but its most
in clusters of commodity servers and then analyzed for
recent messaging makes a distinction between “big data” of
correlations, trends, and other potentially valuable information.
Apache Hadoop and other forms of data. Microsoft’s broader
So popular has Apache Hadoop become as a big data solution vision is supported in part by Microsoft SQL Server 2012
that to many, the terms “big data” and Apache Hadoop have
Parallel Data Warehouse* (PDW), which is a data-warehouse
become synonymous.
hardware appliance that stores only structured data but that
also supports queries of both structured and unstructured data
Microsoft is offering an Apache Hadoop component with
through Microsoft’s proprietary PolyBase technology. Microsoft
Microsoft HDInsight*, a set of services built on Hortonworks
also positions SQL Server Analysis Services (SSAS), Excel, and
Data Platform* (HDP*) for Windows*. More specifically, HDInsight
Microsoft SharePoint Server* as part of its “all data” tool set,
can refer to either of two separate Microsoft products, both still
along with optional analysis add-ons for Microsoft Office* such
in preview and months away from general release: HDInsight
as PowerPivot, Power View, Power Map, and Power Query.
Microsoft Big Data and Analytics
Contents
Executive Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Evaluating the Microsoft Data Platform. . . . . . . . . . . . . 3
Is Microsoft Really Democratizing Big Data? . . . . . . . 3
Does Microsoft Offer a Truly Comprehensive
Data-Management Solution?. . . . . . . . . . . . . . . . . . . 3
Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Microsoft’s Big Data Vision. . . . . . . . . . . . . . . . . . . . . . . . 5
Microsoft’s General Claims about its
Comprehensive Data Solution. . . . . . . . . . . . . . . . . . . . 6
Claim: “The Microsoft big data solution offers
an integrated platform for managing data of
any type or size.”. . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Claim: “Microsoft’s big data solution gives you
the power to … enable anyone in your organization to easily glean insight from your data so they can make . smarter decisions.” . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Microsoft HDInsight*: Microsoft’s Apache
Hadoop* Solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Creating HDInsight Service Clusters. . . . . . . . . . . . . . . 8
HDInsight Storage Options. . . . . . . . . . . . . . . . . . . . . . 8
HDInsight Management . . . . . . . . . . . . . . . . . . . . . . . . 9
Getting Data in and out of HDInsight. . . . . . . . . . . . . . . 9
Technical Notes about HDInsight . . . . . . . . . . . . . . . . 10
Microsoft’s Claims about HDInsight . . . . . . . . . . . . . . 10
Claim: “[HDInsight lets you] accelerate the deployment
with the cloud by deploying an Apache Hadoop
cluster on Windows Azure* in just 10 minutes.” . . . . 10
Claim: “Microsoft simplifies programming on
Apache Hadoop.” . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Claim: “[Microsoft big data lets you] seamlessly
extend privileges across HDInsight with
Active Directory*.” . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Claim: “HDInsight is 100% compatible
with Apache Hadoop.”. . . . . . . . . . . . . . . . . . . . . . . 11
SQL Server 2012* Parallel Data Warehouse:
An (Almost) All-in-One Data Solution . . . . . . . . . . . . . . . 12
PDW Hardware Specifications . . . . . . . . . . . . . . . . . . 12
Dell Parallel Data Warehouse Appliance. . . . . . . . . .
HP AppSystem for Microsoft SQL Server 2012
Parallel Data Warehouse . . . . . . . . . . . . . . . . . . . . .
How PDW Works . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison of Data Warehousing Appliances . . . . . .
Big Data Integration – PolyBase . . . . . . . . . . . . . . . . .
CREATE EXTERNAL TABLE Statement. . . . . . . . . .
CREATE TABLE AS SELECT Statement. . . . . . . . . .
Querying the Data. . . . . . . . . . . . . . . . . . . . . . . . . .
Pushing Data to Apache Hadoop from PDW. . . . . .
Roadmap for PolyBase . . . . . . . . . . . . . . . . . . . . . .
ETL in PDW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Microsoft’s Claims about PDW. . . . . . . . . . . . . . . . . .
Claim: PolyBase for PDW provides “seamless
integration of Apache Hadoop data with the data
warehouse in a single query.”. . . . . . . . . . . . . . . . . .
Claim: “HDFS Bridge in PolyBase … enable[s]
direct communication between HDFS data nodes
and PDW compute nodes.”. . . . . . . . . . . . . . . . . . .
Business Intelligence and Analytics . . . . . . . . . . . . . . . .
Apache Hive* ODBC Driver. . . . . . . . . . . . . . . . . . . . .
PowerPivot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Power Query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Power View. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Power Map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Microsoft’s Claims about BI . . . . . . . . . . . . . . . . . . . .
Claim: “HDInsight democratizes the power of
big data BI.”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Claim: “[Microsoft lets you] analyze big data with
familiar tools.”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
13
13
14
15
15
15
16
16
16
16
16
17
17
18
18
18
19
20
20
20
20
21
21
23
Microsoft Big Data and Analytics
Evaluating the Microsoft Data Platform
on tables in this data, and finally present this data in impressive
Microsoft makes two alluring pitches for its suite of data
visualizations that can provide valuable insights. However, it
products. The first is that its solution can bring the power of
is essential to understand first that Excel can load data from
big data to the masses, making queries easier to submit and
any Apache Hadoop source, not just from HDInsight. Excel
data easier to analyze with tools that are already ubiquitous.
allows users to import Apache Hadoop data from any source
The second claim is that the Microsoft solution offers a single,
by means of a special add-on driver (the Apache Hive* Open
comprehensive solution to manage all enterprise data—
Database Connectivity [ODBC] driver from Microsoft). As much
regardless of size, structure, or speed.
as Microsoft is attempting to connect Excel and HDInsight as
part of a single solution, there is no substantial advantage to
Is Microsoft Really Democratizing Big Data?
choosing HDInsight as the particular backend source of Apache
Despite the near-exuberant rhetoric about bringing big data
Hadoop data in Excel. Moreover, Excel does not allow users
analysis to the masses, Microsoft’s progress on this count has
to perform complex operations, such as machine learning,
been somewhat modest. Microsoft is indeed lowering the barrier
that analyze or mine vast amounts of data from an Apache
to entry for big data, but only incrementally. Its clearest success
Hadoop cluster in the way the term “big data” suggests. With
along these lines comes in deployment and management.
Excel, information workers can merely import individual Apache
Whether on premises or in the cloud, HDInsight is easy to set up
Hadoop files and perform analyses and visualizations on tables
and manage compared to other big data solutions, especially
stored in these files.
for IT personnel who lack Linux* expertise. This solid innovation,
however, does not simplify deriving value from data in the cluster. Does Microsoft Offer a Truly Comprehensive DataManagement Solution?
For this ultimate purpose, HDInsight only modestly reduces the
difficulty of searching, analyzing, and mining Apache Hadoop
The closest Microsoft comes to a comprehensive data solution
data compared to other Apache Hadoop solutions. Microsoft’s
today is with its PDW hardware appliance, which includes
unique contribution toward simplifying data mining from Apache
SQL Server 2012 to store structured data and which can also
Hadoop clusters is to offer a set of programming libraries that
connect to Apache Hadoop data from an external source. PDW
allows programmers to run operations against Apache Hadoop
thus enables unified access to both structured and unstructured
data in simpler programming languages, such as JavaScript*
data. However, PDW does not currently favor any particular
and .NET* languages such as C# and F#. It also offers an
Apache Hadoop solution as the external source of unstructured
interactive JavaScript console that allows programmers to run
data, making it a big data solution far from specific to Microsoft.
JavaScript commands against data in Apache Hadoop files one
Microsoft’s current big data solution is also limited in that none of
line of code at a time. As a comparison, the classic “WordCount” its components can handle streaming unstructured data, such
as from social media or user clickstreams.
program in Apache Hadoop requires approximately 60 lines of
code in Java*, but only 15 in JavaScript. Such advancements will
allow more people to gain insights from data stored in Apache
PDW might have clear limitations today, but in the future the
Hadoop files, but that wider group must still be programmers.
appliance is likely to fulfill Microsoft’s promise of delivering
One area where Microsoft is truly democratizing data analysis
high price. This comprehensive vision is set to be realized
and visualization is on the client end, in Excel. Excel has the
with the next version of PDW, which will likely include a pre-
ability to take data stored in individual Apache Hadoop files, run
installed version of HDInsight Server (at least as an option).
traditional database queries against this data, perform analysis
The ability to perform real-time queries of unstructured data
truly comprehensive data management on-premises, if at a
Microsoft Big Data and Analytics
streams is also likely to be incorporated into future versions of
fundamentally, Microsoft’s solution for unstructured big data is
HDInsight, making Microsoft’s data-handling capabilities truly
still not released, and it will be a matter of time before general
comprehensive. Microsoft has not confirmed that it will include
usage can truly reveal its strengths and its faults.
PolyBase in a future, general release of SQL Server outside
of PDW, but such a move is also plausible. Adding PolyBase
Despite these reservations, there are reasons to be optimistic
broadly to SQL Server would bring the capabilities of handling
about Microsoft’s chances of bringing big data to the masses in
structured and unstructured data to the wider database market.
the future. Compared to other companies, Microsoft has more
of the components in place for a comprehensive data solution,
A cloud-based, on-demand solution that meets Microsoft’s
including popular database management software in SQL
promise of comprehensive data management is also eventually
Server, a rapidly-maturing cloud provider in Windows Azure,
likely to arrive. Windows Azure* already allows users to
widely-used business intelligence tools, and the resources to
create, store, and manage databases in Windows Azure SQL
invest in this comprehensive vision for the long term.
Database* online, so all of the components of a comprehensive
data solution will be available through Windows Azure when
Windows Azure HDInsight Service matures. It is unclear if a
wider distribution of PolyBase to SQL Server would extend to
a cloud-based version, however. Even if it does, the bottleneck
of upload speeds on large, proprietary data sets could limit the
usefulness of the cloud-only option for some data-heavy firms.
However, the cloud-only solution will present an attractive
option for firms that generate data online or that work with
public data files.
Conclusion
Microsoft provides a vision for big data within a larger context
of all data, structured and unstructured. While this vision
is tantalizing for the future, it ultimately lacks substance
today. Democratizing big data would hold some of the same
revolutionary promise that personal computing and later the
Internet realized in the last three decades, yet it is far from
clear that Microsoft will ultimately consummate this revolution.
PolyBase shows potential for managing and analyzing
structured and semi-structured enterprise data by using familiar
database skills, but it is currently only available in a high-end
data-warehouse appliance. Using Excel as a frontend for bigdata analysis is another alluring vision, but it too is limited to
dealing with structured and semi-structured data. Moreover,
if Excel continues to be agnostic about the big-data backend
supporting it, it does not provide an argument for companies to
pick HDInsight over any other Apache Hadoop solution. Most
Microsoft Big Data and Analytics
Microsoft’s Big Data Vision
Microsoft is currently developing a big data solution whose
main components are likely to be released over the next year.
These products have not yet been finalized, but their features
have been made public, and Microsoft’s own statements about
their soon-to-be-released big data tools provide insight not only
into the company’s big data strategy, but also into its broader
data strategy in general. This paper provides an overview of this
broader strategy and an analysis of Microsoft’s big data claims.
Big data as a trend relies technically on the open-source
software framework of Apache Hadoop. Originally created
at Yahoo!, Apache Hadoop allows nearly unlimited amounts
of unstructured or semi-structured data (such as is found in
log files) to be stored in clusters of inexpensive servers and
then analyzed for correlations, trends, causal relationships,
and other insights. Apache Hadoop has become the industry
standard for big data, and for many, the terms “big data”
and Apache Hadoop have become synonymous. For many
companies selling a big data solution, the conversation about
big data begins and ends with Apache Hadoop.
Microsoft’s vision of big data differs from many others in that it
has publicly positioned Apache Hadoop as only a component
of a more comprehensive data strategy. This comprehensive
strategy includes not only the unstructured and semi-structured
data that are the accepted mainstays of big data, but also data
that is structured (such as into traditional database tables, such
as in a data warehouses), along with the business intelligence
tools used to analyze all data, whether unstructured, semistructured, or structured. This broader “all data” vision allows
Microsoft to draw into the big data conversation the company’s
existing strengths in products such as SQL Server and
Excel. By re-imagining the business staples of SQL Server
and Excel as having a role in a big data solution, Microsoft
is targeting their suite of big data products toward the many
skills in common software tools but that lack the specialized
knowledge reserved for data scientists and pure Apache
Hadoop experts.
The most central component of Microsoft’s big data strategy
is provided by HDInsight, an Apache Hadoop solution built
from a particular Apache Hadoop distribution, namely HDP for
Windows. (HDP for Windows, developed by Hortonworks, Inc., is
in fact the first distribution of Apache Hadoop that runs natively
on Windows, and it is already publicly available as a free tool.)
HDInsight can actually refer to either of two separate Apache
Hadoop products, both still available only as preview versions:
HDInsight Server, an on-premises solution, and Windows Azure
HDInsight Service, a completely cloud-based solution. Both
of these options are touted as versions of Apache Hadoop
that are easier to set up and use than are the Apache Hadoop
products offered by competitors. More recently, Microsoft has
also described HDInsight as a solution for analyzing data that
is semi-structured in particular, such as data sourced from
smartphones, web sites, RFID tags, and Twitter feeds. Microsoft
has also hinted that its search engine technologies, Bing* and
Microsoft FAST Search, will act as the solutions to interact with
completely unstructured data, such as documents. Code from
both products was in fact incorporated into the search function
in Microsoft SharePoint 2013*. However, Microsoft has not
elaborated on the particular role it sees for its search engines
within its comprehensive data strategy.
A second cornerstone of Microsoft’s big data vision is SQL Server
2012 PDW, a hardware appliance that supports queries of data
stored both in SQL tables and Apache Hadoop files through
Microsoft’s proprietary PolyBase technology. PDW is already
available at a price of approximately $1.5M. (Note that data
warehouses commonly cost as much as $30M, so while high, the
cost of PDW is actually low relative to that of the competition.)
The third, and currently final, component of Microsoft’s big data
businesses that have already invested heavily in these tools and solution is its business intelligence (BI) and visualization tools.
accumulated large amounts of potentially useful data in them.
These tools include Excel most importantly, but also Microsoft
Microsoft is also targeting the many companies that have high
Microsoft Big Data and Analytics
analysis add-ons such as PowerPivot, Power View, Power Map,
Microsoft’s General Claims about its
Comprehensive Data Solution
and Power Query.
Microsoft’s claims surrounding its all-data solution fall into three
SharePoint Server and Microsoft Office 365*, along with optional
Future components might be added to this suite of big data
products as they become available. For example, the next
version of SQL Server will include an in-memory online
transaction processing (OLTP) engine, currently code-named
Hekaton. Hekaton will allow any new products based on it to
efficiently process data captured in real-time, such as from
broad categories: that Microsoft provides a platform to manage
data of any type and size, that the Microsoft solution provides a
way to analyze all data, and that the Microsoft solution enables
information worker generalists to glean insights from big data.
While these claims are generally accurate, careful examination
of each claim yields a more nuanced picture.
data streams. It is plausible that Microsoft’s big data strategy
Claim: “The Microsoft big data solution offers an integrated
will eventually reflect this new functionality provided by Hekaton
platform for managing data of any type or size.”1
and include a real-time data analysis tool.
In discussing its comprehensive data solution, Microsoft places
Although Microsoft is describing these various products as
components of an integrated big data solution, they do not
function cohesively today. It is more accurate to view these
components as a list of separate tools that might slowly
become integrated over time.
Another limitation to keep in mind about Microsoft’s big data
solution is that its central component, HDInsight, is still a work
in progress and many months away from release. Moreover,
there is even a question about whether HDInsight will be
outdated when it finally is released. HDInsight is currently
based on HDP for Windows 1.x, which in turn is based on
the Linux exclusive Apache Hadoop 1.0. The next version of
Apache Hadoop based on Linux , version 2.0, is currently
in community preview and is scheduled for general release
in late summer 2013; it offers an architectural overhaul
that promises to dramatically improve performance and
extensibility. Hortonworks’s port of Apache Hadoop 2.0 to HDP
for Windows 2.0 is currently being targeted for late 2013. Any
future version of HDInsight that incorporates the updates in
Apache Hadoop 2.0 can only be built after HDP for Windows
2.0 is finalized in late 2013.
data into two broad categories for management: structured
data (managed by SQL Server) and semi- and unstructured
data (managed by HDInsight). The fact that Microsoft points to
two products actually hints at the lack of an integrated platform
for data management: data for SQL Server and Apache
Hadoop are not integrated into a single platform.
Even within each discrete product, data is not necessarily
integrated. On the one hand, it is true that SQL Server is the
management tool for structured data. On the other hand,
managing data in HDInsight is more complex. For companies
choosing the cloud-based Windows Azure HDInsight Service
as their Apache Hadoop option, both semi-structured and
unstructured data are likely to be stored and managed in
Windows Azure blob storage. For firms choosing the onpremises HDInsight Server option, semi-structured and
unstructured data are likely to be managed in separate
locations. Semi-structured data will likely be stored and
managed in Apache Hadoop Distributed File System (HDFS).
Unstructured data, such as documents, spreadsheets,
presentations, videos, and audio recordings, will likely be
managed not in Apache Hadoop but in SharePoint, in a
Microsoft product-centric IT deployment.
Microsoft Big Data and Analytics
The technology that currently comes closest to realizing
IT organizations looking at these solutions, however, should
the claim of an integrated platform is PolyBase. PolyBase
keep their eyes wide open for the behind-the-scenes work that
should not be viewed as a silver bullet, however. Beyond
can go into preparing data sets for wider use within a company.
being currently locked away in a specialized, expensive data
A sample data set of electrical usage of households in two
warehouse appliance, it is unclear to what extent it will integrate Dallas suburbs used to demonstrate Power Map and Power
with Microsoft’s principal tool for unstructured data querying,
View in Excel provides a telling example. The Microsoft team
Bing, or with Microsoft FAST Search for queries in SharePoint.
loaded Dallas County Appraisal District flat-file records into
As with so many other aspects of the Microsoft data vision,
SQL Server, converted geographical coordinates within them
time alone will tell how and to what degree organizations can
from planer to an ellipsoid projection with a third-party tool,
implement them.
and calculated the centroid of each land parcel in SQL Server
to obtain a longitude and latitude figure for each plot before
In general, Microsoft does not currently offer a comprehensive
exporting the data to Excel. (All of this before adding details
data management solution but a set of tools and products that
to the data set, such as simulated rates of electricity usage.)
allows organizations to handle structured, semi-structured, and
The result was a rich data set that could be dissected by
unstructured data.
information workers across a variety of dimensions, including
Claim: “Microsoft’s big data solution gives you the power to
… enable anyone in your organization to easily glean insight
from your data so they can make smarter decisions.”
2
This claim exaggerates the democratizing power of the
Microsoft big data solution. Microsoft’s integration of ubiquitous
and well-understood tools for big data analytics (particularly
Excel) should not be confused with making big data queries
and analysis inherently easier. Using laymen’s tools for big data
work is not the same as putting big data insights within reach
of all laymen.
That said, this represents a key part of Microsoft’s competitive
advantage in the big data arena, particularly with the saturation
of Excel in the enterprise productivity market. Many more
knowledge workers are familiar with Excel than with even SQL
queries, for example, opening up direct examination of big data
sets to a larger pool of analysts who previously had to work
through middlemen like data scientists. Moreover, Excel addins such as PowerPivot, Power View, Power Map, and Power
Query definitively put more analytical power in the hands of end
users than before.
time. The route to get there was anything but trivial, however.
Microsoft HDInsight*: Microsoft’s
Apache Hadoop* Solution
HDInsight is the brand Microsoft has assigned to its two
upcoming Apache Hadoop products: the cloud-based
Windows Azure HDInsight Service and the on-premises
HDInsight Server. Both of these solutions are built from a core
of HDP for Windows. HDInsight in both cases thus refers to a
product composed of this basic Hortonworks Apache Hadoop
distribution in addition to extensive software customizations
added by Microsoft. (HDInsight and HDP for Windows do not,
in other words, refer to distinct components that communicate
with each other.)
Of the two versions of HDInsight, Microsoft has promoted the
cloud-based Windows Azure HDInsight Service to a much
greater degree. This product, hosted on Windows Azure, is
also expected to be released first, mostly likely in Q4 2013.
The emphasis on the cloud-based HDInsight suggests that
this version of the product aligns more closely with Microsoft’s
chosen market positioning for HDInsight in general.
Microsoft Big Data and Analytics
The HDInsight Service web page (found at http://www.
Creating HDInsight Service Clusters
windowsazure.com/en-us/services/hdinsight/) describes the
Clusters created in HDInsight are intended to be disposable
service by featuring words and phrases such as “gain insight
as a way to minimize costs. HDInsight was designed with the
from any data, any size, anywhere,” “provides simplicity,” “ease
expectation that users will create an HDInsight cluster, load
of management,” “simplicity of Windows Azure,” “simple and
the data needed, run the analyses desired, and then destroy
straightforward,” “seamless scale,” “quickly create,” “cost
the cluster.
savings only possible on a cloud environment,” “glean insights
on all your data with familiar tools,” and “analyze all your data
HDInsight promises to be simple, and as far as the procedure
easily.” The messaging is clear: HDInsight Service is simple,
to create a new cluster is concerned, it lives up to this promise.
cost-efficient, and takes advantages of existing knowledge.
With the “Quick Create” option in particular, the user merely
chooses the cluster size (as defined by the number of nodes)
Simple as it might be, HDInsight Service is not the only
and then assigns a name, password, and storage account for
cloud-based Apache Hadoop solution. Other such products
the cluster. Once the user clicks the option to create the cluster,
include Amazon’s Elastic MapReduce*, Joyent Solution
the process takes 15 to 20 minutes.
for Apache Hadoop*, and InfoChimps Cloud::Hadoop*.
Microsoft’s offering differs from these others most obviously
HDInsight Storage Options
in that it runs on Windows and that it is integrated into the
HDInsight allows data to be stored in the local HDFS file
Windows Azure platform.
system, as does any Apache Hadoop distribution. However, an
option unique to HDInsight Service is the Azure Storage Vault
Another idiosyncrasy of HDInsight is that it is currently based
(ASV) protocol, which builds on the HDFS API to map Apache
on Apache Hadoop 1.0.3 and HDP for Windows 1.1.0, even
Hadoop operations to Windows Azure blob storage instead of
though (as of August 2013) the most recent stable releases of
to local HDFS. Through ASV, customers can keep their Apache
Apache Hadoop based on Linux are Apache Hadoop 1.2.1 and
Hadoop data in an inexpensive Windows Azure blob storage
HDP 1.3.1. Even the most recent version of HDP for Windows
account and avoid having to import this data into the physical
is a later version: 1.3. Because Apache Hadoop is a quickly-
compute nodes of the HDInsight cluster. Because the data
maturing platform, the difference in incremental updates can
accessed through ASV isn’t physically stored in the HDInsight
be significant. For example, HDP 1.3.1 features a revision of the
cluster, the data remains in Windows Azure blob storage before
Hive query language called the Stinger Initiative that supports
clusters are created and after they are destroyed.
3
4
5
6
50 times faster performance and increased compatibility with
the SQL query language, but this technology is not currently
After users spin up an HDInsight cluster, they can point
included in HDInsight. In addition, the next full version of Apache
operations such as Hive queries toward data that has been
Hadoop, Apache Hadoop 2.0, is expected to be released in
stored in Windows Azure blob storage by using a URI
Q3 2013 and to be incorporated into HDP for Windows in Q4.
beginning with asv:// or asvs://. The drawback to ASV is that,
Apache Hadoop 2.0 is an important update that will dramatically
because this data is not stored in the Apache Hadoop cluster
improve the efficiency and extensibility of the platform, but it is
itself, performance is not always optimized. However, write
not clear when these updates will reach HDInsight.
performance on Windows Azure blob storage is much faster
that it is on HDFS, and with large file reads, temporary writes
can be used so often that ASV can actually even result in better
overall performance than local HDFS storage can. Figure 1
shows the setting to configure ASV for HDInsight.
Microsoft Big Data and Analytics
For more fine-grained management of HDInsight clusters and
their associated storage, Windows PowerShell* is available.
Windows PowerShell cmdlets for HDInsight are currently in
version 0.9 and are available through the Microsoft .NET SDK
For Apache Hadoop web site on Codeplex (https://hadoopsdk.
codeplex.com/releases/view/109811).
Figure 1. HDInsight cluster management screen
If optimal performance is important, it is advisable to run tests
with data stored in both ASV and local HDFS and compare
the results. Note however that the cost of storing data in HDFS
on HDInsight node instances is much higher than the cost of
storing a comparable amount of data in Windows Azure blob
storage. Another drawback to HDFS over ASV is that data
stored in HDFS is removed when the cluster is destroyed.
Figure 2 illustrates the relationship between an HDInsight
Figure 3. HDInsight cluster dashboard
cluster, HDFS, and ASV.
Beyond these current tools, Microsoft has stated that in the
future, Microsoft System Center* will provide tools to manage
HDInsight. Given this information, it seems most likely that
this System Center integration will become available in first
full release of System Center after the official public release
of HDInsight.
Getting Data in and out of HDInsight
Figure 2. Relationship between an HDInsight cluster, HDFS, and Windows Azure Blob Storage
HDInsight Management
HDInsight offers a number of standard Apache Hadoop
ecosystem tools for loading and unloading data, such as the
Apache Hadoop command or, if the source is a relational
database, the Apache Sqoop* tool (included in all Apache
Windows Azure HDInsight Service and its on-premises
Hadoop distributions). To load log file data, the standard
counterpart HDInsight Server share the same web-based
Apache Hadoop ecosystem tool Apache Flume* is used.
management interface, shown in Figure 3. The graphical user
interface (GUI) provides options such as an interactive JavaScript To load data into or out of Windows Azure blob storage (as
and Hive console to a cluster, a remote desktop connection to
opposed to HDFS), users have more options. For example,
the name (main) node, and monitoring data.
one can use any number of tools that make use of the HDFS
API, such as the free graphical tools Azure Storage Explorer*
Microsoft Big Data and Analytics
and CloudXplorer* or the command-line tool AzCopy*. One
Claim: “[HDInsight lets you] accelerate the deployment
can also use JavaScript via the interactive console, the Apache
with the cloud by deploying an Apache Hadoop cluster on
Hadoop command line (using the Apache Hadoop command),
Windows Azure* in just 10 minutes.”7
or a .NET language such as C#. Yet another option is Windows
The claim is specific and easy to verify, but it also suggests
PowerShell.
something general: that creating an HDInsight cluster in
Windows Azure is a trivial exercise and is far easier than setting
After data has been unloaded, it’s typically necessary to clean
up one’s own hardware cluster.
it before it can be consumed, analyzed, or displayed in a
visualization. These data cleaning operations are often referred
Although it takes closer to 20 minutes to set up an HDInsight
to as extract, transform, and load (ETL). For ETL operations
cluster, it is true that by using Windows Azure HDInsight
with HDInsight, the standard Apache Hadoop tool Apache Pig*
Service the circumscribed process of setting up an HDInsight
can be used. However, Microsoft also makes ETL for Apache
cluster is quick and easy. However, this statement is essentially
Hadoop possible through SSIS, by means of the Hive ODBC
misleading because it ignores the necessary aspect of
Driver; the Hive ODBC Driver allows external applications such
uploading data into the cloud. This uploading process is
as Excel and SQL Server to connect to Apache Hadoop data.
necessary unless the enterprise data destined for Apache
Technical Notes about HDInsight
Hadoop is already stored in Windows Azure blob storage
(an uncommon scenario). To upload 1 TB of uncompressed
HDInsight was developed with ease of use in mind and has
data at a rate of 1 MB/second would require approximately
not been optimized for other features such as performance.
12 days. Compression can reduce the transfer times by 80 to
In addition, it is unlikely that HDInsight will ever be built on the
90 percent, but even assuming the rate can be increased to a
very latest version of Apache Hadoop because these versions
brisk 1 TB per day, the process of uploading 100 TB would still
are written on Linux. As a result, HDInsight will be late to
take 100 days. (Windows Azure does not yet allow customers
adopt cutting edge features and frameworks such as Intel’s
to ship physical disks to speed the process of loading data, but
Project Rhino, which provides a common security framework
this service is planned before the end of 2013.8)
for Apache Hadoop; or Intel® Advanced Encryption Standard
New Instruction (Intel® AES-NI), which speeds performance
In addition, regardless of how complicated or time-consuming
on encryption; or cell-level security in Apache Hadoop, such
the process of deploying an Apache Hadoop cluster might
as is being developed in the Apache Accumulo* project.
be, this difficulty of deployment is not a major deterrent to the
Regarding security, the only claims Microsoft is in fact making
sound use of Apache Hadoop. In the broader scheme, ease of
about HDInsight and security relate to its integration with Active
installation is a nice-to-have feature of HDInsight that does not
Directory* Domain Services.
help businesses derive any value whatsoever from an Apache
Microsoft’s Claims about HDInsight
Hadoop cluster.
Microsoft’s main claims about HDInsight usually suggest that the Note that for the on-premises version of this product, the true
product makes Apache Hadoop easier. What follows are some
ease of installation cannot yet be verified because the current
representative examples of Microsoft followed by a brief analysis. preview of HDInsight Server (for on-premises deployment) can
only be installed as a single node.
Microsoft Big Data and Analytics
Claim: “Microsoft simplifies programming on
services related to Apache Hadoop use for logon credentials.
Apache Hadoop.”
These services, and the “hadoop” logon account, are shown in
9
The claim that a procedure has been simplified can mean
Figure 4.
either that it has been made simple, or that it has merely been
made simpler. In this case it is true that Microsoft has made
programming on Apache Hadoop a little simpler, but it is not true
that it has made programming on Apache Hadoop simple.
Microsoft’s programmatic addition to Apache Hadoop has been
to create a .NET software development kit (SDK) and a set of
JavaScript libraries for HDInsight, in addition to providing an
interactive JavaScript console to Apache Hadoop. (The .NET
SDK allows programmers to write essential Apache Hadoop
MapReduce jobs in all .NET languages such as C# and F#.)
These additions in principle should make programming for
Figure 4. HDInsight services
Apache Hadoop easier for the many programmers who are not
In general, IT should not soon expect dramatic improvements
Java specialists. However, programming MapReduce jobs will
in the manageability of Apache Hadoop because of its loose
remain fundamentally complex even in these other languages.
integration in Active Directory Domain Services. However, it is
For the IT decision maker, the take-away is that developers
likely that Apache Hadoop and Active Directory Domain Services
comfortable in any .NET language or JavaScript will now be able
will become more integrated over time, leading to (for example)
to program MapReduce jobs and quickly perform queries in a
specific HDInsight group policy objects (GPOs) and other
console against data stored in Apache Hadoop.
administrative benefits. HDInsight will likely need some years to
mature before that will happen, however.
Claim: “[Microsoft big data lets you] seamlessly extend
privileges across HDInsight with Active Directory*.”10
Claim: “HDInsight is 100% compatible with
This implication of this claim is that the integration of HDInsight
Apache Hadoop.”11
with Active Directory Domain Services makes managing
Buried within Microsoft’s general claim that “HDInsight makes
HDInsight easier.
Apache Hadoop easier” is the implicit claim that HDInsight really
is Apache Hadoop. Is it? In general, yes. Apache Hadoop runs
Apache Hadoop is in fact integrated with Active Directory
inside HDInsight, and it is true that Apache Hadoop files from
Domain Services, but not yet to the high degree that is
other Apache Hadoop distributions are 100 percent compatible
suggested in the claim. The locus of integration is currently with with it. In addition, one can download an Apache Hadoop
user accounts, authentication, and authorization: Windows
component such as Apache Mahout* straight from the Apache
accounts are used to manage Apache Hadoop, and it’s not
web site, and it will run on an HDInsight cluster without errors.
necessary to create user accounts within HDInsight itself.
However, it is not true (as the claim might be interpreted) that
In fact, with HDInsight, no aspect of authentication and
HDInsight has the same features as all standard versions of
authorization remains siloed in Apache Hadoop; security is
Apache Hadoop. At the time of this writing, for example, HDP
handled by Windows Azure, Active Directory Domain Services,
1.3.1 and Apache Hadoop 1.2.1 support features that have
or local Windows security. In addition, HDInsight creates a
not yet appeared in HDInsight. This lag time between Apache
special Windows user account named “hadoop” that the 14
Microsoft Big Data and Analytics
Hadoop versions is likely to persist indefinitely, and it remains to
PDW Hardware Specifications
be seen whether in some cases it could actually lead to file or
The PDW versions from both Dell and Hewlett-Packard are
code incompatibilities.
not identical, but they do share some common specifications.
In general, the take away for the IT decision maker is that
HDInsight is likely to be running a slightly outdated version
of standard Apache Hadoop. Today, code and syntax is 100
percent portable from standard Apache Hadoop, but in the
future, exceptions to this rule cannot be ruled out. Ultimately,
however, Microsoft has made clear that they want to remain
100 percent compatible with Apache Hadoop, so if such an
incompatibility should arise, it will likely be a temporary problem.
SQL Server 2012* Parallel Data
Warehouse: An (Almost) All-in-One
Data Solution
First, both vendors assign 256 GB of RAM to each physical
node in the appliance. Second, for both Dell and HP, the first
rack in the appliance (or only rack, if there’s only one) includes
one node assigned control and management responsibilities.
Microsoft also specifies that one extra node per rack should
remain essentially unused and be included for failover, so this
is another common element from both vendors. Finally, in both
the Dell and HP solutions, nodes are connected with InfiniBand*
and Ethernet, both of which are implemented with redundancy.
These control and failover nodes along with the redundant
networking components occupy 6U in the first (or only) rack, and
5U in all subsequent racks (because the control node is needed
only in the first rack).
Another pillar in Microsoft’s all-data product lineup is SQL
Server 2012 PDW. PDW is a massive parallel processing (MPP)
Dell Parallel Data Warehouse Appliance
data warehousing appliance that combines custom software
Dell’s PDW product is officially called the Dell Parallel Data
built on SQL Server 2012 with commodity hardware. Currently,
Warehouse Appliance. The following list provides additional
the appliance is sold in various scalable configurations only by
detailed hardware specifications about the Dell PDW
Dell and Hewlett-Packard. At the lowest end, both vendors sell
configuration options, beyond the elements described above:
a one-quarter rack version (of a standard 42U rack). The Dell
appliance can scale up to 6 racks, and the HP counterpart can
scale up to 7 racks.
• Basic scale unit of 10U: 3 servers in a 2U enclosure, and
two 4U drive arrays
• Basic scale unit = 3 Dell PowerEdge R620* compute
A key concept in understanding PDW is that it represents a
nodes, 2 Dell PowerVault MD3060e* JBOD SAS arrays
scale-out solution, as opposed to a scale-up solution. When
(102 drives)
users run T-SQL queries against PDW, the queries are broken
• Up to 3 scale units (9 compute nodes) per rack
down and distributed among all required nodes. The processing
• ¼–6 racks
itself is therefore distributed and not centralized. As nodes are
• 3–54 compute nodes total
added to the appliance, the raw processing power of PDW
• 1, 2, or 3 TB storage capacity per drive
increases in an essentially linear manner.
• 22.65–1,223.1 TB raw free storage space
• 79–6,116 TB user storage (with compression)
Storage in the PDW appliance is both replicated and distributed.
Smaller tables (approximately 5 GB or smaller) are replicated
among all nodes for improved performance. Larger tables are
broken up and distributed across nodes.
• 6U available for customer space on first rack,
7U on other racks
Microsoft Big Data and Analytics
HP AppSystem for Microsoft SQL Server 2012 Parallel
responds to the client with the results of the query. To answer
Data Warehouse
the query, the control node uses its metadata to break up
HP’s PDW product is called the HP AppSystem for Microsoft
an original query into smaller parts and send these smaller
SQL Server 2012 Parallel Data Warehouse. The HP AppSystem
component queries to the appropriate nodes. The control node
offers a different range of hardware options:
then compiles into one response the results received from
these various nodes and then sends this response to the client.
• Basic scale unit of 7U: two 1U servers and one 5U
drive array
• Basic scale unit = 2 Dell ProLiant Gen8 DEL360 compute
nodes, 1 HP P6000 JBOD SAS array (70 drives)
• Up to 4 scale units (8 computer nodes) per rack
PDW virtualizes all servers on its physical nodes and uses
failover clustering to protect these virtualized workloads. No
one node (including the control node) represents a single point
of failure.
• ¼–7 racks
• 2–56 compute nodes
Figure 6 shows a view of the PDW from the perspective of
• 1, 2, or 3 TB storage capacity per drive
an administrator.
• 15.1–1,268.4 TB raw free storage space
• 53–6,342 TB user storage (with compression)
• 8U available for customer space on first rack,
9U on other racks
These different hardware specifications for the first rack from
each vendor are shown in Figure 5.
Figure 6. SQL Server 2012 Parallel Data Warehouse management portal
Figure 5. Comparison of SQL Server 2012 Parallel Data Warehouse
hardware specificaitons between Dell and HP
How PDW Works
Despite the many components included in PDW, to external
clients the appliance looks just like a single instance of SQL
Server 2012. T-SQL queries to PDW are directed from clients
toward the PDW control node, and the control node eventually
Microsoft Big Data and Analytics
Vendor and Appliance
Memory (GB)
Total Cores
Compression
User Storage
(TB, Compressed)
List Price
EMC Greenplum Data
Computing Appliance*
768
48
4 to 1
144
$2,000,000
IBM PureData System for
Analytics
N1001-010*
n/a
112
4 to 1
128
$1,599,000
2,304
144
5 to 1
340
$1,569,970
2,048
128
10 to 1
450
$13,580,000
768
96
4 to 1
146
$1,168,000
Microsoft SQL Server
2012 Parallel Data
Warehouse (Dell)*
Oracle Exadata Database
Machine X3-2*
Teradata Data
Warehouse Appliance
2690*
Table 1. Comparison of hardware specifications for full-rack implementations of data warehousing appliances from several vendors
Vendor
EMC
I/O Bandwidth (GB/sec)
Price per GB/sec of I/O Bandwidth
24
$83,333
Microsoft
108
$14,537
Oracle
100
$136,440
Table 2. Comparison of input/output (I/O) rates among three data warehouse appliances
Comparison of Data Warehousing Appliances
With respect to performance, Table 2 shows that the I/O
Within the playing field of data warehousing appliances,
throughput of PDW compares favorably with that of the EMC
Microsoft makes essentially three pitches in favor of PDW:
and Oracle solutions. (Data from IBM and Teradata are not
that it offers a great value, that it has excellent performance,
available.) Microsoft claims PDW is also able to speed I/O
and that it connects seamlessly to Apache Hadoop.
performance (over 10 times) through the use of columnstore
Table 1 compares hardware specifications for full-rack
indexing and batch processing, both members of the xVelocity*
14
implementations of data warehousing appliances from various family of memory-optimized technologies in SQL Server 2012.
vendors.12 Table 2 compares input/output (I/O) rates among
three data warehouse appliances.13
Regarding the integration of PDW and Apache Hadoop,
With respect to value, an advantage highlighted by Table 1 is
warehouses in offering this capability. In fact, all of the data
that, compared to other solutions, the SQL Server 2012 PDW
warehouse appliance vendors mentioned in Table 1 have
displays a low cost per unit storage. Microsoft is able to attain
presented a product roadmap involving some integration with
these cost reductions mainly by using direct-attached storage
Apache Hadoop. Of these, however, the PolyBase roadmap
(DAS) with its nodes instead of storage area network (SAN)
is distinctive in its plan to deeply integrate Apache Hadoop
storage, an option made possible because of a Windows
processing with PDW processing.
Server 2012 feature called Storage Spaces. Storage Spaces
allows flexible SAN-like storage provisioning from a JBOD SAS
array that is attached to one node only.
Microsoft is careful not to claim that it is unique among data
The next section provides more detail about PolyBase and its
product roadmap.
Microsoft Big Data and Analytics
Big Data Integration – PolyBase
Apache Hadoop source, query results will show the updated
PolyBase is a PDW-only feature that provides a means to
data. However, query performance isn’t optimized. Figure 8
integrate Apache Hadoop data with SQL Server and to make
shows an example of a CREATE EXTERNAL TABLE statement
this data accessible through T-SQL queries. The manner in
that creates a table called ClickStream from an Apache
which PolyBase integrates T-SQL with Apache Hadoop is
Hadoop file called employee.tbl.
illustrated in Figure 7.
Figure 8. Example of a CREATE EXTERNAL TABLE statement from
Apache Hadoop
CREATE TABLE AS SELECT Statement
The CTAS statement can be run after an external table is created.
When a PDW administrator creates a table as a select statement
from an external table, this external data is physically copied into
a SQL table that resides in PDW. In this case, PDW can perform
parallel processing on the remote Apache Hadoop data, and
when the table is created, the administrator can optimize its
storage in PDW by distributing it across nodes. The imported
Apache Hadoop data then persists in PDW until the new table is
deleted. Creating a table as a select statement optimizes query
response times, but the imported data is not updated from its
source if that source data should ever change.
Figure 7. PolyBase integration of T-SQL with Apache Hadoop
To achieve this integration, PDW must first be connected to an
Apache Hadoop source. Administrators can then integrate the
external Apache Hadoop data into SQL data on PDW by using
either a CREATE EXTERNAL TABLE statement or a CREATE
TABLE AS SELECT (CTAS) statement. Administrators can
also push data from PDW to Apache Hadoop by means of a
CREATE EXTERNAL TABLE AS SELECT (CETAS) statement.
The following example shows a basic CTAS statement:
CREATE TABLE ClickStream _ PDW WITH
DISTRIBUTION = HASH(url)
AS SELECT url, event _ date, user _ IP FROM
ClickStream
Note that Apache Hadoop data does not need to persist as
an isolated table. Imported data can also be mashed up with
CREATE EXTERNAL TABLE Statement
When an external table is created from Apache Hadoop
data, PDW frames a SQL structure around the external data.
Users can then query the external table as if it were a normal
table residing in a SQL database. If the data is updated in the
native relational data through JOIN statements.
Microsoft Big Data and Analytics
Querying the Data
SQL and Apache Hadoop data, makes a cost-based decision
After data is imported into a table in PDW, users can perform
about when to process queries with SQL and when to push
ordinary T-SQL queries on it, as shown in the three examples
queries onto HDFS data as MapReduce jobs.
in Figure 9.
The goals of PolyBase phase 3 have not been finalized, but
Microsoft has publicly stated that it is considering compatibility
with Apache Hadoop MapReduce 2.0 (YARN) and more
efficient alternatives to MapReduce.
No dates have been given for the release of PolyBase phase 2
or phase 3.
Besides this roadmap for planned functionality in PolyBase,
Microsoft has occasionally hinted that the technology will
eventually be integrated into its SQL Server product, perhaps
Figure 9. Examples of T-SQL queries performed on data imported to a SQL Server 2012 Parallel Data Warehouse table
Pushing Data to Apache Hadoop from PDW
Finally, PDW administrators also have the option of migrating
data PDW to an Apache Hadoop source. To achieve this, a
CETAS statement is used, as in the following example:
CREATE EXTERNAL TABLE ClickStream (url,
event _ date, user _ IP)
as soon as the next release (SQL Server 2014).
ETL in PDW
The Microsoft specifications for PDW do not include any ETL
server, such as a dedicated instance of SQL Server loaded
with SSIS. Both Dell and HP include SQL Server tools installed
on the control node, but it is expected that many firms will use
a pre-existing ETL server to connect to PDW.
Using SSIS packages to import data is sensible if these
WITH (LOCATION =‘hdfs://MyHadoop:5000/
packages are already created. It should be noted, however,
TERMINATOR = '|')) AS SELECT url, event _ date,
performance as a way to import data.15
users/outputDir’, FORMAT _ OPTIONS (FIELD _
user _ IP FROM ClickStream _ PDW
Roadmap for PolyBase
Currently, PolyBase is in phase 1 of a multi-phase rollout.
Phase 1 allows data to be imported directly from and exported
directly to HDFS on Apache Hadoop. Because MapReduce
is bypassed and parallel processing is used, performance for
import and export operations is normally optimized.
that in PDW, ordinary T-SQL queries offer much better
Microsoft’s Claims about PDW
This paper focuses on Microsoft’s comprehensive data
strategy and how the various components of that strategy
might work together. Although Microsoft makes claims about
PDW that relate to its value and its performance, these claims
do not relate to its big data strategy. One important claim that
Microsoft is making about PDW, however, does relate to its
comprehensive data strategy: that PDW integrates Apache
Phase 2 goes beyond integrating Apache Hadoop data into
Hadoop data with traditional relational data. We will look at two
PDW and will move toward integrating the processing power
representative examples
of Apache Hadoop clusters into PDW queries. This next phase
will include a PDW query optimizer that, for all queries of both
Microsoft Big Data and Analytics
Claim: PolyBase for PDW provides “seamless integration
Claim: “HDFS Bridge in PolyBase … enable[s] direct
of Apache Hadoop data with the data warehouse in a
communication between HDFS data nodes and PDW
single query.”
compute nodes.”18
16
This claim essentially states that a single query executed
This particular claim is more conservative than the last. It states
against PDW will return both Apache Hadoop data and
merely that instances of SQL Server in PDW can communicate
SQL data. The implication of the claim, along with the word
directly with data nodes in HDFS through a PDW component
“seamless,” is that all Apache Hadoop data will easily be
called the HDFS Bridge. The implication of the claim is that IT
brought into the SQL world and made accessible to all users
personnel do not need to use additional tools (such as Hive)
through ordinary T-SQL statements.
or write additional MapReduce scripts to import data from
Apache Hadoop to PDW or export data from PDW to Apache
The claim can be construed as true if it is limited to describing
Hadoop. SQL communicates with Apache Hadoop directly.
the availability of data that has already been imported from
Apache Hadoop, but it is essentially misleading in describing
The claim offers a reasonable description of what PolyBase
this process as “seamless.” In truth, the only Apache Hadoop
can do, and if anything, it might sell the technology a little
data that can be queried through SQL statements is data that
short. PolyBase doesn’t merely allow users to bypass
an administrator has located and made the effort to import
MapReduce and import and export data directly; it also allows
with CREATE EXTERNAL TABLE or CTAS statements. In
PDW to use parallel processing when it performs queries on
addition, the only Apache Hadoop data that is capable of
an Apache Hadoop cluster. On the other hand, the claim also
being imported is data that is semi-structured with delimiters
hints at the lack of advanced integration between PDW and
such as commas. Most Apache Hadoop data, however, is not
Apache Hadoop. The two technologies are connected only
structured at all.
through a bridge, so importing and exporting data is required
before data can be accessed from one system to another.
Although importing Apache Hadoop data into a SQL table is
certainly a useful capability, this is not a particularly common use
Other companies are indeed working on solutions that
case for PolyBase. It is useful only for data that can fit easily into
dispense with the need for such a bridge. When assessing
a table (such as log files) and whose location within the Apache
Microsoft’s comprehensive data vision, therefore, it’s important
Hadoop cluster is known. In the words of Yale researcher Daniel
to recognize that PolyBase does not represent a singular
Abadi, “[PolyBase lets you] dynamically get at data in Hadoop/
cutting edge to integration of SQL and Apache Hadoop.
HDFS that could theoretically have been stored in the DBMS
all along, but for some reason is being stored in Hadoop
instead of the DBMS.”17 It should also be noted that other rival
technologies offer a more seamless integration of SQL with
Apache Hadoop, such as Hadapt Adaptive Analytical Platform*
and the Hortonworks Stinger initiative.
Microsoft Big Data and Analytics
Business Intelligence and Analytics
BI typically refers to tools used to collect, analyze, and view
enterprise data for the purpose of meeting business goals.
These three functions of collecting, analyzing, and viewing data
for Microsoft have traditionally been filled by SSIS, SQL Server
Analysis Services (SSAS), and SQL Server Reporting Services
(SSRS), respectively. Microsoft BI has traditionally relied heavily
on IT personnel, for example, to create packages in SSIS for
importing data, to develop online analytic processing (OLAP)
cubes in SSAS, and then to build reports in SSRS that are
finally delivered to end-users.
PolyBase. (Because PolyBase provides better functionality than
the Hive ODBC Driver, there is no such driver for PolyBase.)
The Hive ODBC Driver is central to Microsoft big data
because the company’s strategy does not provide any BI
tools specifically for Apache Hadoop. Microsoft’s goal with big
data and BI is merely to provide a method to import Apache
Hadoop data into well-known Microsoft tools, where this data
can be shaped, analyzed, and visualized just like any other
data can. Figure 10 shows how the Hive ODBC driver (labeled
“ODBC for Hive”) is used to connect Windows Azure HDInsight
Service to Excel, SQL Server, and Analysis Services.
In the big data era, however, Microsoft has been expanding
its vision of BI to include what it calls “managed self-service
BI.” In this new vision, IT manages access to data sources,
and end-users connect to these data sources as needed
with client tools, most notably Excel. Users import data as
tables into Excel and then shape and visualize the data as
needed. Possible sources of enterprise data can still include
databases, but they also now include Apache Hadoop and
other sources, such as web pages and Open Data Protocol
(OData) feeds. (OData is a data access protocol released under
the Microsoft Open Specification Promise.) Excel in particular
is able to achieve high performance when handling data sets
and processing visualizations because it uses the xVelocity inmemory analytics engine for these purposes. (This in-memory
engine was first available only in SQL Server 2008 R2.)
The next section describes some new Microsoft BI features
available in Excel and some other tools that users can employ
to connect to and manipulate big data.
Apache Hive* ODBC Driver
The Hive ODBC Driver is currently a critical piece of software in
Microsoft’s all-data strategy. This driver allows Apache Hadoop
data sets to be imported into SQL Server, Excel, and Analysis
Services through HiveQL queries. Unlike PolyBase, which
is available only in PDW, the Hive ODBC Driver connects to
Apache Hadoop by allowing HiveQL queries to be translated
to MapReduce jobs. Performance is much lower than with
Figure 10. Connection of Windows Azure HDInsight Service to Excel, SQL Server, and Analysis Services through the Hive ODBC driver
PowerPivot
The PowerPivot add-in for Excel first appeared in Excel 2010,
allowing users to load large amounts of highly compressed
data into Excel from different sources, create relationships
within that data, and then perform analysis on the data. In
Excel 2013, much of this functionality is now built directly in.
Without installing the PowerPivot add-in, one can already
import large data sets (millions of rows) from multiple data
sources, create relationships between data from different
sources and between multiple tables in a PivotTable, create
implicit calculated fields, and manage data connections.
Microsoft Big Data and Analytics
In Excel 2013, data is also now automatically loaded into
the xVelocity in-memory analytics engine even before the
PowerPivot add-on is installed.
When the PowerPivot add-in is installed in Excel 2013, more
advanced modeling capabilities become available, such as
the ability to filter and rename data as it is imported, to define
custom calculated fields throughout a workbook, to define key
performance indicators (KPIs) to use in PivotTables, and to use
the Data Analysis Expressions (DAX) expression language to
create advanced formulas.
Data imported into PowerPivot can originate from databases, like
SQL Server, IBM DB2*, and Oracle, or from other types of data
sources, like Apache Hadoop, OData feeds, reporting services
reports, and text files.19 Figure 11 shows the functions available in
Figure 12. Importing data from HDInsight to Excel 2013 Power Query
the PowerPivot ribbon in Excel 2013.
Figure 11. Excel 2013 Power Pivot ribbon
Power Query
Power Query is a new tool whose name has recently been
updated from its preview name, Data Explorer. Power Query
allows users to query external sources of data, such as the
Internet in general or HDInsight, and import detected tabular
data sets. Once imported, the data in the table can be
modified, combined with other data, analyzed, and visualized
by using other tools.
Figure 12 shows how Power Query can be used to import data
from HDInsight and other external sources. Figure 13 shows a
result when Power Query is used to perform an online search
for “most populous metropolitan areas in North America.”
When the data set is selected in the right column, a table
containing the data set is automatically created.
Figure 13. E xcel 2013 Power Query results of an online search for “most populous metropolitan areas in North America”
Microsoft Big Data and Analytics
Power View
Power View is a visualization tool that first appeared in
SharePoint 2010 and that is now also available in SQL Server
2012 SP1 and Excel 2013. Power View allows users to create
interactive charts and maps from tabular data in Excel and
then add them to a view or dashboard, as shown in Figure 14.
Figure 15. E xample of a visualization created in Excel 2013 Power Map
Microsoft’s Claims about BI
Microsoft’s claims about its BI tools within the context of its
big data strategy are essentially variations of one point, that
its BI tools bring big data to the masses. These claims can be
mostly accurate or mostly misleading, depending on how they
are phrased.
Claim: “HDInsight democratizes the power of big data BI.”20
Figure 14. E xcel 2013 Power View output of most populous metropolitan areas in North America
This claim explicitly states that HDInsight itself, and not
Power Map
democratizing big data BI. The implication of the claim is that
Power Map is a new visualization tool that until recently was
by opting for HDInsight over another Apache Hadoop solution,
known by its preview name, GeoFlow. Power Map provides
firms will have an advantage in their ability to derive valuable
3D geographical visualizations that are superimposed on a
insights from their data.
some other component in the Microsoft big data strategy, is
globe. Such data can range from remote sensor output to data
from Twitter. An important constraint, however, is that Power
In truth, Microsoft’s big data BI strategy is to bring Apache
Map can work only work with data preformatted in a table and
Hadoop data into its suite of existing BI tools, and these tools
cannot work with live or streaming data. Figure 15 shows an
are almost completely agnostic about the particular source
example of a visualization created in Power Map.
of the Apache Hadoop data. Microsoft does not have any BI
solution for HDInsight in particular. It’s true that in Microsoft
Excel, some might consider that importing data from an
HDInsight account is easier than doing so from a generic
Apache Hadoop file, but this difference at the moment is
negligible. Moreover the principal interface between Excel
and Apache Hadoop data, the Hive ODBC Driver, is backend
agnostic, meaning that it could draw its data from a competing
Apache Hadoop distribution as easily as from HDInsight.
Microsoft Big Data and Analytics
Over time, it is plausible that Microsoft will continue to develop
continue to make these processes gradually easier. However, it
HDInsight and Excel in a way that optimizes this connection far
is unlikely that Microsoft will succeed in bringing true analytics
more, but for now, it is not accurate to suggest that HDInsight
(as opposed to mere visualizations) to the masses, whether
itself lowers the barrier to entry for big data BI.
for structured data or unstructured data. True data analysis
that is capable of revealing valuable and non-obvious insights,
Claim: “[Microsoft lets you] analyze big data with
after all, is a discipline that requires specialized mathematical
familiar tools.” 21
and statistical skills that go beyond the simple familiarity with a
If Microsoft Excel can be considered a familiar tool, then this
given scripting interface or software tool.
claim is accurate. In Excel, information workers can perform a
HiveQL query to import over a million rows of Apache Hadoop
Conclusion
data, clean the data so that it fits into a table, and then analyze
The main products Microsoft includes in its “all-data” vision—
this data with advanced tools such as DAX statements.
(Note that professional data analysts can also import Apache
Hadoop data into the tools they are used to, such as SSAS,
and perform the same analytics on this data as they could with
any data that is originally in a static, tabular format.)
However, there are some important caveats to keep in mind
about using “familiar tools” to analyze big data. First, one
can’t import all Apache Hadoop data into Excel (or SSAS).
Users can only import data that lends itself to being shaped
into a table, such as comma-delimited files and other forms
of semi-structured data. Microsoft is in fact heavily promoting
semi-structured data as the type of Apache Hadoop data it
can handle with its existing tools, but semi-structured data
represents only a small percentage of the data that is kept in
Apache Hadoop clusters. Second, the fact that one can use
Excel to perform analytics on big data does not mean that
this task is in any way easy to perform. The ability to import
the useful data through HiveQL, fashion this data into a clean
table filled with the right information, and then perform the right
analytics in a way that yields valuable insights is a set of skills
reserved for specialists such as data analysts familiar with tools
such as the DAX scripting language.
It is true that Microsoft has lowered the barrier to entry for
reading semi-structured data stored in Apache Hadoop and
especially for creating visualizations of tabular data. It is likely
that in future releases of Excel and HDInsight, Microsoft will
HDInsight, PDW, and Excel—comprise a compelling and
comprehensive set of features, albeit with some significant
limitations. Even with these limitations, however, this vision
offers a unique take on big data that is not available through
other vendors.
On the positive side, Microsoft offers the many firms that are
already heavily invested in Microsoft products a way to ease
into big data with minimal adjustment. For example, HDInsight
will be manageable from System Center and increasingly
integrated into Active Directory Domain Services, reducing
administrative overhead compared to other Apache Hadoop
solutions. Furthermore, companies planning to deploy future
releases of SQL Server will find that this product is likely to
include PolyBase, and by extension, a T-SQL connection to
Apache Hadoop. Existing BI expertise in Excel, meanwhile,
can be used by connecting to both Apache Hadoop and
relational data sources. Finally, for the many organizations
already moving their servers or data into Windows Azure,
Windows Azure HDInsight Service offers an attractive option
because it can connect directly to data stored in Windows
Azure blob storage.
Another legitimate advantage of Microsoft’s vision is ease of
implementation, and to a lesser degree, ease of use. Spinning
up a data cluster in HDInsight Service is decidedly easier
than creating a physical Apache Hadoop cluster on premises,
even if uploading big data sets to the cloud can be extremely
Microsoft Big Data and Analytics
time-consuming. For ease of use, the.NET SDK and JavaScript
Today, what Microsoft is truly offering is merely the promise
libraries for HDInsight, as well as the interactive JavaScript
of an all-data solution, a promise that might or might not be
console, do make programming for and interacting with
realized soon.
Apache Hadoop somewhat easier than with other Apache
Hadoop distributions.
None of these limitations can negate that fact that the unusual
breadth of Microsoft’s product line across Apache Hadoop,
All of this said, although the components of Microsoft’s big
relational databases, data warehousing, spreadsheets, the
data offerings are compatible with each other, they are not truly
cloud, business intelligence, and server administration, make
integrated into a single solution and do not yet achieve any
the company a unique contender among big data vendors.
significant synergistic effects when used together.
Microsoft has more of the components for a comprehensive
data solution either in place or credibly maturing than any
Another significant limitation to Microsoft’s big data
other company. Moreover, Microsoft provides a vision of
strategy is that while Apache Hadoop is rapidly improving in
data management and analysis that could be revolutionary
performance, extensibility, and ease of use, Microsoft has not
if fully realized. However, it is important to underscore that
yet proven that it can keep pace with these changes. By the
Microsoft’s all-data vision is currently just that: a vision. Those
time HDInsight is officially released, HDInsight might in fact
waiting for the Microsoft all-data solution will have to be
already be outdated, if not far surpassed, by superior Apache
patient and wait to see whether its promise matures into the
Hadoop-based alternatives.
truly comprehensive, integrated, and synergistic suite of data
The BI portion of Microsoft’s all-data vision does not yet
fully live up to claims of democratizing big data. Microsoft’s
solution manages to simplify relatively easy portions of dealing
with big data, such as server-cluster installation or semistructured data visualization. Benefits such as these provide
evolutionary business value but still leave fundamentally
difficult aspects of working with big data—such as crafting
queries for unstructured data, sanitizing data for visualization,
and implementing effective machine-learning algorithms—
unaddressed. Until Microsoft finds ways help non-specialist
information workers ask the right questions of any kind of data,
the company will not truly live up to its claim of bringing big
data to the masses.
But perhaps the most damaging critique one can level against
Microsoft’s all-data solution is that it is currently little more
than a marketing vision (though a compelling one). In reality,
HDInsight is now functional, but only as a preview, and the
other main components of the vision, PDW and Excel, are
essentially pre-existing products that have been repositioned
as critical components of a grand big data marketing strategy.
management and analysis products currently promised.
Microsoft Big Data and Analytics
Notes
1
Microsoft. “Microsoft Big Data.” http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-data.aspx.
2
Microsoft. “From Data to Insights with Microsoft Big Data.”
http://download.microsoft.com/download/6/E/3/6E335796-3003-4B2D-BB55-0A33E003F879/Microsoft_Big_Data_Booklet.pdf.
3
Microsoft. “What Version of Hadoop Is in Windows Azure HDInsight?” http://www.windowsazure.com/en-us/manage/services/hdinsight/howto-hadoop-version/.
4
Apache. “Hadoop Releases.” http://hadoop.apache.org/releases.html#download.
5
Hortonworks. “Hortonworks Data Platform (HDP).” http://hortonworks.com/products/hdp/.
6
Hortonworks. “HDP for Windows.” http://hortonworks.com/products/hdp-windows/.
7
Microsoft. “Microsoft Big Data.” http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-data.aspx.
8
Windows Azure Storage Team Blog. “Windows Azure Storage BUILD Talk – What’s Coming, Best Practices and Internals.”
http://blogs.msdn.com/b/windowsazurestorage/archive/2013/06/28/windows-azure-storage-build-talk-what-s-coming-best-practices-and-internals.aspx.
9
Microsoft. “Microsoft Big Data.” http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-data.aspx.
10
Microsoft. “Microsoft Big Data.” http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-data.aspx.
11
Microsoft. “Microsoft Big Data.” http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-data.aspx.
12
Value Prism Consulting. “Microsoft’s SQL Server Parallel Data Warehouse Provides High Performance and Great Value.” March 2013.
http://www.valueprism.com/resources/resources/Resources/PDW%20Compete%20Pricing%20FINAL.pdf.
13
Value Prism Consulting. “Microsoft’s SQL Server Parallel Data Warehouse Provides High Performance and Great Value.” March 2013.
http://www.valueprism.com/resources/resources/Resources/PDW%20Compete%20Pricing%20FINAL.pdf.
14
For more information about columnstore indexing, see http://social.technet.microsoft.com/wiki/contents/articles/3540.sql-server-columnstore-index-faq.aspx.
For more information about other technologies in xVelocity, see http://technet.microsoft.com/en-us/library/hh922900.aspx.
15
Data Warehouse Junkie. “Rock Your Data with SQL Server 2012 Parallel Data Warehouse (PDW) – POC Experiences.” June 2013.
http://dwjunkie.wordpress.com/2013/06/27/rock-your-data-with-sql-server-2012-parallel-data-warehouse-pdw-poc-experiences/.
16
Microsoft. “Appliance: Parallel Data Warehouse (PDW).” http://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/pdw.aspx.
17
Monash Research. “SQL-Hadoop Architectures Compared.” June 2013. http://www.dbms2.com/2013/06/02/sql-hadoop-architectures-compared/.
18
Microsoft. “PolyBase.” http://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/polybase.aspx.
19
The complete list of supported data sources for PowerPivot can be found at
http://office.microsoft.com/en-us/excel-help/get-data-using-the-powerpivot-add-in-HA102836921.aspx.
20
Microsoft. “Business Insights Newsletter Article.” December 2012.
http://www.microsoft.com/enterprise/newsletter/bi-newsletter/articles/december2012.aspx#fbid=f36OYx4e5uq.
21
Microsoft. “Microsoft Big Data.”
http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-data.aspx.
The analysis in this document was done by Prowess Consulting and derived from work done with Intel.
Results have been simulated and are provided for informational purposes only. Any difference in system hardware or software design or
configuration may affect actual performance.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
Prowess, the Prowess logo, and SmartDeploy are trademarks of Prowess Consulting, LLC.
Copyright © 2014 Prowess Consulting, LLC. All rights reserved.
Other trademarks are the property of their respective owners.