Microsoft Big Data and Analytics Executive Summary Microsoft has established a firm foothold in the world of traditionally structured data with Microsoft SQL Server* and an even firmer foothold in the world of data analysis with tools such as Microsoft Excel*. However, the big data era requires solutions to store, query, and analyze data beyond that which is traditionally structured in relational databases or spreadsheets. Microsoft has responded to this big data challenge not only Server, an on-premises solution, and Windows Azure HDInsight by offering a new big data solution, but also by describing Service*, a completely cloud-based solution. a broad solution for comprehensive data management and analysis that is supported by a combination of new and old Although Microsoft does offer these two new Apache Hadoop Microsoft products. products for storing and mining both semi-structured and unstructured data, the company has also been keen to steer the The big data trend in recent years has been largely driven big data conversation away from the need for big data solutions by the popular, open-source software framework of Apache per se and toward the need for a universal data management Hadoop*. Apache Hadoop allows massive amounts of data and analysis solution. Until recently, in fact, Microsoft used that is not structured into relational databases to be stored the term “big data” to refer to this universal vision, but its most in clusters of commodity servers and then analyzed for recent messaging makes a distinction between “big data” of correlations, trends, and other potentially valuable information. Apache Hadoop and other forms of data. Microsoft’s broader So popular has Apache Hadoop become as a big data solution vision is supported in part by Microsoft SQL Server 2012 that to many, the terms “big data” and Apache Hadoop have Parallel Data Warehouse* (PDW), which is a data-warehouse become synonymous. hardware appliance that stores only structured data but that also supports queries of both structured and unstructured data Microsoft is offering an Apache Hadoop component with through Microsoft’s proprietary PolyBase technology. Microsoft Microsoft HDInsight*, a set of services built on Hortonworks also positions SQL Server Analysis Services (SSAS), Excel, and Data Platform* (HDP*) for Windows*. More specifically, HDInsight Microsoft SharePoint Server* as part of its “all data” tool set, can refer to either of two separate Microsoft products, both still along with optional analysis add-ons for Microsoft Office* such in preview and months away from general release: HDInsight as PowerPivot, Power View, Power Map, and Power Query. Microsoft Big Data and Analytics Contents Executive Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Evaluating the Microsoft Data Platform. . . . . . . . . . . . . 3 Is Microsoft Really Democratizing Big Data? . . . . . . . 3 Does Microsoft Offer a Truly Comprehensive Data-Management Solution?. . . . . . . . . . . . . . . . . . . 3 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Microsoft’s Big Data Vision. . . . . . . . . . . . . . . . . . . . . . . . 5 Microsoft’s General Claims about its Comprehensive Data Solution. . . . . . . . . . . . . . . . . . . . 6 Claim: “The Microsoft big data solution offers an integrated platform for managing data of any type or size.”. . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Claim: “Microsoft’s big data solution gives you the power to … enable anyone in your organization to easily glean insight from your data so they can make . smarter decisions.” . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Microsoft HDInsight*: Microsoft’s Apache Hadoop* Solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Creating HDInsight Service Clusters. . . . . . . . . . . . . . . 8 HDInsight Storage Options. . . . . . . . . . . . . . . . . . . . . . 8 HDInsight Management . . . . . . . . . . . . . . . . . . . . . . . . 9 Getting Data in and out of HDInsight. . . . . . . . . . . . . . . 9 Technical Notes about HDInsight . . . . . . . . . . . . . . . . 10 Microsoft’s Claims about HDInsight . . . . . . . . . . . . . . 10 Claim: “[HDInsight lets you] accelerate the deployment with the cloud by deploying an Apache Hadoop cluster on Windows Azure* in just 10 minutes.” . . . . 10 Claim: “Microsoft simplifies programming on Apache Hadoop.” . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Claim: “[Microsoft big data lets you] seamlessly extend privileges across HDInsight with Active Directory*.” . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Claim: “HDInsight is 100% compatible with Apache Hadoop.”. . . . . . . . . . . . . . . . . . . . . . . 11 SQL Server 2012* Parallel Data Warehouse: An (Almost) All-in-One Data Solution . . . . . . . . . . . . . . . 12 PDW Hardware Specifications . . . . . . . . . . . . . . . . . . 12 Dell Parallel Data Warehouse Appliance. . . . . . . . . . HP AppSystem for Microsoft SQL Server 2012 Parallel Data Warehouse . . . . . . . . . . . . . . . . . . . . . How PDW Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of Data Warehousing Appliances . . . . . . Big Data Integration – PolyBase . . . . . . . . . . . . . . . . . CREATE EXTERNAL TABLE Statement. . . . . . . . . . CREATE TABLE AS SELECT Statement. . . . . . . . . . Querying the Data. . . . . . . . . . . . . . . . . . . . . . . . . . Pushing Data to Apache Hadoop from PDW. . . . . . Roadmap for PolyBase . . . . . . . . . . . . . . . . . . . . . . ETL in PDW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Microsoft’s Claims about PDW. . . . . . . . . . . . . . . . . . Claim: PolyBase for PDW provides “seamless integration of Apache Hadoop data with the data warehouse in a single query.”. . . . . . . . . . . . . . . . . . Claim: “HDFS Bridge in PolyBase … enable[s] direct communication between HDFS data nodes and PDW compute nodes.”. . . . . . . . . . . . . . . . . . . Business Intelligence and Analytics . . . . . . . . . . . . . . . . Apache Hive* ODBC Driver. . . . . . . . . . . . . . . . . . . . . PowerPivot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Power Query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Power View. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Power Map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Microsoft’s Claims about BI . . . . . . . . . . . . . . . . . . . . Claim: “HDInsight democratizes the power of big data BI.”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Claim: “[Microsoft lets you] analyze big data with familiar tools.”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 13 13 14 15 15 15 16 16 16 16 16 17 17 18 18 18 19 20 20 20 20 21 21 23 Microsoft Big Data and Analytics Evaluating the Microsoft Data Platform on tables in this data, and finally present this data in impressive Microsoft makes two alluring pitches for its suite of data visualizations that can provide valuable insights. However, it products. The first is that its solution can bring the power of is essential to understand first that Excel can load data from big data to the masses, making queries easier to submit and any Apache Hadoop source, not just from HDInsight. Excel data easier to analyze with tools that are already ubiquitous. allows users to import Apache Hadoop data from any source The second claim is that the Microsoft solution offers a single, by means of a special add-on driver (the Apache Hive* Open comprehensive solution to manage all enterprise data— Database Connectivity [ODBC] driver from Microsoft). As much regardless of size, structure, or speed. as Microsoft is attempting to connect Excel and HDInsight as part of a single solution, there is no substantial advantage to Is Microsoft Really Democratizing Big Data? choosing HDInsight as the particular backend source of Apache Despite the near-exuberant rhetoric about bringing big data Hadoop data in Excel. Moreover, Excel does not allow users analysis to the masses, Microsoft’s progress on this count has to perform complex operations, such as machine learning, been somewhat modest. Microsoft is indeed lowering the barrier that analyze or mine vast amounts of data from an Apache to entry for big data, but only incrementally. Its clearest success Hadoop cluster in the way the term “big data” suggests. With along these lines comes in deployment and management. Excel, information workers can merely import individual Apache Whether on premises or in the cloud, HDInsight is easy to set up Hadoop files and perform analyses and visualizations on tables and manage compared to other big data solutions, especially stored in these files. for IT personnel who lack Linux* expertise. This solid innovation, however, does not simplify deriving value from data in the cluster. Does Microsoft Offer a Truly Comprehensive DataManagement Solution? For this ultimate purpose, HDInsight only modestly reduces the difficulty of searching, analyzing, and mining Apache Hadoop The closest Microsoft comes to a comprehensive data solution data compared to other Apache Hadoop solutions. Microsoft’s today is with its PDW hardware appliance, which includes unique contribution toward simplifying data mining from Apache SQL Server 2012 to store structured data and which can also Hadoop clusters is to offer a set of programming libraries that connect to Apache Hadoop data from an external source. PDW allows programmers to run operations against Apache Hadoop thus enables unified access to both structured and unstructured data in simpler programming languages, such as JavaScript* data. However, PDW does not currently favor any particular and .NET* languages such as C# and F#. It also offers an Apache Hadoop solution as the external source of unstructured interactive JavaScript console that allows programmers to run data, making it a big data solution far from specific to Microsoft. JavaScript commands against data in Apache Hadoop files one Microsoft’s current big data solution is also limited in that none of line of code at a time. As a comparison, the classic “WordCount” its components can handle streaming unstructured data, such as from social media or user clickstreams. program in Apache Hadoop requires approximately 60 lines of code in Java*, but only 15 in JavaScript. Such advancements will allow more people to gain insights from data stored in Apache PDW might have clear limitations today, but in the future the Hadoop files, but that wider group must still be programmers. appliance is likely to fulfill Microsoft’s promise of delivering One area where Microsoft is truly democratizing data analysis high price. This comprehensive vision is set to be realized and visualization is on the client end, in Excel. Excel has the with the next version of PDW, which will likely include a pre- ability to take data stored in individual Apache Hadoop files, run installed version of HDInsight Server (at least as an option). traditional database queries against this data, perform analysis The ability to perform real-time queries of unstructured data truly comprehensive data management on-premises, if at a Microsoft Big Data and Analytics streams is also likely to be incorporated into future versions of fundamentally, Microsoft’s solution for unstructured big data is HDInsight, making Microsoft’s data-handling capabilities truly still not released, and it will be a matter of time before general comprehensive. Microsoft has not confirmed that it will include usage can truly reveal its strengths and its faults. PolyBase in a future, general release of SQL Server outside of PDW, but such a move is also plausible. Adding PolyBase Despite these reservations, there are reasons to be optimistic broadly to SQL Server would bring the capabilities of handling about Microsoft’s chances of bringing big data to the masses in structured and unstructured data to the wider database market. the future. Compared to other companies, Microsoft has more of the components in place for a comprehensive data solution, A cloud-based, on-demand solution that meets Microsoft’s including popular database management software in SQL promise of comprehensive data management is also eventually Server, a rapidly-maturing cloud provider in Windows Azure, likely to arrive. Windows Azure* already allows users to widely-used business intelligence tools, and the resources to create, store, and manage databases in Windows Azure SQL invest in this comprehensive vision for the long term. Database* online, so all of the components of a comprehensive data solution will be available through Windows Azure when Windows Azure HDInsight Service matures. It is unclear if a wider distribution of PolyBase to SQL Server would extend to a cloud-based version, however. Even if it does, the bottleneck of upload speeds on large, proprietary data sets could limit the usefulness of the cloud-only option for some data-heavy firms. However, the cloud-only solution will present an attractive option for firms that generate data online or that work with public data files. Conclusion Microsoft provides a vision for big data within a larger context of all data, structured and unstructured. While this vision is tantalizing for the future, it ultimately lacks substance today. Democratizing big data would hold some of the same revolutionary promise that personal computing and later the Internet realized in the last three decades, yet it is far from clear that Microsoft will ultimately consummate this revolution. PolyBase shows potential for managing and analyzing structured and semi-structured enterprise data by using familiar database skills, but it is currently only available in a high-end data-warehouse appliance. Using Excel as a frontend for bigdata analysis is another alluring vision, but it too is limited to dealing with structured and semi-structured data. Moreover, if Excel continues to be agnostic about the big-data backend supporting it, it does not provide an argument for companies to pick HDInsight over any other Apache Hadoop solution. Most Microsoft Big Data and Analytics Microsoft’s Big Data Vision Microsoft is currently developing a big data solution whose main components are likely to be released over the next year. These products have not yet been finalized, but their features have been made public, and Microsoft’s own statements about their soon-to-be-released big data tools provide insight not only into the company’s big data strategy, but also into its broader data strategy in general. This paper provides an overview of this broader strategy and an analysis of Microsoft’s big data claims. Big data as a trend relies technically on the open-source software framework of Apache Hadoop. Originally created at Yahoo!, Apache Hadoop allows nearly unlimited amounts of unstructured or semi-structured data (such as is found in log files) to be stored in clusters of inexpensive servers and then analyzed for correlations, trends, causal relationships, and other insights. Apache Hadoop has become the industry standard for big data, and for many, the terms “big data” and Apache Hadoop have become synonymous. For many companies selling a big data solution, the conversation about big data begins and ends with Apache Hadoop. Microsoft’s vision of big data differs from many others in that it has publicly positioned Apache Hadoop as only a component of a more comprehensive data strategy. This comprehensive strategy includes not only the unstructured and semi-structured data that are the accepted mainstays of big data, but also data that is structured (such as into traditional database tables, such as in a data warehouses), along with the business intelligence tools used to analyze all data, whether unstructured, semistructured, or structured. This broader “all data” vision allows Microsoft to draw into the big data conversation the company’s existing strengths in products such as SQL Server and Excel. By re-imagining the business staples of SQL Server and Excel as having a role in a big data solution, Microsoft is targeting their suite of big data products toward the many skills in common software tools but that lack the specialized knowledge reserved for data scientists and pure Apache Hadoop experts. The most central component of Microsoft’s big data strategy is provided by HDInsight, an Apache Hadoop solution built from a particular Apache Hadoop distribution, namely HDP for Windows. (HDP for Windows, developed by Hortonworks, Inc., is in fact the first distribution of Apache Hadoop that runs natively on Windows, and it is already publicly available as a free tool.) HDInsight can actually refer to either of two separate Apache Hadoop products, both still available only as preview versions: HDInsight Server, an on-premises solution, and Windows Azure HDInsight Service, a completely cloud-based solution. Both of these options are touted as versions of Apache Hadoop that are easier to set up and use than are the Apache Hadoop products offered by competitors. More recently, Microsoft has also described HDInsight as a solution for analyzing data that is semi-structured in particular, such as data sourced from smartphones, web sites, RFID tags, and Twitter feeds. Microsoft has also hinted that its search engine technologies, Bing* and Microsoft FAST Search, will act as the solutions to interact with completely unstructured data, such as documents. Code from both products was in fact incorporated into the search function in Microsoft SharePoint 2013*. However, Microsoft has not elaborated on the particular role it sees for its search engines within its comprehensive data strategy. A second cornerstone of Microsoft’s big data vision is SQL Server 2012 PDW, a hardware appliance that supports queries of data stored both in SQL tables and Apache Hadoop files through Microsoft’s proprietary PolyBase technology. PDW is already available at a price of approximately $1.5M. (Note that data warehouses commonly cost as much as $30M, so while high, the cost of PDW is actually low relative to that of the competition.) The third, and currently final, component of Microsoft’s big data businesses that have already invested heavily in these tools and solution is its business intelligence (BI) and visualization tools. accumulated large amounts of potentially useful data in them. These tools include Excel most importantly, but also Microsoft Microsoft is also targeting the many companies that have high Microsoft Big Data and Analytics analysis add-ons such as PowerPivot, Power View, Power Map, Microsoft’s General Claims about its Comprehensive Data Solution and Power Query. Microsoft’s claims surrounding its all-data solution fall into three SharePoint Server and Microsoft Office 365*, along with optional Future components might be added to this suite of big data products as they become available. For example, the next version of SQL Server will include an in-memory online transaction processing (OLTP) engine, currently code-named Hekaton. Hekaton will allow any new products based on it to efficiently process data captured in real-time, such as from broad categories: that Microsoft provides a platform to manage data of any type and size, that the Microsoft solution provides a way to analyze all data, and that the Microsoft solution enables information worker generalists to glean insights from big data. While these claims are generally accurate, careful examination of each claim yields a more nuanced picture. data streams. It is plausible that Microsoft’s big data strategy Claim: “The Microsoft big data solution offers an integrated will eventually reflect this new functionality provided by Hekaton platform for managing data of any type or size.”1 and include a real-time data analysis tool. In discussing its comprehensive data solution, Microsoft places Although Microsoft is describing these various products as components of an integrated big data solution, they do not function cohesively today. It is more accurate to view these components as a list of separate tools that might slowly become integrated over time. Another limitation to keep in mind about Microsoft’s big data solution is that its central component, HDInsight, is still a work in progress and many months away from release. Moreover, there is even a question about whether HDInsight will be outdated when it finally is released. HDInsight is currently based on HDP for Windows 1.x, which in turn is based on the Linux exclusive Apache Hadoop 1.0. The next version of Apache Hadoop based on Linux , version 2.0, is currently in community preview and is scheduled for general release in late summer 2013; it offers an architectural overhaul that promises to dramatically improve performance and extensibility. Hortonworks’s port of Apache Hadoop 2.0 to HDP for Windows 2.0 is currently being targeted for late 2013. Any future version of HDInsight that incorporates the updates in Apache Hadoop 2.0 can only be built after HDP for Windows 2.0 is finalized in late 2013. data into two broad categories for management: structured data (managed by SQL Server) and semi- and unstructured data (managed by HDInsight). The fact that Microsoft points to two products actually hints at the lack of an integrated platform for data management: data for SQL Server and Apache Hadoop are not integrated into a single platform. Even within each discrete product, data is not necessarily integrated. On the one hand, it is true that SQL Server is the management tool for structured data. On the other hand, managing data in HDInsight is more complex. For companies choosing the cloud-based Windows Azure HDInsight Service as their Apache Hadoop option, both semi-structured and unstructured data are likely to be stored and managed in Windows Azure blob storage. For firms choosing the onpremises HDInsight Server option, semi-structured and unstructured data are likely to be managed in separate locations. Semi-structured data will likely be stored and managed in Apache Hadoop Distributed File System (HDFS). Unstructured data, such as documents, spreadsheets, presentations, videos, and audio recordings, will likely be managed not in Apache Hadoop but in SharePoint, in a Microsoft product-centric IT deployment. Microsoft Big Data and Analytics The technology that currently comes closest to realizing IT organizations looking at these solutions, however, should the claim of an integrated platform is PolyBase. PolyBase keep their eyes wide open for the behind-the-scenes work that should not be viewed as a silver bullet, however. Beyond can go into preparing data sets for wider use within a company. being currently locked away in a specialized, expensive data A sample data set of electrical usage of households in two warehouse appliance, it is unclear to what extent it will integrate Dallas suburbs used to demonstrate Power Map and Power with Microsoft’s principal tool for unstructured data querying, View in Excel provides a telling example. The Microsoft team Bing, or with Microsoft FAST Search for queries in SharePoint. loaded Dallas County Appraisal District flat-file records into As with so many other aspects of the Microsoft data vision, SQL Server, converted geographical coordinates within them time alone will tell how and to what degree organizations can from planer to an ellipsoid projection with a third-party tool, implement them. and calculated the centroid of each land parcel in SQL Server to obtain a longitude and latitude figure for each plot before In general, Microsoft does not currently offer a comprehensive exporting the data to Excel. (All of this before adding details data management solution but a set of tools and products that to the data set, such as simulated rates of electricity usage.) allows organizations to handle structured, semi-structured, and The result was a rich data set that could be dissected by unstructured data. information workers across a variety of dimensions, including Claim: “Microsoft’s big data solution gives you the power to … enable anyone in your organization to easily glean insight from your data so they can make smarter decisions.” 2 This claim exaggerates the democratizing power of the Microsoft big data solution. Microsoft’s integration of ubiquitous and well-understood tools for big data analytics (particularly Excel) should not be confused with making big data queries and analysis inherently easier. Using laymen’s tools for big data work is not the same as putting big data insights within reach of all laymen. That said, this represents a key part of Microsoft’s competitive advantage in the big data arena, particularly with the saturation of Excel in the enterprise productivity market. Many more knowledge workers are familiar with Excel than with even SQL queries, for example, opening up direct examination of big data sets to a larger pool of analysts who previously had to work through middlemen like data scientists. Moreover, Excel addins such as PowerPivot, Power View, Power Map, and Power Query definitively put more analytical power in the hands of end users than before. time. The route to get there was anything but trivial, however. Microsoft HDInsight*: Microsoft’s Apache Hadoop* Solution HDInsight is the brand Microsoft has assigned to its two upcoming Apache Hadoop products: the cloud-based Windows Azure HDInsight Service and the on-premises HDInsight Server. Both of these solutions are built from a core of HDP for Windows. HDInsight in both cases thus refers to a product composed of this basic Hortonworks Apache Hadoop distribution in addition to extensive software customizations added by Microsoft. (HDInsight and HDP for Windows do not, in other words, refer to distinct components that communicate with each other.) Of the two versions of HDInsight, Microsoft has promoted the cloud-based Windows Azure HDInsight Service to a much greater degree. This product, hosted on Windows Azure, is also expected to be released first, mostly likely in Q4 2013. The emphasis on the cloud-based HDInsight suggests that this version of the product aligns more closely with Microsoft’s chosen market positioning for HDInsight in general. Microsoft Big Data and Analytics The HDInsight Service web page (found at http://www. Creating HDInsight Service Clusters windowsazure.com/en-us/services/hdinsight/) describes the Clusters created in HDInsight are intended to be disposable service by featuring words and phrases such as “gain insight as a way to minimize costs. HDInsight was designed with the from any data, any size, anywhere,” “provides simplicity,” “ease expectation that users will create an HDInsight cluster, load of management,” “simplicity of Windows Azure,” “simple and the data needed, run the analyses desired, and then destroy straightforward,” “seamless scale,” “quickly create,” “cost the cluster. savings only possible on a cloud environment,” “glean insights on all your data with familiar tools,” and “analyze all your data HDInsight promises to be simple, and as far as the procedure easily.” The messaging is clear: HDInsight Service is simple, to create a new cluster is concerned, it lives up to this promise. cost-efficient, and takes advantages of existing knowledge. With the “Quick Create” option in particular, the user merely chooses the cluster size (as defined by the number of nodes) Simple as it might be, HDInsight Service is not the only and then assigns a name, password, and storage account for cloud-based Apache Hadoop solution. Other such products the cluster. Once the user clicks the option to create the cluster, include Amazon’s Elastic MapReduce*, Joyent Solution the process takes 15 to 20 minutes. for Apache Hadoop*, and InfoChimps Cloud::Hadoop*. Microsoft’s offering differs from these others most obviously HDInsight Storage Options in that it runs on Windows and that it is integrated into the HDInsight allows data to be stored in the local HDFS file Windows Azure platform. system, as does any Apache Hadoop distribution. However, an option unique to HDInsight Service is the Azure Storage Vault Another idiosyncrasy of HDInsight is that it is currently based (ASV) protocol, which builds on the HDFS API to map Apache on Apache Hadoop 1.0.3 and HDP for Windows 1.1.0, even Hadoop operations to Windows Azure blob storage instead of though (as of August 2013) the most recent stable releases of to local HDFS. Through ASV, customers can keep their Apache Apache Hadoop based on Linux are Apache Hadoop 1.2.1 and Hadoop data in an inexpensive Windows Azure blob storage HDP 1.3.1. Even the most recent version of HDP for Windows account and avoid having to import this data into the physical is a later version: 1.3. Because Apache Hadoop is a quickly- compute nodes of the HDInsight cluster. Because the data maturing platform, the difference in incremental updates can accessed through ASV isn’t physically stored in the HDInsight be significant. For example, HDP 1.3.1 features a revision of the cluster, the data remains in Windows Azure blob storage before Hive query language called the Stinger Initiative that supports clusters are created and after they are destroyed. 3 4 5 6 50 times faster performance and increased compatibility with the SQL query language, but this technology is not currently After users spin up an HDInsight cluster, they can point included in HDInsight. In addition, the next full version of Apache operations such as Hive queries toward data that has been Hadoop, Apache Hadoop 2.0, is expected to be released in stored in Windows Azure blob storage by using a URI Q3 2013 and to be incorporated into HDP for Windows in Q4. beginning with asv:// or asvs://. The drawback to ASV is that, Apache Hadoop 2.0 is an important update that will dramatically because this data is not stored in the Apache Hadoop cluster improve the efficiency and extensibility of the platform, but it is itself, performance is not always optimized. However, write not clear when these updates will reach HDInsight. performance on Windows Azure blob storage is much faster that it is on HDFS, and with large file reads, temporary writes can be used so often that ASV can actually even result in better overall performance than local HDFS storage can. Figure 1 shows the setting to configure ASV for HDInsight. Microsoft Big Data and Analytics For more fine-grained management of HDInsight clusters and their associated storage, Windows PowerShell* is available. Windows PowerShell cmdlets for HDInsight are currently in version 0.9 and are available through the Microsoft .NET SDK For Apache Hadoop web site on Codeplex (https://hadoopsdk. codeplex.com/releases/view/109811). Figure 1. HDInsight cluster management screen If optimal performance is important, it is advisable to run tests with data stored in both ASV and local HDFS and compare the results. Note however that the cost of storing data in HDFS on HDInsight node instances is much higher than the cost of storing a comparable amount of data in Windows Azure blob storage. Another drawback to HDFS over ASV is that data stored in HDFS is removed when the cluster is destroyed. Figure 2 illustrates the relationship between an HDInsight Figure 3. HDInsight cluster dashboard cluster, HDFS, and ASV. Beyond these current tools, Microsoft has stated that in the future, Microsoft System Center* will provide tools to manage HDInsight. Given this information, it seems most likely that this System Center integration will become available in first full release of System Center after the official public release of HDInsight. Getting Data in and out of HDInsight Figure 2. Relationship between an HDInsight cluster, HDFS, and Windows Azure Blob Storage HDInsight Management HDInsight offers a number of standard Apache Hadoop ecosystem tools for loading and unloading data, such as the Apache Hadoop command or, if the source is a relational database, the Apache Sqoop* tool (included in all Apache Windows Azure HDInsight Service and its on-premises Hadoop distributions). To load log file data, the standard counterpart HDInsight Server share the same web-based Apache Hadoop ecosystem tool Apache Flume* is used. management interface, shown in Figure 3. The graphical user interface (GUI) provides options such as an interactive JavaScript To load data into or out of Windows Azure blob storage (as and Hive console to a cluster, a remote desktop connection to opposed to HDFS), users have more options. For example, the name (main) node, and monitoring data. one can use any number of tools that make use of the HDFS API, such as the free graphical tools Azure Storage Explorer* Microsoft Big Data and Analytics and CloudXplorer* or the command-line tool AzCopy*. One Claim: “[HDInsight lets you] accelerate the deployment can also use JavaScript via the interactive console, the Apache with the cloud by deploying an Apache Hadoop cluster on Hadoop command line (using the Apache Hadoop command), Windows Azure* in just 10 minutes.”7 or a .NET language such as C#. Yet another option is Windows The claim is specific and easy to verify, but it also suggests PowerShell. something general: that creating an HDInsight cluster in Windows Azure is a trivial exercise and is far easier than setting After data has been unloaded, it’s typically necessary to clean up one’s own hardware cluster. it before it can be consumed, analyzed, or displayed in a visualization. These data cleaning operations are often referred Although it takes closer to 20 minutes to set up an HDInsight to as extract, transform, and load (ETL). For ETL operations cluster, it is true that by using Windows Azure HDInsight with HDInsight, the standard Apache Hadoop tool Apache Pig* Service the circumscribed process of setting up an HDInsight can be used. However, Microsoft also makes ETL for Apache cluster is quick and easy. However, this statement is essentially Hadoop possible through SSIS, by means of the Hive ODBC misleading because it ignores the necessary aspect of Driver; the Hive ODBC Driver allows external applications such uploading data into the cloud. This uploading process is as Excel and SQL Server to connect to Apache Hadoop data. necessary unless the enterprise data destined for Apache Technical Notes about HDInsight Hadoop is already stored in Windows Azure blob storage (an uncommon scenario). To upload 1 TB of uncompressed HDInsight was developed with ease of use in mind and has data at a rate of 1 MB/second would require approximately not been optimized for other features such as performance. 12 days. Compression can reduce the transfer times by 80 to In addition, it is unlikely that HDInsight will ever be built on the 90 percent, but even assuming the rate can be increased to a very latest version of Apache Hadoop because these versions brisk 1 TB per day, the process of uploading 100 TB would still are written on Linux. As a result, HDInsight will be late to take 100 days. (Windows Azure does not yet allow customers adopt cutting edge features and frameworks such as Intel’s to ship physical disks to speed the process of loading data, but Project Rhino, which provides a common security framework this service is planned before the end of 2013.8) for Apache Hadoop; or Intel® Advanced Encryption Standard New Instruction (Intel® AES-NI), which speeds performance In addition, regardless of how complicated or time-consuming on encryption; or cell-level security in Apache Hadoop, such the process of deploying an Apache Hadoop cluster might as is being developed in the Apache Accumulo* project. be, this difficulty of deployment is not a major deterrent to the Regarding security, the only claims Microsoft is in fact making sound use of Apache Hadoop. In the broader scheme, ease of about HDInsight and security relate to its integration with Active installation is a nice-to-have feature of HDInsight that does not Directory* Domain Services. help businesses derive any value whatsoever from an Apache Microsoft’s Claims about HDInsight Hadoop cluster. Microsoft’s main claims about HDInsight usually suggest that the Note that for the on-premises version of this product, the true product makes Apache Hadoop easier. What follows are some ease of installation cannot yet be verified because the current representative examples of Microsoft followed by a brief analysis. preview of HDInsight Server (for on-premises deployment) can only be installed as a single node. Microsoft Big Data and Analytics Claim: “Microsoft simplifies programming on services related to Apache Hadoop use for logon credentials. Apache Hadoop.” These services, and the “hadoop” logon account, are shown in 9 The claim that a procedure has been simplified can mean Figure 4. either that it has been made simple, or that it has merely been made simpler. In this case it is true that Microsoft has made programming on Apache Hadoop a little simpler, but it is not true that it has made programming on Apache Hadoop simple. Microsoft’s programmatic addition to Apache Hadoop has been to create a .NET software development kit (SDK) and a set of JavaScript libraries for HDInsight, in addition to providing an interactive JavaScript console to Apache Hadoop. (The .NET SDK allows programmers to write essential Apache Hadoop MapReduce jobs in all .NET languages such as C# and F#.) These additions in principle should make programming for Figure 4. HDInsight services Apache Hadoop easier for the many programmers who are not In general, IT should not soon expect dramatic improvements Java specialists. However, programming MapReduce jobs will in the manageability of Apache Hadoop because of its loose remain fundamentally complex even in these other languages. integration in Active Directory Domain Services. However, it is For the IT decision maker, the take-away is that developers likely that Apache Hadoop and Active Directory Domain Services comfortable in any .NET language or JavaScript will now be able will become more integrated over time, leading to (for example) to program MapReduce jobs and quickly perform queries in a specific HDInsight group policy objects (GPOs) and other console against data stored in Apache Hadoop. administrative benefits. HDInsight will likely need some years to mature before that will happen, however. Claim: “[Microsoft big data lets you] seamlessly extend privileges across HDInsight with Active Directory*.”10 Claim: “HDInsight is 100% compatible with This implication of this claim is that the integration of HDInsight Apache Hadoop.”11 with Active Directory Domain Services makes managing Buried within Microsoft’s general claim that “HDInsight makes HDInsight easier. Apache Hadoop easier” is the implicit claim that HDInsight really is Apache Hadoop. Is it? In general, yes. Apache Hadoop runs Apache Hadoop is in fact integrated with Active Directory inside HDInsight, and it is true that Apache Hadoop files from Domain Services, but not yet to the high degree that is other Apache Hadoop distributions are 100 percent compatible suggested in the claim. The locus of integration is currently with with it. In addition, one can download an Apache Hadoop user accounts, authentication, and authorization: Windows component such as Apache Mahout* straight from the Apache accounts are used to manage Apache Hadoop, and it’s not web site, and it will run on an HDInsight cluster without errors. necessary to create user accounts within HDInsight itself. However, it is not true (as the claim might be interpreted) that In fact, with HDInsight, no aspect of authentication and HDInsight has the same features as all standard versions of authorization remains siloed in Apache Hadoop; security is Apache Hadoop. At the time of this writing, for example, HDP handled by Windows Azure, Active Directory Domain Services, 1.3.1 and Apache Hadoop 1.2.1 support features that have or local Windows security. In addition, HDInsight creates a not yet appeared in HDInsight. This lag time between Apache special Windows user account named “hadoop” that the 14 Microsoft Big Data and Analytics Hadoop versions is likely to persist indefinitely, and it remains to PDW Hardware Specifications be seen whether in some cases it could actually lead to file or The PDW versions from both Dell and Hewlett-Packard are code incompatibilities. not identical, but they do share some common specifications. In general, the take away for the IT decision maker is that HDInsight is likely to be running a slightly outdated version of standard Apache Hadoop. Today, code and syntax is 100 percent portable from standard Apache Hadoop, but in the future, exceptions to this rule cannot be ruled out. Ultimately, however, Microsoft has made clear that they want to remain 100 percent compatible with Apache Hadoop, so if such an incompatibility should arise, it will likely be a temporary problem. SQL Server 2012* Parallel Data Warehouse: An (Almost) All-in-One Data Solution First, both vendors assign 256 GB of RAM to each physical node in the appliance. Second, for both Dell and HP, the first rack in the appliance (or only rack, if there’s only one) includes one node assigned control and management responsibilities. Microsoft also specifies that one extra node per rack should remain essentially unused and be included for failover, so this is another common element from both vendors. Finally, in both the Dell and HP solutions, nodes are connected with InfiniBand* and Ethernet, both of which are implemented with redundancy. These control and failover nodes along with the redundant networking components occupy 6U in the first (or only) rack, and 5U in all subsequent racks (because the control node is needed only in the first rack). Another pillar in Microsoft’s all-data product lineup is SQL Server 2012 PDW. PDW is a massive parallel processing (MPP) Dell Parallel Data Warehouse Appliance data warehousing appliance that combines custom software Dell’s PDW product is officially called the Dell Parallel Data built on SQL Server 2012 with commodity hardware. Currently, Warehouse Appliance. The following list provides additional the appliance is sold in various scalable configurations only by detailed hardware specifications about the Dell PDW Dell and Hewlett-Packard. At the lowest end, both vendors sell configuration options, beyond the elements described above: a one-quarter rack version (of a standard 42U rack). The Dell appliance can scale up to 6 racks, and the HP counterpart can scale up to 7 racks. • Basic scale unit of 10U: 3 servers in a 2U enclosure, and two 4U drive arrays • Basic scale unit = 3 Dell PowerEdge R620* compute A key concept in understanding PDW is that it represents a nodes, 2 Dell PowerVault MD3060e* JBOD SAS arrays scale-out solution, as opposed to a scale-up solution. When (102 drives) users run T-SQL queries against PDW, the queries are broken • Up to 3 scale units (9 compute nodes) per rack down and distributed among all required nodes. The processing • ¼–6 racks itself is therefore distributed and not centralized. As nodes are • 3–54 compute nodes total added to the appliance, the raw processing power of PDW • 1, 2, or 3 TB storage capacity per drive increases in an essentially linear manner. • 22.65–1,223.1 TB raw free storage space • 79–6,116 TB user storage (with compression) Storage in the PDW appliance is both replicated and distributed. Smaller tables (approximately 5 GB or smaller) are replicated among all nodes for improved performance. Larger tables are broken up and distributed across nodes. • 6U available for customer space on first rack, 7U on other racks Microsoft Big Data and Analytics HP AppSystem for Microsoft SQL Server 2012 Parallel responds to the client with the results of the query. To answer Data Warehouse the query, the control node uses its metadata to break up HP’s PDW product is called the HP AppSystem for Microsoft an original query into smaller parts and send these smaller SQL Server 2012 Parallel Data Warehouse. The HP AppSystem component queries to the appropriate nodes. The control node offers a different range of hardware options: then compiles into one response the results received from these various nodes and then sends this response to the client. • Basic scale unit of 7U: two 1U servers and one 5U drive array • Basic scale unit = 2 Dell ProLiant Gen8 DEL360 compute nodes, 1 HP P6000 JBOD SAS array (70 drives) • Up to 4 scale units (8 computer nodes) per rack PDW virtualizes all servers on its physical nodes and uses failover clustering to protect these virtualized workloads. No one node (including the control node) represents a single point of failure. • ¼–7 racks • 2–56 compute nodes Figure 6 shows a view of the PDW from the perspective of • 1, 2, or 3 TB storage capacity per drive an administrator. • 15.1–1,268.4 TB raw free storage space • 53–6,342 TB user storage (with compression) • 8U available for customer space on first rack, 9U on other racks These different hardware specifications for the first rack from each vendor are shown in Figure 5. Figure 6. SQL Server 2012 Parallel Data Warehouse management portal Figure 5. Comparison of SQL Server 2012 Parallel Data Warehouse hardware specificaitons between Dell and HP How PDW Works Despite the many components included in PDW, to external clients the appliance looks just like a single instance of SQL Server 2012. T-SQL queries to PDW are directed from clients toward the PDW control node, and the control node eventually Microsoft Big Data and Analytics Vendor and Appliance Memory (GB) Total Cores Compression User Storage (TB, Compressed) List Price EMC Greenplum Data Computing Appliance* 768 48 4 to 1 144 $2,000,000 IBM PureData System for Analytics N1001-010* n/a 112 4 to 1 128 $1,599,000 2,304 144 5 to 1 340 $1,569,970 2,048 128 10 to 1 450 $13,580,000 768 96 4 to 1 146 $1,168,000 Microsoft SQL Server 2012 Parallel Data Warehouse (Dell)* Oracle Exadata Database Machine X3-2* Teradata Data Warehouse Appliance 2690* Table 1. Comparison of hardware specifications for full-rack implementations of data warehousing appliances from several vendors Vendor EMC I/O Bandwidth (GB/sec) Price per GB/sec of I/O Bandwidth 24 $83,333 Microsoft 108 $14,537 Oracle 100 $136,440 Table 2. Comparison of input/output (I/O) rates among three data warehouse appliances Comparison of Data Warehousing Appliances With respect to performance, Table 2 shows that the I/O Within the playing field of data warehousing appliances, throughput of PDW compares favorably with that of the EMC Microsoft makes essentially three pitches in favor of PDW: and Oracle solutions. (Data from IBM and Teradata are not that it offers a great value, that it has excellent performance, available.) Microsoft claims PDW is also able to speed I/O and that it connects seamlessly to Apache Hadoop. performance (over 10 times) through the use of columnstore Table 1 compares hardware specifications for full-rack indexing and batch processing, both members of the xVelocity* 14 implementations of data warehousing appliances from various family of memory-optimized technologies in SQL Server 2012. vendors.12 Table 2 compares input/output (I/O) rates among three data warehouse appliances.13 Regarding the integration of PDW and Apache Hadoop, With respect to value, an advantage highlighted by Table 1 is warehouses in offering this capability. In fact, all of the data that, compared to other solutions, the SQL Server 2012 PDW warehouse appliance vendors mentioned in Table 1 have displays a low cost per unit storage. Microsoft is able to attain presented a product roadmap involving some integration with these cost reductions mainly by using direct-attached storage Apache Hadoop. Of these, however, the PolyBase roadmap (DAS) with its nodes instead of storage area network (SAN) is distinctive in its plan to deeply integrate Apache Hadoop storage, an option made possible because of a Windows processing with PDW processing. Server 2012 feature called Storage Spaces. Storage Spaces allows flexible SAN-like storage provisioning from a JBOD SAS array that is attached to one node only. Microsoft is careful not to claim that it is unique among data The next section provides more detail about PolyBase and its product roadmap. Microsoft Big Data and Analytics Big Data Integration – PolyBase Apache Hadoop source, query results will show the updated PolyBase is a PDW-only feature that provides a means to data. However, query performance isn’t optimized. Figure 8 integrate Apache Hadoop data with SQL Server and to make shows an example of a CREATE EXTERNAL TABLE statement this data accessible through T-SQL queries. The manner in that creates a table called ClickStream from an Apache which PolyBase integrates T-SQL with Apache Hadoop is Hadoop file called employee.tbl. illustrated in Figure 7. Figure 8. Example of a CREATE EXTERNAL TABLE statement from Apache Hadoop CREATE TABLE AS SELECT Statement The CTAS statement can be run after an external table is created. When a PDW administrator creates a table as a select statement from an external table, this external data is physically copied into a SQL table that resides in PDW. In this case, PDW can perform parallel processing on the remote Apache Hadoop data, and when the table is created, the administrator can optimize its storage in PDW by distributing it across nodes. The imported Apache Hadoop data then persists in PDW until the new table is deleted. Creating a table as a select statement optimizes query response times, but the imported data is not updated from its source if that source data should ever change. Figure 7. PolyBase integration of T-SQL with Apache Hadoop To achieve this integration, PDW must first be connected to an Apache Hadoop source. Administrators can then integrate the external Apache Hadoop data into SQL data on PDW by using either a CREATE EXTERNAL TABLE statement or a CREATE TABLE AS SELECT (CTAS) statement. Administrators can also push data from PDW to Apache Hadoop by means of a CREATE EXTERNAL TABLE AS SELECT (CETAS) statement. The following example shows a basic CTAS statement: CREATE TABLE ClickStream _ PDW WITH DISTRIBUTION = HASH(url) AS SELECT url, event _ date, user _ IP FROM ClickStream Note that Apache Hadoop data does not need to persist as an isolated table. Imported data can also be mashed up with CREATE EXTERNAL TABLE Statement When an external table is created from Apache Hadoop data, PDW frames a SQL structure around the external data. Users can then query the external table as if it were a normal table residing in a SQL database. If the data is updated in the native relational data through JOIN statements. Microsoft Big Data and Analytics Querying the Data SQL and Apache Hadoop data, makes a cost-based decision After data is imported into a table in PDW, users can perform about when to process queries with SQL and when to push ordinary T-SQL queries on it, as shown in the three examples queries onto HDFS data as MapReduce jobs. in Figure 9. The goals of PolyBase phase 3 have not been finalized, but Microsoft has publicly stated that it is considering compatibility with Apache Hadoop MapReduce 2.0 (YARN) and more efficient alternatives to MapReduce. No dates have been given for the release of PolyBase phase 2 or phase 3. Besides this roadmap for planned functionality in PolyBase, Microsoft has occasionally hinted that the technology will eventually be integrated into its SQL Server product, perhaps Figure 9. Examples of T-SQL queries performed on data imported to a SQL Server 2012 Parallel Data Warehouse table Pushing Data to Apache Hadoop from PDW Finally, PDW administrators also have the option of migrating data PDW to an Apache Hadoop source. To achieve this, a CETAS statement is used, as in the following example: CREATE EXTERNAL TABLE ClickStream (url, event _ date, user _ IP) as soon as the next release (SQL Server 2014). ETL in PDW The Microsoft specifications for PDW do not include any ETL server, such as a dedicated instance of SQL Server loaded with SSIS. Both Dell and HP include SQL Server tools installed on the control node, but it is expected that many firms will use a pre-existing ETL server to connect to PDW. Using SSIS packages to import data is sensible if these WITH (LOCATION =‘hdfs://MyHadoop:5000/ packages are already created. It should be noted, however, TERMINATOR = '|')) AS SELECT url, event _ date, performance as a way to import data.15 users/outputDir’, FORMAT _ OPTIONS (FIELD _ user _ IP FROM ClickStream _ PDW Roadmap for PolyBase Currently, PolyBase is in phase 1 of a multi-phase rollout. Phase 1 allows data to be imported directly from and exported directly to HDFS on Apache Hadoop. Because MapReduce is bypassed and parallel processing is used, performance for import and export operations is normally optimized. that in PDW, ordinary T-SQL queries offer much better Microsoft’s Claims about PDW This paper focuses on Microsoft’s comprehensive data strategy and how the various components of that strategy might work together. Although Microsoft makes claims about PDW that relate to its value and its performance, these claims do not relate to its big data strategy. One important claim that Microsoft is making about PDW, however, does relate to its comprehensive data strategy: that PDW integrates Apache Phase 2 goes beyond integrating Apache Hadoop data into Hadoop data with traditional relational data. We will look at two PDW and will move toward integrating the processing power representative examples of Apache Hadoop clusters into PDW queries. This next phase will include a PDW query optimizer that, for all queries of both Microsoft Big Data and Analytics Claim: PolyBase for PDW provides “seamless integration Claim: “HDFS Bridge in PolyBase … enable[s] direct of Apache Hadoop data with the data warehouse in a communication between HDFS data nodes and PDW single query.” compute nodes.”18 16 This claim essentially states that a single query executed This particular claim is more conservative than the last. It states against PDW will return both Apache Hadoop data and merely that instances of SQL Server in PDW can communicate SQL data. The implication of the claim, along with the word directly with data nodes in HDFS through a PDW component “seamless,” is that all Apache Hadoop data will easily be called the HDFS Bridge. The implication of the claim is that IT brought into the SQL world and made accessible to all users personnel do not need to use additional tools (such as Hive) through ordinary T-SQL statements. or write additional MapReduce scripts to import data from Apache Hadoop to PDW or export data from PDW to Apache The claim can be construed as true if it is limited to describing Hadoop. SQL communicates with Apache Hadoop directly. the availability of data that has already been imported from Apache Hadoop, but it is essentially misleading in describing The claim offers a reasonable description of what PolyBase this process as “seamless.” In truth, the only Apache Hadoop can do, and if anything, it might sell the technology a little data that can be queried through SQL statements is data that short. PolyBase doesn’t merely allow users to bypass an administrator has located and made the effort to import MapReduce and import and export data directly; it also allows with CREATE EXTERNAL TABLE or CTAS statements. In PDW to use parallel processing when it performs queries on addition, the only Apache Hadoop data that is capable of an Apache Hadoop cluster. On the other hand, the claim also being imported is data that is semi-structured with delimiters hints at the lack of advanced integration between PDW and such as commas. Most Apache Hadoop data, however, is not Apache Hadoop. The two technologies are connected only structured at all. through a bridge, so importing and exporting data is required before data can be accessed from one system to another. Although importing Apache Hadoop data into a SQL table is certainly a useful capability, this is not a particularly common use Other companies are indeed working on solutions that case for PolyBase. It is useful only for data that can fit easily into dispense with the need for such a bridge. When assessing a table (such as log files) and whose location within the Apache Microsoft’s comprehensive data vision, therefore, it’s important Hadoop cluster is known. In the words of Yale researcher Daniel to recognize that PolyBase does not represent a singular Abadi, “[PolyBase lets you] dynamically get at data in Hadoop/ cutting edge to integration of SQL and Apache Hadoop. HDFS that could theoretically have been stored in the DBMS all along, but for some reason is being stored in Hadoop instead of the DBMS.”17 It should also be noted that other rival technologies offer a more seamless integration of SQL with Apache Hadoop, such as Hadapt Adaptive Analytical Platform* and the Hortonworks Stinger initiative. Microsoft Big Data and Analytics Business Intelligence and Analytics BI typically refers to tools used to collect, analyze, and view enterprise data for the purpose of meeting business goals. These three functions of collecting, analyzing, and viewing data for Microsoft have traditionally been filled by SSIS, SQL Server Analysis Services (SSAS), and SQL Server Reporting Services (SSRS), respectively. Microsoft BI has traditionally relied heavily on IT personnel, for example, to create packages in SSIS for importing data, to develop online analytic processing (OLAP) cubes in SSAS, and then to build reports in SSRS that are finally delivered to end-users. PolyBase. (Because PolyBase provides better functionality than the Hive ODBC Driver, there is no such driver for PolyBase.) The Hive ODBC Driver is central to Microsoft big data because the company’s strategy does not provide any BI tools specifically for Apache Hadoop. Microsoft’s goal with big data and BI is merely to provide a method to import Apache Hadoop data into well-known Microsoft tools, where this data can be shaped, analyzed, and visualized just like any other data can. Figure 10 shows how the Hive ODBC driver (labeled “ODBC for Hive”) is used to connect Windows Azure HDInsight Service to Excel, SQL Server, and Analysis Services. In the big data era, however, Microsoft has been expanding its vision of BI to include what it calls “managed self-service BI.” In this new vision, IT manages access to data sources, and end-users connect to these data sources as needed with client tools, most notably Excel. Users import data as tables into Excel and then shape and visualize the data as needed. Possible sources of enterprise data can still include databases, but they also now include Apache Hadoop and other sources, such as web pages and Open Data Protocol (OData) feeds. (OData is a data access protocol released under the Microsoft Open Specification Promise.) Excel in particular is able to achieve high performance when handling data sets and processing visualizations because it uses the xVelocity inmemory analytics engine for these purposes. (This in-memory engine was first available only in SQL Server 2008 R2.) The next section describes some new Microsoft BI features available in Excel and some other tools that users can employ to connect to and manipulate big data. Apache Hive* ODBC Driver The Hive ODBC Driver is currently a critical piece of software in Microsoft’s all-data strategy. This driver allows Apache Hadoop data sets to be imported into SQL Server, Excel, and Analysis Services through HiveQL queries. Unlike PolyBase, which is available only in PDW, the Hive ODBC Driver connects to Apache Hadoop by allowing HiveQL queries to be translated to MapReduce jobs. Performance is much lower than with Figure 10. Connection of Windows Azure HDInsight Service to Excel, SQL Server, and Analysis Services through the Hive ODBC driver PowerPivot The PowerPivot add-in for Excel first appeared in Excel 2010, allowing users to load large amounts of highly compressed data into Excel from different sources, create relationships within that data, and then perform analysis on the data. In Excel 2013, much of this functionality is now built directly in. Without installing the PowerPivot add-in, one can already import large data sets (millions of rows) from multiple data sources, create relationships between data from different sources and between multiple tables in a PivotTable, create implicit calculated fields, and manage data connections. Microsoft Big Data and Analytics In Excel 2013, data is also now automatically loaded into the xVelocity in-memory analytics engine even before the PowerPivot add-on is installed. When the PowerPivot add-in is installed in Excel 2013, more advanced modeling capabilities become available, such as the ability to filter and rename data as it is imported, to define custom calculated fields throughout a workbook, to define key performance indicators (KPIs) to use in PivotTables, and to use the Data Analysis Expressions (DAX) expression language to create advanced formulas. Data imported into PowerPivot can originate from databases, like SQL Server, IBM DB2*, and Oracle, or from other types of data sources, like Apache Hadoop, OData feeds, reporting services reports, and text files.19 Figure 11 shows the functions available in Figure 12. Importing data from HDInsight to Excel 2013 Power Query the PowerPivot ribbon in Excel 2013. Figure 11. Excel 2013 Power Pivot ribbon Power Query Power Query is a new tool whose name has recently been updated from its preview name, Data Explorer. Power Query allows users to query external sources of data, such as the Internet in general or HDInsight, and import detected tabular data sets. Once imported, the data in the table can be modified, combined with other data, analyzed, and visualized by using other tools. Figure 12 shows how Power Query can be used to import data from HDInsight and other external sources. Figure 13 shows a result when Power Query is used to perform an online search for “most populous metropolitan areas in North America.” When the data set is selected in the right column, a table containing the data set is automatically created. Figure 13. E xcel 2013 Power Query results of an online search for “most populous metropolitan areas in North America” Microsoft Big Data and Analytics Power View Power View is a visualization tool that first appeared in SharePoint 2010 and that is now also available in SQL Server 2012 SP1 and Excel 2013. Power View allows users to create interactive charts and maps from tabular data in Excel and then add them to a view or dashboard, as shown in Figure 14. Figure 15. E xample of a visualization created in Excel 2013 Power Map Microsoft’s Claims about BI Microsoft’s claims about its BI tools within the context of its big data strategy are essentially variations of one point, that its BI tools bring big data to the masses. These claims can be mostly accurate or mostly misleading, depending on how they are phrased. Claim: “HDInsight democratizes the power of big data BI.”20 Figure 14. E xcel 2013 Power View output of most populous metropolitan areas in North America This claim explicitly states that HDInsight itself, and not Power Map democratizing big data BI. The implication of the claim is that Power Map is a new visualization tool that until recently was by opting for HDInsight over another Apache Hadoop solution, known by its preview name, GeoFlow. Power Map provides firms will have an advantage in their ability to derive valuable 3D geographical visualizations that are superimposed on a insights from their data. some other component in the Microsoft big data strategy, is globe. Such data can range from remote sensor output to data from Twitter. An important constraint, however, is that Power In truth, Microsoft’s big data BI strategy is to bring Apache Map can work only work with data preformatted in a table and Hadoop data into its suite of existing BI tools, and these tools cannot work with live or streaming data. Figure 15 shows an are almost completely agnostic about the particular source example of a visualization created in Power Map. of the Apache Hadoop data. Microsoft does not have any BI solution for HDInsight in particular. It’s true that in Microsoft Excel, some might consider that importing data from an HDInsight account is easier than doing so from a generic Apache Hadoop file, but this difference at the moment is negligible. Moreover the principal interface between Excel and Apache Hadoop data, the Hive ODBC Driver, is backend agnostic, meaning that it could draw its data from a competing Apache Hadoop distribution as easily as from HDInsight. Microsoft Big Data and Analytics Over time, it is plausible that Microsoft will continue to develop continue to make these processes gradually easier. However, it HDInsight and Excel in a way that optimizes this connection far is unlikely that Microsoft will succeed in bringing true analytics more, but for now, it is not accurate to suggest that HDInsight (as opposed to mere visualizations) to the masses, whether itself lowers the barrier to entry for big data BI. for structured data or unstructured data. True data analysis that is capable of revealing valuable and non-obvious insights, Claim: “[Microsoft lets you] analyze big data with after all, is a discipline that requires specialized mathematical familiar tools.” 21 and statistical skills that go beyond the simple familiarity with a If Microsoft Excel can be considered a familiar tool, then this given scripting interface or software tool. claim is accurate. In Excel, information workers can perform a HiveQL query to import over a million rows of Apache Hadoop Conclusion data, clean the data so that it fits into a table, and then analyze The main products Microsoft includes in its “all-data” vision— this data with advanced tools such as DAX statements. (Note that professional data analysts can also import Apache Hadoop data into the tools they are used to, such as SSAS, and perform the same analytics on this data as they could with any data that is originally in a static, tabular format.) However, there are some important caveats to keep in mind about using “familiar tools” to analyze big data. First, one can’t import all Apache Hadoop data into Excel (or SSAS). Users can only import data that lends itself to being shaped into a table, such as comma-delimited files and other forms of semi-structured data. Microsoft is in fact heavily promoting semi-structured data as the type of Apache Hadoop data it can handle with its existing tools, but semi-structured data represents only a small percentage of the data that is kept in Apache Hadoop clusters. Second, the fact that one can use Excel to perform analytics on big data does not mean that this task is in any way easy to perform. The ability to import the useful data through HiveQL, fashion this data into a clean table filled with the right information, and then perform the right analytics in a way that yields valuable insights is a set of skills reserved for specialists such as data analysts familiar with tools such as the DAX scripting language. It is true that Microsoft has lowered the barrier to entry for reading semi-structured data stored in Apache Hadoop and especially for creating visualizations of tabular data. It is likely that in future releases of Excel and HDInsight, Microsoft will HDInsight, PDW, and Excel—comprise a compelling and comprehensive set of features, albeit with some significant limitations. Even with these limitations, however, this vision offers a unique take on big data that is not available through other vendors. On the positive side, Microsoft offers the many firms that are already heavily invested in Microsoft products a way to ease into big data with minimal adjustment. For example, HDInsight will be manageable from System Center and increasingly integrated into Active Directory Domain Services, reducing administrative overhead compared to other Apache Hadoop solutions. Furthermore, companies planning to deploy future releases of SQL Server will find that this product is likely to include PolyBase, and by extension, a T-SQL connection to Apache Hadoop. Existing BI expertise in Excel, meanwhile, can be used by connecting to both Apache Hadoop and relational data sources. Finally, for the many organizations already moving their servers or data into Windows Azure, Windows Azure HDInsight Service offers an attractive option because it can connect directly to data stored in Windows Azure blob storage. Another legitimate advantage of Microsoft’s vision is ease of implementation, and to a lesser degree, ease of use. Spinning up a data cluster in HDInsight Service is decidedly easier than creating a physical Apache Hadoop cluster on premises, even if uploading big data sets to the cloud can be extremely Microsoft Big Data and Analytics time-consuming. For ease of use, the.NET SDK and JavaScript Today, what Microsoft is truly offering is merely the promise libraries for HDInsight, as well as the interactive JavaScript of an all-data solution, a promise that might or might not be console, do make programming for and interacting with realized soon. Apache Hadoop somewhat easier than with other Apache Hadoop distributions. None of these limitations can negate that fact that the unusual breadth of Microsoft’s product line across Apache Hadoop, All of this said, although the components of Microsoft’s big relational databases, data warehousing, spreadsheets, the data offerings are compatible with each other, they are not truly cloud, business intelligence, and server administration, make integrated into a single solution and do not yet achieve any the company a unique contender among big data vendors. significant synergistic effects when used together. Microsoft has more of the components for a comprehensive data solution either in place or credibly maturing than any Another significant limitation to Microsoft’s big data other company. Moreover, Microsoft provides a vision of strategy is that while Apache Hadoop is rapidly improving in data management and analysis that could be revolutionary performance, extensibility, and ease of use, Microsoft has not if fully realized. However, it is important to underscore that yet proven that it can keep pace with these changes. By the Microsoft’s all-data vision is currently just that: a vision. Those time HDInsight is officially released, HDInsight might in fact waiting for the Microsoft all-data solution will have to be already be outdated, if not far surpassed, by superior Apache patient and wait to see whether its promise matures into the Hadoop-based alternatives. truly comprehensive, integrated, and synergistic suite of data The BI portion of Microsoft’s all-data vision does not yet fully live up to claims of democratizing big data. Microsoft’s solution manages to simplify relatively easy portions of dealing with big data, such as server-cluster installation or semistructured data visualization. Benefits such as these provide evolutionary business value but still leave fundamentally difficult aspects of working with big data—such as crafting queries for unstructured data, sanitizing data for visualization, and implementing effective machine-learning algorithms— unaddressed. Until Microsoft finds ways help non-specialist information workers ask the right questions of any kind of data, the company will not truly live up to its claim of bringing big data to the masses. But perhaps the most damaging critique one can level against Microsoft’s all-data solution is that it is currently little more than a marketing vision (though a compelling one). In reality, HDInsight is now functional, but only as a preview, and the other main components of the vision, PDW and Excel, are essentially pre-existing products that have been repositioned as critical components of a grand big data marketing strategy. management and analysis products currently promised. Microsoft Big Data and Analytics Notes 1 Microsoft. “Microsoft Big Data.” http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-data.aspx. 2 Microsoft. “From Data to Insights with Microsoft Big Data.” http://download.microsoft.com/download/6/E/3/6E335796-3003-4B2D-BB55-0A33E003F879/Microsoft_Big_Data_Booklet.pdf. 3 Microsoft. “What Version of Hadoop Is in Windows Azure HDInsight?” http://www.windowsazure.com/en-us/manage/services/hdinsight/howto-hadoop-version/. 4 Apache. “Hadoop Releases.” http://hadoop.apache.org/releases.html#download. 5 Hortonworks. “Hortonworks Data Platform (HDP).” http://hortonworks.com/products/hdp/. 6 Hortonworks. “HDP for Windows.” http://hortonworks.com/products/hdp-windows/. 7 Microsoft. “Microsoft Big Data.” http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-data.aspx. 8 Windows Azure Storage Team Blog. “Windows Azure Storage BUILD Talk – What’s Coming, Best Practices and Internals.” http://blogs.msdn.com/b/windowsazurestorage/archive/2013/06/28/windows-azure-storage-build-talk-what-s-coming-best-practices-and-internals.aspx. 9 Microsoft. “Microsoft Big Data.” http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-data.aspx. 10 Microsoft. “Microsoft Big Data.” http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-data.aspx. 11 Microsoft. “Microsoft Big Data.” http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-data.aspx. 12 Value Prism Consulting. “Microsoft’s SQL Server Parallel Data Warehouse Provides High Performance and Great Value.” March 2013. http://www.valueprism.com/resources/resources/Resources/PDW%20Compete%20Pricing%20FINAL.pdf. 13 Value Prism Consulting. “Microsoft’s SQL Server Parallel Data Warehouse Provides High Performance and Great Value.” March 2013. http://www.valueprism.com/resources/resources/Resources/PDW%20Compete%20Pricing%20FINAL.pdf. 14 For more information about columnstore indexing, see http://social.technet.microsoft.com/wiki/contents/articles/3540.sql-server-columnstore-index-faq.aspx. For more information about other technologies in xVelocity, see http://technet.microsoft.com/en-us/library/hh922900.aspx. 15 Data Warehouse Junkie. “Rock Your Data with SQL Server 2012 Parallel Data Warehouse (PDW) – POC Experiences.” June 2013. http://dwjunkie.wordpress.com/2013/06/27/rock-your-data-with-sql-server-2012-parallel-data-warehouse-pdw-poc-experiences/. 16 Microsoft. “Appliance: Parallel Data Warehouse (PDW).” http://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/pdw.aspx. 17 Monash Research. “SQL-Hadoop Architectures Compared.” June 2013. http://www.dbms2.com/2013/06/02/sql-hadoop-architectures-compared/. 18 Microsoft. “PolyBase.” http://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/polybase.aspx. 19 The complete list of supported data sources for PowerPivot can be found at http://office.microsoft.com/en-us/excel-help/get-data-using-the-powerpivot-add-in-HA102836921.aspx. 20 Microsoft. “Business Insights Newsletter Article.” December 2012. http://www.microsoft.com/enterprise/newsletter/bi-newsletter/articles/december2012.aspx#fbid=f36OYx4e5uq. 21 Microsoft. “Microsoft Big Data.” http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-data.aspx. The analysis in this document was done by Prowess Consulting and derived from work done with Intel. Results have been simulated and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Prowess, the Prowess logo, and SmartDeploy are trademarks of Prowess Consulting, LLC. Copyright © 2014 Prowess Consulting, LLC. All rights reserved. Other trademarks are the property of their respective owners.