Conclusions Paper The Use of Open Source Is Growing. So Why Do Organizations Still Turn to SAS? Insights from a presentation at the 2014 Hadoop Summit Featuring Brian Garrett, Principal Solutions Architect at SAS Contents Commercial Analytics Software and/or Open Source?..................................................................... 1 SAS Offers Unique Value to Open Source Users........ 1 Experience and Expertise.....................................................1 Proven Value to Customers..................................................1 Innovation and Leadership in Analytics.............................2 SAS Brings Value to Open Source Solutions............... 3 SAS/IML® Software Integration............................................3 SAS® Enterprise Miner™.......................................................3 SAS Supports the Entire Analytics Life Cycle................ 4 Preparing Data Using SAS® .................................................5 Building and Validating Models Using SAS® ...................5 Deploying and Monitoring Models Using SAS®.............5 The Facts on Total Cost of Ownership........................... 5 SAS® Analytical Innovations for Hadoop...................... 6 SAS® High-Performance Analytics......................................6 SAS® Visual Analytics at Scale..............................................7 SAS® Visual Statistics at Scale...............................................7 SAS® In-Memory Statistics....................................................8 Making SAS® Accessible to Professors, Students, Researchers and Independent Learners ...................... 8 Learn More.......................................................................... 8 1 Commercial Analytics Software and/or Open Source? It’s a hot topic today. As customers debate which is the best way to go, recent findings by Nucleus Research suggest that many organizations have realized that they can meet both internal and external stakeholder requirements by finding the right balance of SAS enterprise-class analytics solutions and open source solutions. Why? Because SAS is optimized for operational and production analysis and includes integrated capabilities for data management and more, while open source quickly brings new analytic algorithms to market.1 But how can these technologies coexist in the real world to meet different business needs? Do they “play” together well? How do they work together to leverage Hadoop? These and other questions were answered in a recent presentation at the 2014 Hadoop Summit by Brian Garrett, Principal Solutions Architect at SAS. His presentation “With the Rise of Open Source, Why Organizations Still Turn to SAS” highlighted little-known facts about SAS investments in software enhancements that allow analysts to incorporate R algorithms into analytic processes as part of a comprehensive, enterprise-class SAS analytics platform. SAS Offers Unique Value to Open Source Users Everyone knows SAS. What they may not know is what SAS offers to users of open source analytics – and especially to those users storing data in Hadoop clusters. Garrett explained, “From my perspective, SAS offers four key strengths: experience and expertise, value to customers, innovation, and leadership in analytics.” Experience and Expertise SAS – the world’s largest, privately held software company – was founded in 1976 and currently employs more than 13,000 people worldwide across 400 office locations. “Our employees are one of our biggest assets because of the depth and breadth of their experience,” said Garrett. On average, the average tenure at SAS is over 10 years, which is important because 1 SAS Analytics and Open Source. Nucleus Research, April 2014. 1) usually, original code developers are just down the hall and available for consultation, and 2) SAS employees have a tremendous amount of intellectual property in their heads – IP that’s readily available to peers and customers. “In addition, SAS employee turnover rate averages 3 percent annually – compared to the industry average of 22 percent,” noted Garrett. “Eighty-five percent of our statistics, statistical product and testing teams have advanced degrees – and fully half of them have PhDs in math, statistics or operations research. So we manage to attract and retain some of the best and brightest analytical talent in the industry.” For example, several years ago, SAS created an advanced analytics lab to tackle some of the newest emerging technologies – and 98 percent of employees have advanced degrees and one-third have PhDs. “It’s this level of expertise that separates SAS’ development and testing processes. They can – and do – perform rigorous testing and validation of all algorithms used in products so that they are proven reliable, accurate and ready for enterprise use,” continued Garrett. }} “About 25 percent of our revenues are put back into research and development. That’s huge for a software company. But we believe it’s critical to driving innovation and leadership in this space.” Brian Garrett, Principal Solutions Architect, SAS Proven Value to Customers SAS software is currently in use across approximately 70,000 customer sites in 135 countries – including 90 of the top 100 companies in the Fortune 500. “One-fourth of our revenue comes from the financial services industries, and three-quarters comes from all other major segments,” noted Garrett. “So we’re delivering value across a wide range of industries.” Figure 1 details these industries. 2 Figure 1: Percentage of SAS business by industry. Equally important, SAS Business Solutions go far beyond analytic algorithms. They encompass solutions for data management, analytics, business intelligence, and high-performance analytics. Under this broad umbrella, SAS offers many horizontal and vertical solutions that help organizations use their data to solve contemporary and industry-specific business problems. “Our analytic solutions all use the same underlying core technology, meaning they use the same algorithms and statistical methods in very specific ways to address such issues as fraud, credit risk, merchandizing effectiveness, customer intelligence and more,” noted Garrett. “So while Hadoop enables the persistence of large amounts of data in a relatively cheap fashion, SAS focuses on what you can do with data, what it can tell you.” SAS is constantly including extensive customer input when building and enhancing products, ensuring close alignment between software functionality and business need. At the same time, SAS provides comprehensive training and documentation for all products, which is critical to ramping up and supporting users so customers quickly realize value from their SAS investments. Innovation and Leadership in Analytics Customers, industries and research organizations seeking innovative solutions to new analytic problems rely on SAS. To develop these solutions, SAS developers engage in conversations with customers across many industries to learn about their most pressing business problems. At the same time, SAS experts are actively involved in professional conferences and work with leading academic researchers – for example, to learn about new methodologies and algorithms, as well as to evaluate the solutions’ robustness and effectiveness for use by customers. By bridging these two worlds – academia and business – SAS delivers innovative methods that matter to our customers, drawing on rapidly expanding analytical disciplines. At the same time, through continual and longstanding engagement with universities, SAS shares its own best practices with professors and students, broadening the practices’ application to other disciplines and industries. Advanced analytics developers at SAS learn about the broad range of problems that customers have – and at the same time, they keep up with new developments in our respective disciplines. This combination enables SAS to develop new methods and algorithms that make a real difference in practice. 3 SAS Brings Value to Open Source Solutions “The open source community brings a tremendous amount of value to the analytics community,” explained Garrett. “It brings together people from very different backgrounds and experiences to solve complex problems. And at SAS, we believe there’s a lot of really good, collaborative work that’s been done identifying new problems and finding new ways to solve them. And that’s why SAS has opened up key parts of its software to integrate with certain open source software.” SAS/IML® Software Integration SAS® Enterprise Miner™ SAS Enterprise Miner – a data mining product – also integrates with open source solutions. SAS Enterprise Miner has an easyto-use, drag-and-drop interface that allows people to do data mining with ease and to create models. Integration with open source provides access to R modeling packages. In addition, if a model supports PMML (Predictive Modeling Markup Language), SAS Enterprise Miner can convert the PMML to SAS score code for conversion to production. (If a model does not generate PMML, users can still assess the model against other SAS models.) • Enables R functions and packages. “You can leverage R modeling packages from a SAS environment and generate corresponding PMML models, compare R and SAS models in one interface, and create ensemble models,” stated Garrett. As shown in Figure 2, using SAS Enterprise Miner, users can simply drag and drop particular blocks and connect them together to create a process flow. In PMML output mode, the Open Source Integration node translates the R model into SAS DATA step code using PMML. The node then scores all imported data partitions with the generated SAS score code so users can easily compare R and SAS models. The node automatically runs standard SAS Enterprise Miner assessments for supervised predictive models. “With SAS/IML, users can submit R code within SAS,” stated Garrett. “So you can write R code and be running SAS/IML software. Say you want to do some matrix manipulations and some functions within R. Now you can do that.” To learn more, see: “But the real value of the integration is that people can also create ensemble models using R and SAS,” explained Garrett. “SAS Enterprise Miner can import R models and be used for model transformations, imputations and more. People can then For example, SAS/IML – a matrix manipulation product that supports matrix-vector computation – gives users the flexibility to create custom functions. It also: • Takes advantage of built-in functions, subroutines and SAS procedures. • Enables interfaces with R language so users can submit R code within SAS. • Moves between SAS and R data structures. • www.youtube.com/watch?v=rUaTTre24kI • www.youtube.com/watch?v=nmRQ3MtkG6A Figure 2: Using SAS® Enterprise Miner™ and R. 4 Figure 3: Creating ensemble models. build R models with ease, as it wraps a framework around the R model. They can also take the two models, put them together, and make an even better model (for instance, by better targeting populations that impact critical business decisions).” Figure 3 illustrates how a blended model can combine the best of both an open source and SAS model to achieve the greatest lift, and thus the greatest improvement in overall performance. SAS Supports the Entire Analytics Life Cycle As discussed previously, SAS integration with open source optimizes how data scientists can explore data and develop models. But as shown in Figure 4, SAS also fills in the gaps across the entire analytics life cycle. For example, many open source analytics software products do not support data management and preparation or model deployment and monitoring – or if they do, it’s in a way that is too difficult or cumbersome. In contrast, SAS solutions allow you to have an end-to-end solution across the entire analytics life cycle. Figure 4 illustrates how SAS innovates across the entire analytics life cycle – not just algorithms or a new DATA step. “So if you start at the top of the diagram, you can identify a specific problem that needs to be solved,” explained Garrett. “Our solutions then support the data preparation phase. Once data is prepared, people want to explore their data – and SAS data exploration solutions make that easy to do. And so on, across the entire life cycle. We’ve built our success on the fact that we can offer solutions for every step in the analytics life cycle – even the building, care and management of data models and the scoring.” Figure 4: SAS® solutions support the entire analytics life cycle. 5 Preparing Data Using SAS® Preparing data for analytics is different than preparing it for traditional IT purposes. For data to be analytics-ready, all data preparation steps must be completed, including data aggregation, data transformation (for distribution transformations), data enrichment (for deriving new variables), and analytical data cleaning (for missing values). Completing these steps takes a great deal of time and effort; in fact, as much as 60-80 percent of time spent on an open source analytics project is on data preparation. “R packages are not very good when they run into different types of data or different formats,” explained Garrett. “Nor does R allow for automatic treatment of measurement levels, character variables and missing values.” Building and Validating Models Using SAS® Once the hard work of data preparation is done, the fun begins. SAS gives users lots of algorithms and methods to solve a given problem. With SAS, users can create better-performing models using innovative algorithms and industry-specific methods, as well as verify results with visual assessment and validation metrics. SAS software also helps users easily compare predictions and assessment statistics from models built using different approaches, since they can be viewed side by side. Deploying and Monitoring Models Using SAS® Once a model is finalized, it needs to be put into production. “SAS has several technologies for model deployment and monitoring, including score code that can deploy in traditional relational databases and in Hadoop,” commented Garrett. “Our SAS Scoring Accelerator for Hadoop actually allows you to pull in a particular model, build in the parameters, build equations into calculations, and then push it into Hadoop. It’s a one-click conversion process that saves you weeks and even months of time if you are trying to convert to SQL or JAVA, so you can build your model quickly and run it in parallel.” At the same time, SAS Model Manager streamlines the steps of creating, managing, deploying and operationalizing analytic models. Users can import R models into SAS Model Manager and then transform the scored output into a SAS data set for reporting. And by using the SAS Scoring Accelerator for Hadoop, they can push score coding directly into Hadoop, significantly reducing data movements. The solution’s performance monitoring and retraining capabilities help users take quick actions if model performance starts to degrade. The Facts on Total Cost of Ownership However, some nagging questions persist: From a financial perspective, does it make sense to run open source and SAS solutions concurrently? Isn’t open source significantly less expensive because the software is free? Garrett addressed these concerns head-on. “People often assume that open source is less costly because there’s no software to license,” he explained. “But the total cost of a solution encompasses much more than just license fees.” In fact, it actually comprises four variables: • Hardware. • Software. • Human capital for lines of business (HC LOB). • Human capital for IT (HC IT). Figure 5: Modernizing the analytic ecosystem leads to lower IT costs. 6 Figure 6: Open source solutions increase human capital costs (IT and line of business). As shown in Figure 5, as organizations modernize their analytic ecosystems over time, their total cost of ownership should decrease dramatically. “For example, when companies move from legacy platforms and warehouses … to grid computing plus Hadoop and a comprehensive suite of high-performance, extremely scalable algorithms for distributed computing that uses the latest analytical innovations – and all this computing happens in memory – they see dramatic reductions in TCO,” noted Garrett. Stated Garrett: “People assume that by ‘building their own’ Hadoop distribution using free or much lower-cost open source software, they can get to the same place – or close to it – even cheaper. But this kind of thinking ignores the other cost variables. Because if I overlay total open source costs on this same chart, the costs from a human capital perspective actually grow.” Figure 6 illustrates the unexpected cost increases. Why? Because organizations have to either hire or reallocate staff to do considerable extra work to get all of the technology built, integrated, tested and running. “People are busy coding, integrating, caring for systems and models, and more,” explained Garrett. “At this conference, I spoke to a gentleman who has been building his own Hadoop distribution for over five years! All this work – the human resources being consumed in both IT and the associated lines of business – costs money.” These are just some of the costs that are often overlooked when people consider using R or other open source software. SAS® Analytical Innovations for Hadoop Garrett ended his presentation by sharing some recent analytical innovations developed by SAS that leverage Hadoop. These include in-memory analytics, visual analytics, and visual statistics. SAS® High-Performance Analytics SAS offers high-performance analytics that run in memory for lightning-fast processing – even for the largest data sets on Hadoop. “We had a customer who was building risk models that were taking longer and longer to run,” explained Garrett. “The customer needed to do weekly or daily models, but they were approaching the limits of their time frame to do it. So they asked us to help them speed things up. But the amount of data they needed to process – and the computational intensity required to do it – was quickly outpacing Moore’s Law. So we took the algorithms they needed and rebuilt them to run them in parallel across a set of machines.” SAS then worked with the appliance vendors to build databases that can run the calculations in a massively parallel fashion. As Hadoop became more popular, SAS re-architected about 50 algorithms and individual statistical procedures so they can run inside Hadoop at high speeds. These procedures are used by a large number of SAS products and solutions. 7 “Not every algorithm is parallelizable – but we’ve taken the ones most frequently used by customers and built them out to run in parallel,” noted Garrett. “Whether it’s a logistic regression or neural networks or SVM (Support Vector Machines), or if you have millions or hundreds of millions of records that you’re trying to do these sorts of calculations on, you can use a tool like SAS Enterprise Miner to leverage modern machine learning algorithms to run in memory – and in a highly distributed fashion.” SAS® Visual Statistics at Scale SAS® Visual Analytics at Scale • Group BY processing. Business analysts need tools to help them “play” with large volumes of data and reveal hidden patterns. “But traditional business intelligence tools begin to fall over as you get into billions of records,” commented Garrett. “And today, many customers are way beyond the hundreds of millions of records. So our challenge was to build in-memory tools that would allow for high-speed data visualization, visual analytics, and more when people are dealing with massive data volumes. SAS is proud to have released a number of products like this.” The first one is SAS Visual Analytics – a drag-and-drop tool that goes against millions, hundreds of millions, even billions of records kept in a distributed, in-memory, analytical store. “When I say distributed, I mean taking the data that you’ve splayed out across your HDFS across all of your particular nodes,” noted Garrett. “And enabling each of those particular machines to play their part to lift that data in parallel, up into RAM. And maybe you do it directly on your new cluster, where you take some of the RAM off of each of your Hadoop nodes. Or you build up a rack of machines nearby for that special purpose – a ‘math rack,’ if you will. Instead of lifting directly up into memory, you lift in parallel up into that particular math rack that can perform the specific analysis you want.” These visualizations are accomplished using the SAS® LASR™ Analytic Server – an in-memory, distributed, stateless system. Explained Garrett: “It’s very different than an in-memory database, because you don’t ask for rows and columns out of this. You ask for analysis, such as a forecast, a decision tree, a correlation matrix, and so on. And you can do all this in a massively parallel fashion and get results at incredibly high speeds.” SAS hasn’t forgotten about the data miners, modelers, and data scientists who need to perform exploratory modeling, such as: • Supervised learning (for example, logistic regression, linear regression and generalized linear modeling). • Unsupervised learning (using decision trees and clustering). • Model assessments and comparisons (such as lift, ROC, and classification rate). • Discovery at the observational level (for instance, to identify outliers and influence points). “To support exploratory modeling, SAS has developed a new offering called SAS Visual Statistics,” Garrett remarked. “SAS Visual Statistics allows multiple users to quickly and interactively customize their models. They can add or change variables, remove outliers, etc., and instantaneously see how those changes affect model outcomes. And they can look at multiple models to determine which one provides the most predictive power.” SAS® Visual Statistics at a Glance • Interactive and exploratory predictive modeling in a superior visual environment. • Seamless integration of data exploration and model development. • Concurrent access to data loaded in memory in a multiuser environment. • Support for predictive modeling techniques such as clustering, linear and logistic regression, interactive decision trees, and general linear models. 8 SAS® In-Memory Statistics For people who prefer to write code rather than use a GUI, SAS offers SAS In-Memory Statistics. “This solution does everything SAS Visual Statistics can do – but in a concurrent, multiuser environment within a programming environment,” added Garrett. “For example, users can perform descriptive statistics, regression (both linear and logistic), decision trees, random forests, generalized linear models, text mining, forecasting, and clustering. And because data persists in memory, organizations benefit from faster computation time.” With SAS In-Memory Statistics, users can also work with raw data and then program as needed to generate a wide variety of advanced analytical methods and machine learning algorithms. It also includes a recommendation engine that generates both explicit and implicit recommendations. Making SAS® Accessible to Professors, Students, Researchers and Independent Learners Analytic skills are highly sought by today’s employers. SAS understands this – and it’s why we make SAS software available and free for people who want to learn it. Our goal is to seed the market with analytical talent. To this end, SAS has created the SAS Analytics U program, which makes SAS software readily available to professors, instructors, students and researchers in an academic setting, as well as to independent learners seeking to learn SAS to attain skills required for a current or future job. There has been a strong adoption of SAS as a result of this new program. In fact, more than 100,000 people have taken advantage of this program in the first six months of its availability. “SAS has made its software available for free to people who want to learn it with its SAS Analytics U program,” concluded Garrett. ”They also make videos, training, full documentation, and technical support available. Even if you are an independent learner – you can get a copy of SAS today and play with the product in a noncommercial environment.” SAS Analytics U is a comprehensive global program that offers professors, students, academic researchers and independent learners access to: •Free SAS software. •Helpful resources to install, learn and use SAS. •Free online classes. •Interactive, online SAS Analytics U Community. Learn More To learn more about analytic solutions for Hadoop users, please see the research brief “Eyes Wide Open: Open Source Analytics Software” from the International Institute for Analytics, which is available at sas.com/openeyes. To contact your local SAS office, please visit: sas.com/offices SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Copyright © 2014, SAS Institute Inc. All rights reserved. 107429_S127985.1214