The Use of Open Source is Growing. So Why Do Organizations Still

 Conclusions Paper
The Use of Open Source Is Growing.
So Why Do Organizations Still Turn to SAS?
Insights from a presentation at the 2014 Hadoop Summit
Featuring
Brian Garrett, Principal Solutions Architect at SAS
Contents
Commercial Analytics Software and/or
Open Source?..................................................................... 1
SAS Offers Unique Value to Open Source Users........ 1
Experience and Expertise.....................................................1
Proven Value to Customers..................................................1
Innovation and Leadership in Analytics.............................2
SAS Brings Value to Open Source Solutions............... 3
SAS/IML® Software Integration............................................3
SAS® Enterprise Miner™.......................................................3
SAS Supports the Entire Analytics Life Cycle................ 4
Preparing Data Using SAS® .................................................5
Building and Validating Models Using SAS® ...................5
Deploying and Monitoring Models Using SAS®.............5
The Facts on Total Cost of Ownership........................... 5
SAS® Analytical Innovations for Hadoop...................... 6
SAS® High-Performance Analytics......................................6
SAS® Visual Analytics at Scale..............................................7
SAS® Visual Statistics at Scale...............................................7
SAS® In-Memory Statistics....................................................8
Making SAS® Accessible to Professors, Students,
Researchers and Independent Learners ...................... 8
Learn More.......................................................................... 8
1
Commercial Analytics Software
and/or Open Source?
It’s a hot topic today. As customers debate which is the best way
to go, recent findings by Nucleus Research suggest that many
organizations have realized that they can meet both internal and
external stakeholder requirements by finding the right balance
of SAS enterprise-class analytics solutions and open source
solutions. Why? Because SAS is optimized for operational and
production analysis and includes integrated capabilities for data
management and more, while open source quickly brings new
analytic algorithms to market.1
But how can these technologies coexist in the real world to
meet different business needs? Do they “play” together well?
How do they work together to leverage Hadoop?
These and other questions were answered in a recent presentation at the 2014 Hadoop Summit by Brian Garrett, Principal
Solutions Architect at SAS. His presentation “With the Rise of
Open Source, Why Organizations Still Turn to SAS” highlighted
little-known facts about SAS investments in software enhancements that allow analysts to incorporate R algorithms into
analytic processes as part of a comprehensive, enterprise-class
SAS analytics platform.
SAS Offers Unique Value
to Open Source Users
Everyone knows SAS. What they may not know is what SAS
offers to users of open source analytics – and especially to those
users storing data in Hadoop clusters. Garrett explained, “From
my perspective, SAS offers four key strengths: experience and
expertise, value to customers, innovation, and leadership in
analytics.”
Experience and Expertise
SAS – the world’s largest, privately held software company –
was founded in 1976 and currently employs more than 13,000
people worldwide across 400 office locations. “Our employees
are one of our biggest assets because of the depth and breadth
of their experience,” said Garrett. On average, the average
tenure at SAS is over 10 years, which is important because
1
SAS Analytics and Open Source. Nucleus Research, April 2014.
1) usually, original code developers are just down the hall and
available for consultation, and 2) SAS employees have a tremendous amount of intellectual property in their heads – IP that’s
readily available to peers and customers.
“In addition, SAS employee turnover rate averages 3 percent
annually – compared to the industry average of 22 percent,”
noted Garrett. “Eighty-five percent of our statistics, statistical
product and testing teams have advanced degrees – and fully
half of them have PhDs in math, statistics or operations research.
So we manage to attract and retain some of the best and
brightest analytical talent in the industry.” For example, several
years ago, SAS created an advanced analytics lab to tackle
some of the newest emerging technologies – and 98 percent of
employees have advanced degrees and one-third have PhDs.
“It’s this level of expertise that separates SAS’ development
and testing processes. They can – and do – perform rigorous
testing and validation of all algorithms used in products so that
they are proven reliable, accurate and ready for enterprise use,”
continued Garrett.
}}
“About 25 percent of our revenues
are put back into research and development. That’s huge for a software
company. But we believe it’s critical to
driving innovation and leadership in
this space.”
Brian Garrett, Principal Solutions Architect, SAS
Proven Value to Customers
SAS software is currently in use across approximately 70,000
customer sites in 135 countries – including 90 of the top 100
companies in the Fortune 500. “One-fourth of our revenue
comes from the financial services industries, and three-quarters
comes from all other major segments,” noted Garrett. “So we’re
delivering value across a wide range of industries.” Figure 1
details these industries.
2
Figure 1: Percentage of SAS business by industry.
Equally important, SAS Business Solutions go far beyond
analytic algorithms. They encompass solutions for data management, analytics, business intelligence, and high-performance
analytics. Under this broad umbrella, SAS offers many horizontal
and vertical solutions that help organizations use their data to
solve contemporary and industry-specific business problems.
“Our analytic solutions all use the same underlying core technology, meaning they use the same algorithms and statistical
methods in very specific ways to address such issues as fraud,
credit risk, merchandizing effectiveness, customer intelligence
and more,” noted Garrett. “So while Hadoop enables the persistence of large amounts of data in a relatively cheap fashion, SAS
focuses on what you can do with data, what it can tell you.”
SAS is constantly including extensive customer input when
building and enhancing products, ensuring close alignment
between software functionality and business need. At the
same time, SAS provides comprehensive training and documentation for all products, which is critical to ramping up and
supporting users so customers quickly realize value from their
SAS investments.
Innovation and Leadership in Analytics
Customers, industries and research organizations seeking
innovative solutions to new analytic problems rely on SAS. To
develop these solutions, SAS developers engage in conversations with customers across many industries to learn about
their most pressing business problems. At the same time, SAS
experts are actively involved in professional conferences and
work with leading academic researchers – for example, to
learn about new methodologies and algorithms, as well as to
evaluate the solutions’ robustness and effectiveness for use by
customers.
By bridging these two worlds – academia and business – SAS
delivers innovative methods that matter to our customers,
drawing on rapidly expanding analytical disciplines. At the same
time, through continual and longstanding engagement with
universities, SAS shares its own best practices with professors
and students, broadening the practices’ application to other
disciplines and industries.
Advanced analytics developers at
SAS learn about the broad range
of problems that customers have
– and at the same time, they keep
up with new developments in our
respective disciplines. This combination enables SAS to develop
new methods and algorithms that
make a real difference in practice.
3
SAS Brings Value to Open
Source Solutions
“The open source community brings a tremendous amount of
value to the analytics community,” explained Garrett. “It brings
together people from very different backgrounds and experiences to solve complex problems. And at SAS, we believe
there’s a lot of really good, collaborative work that’s been done
identifying new problems and finding new ways to solve them.
And that’s why SAS has opened up key parts of its software to
integrate with certain open source software.”
SAS/IML® Software Integration
SAS® Enterprise Miner™
SAS Enterprise Miner – a data mining product – also integrates
with open source solutions. SAS Enterprise Miner has an easyto-use, drag-and-drop interface that allows people to do data
mining with ease and to create models. Integration with open
source provides access to R modeling packages. In addition,
if a model supports PMML (Predictive Modeling Markup
Language), SAS Enterprise Miner can convert the PMML to SAS
score code for conversion to production. (If a model does not
generate PMML, users can still assess the model against other
SAS models.)
• Enables R functions and packages.
“You can leverage R modeling packages from a SAS environment and generate corresponding PMML models, compare R
and SAS models in one interface, and create ensemble models,”
stated Garrett. As shown in Figure 2, using SAS Enterprise
Miner, users can simply drag and drop particular blocks and
connect them together to create a process flow. In PMML
output mode, the Open Source Integration node translates the
R model into SAS DATA step code using PMML. The node then
scores all imported data partitions with the generated SAS score
code so users can easily compare R and SAS models. The node
automatically runs standard SAS Enterprise Miner assessments
for supervised predictive models.
“With SAS/IML, users can submit R code within SAS,” stated
Garrett. “So you can write R code and be running SAS/IML
software. Say you want to do some matrix manipulations and
some functions within R. Now you can do that.” To learn more, see:
“But the real value of the integration is that people can also
create ensemble models using R and SAS,” explained Garrett.
“SAS Enterprise Miner can import R models and be used for
model transformations, imputations and more. People can then
For example, SAS/IML – a matrix manipulation product that
supports matrix-vector computation – gives users the flexibility
to create custom functions. It also:
• Takes advantage of built-in functions, subroutines and SAS
procedures.
• Enables interfaces with R language so users can submit R
code within SAS.
• Moves between SAS and R data structures.
• www.youtube.com/watch?v=rUaTTre24kI
• www.youtube.com/watch?v=nmRQ3MtkG6A
Figure 2: Using SAS® Enterprise Miner™ and R.
4
Figure 3: Creating ensemble models.
build R models with ease, as it wraps a framework around the R
model. They can also take the two models, put them together,
and make an even better model (for instance, by better
targeting populations that impact critical business decisions).”
Figure 3 illustrates how a blended model can combine the best
of both an open source and SAS model to achieve the greatest
lift, and thus the greatest improvement in overall performance.
SAS Supports the Entire
Analytics Life Cycle
As discussed previously, SAS integration with open source optimizes how data scientists can explore data and develop models.
But as shown in Figure 4, SAS also fills in the gaps across the
entire analytics life cycle. For example, many open source
analytics software products do not support data management
and preparation or model deployment and monitoring – or
if they do, it’s in a way that is too difficult or cumbersome. In
contrast, SAS solutions allow you to have an end-to-end solution
across the entire analytics life cycle.
Figure 4 illustrates how SAS innovates across the entire analytics
life cycle – not just algorithms or a new DATA step. “So if you start
at the top of the diagram, you can identify a specific problem
that needs to be solved,” explained Garrett. “Our solutions then
support the data preparation phase. Once data is prepared,
people want to explore their data – and SAS data exploration
solutions make that easy to do. And so on, across the entire life
cycle. We’ve built our success on the fact that we can offer solutions for every step in the analytics life cycle – even the building,
care and management of data models and the scoring.”
Figure 4: SAS® solutions support the entire analytics life cycle.
5
Preparing Data Using SAS®
Preparing data for analytics is different than preparing it for
traditional IT purposes. For data to be analytics-ready, all data
preparation steps must be completed, including data aggregation, data transformation (for distribution transformations), data
enrichment (for deriving new variables), and analytical data
cleaning (for missing values).
Completing these steps takes a great deal of time and effort;
in fact, as much as 60-80 percent of time spent on an open
source analytics project is on data preparation. “R packages
are not very good when they run into different types of data or
different formats,” explained Garrett. “Nor does R allow for automatic treatment of measurement levels, character variables and
missing values.”
Building and Validating Models Using SAS®
Once the hard work of data preparation is done, the fun begins.
SAS gives users lots of algorithms and methods to solve a given
problem. With SAS, users can create better-performing models
using innovative algorithms and industry-specific methods,
as well as verify results with visual assessment and validation
metrics. SAS software also helps users easily compare predictions and assessment statistics from models built using different
approaches, since they can be viewed side by side.
Deploying and Monitoring Models Using SAS®
Once a model is finalized, it needs to be put into production.
“SAS has several technologies for model deployment and
monitoring, including score code that can deploy in traditional
relational databases and in Hadoop,” commented Garrett. “Our
SAS Scoring Accelerator for Hadoop actually allows you to pull
in a particular model, build in the parameters, build equations
into calculations, and then push it into Hadoop. It’s a one-click
conversion process that saves you weeks and even months
of time if you are trying to convert to SQL or JAVA, so you can
build your model quickly and run it in parallel.”
At the same time, SAS Model Manager streamlines the steps
of creating, managing, deploying and operationalizing analytic
models. Users can import R models into SAS Model Manager
and then transform the scored output into a SAS data set
for reporting. And by using the SAS Scoring Accelerator for
Hadoop, they can push score coding directly into Hadoop,
significantly reducing data movements. The solution’s performance monitoring and retraining capabilities help users take
quick actions if model performance starts to degrade.
The Facts on Total Cost
of Ownership
However, some nagging questions persist: From a financial
perspective, does it make sense to run open source and SAS
solutions concurrently? Isn’t open source significantly less
expensive because the software is free?
Garrett addressed these concerns head-on. “People often
assume that open source is less costly because there’s no
software to license,” he explained. “But the total cost of a
solution encompasses much more than just license fees.”
In fact, it actually comprises four variables:
• Hardware.
• Software.
• Human capital for lines of business (HC LOB).
• Human capital for IT (HC IT).
Figure 5: Modernizing the analytic ecosystem leads to lower IT costs.
6
Figure 6: Open source solutions increase human capital costs (IT and line of business).
As shown in Figure 5, as organizations modernize their analytic
ecosystems over time, their total cost of ownership should
decrease dramatically. “For example, when companies move
from legacy platforms and warehouses … to grid computing
plus Hadoop and a comprehensive suite of high-performance,
extremely scalable algorithms for distributed computing that
uses the latest analytical innovations – and all this computing
happens in memory – they see dramatic reductions in TCO,”
noted Garrett.
Stated Garrett: “People assume that by ‘building their own’
Hadoop distribution using free or much lower-cost open source
software, they can get to the same place – or close to it – even
cheaper. But this kind of thinking ignores the other cost variables. Because if I overlay total open source costs on this same
chart, the costs from a human capital perspective actually grow.”
Figure 6 illustrates the unexpected cost increases.
Why? Because organizations have to either hire or reallocate
staff to do considerable extra work to get all of the technology built, integrated, tested and running. “People are busy
coding, integrating, caring for systems and models, and more,”
explained Garrett. “At this conference, I spoke to a gentleman
who has been building his own Hadoop distribution for over
five years! All this work – the human resources being consumed
in both IT and the associated lines of business – costs money.”
These are just some of the costs that are often overlooked when
people consider using R or other open source software.
SAS® Analytical Innovations
for Hadoop
Garrett ended his presentation by sharing some recent analytical innovations developed by SAS that leverage Hadoop. These
include in-memory analytics, visual analytics, and visual statistics.
SAS® High-Performance Analytics
SAS offers high-performance analytics that run in memory for
lightning-fast processing – even for the largest data sets on
Hadoop. “We had a customer who was building risk models
that were taking longer and longer to run,” explained Garrett.
“The customer needed to do weekly or daily models, but they
were approaching the limits of their time frame to do it. So
they asked us to help them speed things up. But the amount of
data they needed to process – and the computational intensity
required to do it – was quickly outpacing Moore’s Law. So we
took the algorithms they needed and rebuilt them to run them
in parallel across a set of machines.” SAS then worked with the
appliance vendors to build databases that can run the calculations in a massively parallel fashion.
As Hadoop became more popular, SAS re-architected about 50
algorithms and individual statistical procedures so they can run
inside Hadoop at high speeds. These procedures are used by a
large number of SAS products and solutions.
7
“Not every algorithm is parallelizable – but we’ve taken the
ones most frequently used by customers and built them out to
run in parallel,” noted Garrett. “Whether it’s a logistic regression or neural networks or SVM (Support Vector Machines),
or if you have millions or hundreds of millions of records that
you’re trying to do these sorts of calculations on, you can use
a tool like SAS Enterprise Miner to leverage modern machine
learning algorithms to run in memory – and in a highly distributed fashion.”
SAS® Visual Statistics at Scale
SAS® Visual Analytics at Scale
• Group BY processing.
Business analysts need tools to help them “play” with large
volumes of data and reveal hidden patterns. “But traditional
business intelligence tools begin to fall over as you get into
billions of records,” commented Garrett. “And today, many
customers are way beyond the hundreds of millions of records.
So our challenge was to build in-memory tools that would allow
for high-speed data visualization, visual analytics, and more
when people are dealing with massive data volumes. SAS is
proud to have released a number of products like this.”
The first one is SAS Visual Analytics – a drag-and-drop tool that
goes against millions, hundreds of millions, even billions of
records kept in a distributed, in-memory, analytical store. “When
I say distributed, I mean taking the data that you’ve splayed out
across your HDFS across all of your particular nodes,” noted
Garrett. “And enabling each of those particular machines to play
their part to lift that data in parallel, up into RAM. And maybe
you do it directly on your new cluster, where you take some of
the RAM off of each of your Hadoop nodes. Or you build up a
rack of machines nearby for that special purpose – a ‘math rack,’
if you will. Instead of lifting directly up into memory, you lift in
parallel up into that particular math rack that can perform the
specific analysis you want.”
These visualizations are accomplished using the SAS® LASR™
Analytic Server – an in-memory, distributed, stateless system.
Explained Garrett: “It’s very different than an in-memory
database, because you don’t ask for rows and columns out
of this. You ask for analysis, such as a forecast, a decision tree,
a correlation matrix, and so on. And you can do all this in a
massively parallel fashion and get results at incredibly high
speeds.”
SAS hasn’t forgotten about the data miners, modelers, and data
scientists who need to perform exploratory modeling, such as:
• Supervised learning (for example, logistic regression, linear
regression and generalized linear modeling).
• Unsupervised learning (using decision trees and clustering).
• Model assessments and comparisons (such as lift, ROC, and
classification rate).
• Discovery at the observational level (for instance, to identify
outliers and influence points).
“To support exploratory modeling, SAS has developed a new
offering called SAS Visual Statistics,” Garrett remarked. “SAS
Visual Statistics allows multiple users to quickly and interactively
customize their models. They can add or change variables,
remove outliers, etc., and instantaneously see how those changes
affect model outcomes. And they can look at multiple models to
determine which one provides the most predictive power.”
SAS® Visual Statistics at a Glance
• Interactive and exploratory predictive modeling in a superior visual
environment.
• Seamless integration of data exploration and model development.
• Concurrent access to data loaded in
memory in a multiuser environment.
• Support for predictive modeling techniques such as clustering, linear and
logistic regression, interactive decision
trees, and general linear models.
8
SAS® In-Memory Statistics
For people who prefer to write code rather than use a GUI, SAS
offers SAS In-Memory Statistics. “This solution does everything
SAS Visual Statistics can do – but in a concurrent, multiuser
environment within a programming environment,” added
Garrett. “For example, users can perform descriptive statistics,
regression (both linear and logistic), decision trees, random
forests, generalized linear models, text mining, forecasting, and
clustering. And because data persists in memory, organizations
benefit from faster computation time.”
With SAS In-Memory Statistics, users can also work with raw
data and then program as needed to generate a wide variety of
advanced analytical methods and machine learning algorithms.
It also includes a recommendation engine that generates both
explicit and implicit recommendations.
Making SAS® Accessible to
Professors, Students, Researchers
and Independent Learners
Analytic skills are highly sought by today’s employers. SAS
understands this – and it’s why we make SAS software available
and free for people who want to learn it. Our goal is to seed the
market with analytical talent.
To this end, SAS has created the SAS Analytics U program, which
makes SAS software readily available to professors, instructors, students and researchers in an academic setting, as well
as to independent learners seeking to learn SAS to attain skills
required for a current or future job.
There has been a strong adoption of SAS as a result of this new
program. In fact, more than 100,000 people have taken advantage of this program in the first six months of its availability.
“SAS has made its software available for free to people who
want to learn it with its SAS Analytics U program,” concluded
Garrett. ”They also make videos, training, full documentation,
and technical support available. Even if you are an independent
learner – you can get a copy of SAS today and play with the
product in a noncommercial environment.”
SAS Analytics U is a
comprehensive global
program that offers
professors, students,
academic researchers and independent learners access to:
•Free SAS software.
•Helpful resources to install, learn
and use SAS.
•Free online classes.
•Interactive, online SAS Analytics U
Community.
Learn More
To learn more about analytic solutions for Hadoop users, please
see the research brief “Eyes Wide Open: Open Source Analytics
Software” from the International Institute for Analytics, which is
available at sas.com/openeyes.
To contact your local SAS office, please visit: sas.com/offices
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product
names are trademarks of their respective companies. Copyright © 2014, SAS Institute Inc. All rights reserved.
107429_S127985.1214