Improving network security using big data and

Improving network security using
big data and machine learning
IT Showcase Article
Situation
Detecting malware and protecting
intellectual property are not new concerns
for IT, but new aspects of the computing
environment at large organizations give
these old concerns new urgency.
Solution
Microsoft IT is working with a new big data
platform and applying machine learning to
help detect and analyze patterns that can
indicate malware or the oversharing of
sensitive information.
Benefits
 The ability to collect and analyze data on
a large scale that is relevant to security
initiatives.
 The ability to detect patterns in the data
that would be invisible to human analysts.
 Improved security with regard to malware
on the company network and protection
of intellectual property.
Products and technologies
 Microsoft Azure Storage
 Microsoft Azure HD Insight
 Microsoft Azure Machine Learning
Published August 2015
Microsoft IT is improving network security using big data and
machine learning to detect usage patterns that might otherwise
escape notice. They are using these tools to address two
particularly significant security problems in large organizations:
the challenging landscape created by the sheer volume of
machines on the network (exacerbated by the “bring your own
device” workplace culture), and the ability to overshare sensitive
information with collaboration software.
Intrusion detection: the first step in fighting malware is to
see it
Malware, by nature, tries to be inconspicuous. At an organization as large as
Microsoft, with half a million machines and devices connecting to the corporate
network, it is naturally a little easier to be inconspicuous.
Microsoft IT needed a better way to stop hackers and the malware these hackers
could introduce into the corporate network. Within Microsoft IT, the Information
Security and Risk Management (ISRM) team began to develop a long-term,
intrusion detection program. ISRM saw the value in deploying big data and
machine learning in symbiotic combination to advance their intrusion detection
capabilities.
With a big data platform, ISRM could create a massive collection of events from
hundreds of thousands of machines across the network. On top of that data, ISRM
could build a series of algorithms to look for anomalous or suspicious behavior,
and fine-tune the machine learning to improve the performance of the model.
For example, consider the algorithm ISRM and its partners developed to isolate
rare processes. An event on every single Windows device indicates when a process
has started. Across half a million machines, that’s about three billion events per
day. ISRM runs an algorithm that shows them if a directory or process only appears
on one or two machines in the entire company. The output for this algorithm is a
rare directory, a rare executable file, or both. Typically, this produces about 10,000
records per day, which is a much more manageable number (as opposed to three
billion) that ISRM then feeds into its machine-learning tool. With far fewer false
Page 2
|
Improving network security using big data and machine learning
positives to analyze, machine learning improves because it focuses on true
positives.
Why a big data platform?
ISRM partnered with scientists in the Data and Decision Sciences Group (DDSG) to
build this big data platform. Microsoft decided to build a new big data platform in
addition to using a security information and event management (SIEM) solution
because it wanted to be able to predict potential problems.
SIEMs provide real-time detection and are a valuable part of the overall security
solution. But they can be difficult to scale to the size of the detection program
Microsoft was looking for. Moreover, as principal IT service engineering manager
Jenn LeMond noted, while SIEM is good for some types of detection (for example,
when you already know what you are looking for), it is not very good at predictive
detection (when you don’t know exactly what you are looking for).
“That’s when you need to move to a big data platform, that’s what machine
learning is supposed to get you,” LeMond said. “I want the machine to tell me what
the data looks like from a normal perspective, and then tell me what the outliers
are.”
So the ISRM and DDSG teams agreed that the components of this new solution
would include:

Microsoft Azure Storage. Provides the ability to store and process hundreds
of terabytes of data and handles millions of requests per second, on average.

Azure HD Insight. Provides big data analysis and accommodates custom
algorithms developed by ISRM and DDSG.

Azure Machine Learning (Azure ML). Provides predictive analytics and
includes some generic “plug-and-play” algorithms to apply to big data.
Data collection: the most time-consuming part of the solution
Over the two years that this program has been running, ISRM has found that the
most laborious part of the process has been collecting the data and aggregating it
into a usable form. It is a complex, technical process that includes transforming the
format of some data, possibly even transforming the values of some data to
standardize with other data, interacting with real-time systems, and accepting
streaming data.
The first feed of data took about six months. The second feed took about half that
time. “As we learn about how to build tables more effectively, with the correct
schema, the amount of time to input new feeds is beginning to reduce,” LeMond
said.
Other key requirements that were recognized during the data collection phase of
the project have included:
IT Showcase Article

Accommodating thousands of transactions per second.

Thinking about the design and schema to support future cross-correlation.

Designing the system to perform well, not just for the analytics but also for ad
hoc data exploration.
Page 3
|
Improving network security using big data and machine learning
Data analysis: a vital, collaborative effort to make the output meaningful
ISRM found that one of the biggest challenges in the big data solution was getting
rid of the noise of false positives in the output. In other words, it was one thing to
collect data, and quite another to create a meaningful and effective tool for looking
ahead, predictively. Data scientist Serge Berger calls this “actionable foresight.” To
get actionable foresight, ISRM found that they needed a great deal of data analysis.
LeMond cites the work her team did to solve the problem of identifying rare paths
and separating the meaningful results from the noise: “When we were looking for
rare paths, we discovered that, for example, Windows 8 installs executables into
three randomly named directories. That meant the way our algorithm worked,
every single directory was flagged as rare, until we calculated the directory path
segments into a mathematical formula.
“There was a game called ‘Unicorn.exe.’ We had 20,000 of those, all coming up as
rare because the directory paths were random. When we calculated the segment
paths together, all of that rolled up into a single 20,000 unicorn.exe, which was no
longer rare.”
It was necessary to understand the data to improve the results of the original
algorithm. ISRM was, thus, able to tune the algorithm and significantly reduce the
false positives in the output.
At present, none of the cloud analytics platforms on the market can handle the
amount of raw data the ISRM project generates, underscoring the necessity of
doing analytics and hive database queries first. Once hive queries are conducted,
ISRM is able to use Azure ML to run secondary sets of logic on the output of that
data.
ISRM considers the data analysis phase to be so important that:

They recommend pairing a data scientist with an IT security professional
for optimal results. The more the two of them can communicate effectively
and iterate on the algorithms the data scientist creates, the more effective the
output will be for the data analysts working downstream.

They recommend conducting data analysis both before and after running
the machine learning algorithms (discussed below). The first round of data
analysis is necessary to optimize the big data platform for machine learning.
The second round of data analysis is necessary to train the algorithms over
time. The process is iterative.

Depending on the size and complexity of the environment, they
recommend accelerating development by adding more data analysts to
work with the data scientists. They have found that this is where most time is
spent.
Running machine learning to make anomalies visible
A quick look at the statistics of this program illustrate exactly why machine-learning
technology is an integral part of the solution. Seven to eight terabytes of data are
generated per day across the company, and tens of thousands of events per day
are produced by the process, network, and account analytics that the ISRM team
employs. As LeMond succinctly puts it, “This is too much for a human being to
look at.”
Machine learning illuminates patterns in the data and makes anomalies visible,
giving the security experts a better chance of detecting malware.
IT Showcase Article
Page 4 |
Improving network security using big data and machine learning
As Berger explains this portion of the solution, “We are looking for particular
patterns associated with events. We’re actually not interested in individual events,
as these are not very useful in uncovering malware. But a certain pattern that exists
within a period of time (for example five to seven days), may allow you to detect
malware with a high degree of accuracy.”
So DDSG is aggregating big data into time-series data, created by events.
Specifically, system log (syslog) and Windows log events are analyzed in time, and
made into temporal data.
To understand how it works, it’s helpful to recognize the different types of machine
learning: deterministic and probabilistic (which itself has unsupervised, semisupervised, and fully supervised aspects).
Deterministic machine learning
ISRM recommends using deterministic machine learning as a starting point. Start
with small data sets, identify known bad behavior, and provide this information to
the data scientists to build the initial models.
Whatever behavior has been identified, the deterministic portion of machine
learning is looking for a deviation from a baseline. The baseline is called normal,
and a deviation that exceeds a certain level of tolerance is called anomalous
behavior.
To take a simplified, real-world example of deterministic machine learning, if your
data set consists of the annual incomes of a group of people dining at a restaurant,
the baseline would be the mean annual income of those people. If a billionaire
were to walk into the restaurant, deterministic machine learning could then identify
that individual as an extreme outlier from the norm, with a huge deviation from the
established baseline.
Probabilistic machine learning
ISRM then uses probabilistic machine learning to find patterns in the data that may
have been undetected by the more blunt deterministic technique. Probabilistic
learning uses the principle of clustering like things together to predict patterns and
outcomes.
“Malware is much rarer than not-malware,” Berger explains. “The clusters with
malware tend to be much less populous than those with just the normal software.
As a result, we solve many problems, including the problem of multiple false
positives.”
This phase of probabilistic machine learning is unsupervised, meaning that the
algorithm runs without human intervention and produces a map of sorts. The map
shows populous clusters, less populous clusters, and perhaps a few outliers or very
sparsely populated clusters.
At this point, a human expert is called in to look at the clusters and decide if a
cluster is a normal process or malware. This is called the semi-supervised phase of
machine learning.
Ultimately, Berger reports, ISRM will use a fully supervised phase of machine
learning, in which it is possible to predict by behavior whether a particular software
is malware. Transactions and communications are scored for the probability of
being anomalous. This phase can be deployed in real-time, and if the machinelearning model is well tuned, it can result in 90 percent accuracy, according to
Berger.
IT Showcase Article
Page 5
|
Improving network security using big data and machine learning
Closing the loop: reporting findings of malware
At the end of this process, if ISRM finds malware within the Microsoft corporate
network, they submit the sample of the malware to OneProtect, the anti-malware
product at Microsoft. Once the signature shows up in OneProtect, the anti-malware
product automatically starts cleaning this malware across the world. And most
importantly, ISRM is able to learn the pattern of behavior of this malware so that it
can detect similar malware in the future.
Oversharing detection: new steps to protect intellectual
property
Another group in ISRM is using machine learning on behalf of corporate security
efforts—but in this case, the goal isn’t to detect malware, it’s to protect the
intellectual property of the company.
Olav Opedal, Principal IT Enterprise Architect, conducted an analysis of attitudes
and behaviors across Microsoft, specifically in the context of using a cloud-based
tool such as SharePoint Online. His study found that 23 percent of employees are
overly altruistic and will share highly sensitive information with too many people.
Oversharing is seldom malicious in nature; usually, the sharer does not understand
the impact of their sharing decision. Some of the reasons for this over-sharing are:

The sharer might enjoy the feeling of importance associated with having
sought-after, “insider” information.

The sharer may not understand the sensitive nature of the information or the
privacy-level of the forum in which the sharing takes place.
In some rare cases, the sharer may be divulging information with malicious intent.
Whatever the motivation of the sharers, Opedal and his team realized they needed
a way to address the potential for the leak of intellectual property that could occur
through this social phenomenon of oversharing.
How the process works
The process they conceived starts with key words. For example, during the
development of Cortana, it was useful to know which documents on SharePoint
referred to the keyword, “Cortana.”
They found that so many documents contained this keyword that it would have
been a Herculean task for one human, or even several humans, to sift through all
the documents, weeding out the sensitive references from the harmless.
So instead, Opedal worked with the tools team in ISRM, as well as Berger, to build a
process that scans the fast index of both SharePoint Online and on-premises
SharePoint. Then they downloaded the documents that have a probability of being
sensitive. Opedal described this as essentially making a big data problem into a
small data one.
Then individuals would rate whether a random sample of documents is sensitive or
not, and machine learning models could be trained based on that sample.
“We combine behavioral sciences with methods and ideas from mathematics to
create this into a classification problem,” Opedal explained. “It is a classification
problem in the sense that the documents fall into one of two categories: either
overshared intellectual property or not. Then using a random sample of documents
classified as falling into one or the other of these categories, we build a model that
can score this with a level of probability.”
IT Showcase Article
Page 6
|
Improving network security using big data and machine learning
The challenge of changing the behavior of oversharing
Opedal believes that the key to reducing this risky behavior of oversharing isn’t
simply to shut off the ability of employees to share information externally. It’s more
effective to create a signal to users when inappropriate sharing occurs, thereby
training them to avoid it.
For example, with OneDrive for Business, many users would select the “share with
everyone” folder for highly sensitive information.
When Microsoft attempted to solve the problem by denying external sharing on
OneDrive for Business and SharePoint Online, they found that in many cases, users
simply switched over to Dropbox. The number of Dropbox users grew to 12,000
during this period.
A new, streamlined process was implemented, which promoted IT-supported
solutions that met required policies and standards. When external sharing was
again permitted on OneDrive and SharePoint, the risk of users using consumer
solutions (such as Dropbox) decreased.
“After monitoring the situation and studying the phenomenon,” Opedal said, “we
decided to deploy an internally built tool focused on machine learning scanning,
and a tool called AutoSites, which signals users when inappropriate sharing occurs.
We have found that this reduces oversharing.”
AutoSites was built by the Discovery and Collaboration team in IT, based on the
requirements Opedal provided to them.
With the success of the oversharing detection program overall, it is now being
upgraded to include the capability to identify documents and files containing
passwords stored on SharePoint Online.
For more information
For more information about Microsoft products or services, call the Microsoft Sales
Information Center at (800) 426-9400. In Canada, call the Microsoft Canada Order
Centre at (800) 933-4750. Outside the 50 United States and Canada, please contact
your local Microsoft subsidiary. To access information via the web, go to:
http://www.microsoft.com
http://www.microsoft.com/ITShowcase
© 2015 Microsoft Corporation. All rights reserved. Microsoft and Windows are either registered
trademarks or trademarks of Microsoft Corporation in the United States and/or other countries.
The names of actual companies and products mentioned herein may be the trademarks of their
respective owners. This document is for informational purposes only. MICROSOFT MAKES NO
WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.
IT Showcase Article