Williams_thesis - College of Engineering and Applied Science

EXTERNAL MONITORING FOR INTERNAL DATA BREACHES
by
GREGORY WILLIAMS
B.S. Colorado Technical University, 2005
A thesis submitted to the Graduate Faculty of the
University of Colorado Colorado Springs
in partial fulfillment of the
requirements for the degree of
Master of Engineering in Information Assurance
Department of Computer Science
2014
© Copyright By Gregory Williams 2014
All Rights Reserved
ii
This thesis for Master of Engineering degree by
GREGORY WILLIAMS
has been approved for the
Department of Computer Science
by
_______________________________________
Dr. C. Edward Chow, Chair
_______________________________________
Dr. Chuan Yue
_______________________________________
Dr. Xiaobo Zhou
_________________________
Date
iii
Williams, Gregory (M.E., Information Assurance)
External Monitoring for Internal Data Breaches
Thesis directed by Professor C. Edward Chow
Data breaches and malware are commonplace these days. Even the
some of the larger organizations have fallen victim to hackers or even inadvertent
data exposure. Hackers have exposed details of their activities across the
internet to sites like Pastebin.com, twitter.com, and others. What happens when
a user on a compromised site uses the same credentials as the organization you
are hired to protect? Or what if a piece of stealthy malware doesn’t trip an
intrusion detection system signature?
Traditionally internal information security relies on analyzing internal logs
and events from desktops and servers to determine whether or not a malicious
security event happened. Credentials from a user in an organization can be
stolen via phishing or malware, which typically can be easily detected. However
what if the organization doesn't know that a user reused their username and
password for their organization on another website? What if that website were
compromised? What if a piece of spam sending malware were so stealthy that
the internal organization didn't know about it. How can an organization identify
this security event if it doesn't know it data may have been compromised
because the internal logs say everything is fine? There is a need for internal
external notification for these types of security incidents. The SAFES app helps
iv
to import and analyze external threat information internally so that organizations
are able to react to an external or internal security event more quickly by
leveraging an external organization’s information on their organization.
v
ACKNOWLEDGEMENTS
I would like to thank my family. First my wife, who continually pushes me
to exceed in what I love to do and who sacrificed her time to watch me pursue
my dreams. My daughters, Annie, Emily, and Adalynn, who sacrificed their time
with me so I could pursue this. Dr. Chow, who continually asks the questions
that makes you think. And finally Jerry Wilson, who recognized that I had a
passion for this and lets me get to do what I love to do every day. Practice
information security.
vi
TABLE OF CONTENTS
Chapter 1: Introduction………………………………………………………………….1
How we got here – Threat landscape…………………………………………2
Hackers…………………………………………………………………..2
Hacktivism………………………………………………………………..3
Malware and Botnets……………………………………………………6
Users……………………………………………………………………..9
Threat Management…………………..……………………………………….16
Attack Methods and Data Types………………………………….….17
Detection Methods…………………………………………………….20
Monitoring and Analysis Tools……………………………….………20
Framework of Log Management……………………………………..22
SIEM…………………………………………………………………….24
Chapter 2: The Problem of Data Breaches, Detection, and Response……….....27
Chapter 3: Splunk………...……………………………………………………………29
Splunk as a Log Management Tool………………………………………….31
Splunk as an a Application Development Platform………………..……….33
How Requirements are Met…………………..………………………………34
Chapter 4: Design of SAFES…………..…………………………………………….37
Requirements…………………………………………………………………..38
Logging Sources……………………………………………………………….39
Windows 2012 domain controller security logs…………………….39
Intrusion detection system logs……………………………….……..41
vii
Microsoft Smart Network Data Services…………………………….42
Pastebin.com Alerts………………………………………...…………42
Shadow Server Foundation Events….…………..…………………..43
Google Alerts…………….…………………………………………….45
Data Schemas……………………………………………..…………………..46
Domain schema………………………………………………………..46
IP address schema…………………………………………………….46
Username schema……………………………...……………………..47
Password schema……………………………………………………..47
Confidence Levels……………………………………………………………..47
Chapter 5: Implementation of SAFES………………………………….……………50
Splunk Installation……………………………………………….…………….50
Third Party apps……………………………………………………………….51
Splunk for IMAP………………………………………………….…….51
RSS Scripted Input…………………………………………………….53
Other Inputs…………………………………………………………………….54
Installation of the SAFES App…………………………………………..……55
GUI………………………………………………………………………………55
Confidence Engine and Alerting……………………………………………..56
Chapter 6: Experiments………………………………………………………………58
Simulated Botnet Activity………………………………………………….….58
Simulated External Data Breach……………………………………………..59
Simulated External Spam Detection………………………………………...59
viii
Conclusions and Future Work…………………………………………………..……61
Future Work………………………………………………………………….…61
Bibliography……………………………………………………….……………………63
Appendix A:
Installing SAFES from Start to Finish……………………………………..…68
ix
TABLES
Table
1.
Information Collected from the torpig botnet…………………………8
x
FIGURES
1.
Sample Google Alert…………………………………………………..45
2.
Pastebin.com Alert Setup……………………………………………..53
3.
Google Alerts Setup…………………………………………………...54
4.
SAFES Overview Dashboard…………………………………….…..56
xi
1
CHAPTER 1
INTRODUCTION
Data breaches occur every day. They can impact individuals and
organizations. They can affect individuals by exposing information that shouldn’t
be known to the public or attackers. They can affect organizations as well.
Malware installed on a system could allow attackers to steal users’ credentials
from inside the organization. Malware can also start sending out spam
messages or allow the compromised computer to become part of a larger botnet.
Hackers can attack an organization’s assets and expose sensitive data. Often
times data breaches are caught by an intrusion detection system (IDS) or by an
individual that has been notified that their account has been compromised.
However, what if internal detection methods fail? What if an organization doesn’t
know they have been compromised because their tools cannot detect it? What if
another organization has a data breach and a user credentials are reused on the
compromised organization’s systems? The third party organization’s data could
be at risk of exposure due to valid credentials leaked on a compromised system.
Consider if an organization could utilize the third party systems across the
internet as a sort of intrusion detection system that processes information
collected and reports on possible data breaches as they are happening or in
some cases before they happen. The third party organization’s information could
be utilized to alert an internal organization about a possible data breach. The
2
information collected is out there but disparate. This thesis will propose and
demonstrate a system that can collect third-party internal organizational data,
and automatically analyze, correlate, and alert on the third party data so that
potential data breaches that occur undetected by the internal organization’s and
detected by the external organization’s will be known to the internal organization.
How we got here – Threat Landscape
Data breaches, data loss, and cybercrime happen frequently and they
have been increasing for years. Everything that is connected has the potential of
being breached (Baun, 2012). Hackers, hacktivists, malware and the users
themselves all contribute to the problem of data breaches.
Hackers
There are many organizations that track data breaches including the
Identity Theft Resource Center (ITRC), the Privacy Rights Clearinghouse, and a
number of private firms. Research on data breaches was conducted by Garrison
and Ncube in 2010. The research looked at data breaches that occurred
between 2005 and 2009, specifically data that had actual number of records that
were breached. Based upon their research, data breaches can be broken up
into five distinct categories which are: stolen, hacker, insider, exposed and
missing (Garrison & Ncube, 2011). The exposed and hacker categories are
directly related to the problem that we are most concerned with. The exposed
category covers unprotected data that can be found in different mediums. These
mediums can include disks, files, hard drives, servers and desktop computers.
The data on those mediums contained personal information such as social
3
security numbers, customer records, parents, children, etc. (Garrison & Ncube,
2011). The hacker category covers unauthorized access to a computer system
or server. The data also revealed six specific types of organizations: business,
education, federal/military, financial, local/state government and media. Their
analysis looked to see if there was any specific data leading them to understand
if certain categories of breaches or organizations had more breaches than
others. What they found was interesting in both categories.
Exposed and hacker categories for data breaches covered 466 data
breaches with over 2.5 million records breached, which are 49.21% of the
incidents and 75.91% of the total number of records breached respectively
(Garrison & Ncube, 2011). The exposed category could be reduced significantly
if it were not for careless employees or employers (Garrison & Ncube, 2011).
During the five years that the study, it is also noted that 48.43% of the data
breach records were from hackers compromising a system (Garrison & Ncube,
2011). Additionally, records that were under the exposed category totaled 28%
with 26.58 of the total number of records compromised. Nearly 75% of the
records that were breached came from the exposed and hacked categories.
Keep in mind that these numbers were from 2009. Their research did not start to
touch on the new trend in hacking, starting in 2010: hacktivism.
Hacktivism
Hacktivism is more or less a combination of hacking and activism
(Hampson, 2012). Hacking is typically done for self-interest, whereas hacktivism
is done for social or political goals. The information the hacktivists obtain is
4
typically shared out to the public (Mansdield-Devine, 2011). Hacktivists desire
publicity. Typically they will claim that it is for the greater good, and to promote
security awareness (Mansdield-Devine, 2011). More so it’s about making a
public statement.
The group Anonymous has a long history with activism. It was only until
recently they started to use their skills for hacking. The first public
demonstrations by Anonymous were protests against the Church of Scientology.
The next major event was Operation Payback against the music industry for its
pursuit of filesharers. Other operations ensued, including hacking (MansdieldDevine, 2011). One of the larger operations to expose data in recent years was
the Sony data breach. The breach affected 75 million accounts. Information that
was compromised included name, address, country, email address, birthdate,
logins (usernames and passwords) and other data that may have been obtained
during the compromise including credit card information. The data was
apparently leaked by Anonymous (Fisher, 2013). A partial database dump was
posted onto Pastebin. What's more concerning is that Sony did not encrypt its'
customers information. Usernames and passwords were in plain text (Fisher,
2013).
LulzSec, in my opinion, was far more damaging to the public. LulzSec
protested only for the "lulz" - the pure joy of mayhem (Mansdield-Devine, 2011).
In LulzSec’s 50 days of hacking and hacktivism stunts, organizations like PBS,
Sony, CIA, and the Serious Organised Crime Agency (SOA) were compromised
or taken down by denial of service attacks. LulzSec also admitted to
5
compromising the security firm HBGary (Mansdield-Devine, 2011). LulzSec’s
campaign "Antisec" was geared toward the public awareness of security
weaknesses. Public awareness came through the exposure of personal
information on many individuals during those 50 days of “Antisec”. This included
information such as email addresses, passwords, usernames, social security
numbers, sensitive emails, etc. (Mansdield-Devine, 2011).
Organizations that are targeted by hacktivists are not the only victims of
an attack. The information that is leaked by hacktivists, such as usernames,
passwords, email addresses, even physical addresses puts everyone at risk.
The attacks on Arizonan law enforcement put police officers at risk (Poulsen,
2011). Information could have been used to pursue revenge. The information
that is leaked by hacktivists can be used for other intentions such as revenge.
LulzSec actually encouraged their followers to log into the personal accounts of
the victims data he leaked and embarrass them (Mansdield-Devine, 2011).
Embarrassment because the exposed data came from sometimes not so
upstanding websites. Ultimately, hacktivists believe that users will just change
their passwords, however it assumes that the victims know that their personal
information has been compromised (Mansdield-Devine, 2011).
The information that is leaked because of hacktivism may be used to log
into other organizations. This is worrisome especially if a user used their same
password on multiple sites or worse for an organization’s website. Leaked
information may be easily obtained from other sites and should be monitored by
an organization so that organizational information is not compromised if
6
organizational credentials are compromised.
Hacktivists seek to expose information to get the attention of a business.
They do this either promote a cause or to point out how weak their security is. In
2011, there were 419 incidents of reported data breaches, involving 23 million
records (Fisher, 2013). This information was exposed. Hacking is not only a
way to steal information, it is a way to get the business's attention. It is a way of
stealing funds, it is a way of making fraudulent transactions. However, from a
consumer's standpoint, it doesn't matter if the data was exposed by a hacker or
hacktivist, information was exposed. Once the information has been exposed the
fear of identity theft is left with the consumer. Consumers’ information can also
be exposed by malware.
Malware and Botnets
The number of hidden and unidentified infections from malware will cause
a degree of unknowingness and concern when it comes to the protection of
sensitive information (Sherstobitoff, 2008). Banking trojans, for example, are
enabling the rise of financial and economic fraud (Sherstobitoff, 2008). Online
fraud and phishing campaigns are also on the rise. Electronic records can be
hacked or spied upon through malware (Kapoor & Nazareth, 2013). Information
shared on Pastebin.com that could be considered sensitive data can include lists
of compromised accounts, database dumps, lists of compromised hosts with
backdoors, stealer malware dumps and lists of premium accounts (Matic, Fattori,
Bruschi & Cavallaro, 2012). Passwords have become compromised because of
malware on workstations and on network equipment (Ives, Walsh & Schneider,
7
2004).
One category of malware that is of particular interest is botnets. Botnets
are a means for cyber criminals to carry out malicious tasks, send spam email,
steal personal information and launch denial of service attacks. Researchers at
the University of California Santa Barbara were able to take over the torpig
botnet for 10 days several years ago. Data collected from the botnet was
astounding. During the course of the botnet takeover 1.2 million IP addresses
were seen communicating with the command and control servers (Stone-Gross
et al., 2009). There were 180 thousand infected machines with 70 GB of
information also collected during this time (Stone-Gross et al., 2009). Most of
that information was of a personal nature. Due to the way the torpig botnet was
set up, information was able to be gathered from a variety of user installed
applications, such as email clients, FTP clients, browsers, and system programs
(Stone-Gross et al., 2009). Data that was sent through these applications were
able to be seen, encrypted and uploaded to the attackers every twenty minutes.
However encryption on the algorithm was broken in late 2008, which allowed
researchers to analyze the botnet (Jackson, 2008).
The torpig botnet also used phishing attacks on its victims to collect
information that was not able to be gathered from passive data collection.
Phishing attacks are very difficult to detect especially when the botnet has taken
over a computer, because a suspicious look website looks legitimate (StoneGross et al., 2009).
The botnet communicated information back to the command and control
8
server over an HTTP POST request. During the time that the botnet was
observed the following personal data was collected:
Type
Amount
Mailbox account:
54,090
Email:
1,258,862
Form data:
11,966,532
HTTP account:
411,039
FTP account:
12,307
POP account:
415,206
SMTP account:
100,472
Windows password:
1,235,122
Table 1.
Another aspect of the torpig botnet was there was evidence that different
operators used the botnet for different tasks. Meaning that the botnet was used
as a service for a fee. The torpig botnet also stole financial information including
8310 accounts at 410 different financial institutions.
Researchers also looked at passwords and provided analysis from what
they saw. The botnet saw nearly 300 thousand unique usernames and
passwords send by 52540 different infected machines (Stone-Gross et al., 2009).
During a 65 minute period, the researchers collected 173 thousand passwords.
56 thousand passwords were recovered using permutation, substitution and
other simple replacement rules by a password cracker. Another 14 thousand
were recovered when given a wordlist in 10 minutes. This means that 40
9
thousand passwords were recovered in 75 minutes (Stone-Gross et al., 2009).
This information is astounding. Password reuse and weak passwords are one of
the most troubling scenarios for administrators and security personnel inside an
organization. If passwords are reused across different websites or organizations,
those credentials can possibly get attackers into other systems in other
organizations.
Users
Computer users are also part of how we got here. Users have credentials,
and most of the time users are trusted. Users can represent two different types of
threats. Internal threats and external threats. The external threat where the
threat is located outside the organization and there is the insider threat, where
the threat is inside the organization. External threats may be actual attackers
trying to gain access to systems or user credentials remotely. Insider threats,
may be users within the organization who may steal data. The third kind of threat
proposed by the researchers is called an external-insider threat, where the threat
originates from an outsider to the organization, but has user credentials that can
place internal systems at risk (Franqueira, van Cleeff, van Eck, and Wieringa,
2010). External insiders add unique challenges to security because they are
trusted, however they have more lax security posture.
Insider attacks tend to be more harmful than that of outsider attacks
(Franqueira et al., 2010). Other research has shown that the number of records
were higher with insider threats that external threats. Value-webs, as defined in
their research, are cross-organizational cooperative networks that consist of an
10
internal organization having some operational relationship with an external
organization (Franqueira et al., 2010). This could be, for example, an HVAC
company monitoring information for heating and cooling for a data center for a
large organization. This is the same scenario that maps to the Target
corporation data breach in 2013 (Krebs, 2014).
Insiders are individuals that are trusted and have some authorized access
over the organizations assets. They have legitimate privileges which may be
required to perform certain sensitive and authorized tasks. This can represent a
problem if authorization or privilege goes unchecked. It can allow insiders to
acquire information they wouldn't normally have access to, causing increased
risk to the organization.
External insiders are individuals that are not trusted, but have some
authorized access into the organizations assets. External insiders needs to have
access grated to them to fulfill the organization value-web contract they have with
the internal organization. This presents a risk if the privileges allow specialized
access.
Insider threats can also be classified into specific kinds of actors.
Masqueraders are individuals that steal the identity of a legitimate user
(Franqueira et al., 2010). Misfeasors are legitimate users that are authorized to
use systems to access information but misuse their privilege. Clandestine users
are individuals who evade access control and audit mechanisms and aren't
identified until they fit the other two classifications.
There are challenges, however, in identifying misuse. An organization
11
may not log enough details to have information to tie back events to a specific
user. In fact, only 19% of analyzed organizations that had data breaches in 2008
had a unique ID tying them back to an event (Franqueira et al., 2010). In the
81% of the cases, shared system access was used, therefore anyone that knew
the credentials for the account that caused the data breach could have been the
cause of the breach (Franqueira et al., 2010). This may be from user password
reuse.
Password reuse is a major problem. Microsoft conducted one of the
largest research studies ever conducted between July and October 2006
(Florencio & Herley, 2008). During that time when a user downloaded the
Windows Live Toolbar, a portion of users were given the choice to opt-in to the
component that measured password habits. The component measured or
estimated the quantities of passwords in use by users, the average number of
accounts used passwords, how many times users entered passwords in per day,
how often passwords were shared between sites and how frequently passwords
were forgotten.
There are numerous kinds of websites that we must remember passwords
for. Passwords are so important that when entered into a web form, they are
masked. SSL is also a key component of securing the password so that it cannot
be seen from observers on the network (Florencio & Herley, 2008). However, in
order to remember passwords, one might think password managers would be in
greater use.
For most users a small set of passwords is maintained in their memory.
12
For example if a user has 30 different accounts, only 5 to 6 passwords are
remembered for those 30 accounts, not 30 different passwords (Florencio &
Herley, 2008). Passwords are typically remembered on paper, by memory,
password trial and error and by resetting passwords, not by password managers.
Phishing has also increased in past years. Keyloggers and malware are also on
the rise. These allow attackers to easily steal strong passwords as well as weak
ones.
The component in the Windows Live Toolbar that was a key component in
the study consisted of a module that monitored and recorded Password Re-use
Events (PRE's). This module contained an HTML password locator that would
scan the document in search of the HTML code inputtype="password" and
extracted the HTML value field associated with the inputtype. If a password was
found, it was hashed and added to the Protected Password List (PLL) (Florencio
& Herley, 2008). Another component was Realtime Password Locator(RPL).
The PRL maintained a 16 character FIFO that stored the last 16 keys typed while
the browser was in focus. This allowed the researchers to show that if a series of
characters entered matched an existing hash, they knew they had a password
re-use event. The URL was also matched so that duplicate visits to the same
site did not produce a Password Reuse Event (PRE). PRE's were only sent to
the researcher’s servers if an actual PRE occurred. Unique passwords that were
only used once didn't produce a PRE.
The research also showed what sites had the most PRE's. The top five
were Live.com, google.com, yahoo.com, myspace.com, and passport.net.
13
What's more concerning is the fact that 101 PRE reports were from known
phishing sites (Florencio & Herley, 2008). This means that a user would have
been successfully phished, allowing the phisher to compromise not only the
account of whatever site they might have been spoofing or phishing, but the rest
of the sites that the user used the same password for. Users choose weak
passwords. The average user has 6.5 passwords, each of which is shared
across about 4 websites each (Florencio & Herley, 2008). Users have roughly 25
accounts across the internet that take a password. A user must use around 8
passwords a day. The average number of sites that share the same password is
5.67. Weak passwords are used more often on sites, on average 6.0 and strong
passwords are used less often 4.48. During the study the average client uses 7
passwords that are shared and 5 of them have been re-used within 3 days after
installing the toolbar (Florencio & Herley, 2008). Password reuse can lead to
direct compromise of other websites if an attacker leaks usernames/passwords
for a compromised site that the user belongs to. The concern is not so much
protecting your internal users from being compromised by one of your servers, it
is the fact that user's passwords habits allow an internal attack to potentially go
undetected because an attacker uses legitimate credentials from an external
resource. If a password is reused on one account, a hacker may be able to pivot
to another system or account and use that same password for that person's
access. This happens in a corporate environment. As e-commerce becomes
more mainstream, accounts belonging to sites that become compromised might
be used to compromise other systems. For example, a user might have
14
accounts at Bank of America, AOL and Amazon.com.
Users are poorly equipped to deal with today's need to multiple passwords
for multiple accounts. Users do not realize that the reuse of their passwords for a
high security website are now lowering security as if using a low security website
(Ives et al., 2004). Password reuse can allow attackers to gain access to other
websites. Users also tend to use short passwords containing only letters that
generally tend to be personal in nature. They typically do not update their
passwords. One theory behind this is the fact that humans have cognitive
limitations which leads a user to not be able to remember a more complex
password over time (Zhang & McDowell, 2009). Because of the cognitive
limitations users are often less than optimal decision makers when it comes to
reasonable thought about risk, especially about passwords. When presented
with password creation, users tend to favor quick decisions to save cognitive
thought (Zhang & McDowell, 2009).
Passwords provide protection to a company intranet, bank accounts, email
accounts, etc. Breach of a password can lead to personal data loss. Therefore
the study by Zhang and McDowell (2009) about password perceptions suggests
that users tend to pick stronger passwords based on perceived risk. However,
the users’ attitudes when presented with security practices are negatively related
to people’s attitudes to a security policy. The effort and time it takes to create
strong passwords and updating them can be associated negatively with their
password protection intention. Even though users gain more accounts over time,
that doesn't necessarily mean that a user will use a new password for each
15
account. If the password of a company network was compromised, this could
lead to data loss specifically confidential and sensitive data. Multiple passwords
in use across multiple sites creates more cognitive thought and therefore users
reuse passwords out of negativity toward coming up with multiple unique
passwords (Zhang & McDowell, 2009). The Protection Motivation Theory (PMT)
suggests that perceived severity in the data that is protected by the password, is
not related to the password protection intentions. Users who perceive a severe
consequence for losing the data to a data breach or compromised password do
not necessarily intend to take more effort to protect it.
Zhang and McDowell (2009) also suggest that users do not choose strong
passwords due to the added response cost of the password. Users typically use
passwords for main tasks in their jobs. Using a password to access data adds to
their cognitive load, so when a new password is requested, users do not want to
add to their cognitive load, so they choose passwords that are familiar and easy
to remember, thus, password reuse may be commonplace.
Users are the weakest link in password control due to our reuse of
passwords. Password policies that prohibit the reuse of passwords are often
abused. Security on a system can be compromised even if an attacker knows a
single work. An attacker that knows a little information can impersonate a user
and gain access to secured information. Social engineering, shoulder surfing,
dumpster diving, and phishing are all ways that a user can be targeted to obtain
a password for information.
Password reuse is almost commonplace for most users since users have
16
multiple accounts. The more accounts we have, the more passwords will be
reused. A user’s password reuse will grow as a user creates more accounts.
Remembering the passwords was the primary reason why a user reused
passwords. Protecting private information was the primary reason for not reusing
passwords. Users typically base their complexity of passwords on the data that
they are trying to protect.
A study conducted by Notoatmodjo and Thomborson (2009) gave a very
strong correlation to the number of accounts and password reuse. The R value
for the correlation was .799 and the adjusted R value was .790 which is
significant (Notoatmodjo & Thomborson, 2009). Users, when approached with
the question of why they reused passwords indicated 35% that the password was
easy to remember (Notoatmodjo & Thomborson, 2009). Only 19% said the
reason for password reuse is that the site does not contain valuable data
(Notoatmodjo & Thomborson, 2009).
Based on user behavior, 11 out of the 24 participants reused a password
even for high importance accounts (Notoatmodjo & Thomborson, 2009). 23 out
of 24 reused passwords for low importance accounts.
Even though perceptions about password security severity increase with
the amount of sensitive data stored on a website, that doesn't mean an
organization should be protecting the information from external data sources.
Threat Management
Hackers, hacktivists, malware and botnets, and most importantly users are
the actors in how data is breached. In order to start identifying how we can
17
mitigate attacks and identify potential areas of remediation, there has to be an
understanding of what mechanisms are allowing data breaches to take place.
Attack Methods and Data Types
There are many reports on how data is breached. However, most of the
information out there is from self-reported incidents, not from the actual
companies reporting the data. 7Safe collected and analyzed the information from
forensic investigations that they conducted. The data was collected over a period
of 18 months. During those 18 months, they analyzed 62 cases of data breaches
(Maple & Phillips, 2010). Investigating the data breaches and analyzing the data
should help in identifying how future attacks can be prevented.
The data breaches investigated came from many different sectors.
Business, financial, sports and retail. Retailers store a lot of information on their
customers including credit card information. From the year 2000 to 2008, card
not present - which is used in e-commerce fraud increased 350% where-as
online shopping increased by 1077% (Maple & Phillips, 2010). 69% of the
organizations that were breached were retail, the next highest was financial at
7%. 85% of the data compromised was credit card information, followed by
sensitive company data, non-payment card information, and intellectual property
(Maple & Phillips, 2010). 80% of the data breaches were external, 2% internal
and 18% were business partners (Maple & Phillips, 2010). 86% of the attacks
involved a web-server that was customer facing (Maple & Phillips, 2010). 62% of
the attacks were of average complexity, 11% were simple attacks and 27% were
sophisticated attacks that required advanced skill and knowledge of
18
programming and operating systems. Sophisticated attacks typically happen
over a long period of time. SQL injection made up 40% of the breaches and
another 20% were from SQL injection combined with malware. 10% were strictly
malware and 30% were poor server configuration or authentication. Significance
is highly placed on SQL injection which accounts for many of the data breaches
that occur. Since many organizations have databases that contain sensitive or
vast amounts of information, SQL injection accounts for large data breaches
(Mansdield-Devine, 2011; Maple & Phillips, 2010; Weir, Aggarwal, Collins, &
Stern, 2010). Types of data that are breached are also worrisome.
There are several different internet data collection domains which contain
specific personal information:
Healthcare: Healthcare information has been able to help users interact with their
providers. However, much of the information on the healthcare websites is
private and sensitive information. Attackers could use this information to steal
someone’s identity or worse exploit one's medical weakness (Aïmeur & Lafond,
2013; Kapoor & Nazareth, 2013).
E-Commerce: Browsing habits, sites visited, products looked at are often
information that is collected on a user. Attackers could use this information to
exploit a user’s interest (Aïmeur & Lafond, 2013).
E-learning: Students typically share information and that information is accessible
to other students.
Information is stored on the above domains. Information is also collected
by not only looking at what accounts users create and the information contained
19
within those accounts, but a user’s habits on the internet.
Social media, online data brokers, search engines and geolocation data all
contain data that can be collected analyzed and parsed through linking users to
what's important to them (Aïmeur & Lafond, 2013). This is a major concern if
data from these sources are leaked. Surveillance, interrogation, aggregation,
identification, insecurity secondary use, exclusion, breach of confidentiality
disclosure, exposure, increased accessibility, blackmail appropriation and
distortion are all concerns if data were to be leaked (Aïmeur & Lafond 2013).
Users give away a lot of information about themselves: Identifying
information such as name, age, gender, address, phone number, mother's
maiden name, SSN, income, occupation, etc. (Aïmeur & Lafond, 2013). Buying
patterns in which users are giving away such information as websites visited,
assets, liabilities, stores they regularly shop from. Navigation habits including
websites visited frequency of the visits usernames used on forums. Lifestyle
information such as hobbies, social network information, traveling behavior,
vacation periods, etc. Sensitive information such as medial or criminal records.
Biological information such as blood group, genetic code, fingerprints, etc.
All of this information can be tied together with enough time. There are
often attacks on people's privacy. Hackers are always on single step behind the
new technology however. If there is a vulnerability hackers will exploit it. All is
not lost since detection methods can catch at least some of the attacks that
happen, some even before they happen.
20
Detection Methods
While there are numerous reports of data breaches happening every day,
an undetected breach cannot be reported (Curtin & Ayers, 2008). Detecting a
data breach can be difficult. Careful intruders hide or remove evidence of a
breach by altering information such as timestamps, deleting logs, or modifying
applications to avert detection (Casey, 2006). Even though a system or user may
become compromised there are areas in which detection can be accomplished.
Monitoring and analysis tools
There has been substantial development in computer and network
security design in the past few years. This is seen in the new protocols, new
encryption algorithms, new authentication methods, smarter firewalls, etc. (Hunt
& Slay, 2010). The security industry has also seen improvement in computer
forensic tools where the methods of searching for and detection of malicious
activity have become more sophisticated. Security systems have been designed
to detect and provide protection from malware such as viruses, worms, trojans,
spyware, botnets, rootkits, spam and denial of device attacks (Hunt & Slay,
2010). However it is often difficult to effectively assess damage caused by
malware or system attacks based on the massive amount of logs collected by
these systems.
Traditionally computer forensics was performed by looking at the data on
storage devices. However in recent years there has been a shift in the way
computer and network forensics data is obtained. This is through the live
analysis of network traffic and system logs. Network forensics is concerned with
21
monitoring network traffic to see if there are any anomalies. An attacker may
have been able to cover their tracks, so traditional computer forensics does not
work as well as network forensics.
Security tools need to monitor and detect attacks and at the same time
forensic tools need to both soundly record traffic and security events while
providing real-time feedback. This is so that an attack can be observed and
monitored, recorded forensically so that it can be preserved for evidence,
tracked, so that a user understands the scope of the events, and also limit its'
damage so that it is not able to take down a network.
Successful investigation of a data breach relies heavily on logs of the data
breach. Information that is preserved can help an investigation succeed.
Successful intruders are counting on an organization to not have forensic
analysis in place and strict logging information (Casey, 2006). As more
information is collected from systems across an organization, the value of the
evidentiary logs when dealing with a data breach increases. Security vendors
will design their products with forensic principles in place, but it is still up to the
organization to define what critical assets will be looked at for those key logs.
Tools exist out there to collect information into a single database that can be
queried for specific time periods, IP addresses, and other information.
Information can be correlated and normalized. Data that is correlated and
analyzed saves investigations valuable time by showing relevant information.
Automatic aggregation and categorization must take place in order to address
abnormalities more quickly (Bronevetsky, Laguna, de Supinski, & Bagchi, 2012).
22
System administrators are overloaded with single messages saying there
are problems. The shear amount of information coming in is hard for anyone to
manage, there has to be tools that do this kind of analysis and categorization
automatically.
Framework of Log Management
In order address the needs of multiple organizations, the National Institute
of Standards and Technology (NIST) came up with the Guide to Computer
Security Log Management (Kent & Souppaya, 2006). The Guide to Computer
Security Log Management from NIST is a guide written and vetted by many
organizations and key researchers. This publication and other special
publications are meant as the highest goal to implement and maintain for the
specific area in which the guide talks about. The guide outlines what and how
security logs should be collected and maintained.
There are many different log sources that provide security event information:

Antimalware Software - can show what malware was detected,
disinfection attempts, updates to the software, file quarantines, etc.

Intrusion Detection and Intrusion Prevention Systems - can provide logs
on suspicious behavior and detected attacks.

Remove Access Software (VPN) - can provide logs on who logged in and
when or who attempted to log in, also where the user logged in from.

Web Proxies - can provide information on web activities.

Vulnerability Management Software - can provide logs about what
systems are vulnerable to specific exploits and how to fix the
23
vulnerabilities.

Authentication Servers - can provide user logs that detail what user was
logging into what system at what time.

Routers - can provide logs on specific blocked or allowed activity.

Firewalls - can provide logs on specific blocked or allowed activity.

Network Quarantine Servers - can provide logs as to an attempted system
connection to an internal resource. This system would provide logs as to
the security posture of an authorized system.

Operating systems - can provide system event logs and audit records that
have detailed information about the system in them.

Other information that can be obtained by application logs are client
requests and server responses, account information, usage information
and significant operational actions.
Often organizations are under specific compliance obligations for the use of
logs. These include the Federal Information Security Management Act (FISMA)
of 2002, Gramm-Leach-Bliley Act (GLBA), the Health Insurance Portability and
Accountability Act (HIPAA), Sarbanes-Oxley (SOX), and Payment Card Industry
Data Security Standard (PCI-DSS).
The amount of systems in an organization complicates log management in a
few different ways. The amount of logs on many different sources can be a lot of
information. A single event as simple as logging in can cause a massive amount
of data on many different systems. Inconsistent log content is another
complication. Some logs contain information only pertaining to that specific
24
resource, such as time, resource, destination, source, MAC address, etc. Logs
from various sources are not normalized as well. Each system may return a
timestamp in a different format as well. There are flat files versus syslog data
versus xml formats. All must be organized and normalized.
Log data provides a massive amount valuable information and can
sometimes contain restricted information. Log data must be protected because it
is confidential information and may end up allowing an attacker to gain access to
systems.
Log collection is not the only thing that an organization should be doing with
their data. An organization should also analyze the data that is collected from
the logs. Often times system administrators are responsible for looking at the
raw log data, however they do not have the tools necessary to look through the
data with ease. Analysis is also treated as reactive.
NIST recommends creating and maintaining a secure log management
infrastructure, all information is available to analyze data from (Kent & Souppaya,
2006). Log analysis must be performed on the data that is centrally collected.
Event correlation is key when analyzing logs. If a user is seen logging in one
place and it is logged in an authentication system, such as Active Directory, the
user may also be seen logging into a remote network, such as VPN.
SIEM
Security Information and Event Management (SIEM) provides a log
management platform for organizations to use. A SIEM supports log collection
by gathering information from log sources via an agent placed on the system or
25
through a log generating host that forwards logs onto the SIEM. SIEMs can also
provide other features such as analysis and correlation of logs. However a
human still may need to interpret the logs due to the variety of interpretation of
the log.
There is a need for context. The meaning of a log often depends on what
other logs are surrounding it and the correlation of other events. Typically a
system administrator can define how a log is placed in that context. SIEMs also
cannot analyze every log and make a determination on what to do. Prioritizing
logs is also key. Different systems may be more important than others. The
combination of several factors and correlation of events might indicate that
something else is going on than what the SIEM says is going on. Entry type,
newness of the log, log source, source or destination of the log or IP address of
the log, time or day of the week and frequency of the entry are all analysis points
that must be taken into consideration for a security event (Kent & Souppaya,
2006).
SIEM technology supports three critical areas: log management, threat
management and compliance reporting (Hunt & Slay, 2010). SIEM is an ideal
tool to collect log data, and incident responses originating from security devices
at the point at which forensic logging should occur. Information from systems is
sent to the SIEM to aggregate, correlate, and normalize the data. Data from
these systems can be analyzed in conjunction with network behavioral data to
provide a more accurate real-time picture of threats and attacks.
There needs to be consistency in log data. Integrated monitoring, logging
26
and alerting is meant to accomplish the following: monitoring network status,
generation of alerts and feedback to the user, reporting the system
administrators, forensically sound safe keeping of traffic and log data over a
period of time and a comprehensive tool set for both real-time and after-the-event
analysis. Information that is fed into the SIEM is used real-time to generate
reports and provide feedback to stop or prevent and attack. A SIEM must also
provide a way to transmit the log data from the system securely, provide a chain
of custody for potential evidence, provide a traceback system to assist in
determining the source of an attack, provide reports and automatic alerts in realtime as well as maintain a forensically sound traffic and log records, and provide
fine-grained information of points of interest (Hunt & Slay, 2010).
Some SIEM engines support the storage of data for forensic analysis,
which is addresses the cross over between security and forensics, by few
systems take real-time information and adapt it to security. There are limitations
in real-time adaptive feedback, however, which result from network forensic
analysis. Surveillance and vulnerability scanning often end up being just another
log.
27
CHAPTER 2
THE PROBLEM OF DATA BREACHES, EVENT DETECTION, AND
RESPONSE
The ability to locate information quickly is paramount in information
security. The amount of information collected needs to be filtered correctly so it
is concise and accurate for processing. Security information and event
managers (SIEMs) provide analysis of security events in real-time collected from
various internal systems. Typically a SIEM aggregates data from many sources,
monitors incidents, correlates the information and sends alerts based on that
information. A human is still responsible for looking at the information and
decides whether or not they should act on it (Aguirre & Alonso, 2012).
Intrusion prevention systems and intrusion detection systems are key
components in detecting information from systems located within specific
networks. However they can have false-positives adding more investigation work
for administrators and security staff. Collaboration between SIEMs would allow
for information to be correlated between many systems.
Research by Aguirre and Alonso (2012) used several SIEMs based off
AlienVault OSSIM to collaborate with each other. Snort was used as the IDS.
Each SIEM would feed the other SIEMs information and be able to correlate
information based off the each sensors directives. This allows for all systems to
see all alerts. There are standalone systems that perform a variety of analysis
28
features, but not all the features in one place with feedback and automated
reporting. This system was an excellent start for mass correlation of events and
overall detection. However, it still does not address external events. This system
only detected internal events. What if we thought about this research a different
way? Utilizing the internet and external organizations as SIEMs themselves
reporting back to a centralized internal SIEM. Internal data would not be pushed
out to the external organizations, but collected from them and reported on.
External events about data that is important to the internal organization would be
alerted on proactively. A proactive approach to threat management versus
reactionary approach to threat management. The research also provides support
for the use of Splunk as a data analysis tool where information can be fed in
quickly, correlated, analyzed automatically, and provide real-time feedback to a
user.
29
CHAPTER 3
SPLUNK
In order to collect, aggregate, normalize, analyze, correlate, and alert
upon large data sets from either an internal or external organization, a platform is
needed to build the infrastructure upon. Research, as seen previously, has
shown Splunk to be able to meet the requirements of today’s vast and ever
expanding data and machine knowledge. Researchers at Sandia National
Laboratories used Splunk to join information for supercomputers, security and
desktop systems (Stearley, Corwell, & Lord, 2010). When joining information for
security and desktop systems, their analysis was able to span their two different
data centers. Information from those systems included run-time data and local
conditions of the computing clusters. Decoding messages can be extremely time
consuming, but they are essential in diagnosing the overall health of the system.
Overall system health can be impacted by software upgrades, configuration
changes, and running applications. Using Splunk to analyze the data,
administrators were able to isolate and resolve problems more efficiently.
According to the research by Stearley et al. (2010), the typical log file
contains over 11 thousand lines every 5 minutes. Administrators could go
through the data searching for information from grep, however this would be very
time consuming. Prior to deploying Splunk, they had coded a 654-line Perl script
to parse through the data to find specific information. Once Splunk was in place,
30
administrators were just able to add the log location and come up with very
simple informational "lookups" which allowed an administrator to see the data
quickly. When using Splunk, administrators were also able to script alerts that
would allow them to be alerted to fault conditions in the system. Another benefit
to using Splunk, allowed the administrators the ability to perform deeper analysis
on their computing systems easier, such as "what is the distribution of alerts
versus time, job, user and type?" Splunk allows the ease searching through
information and the ability to come up with a quick problem resolution based on
the data.
While machine learning offers promise of more automated solutions to find
and correct faults, administrators are still left to analyze logs on what actually
took place for the initial fault condition. They may not have the entire picture of
the system. Splunk allows various data to be parsed and makes sense of the
data.
In addition to logging the system data, Splunk also logs all searches,
which enables administrators to improve searches over time. While it still takes
some analysis, Splunk solves a majority of the needs of an administrator. The
SAFES app, which has been developed as part of this external monitoring
research is built upon Splunk. There are three reasons why Splunk was chosen
as the platform to build and expand the SAFES app on. Splunk is an excellent
log management tool. Splunk provides a robust application building platform to
customize and analyze the data that is being collected. Splunk also meets the
requirements of much of the research already performed.
31
Splunk as a Log Management Tool
Splunk was first released in 2005 (Robb, 2006). The initial intent on
developing Splunk was to have a platform where data from machines could be
searched through. Splunk allows organizations to aggregate structured and
unstructured data and search upon the data. Splunk is a great log management
tool. With Splunk, an organization can almost throw any data into the program
and get useful, meaningful information back out of it, depending on how you
search through it. The biggest ways that Splunk helps make sense of data for
the purposes of SAFES is aggregation of data. Currently the University of
Colorado Colorado Springs’ implementation of Splunk contains data on over
10,000 different log sources. If an organization were looking through over 10,000
different log sources independently for very specific information for correlation of
security events, it would take an indefinite amount of time to search through it
and cull the data down to create meaningful results. Splunk takes the data in,
either structured or unstructured and indexes it. Types of data that are logged:

System event logs

Security event logs

Firewalls

Network routers and switches

Encryption software

Antivirus software

Intrusion detection systems

Servers
32

Email logs

Web application logs

Authentication servers
Aggregating the data into one central place gives administrators and security
personnel the ability to search through all logs and data and make sense of it.
The ability to see if a user logged in from China from a VPN connection and
logged into a server, right before brute force attempts stopped on a specific
server from the same IP address is invaluable.
The data that is put into Splunk comes from a variety of different formats.

Log files

Email

RSS feeds

SNMP traps

Syslog messages
This data is both structured and unstructured. Structured data is typically
normalized. Normalized data usually contains data in fields that are predictable.
Timestamps are understood, IP addresses, MAC addresses are also understood.
Most of the data collected for the purposes of aggregation, correlation and
analysis is structured data. However, there is also unstructured data that must
be indexed. This data could contain information from firewalls or intrusion
detection systems, where log files may not make sense and data is not
predictable. Logs or data from third party external organizations could contain
such unstructured data, however the data needs to be normalized, and indexed
33
for searching, correlation and analysis. The capability of Splunk to contain this
data is impressive. Equally as impressive is Splunk’s capability to have multiple
applications built on top of it to manage the data.
Splunk as an a Application Development Platform
Just searching through log files in a simple to use search interface is not
enough when you need to develop specific ways to look through the data and
display the results easily. Since Splunk’s primary function in the SAFES app is to
store data, administrators and security personnel need to talk the same language
as the data and display relevant data easily.
Splunk gives us the ability to use the base platform as a building block for
custom applications using Splunk’s data. There are several components that are
at the core of the Splunk platform. Search managers, views, and dashboards are
included in the Splunk Web Framework, which is the building tool for Splunk
applications. The Search manager allows Splunk’s core search functionality to
be built into a custom application. It allows an operator to search through, start,
stop, pause, and save searches.
Views allows a developer to customize
visualizations within the application so that data can be visually interpreted
quickly. Visualizations include charts, graphs, tables, events viewer, map viewer,
D3 charts, etc. Views also include custom search controls such as search bars,
timeline views and time range views. Form inputs are also included in the views
component. Dashboards allow visualizations that are common to look at simpler
to find. Dashboards can be updated in real-time or based on timeline criteria.
Building on the core of Splunk to show us relevant data is useful and powerful,
34
however there are times where Splunk needs to be extended to use scripting
languages to build in interactive applications. Django Bindings and SplunkJS
Stack components allow developers to build dynamic applications on top of
Splunk. If SAFES functionality needs to be extended to import and correlate data
in a different way, a user of SAFES can modify the code that is easily managed
by Splunk.
These components allow a developer to use the base functionality of
Splunk as the core of the application and build a robust interface to narrow down
the data that can often times be overwhelming to an operator or security
personnel looking through the data. Data usage and relevancy changes over
time, so the SAFES app in Splunk can be changed when needed without
purchasing other software.
How Requirements are Met
As previously identified in there are several important requirements that an
application would need to have in order to be effective in aggregating,
normalizing, correlating, normalizing, and alerting on both internal and external
data. The requirements and how the requirements are satisfied for SAFES
application in order to monitor external sources and alert on them are as follows:

Accept both structured and unstructured data from multiple sources Splunk, has the ability to accept any type of data source. As previously
identified, Splunk can accept many different data sources and aggregate
them into one location.
35

Parse logs – Through the Splunk search functionality, logs and event data
can be parsed many different ways to gather information about the data.

Filter messages based on terms – Search terms can be included,
excluded, or joined to help cull down the amount of irrelevant data.

Data normalization – Data that comes is aggregated into Splunk can be
normalized through the use of specific “lookups”, “fields”, “tags”. This
allows any data from any source to match fields on other data sources
even though data fields do not match up exactly. This functionality allows
administrators and security personnel to have one language among all
logs.

Event correlation – If data is seen in one place, it may have been seen in
another place. The ability to correlate information across different log
sources also is functionality that is built into the core of Splunk

Log viewing – Logs can be viewed any number of ways depending on how
the user wants the data to be presented.

Alerting – A powerful alerting engine is built into Splunk. This allows a
user to search for specific information, save the search as an alert and
have Splunk send an email, execute a script, or kick off another search if
the pattern is seen again. The use of alerting will allow administrators to
receive timely, proactive notifications about security events from both
internal and external organizations.
The requirements of a system that allows for timely, proactive notifications of
internal and external event data fit what Splunk was designed to do. Using
36
Splunk as the core of the SAFES application allows an organization to quickly
and simply deploy an external monitoring solution, with minimal programming.
37
CHAPTER 4
DESIGN OF SAFES
Based on the previously stated work, a system that can automatically pull
in data, aggregate the input, normalize the data, correlate information obtained
from the input sources, analyze information obtained from the correlation of the
data, and alert system administrators or security professionals to a possible data
breach has been designed before. However, these systems that have been
talked about before only include internal data. They do not include multiple
sources and formats of external data. The data that can be obtained from
external sources, correlated with internal organizational data can give us a
somewhat more realistic picture of overall internal and external data security.
The Security Advisories From Events System, or SAFES, application is
designed to take external information from third party websites, data feeds and
email, aggregate, normalize, correlate that information with internal security
events and produce alerts to internal users about internal and external security
events. SAFES helps internal organizations in the following ways, supported by
research:

Aggregates logs and events. Aggregation aids in quickly searching for
multiple sources of events in one place.

Normalizes data through aggregation. Normalization aids in talking the
same language and time between different sources and/or timestamps.
38

Analyzes events to cut down on unnecessary or irrelevant events.
Analysis aids in removing irrelevant data that could interfere with specific
events.

Provides ability to collect data from multiple external sources. External
data collection provides a more complete picture of what internal
organizational data might have been leaked by a data breach.

Alerts on events. Alerting aids in response time of specific security events.
Requirements
The SAFES application requires the following:

The ability to store events or logs from multiple locations

Process different types of events

Process different input sources

Normalize timestamps and key fields

Alert on specific data

Analyze data while combing internal and external data inputs

Provide historical SAFES alert information
The SAFES app will reside within Splunk. Splunk provides a robust platform
already discussed in chapter 3. Splunk supports one of the most major
components of the SAFES app in that it allows organizations to store large
amounts of events and logs from many different locations very easily. Second,
SAFES will not only process external events, but process internal events. This
requirement allows an organization to have one interface or one tool they can
use to process event data. Third, the Splunk platform will allow an organization
39
to input numerous types of data sources. This includes internal organization
system events from multiple sources, and external third party data sources such
as websites, RSS feeds, and email. Fourth, by utilizing the Splunk platform, we
can normalize logs no matter what timestamp format the event data is in. If an
event or log has a timestamp that Splunk does not recognize, we can easily
configure Splunk to recognize the data source and the timestamp associated with
the data source. Fifth, based upon specific criteria known to the organization, an
organization can utilize the alerting functionality within Splunk to provide
information on external events to security personnel, correlating internal data to
external data. Finally, data that has been alerted on before is stored within
Splunk. If the data has been seen in other external sources, events may not
need to be alerted on again.
Logging Sources
There are several types of events and logs that need to be included into
the SAFES application so that we can see internal events. Other external events
and logs can be added according to the organization’s needs. For the purposes
of our deployment of SAFES the following log sources have been included:
Windows 2012 domain controller security logs. These logs contain
authentication information for resources utilizing a Windows Active Directory
infrastructure. Resources inside a domain that rely on Windows Active Directory
could include email, 802.1x authentication for wireless or wired networks,
desktop systems, servers, LDAP connections, VPN connections, etc. Resources
also could include Kerberos based systems relying on Windows Active Directory.
40
An example of a Windows Active Directory log is as follows:
02/26/2014 08:45:24 PM
LogName=Security
SourceName=Microsoft Windows security auditing.
EventCode=4624
EventType=0
Type=Information
ComputerName=dc1.test.local
TaskCategory=Logon
OpCode=Info
RecordNumber=144211636
Keywords=Audit Success
Message=An account was successfully logged on.
Subject:
Security ID:
NT AUTHORITY\SYSTEM
Account Name:
dc1.test.local$
Account Domain:
testdomain
Logon ID:
0x3E7
Logon Type:
3
Impersonation Level:
Impersonation
New Logon:
Security ID:
testdomain\test
Account Name:
test
Account Domain:
testdomain
Logon ID:
0x4C34215
Logon GUID:
{00000000-0000-0000-0000-000000000000}
Process Information:
Process ID:
0x228
Process Name:
C:\Windows\System32\lsass.exe
Network Information:
Workstation Name: DC1
Source Network Address: 192.168.0.1
Source Port:
18920
Detailed Authentication Information:
Logon Process:
Advapi
Authentication Package: MICROSOFT_AUTHENTICATION_PACKAGE_V1_0
Transited Services:
Package Name (NTLM only):
Key Length:
0
There are several fields within the event log that are important. The timestamp,
“Account Name”, “Workstation Name”, “Source Network Address”, and “Event
Code” are all necessary pieces of information that SAFES can utilize. They can
help in correlating if a third party data breach of internal user credentials were
used to log into a system. When combined with intrusion detection logs, they
also will show if a brute force attack on user credentials were successful or not.
41
Intrusion detection system logs. Intrusion Detection Systems provide
valuable pieces of information for a variety of different security events within a
network. Information such as malware activity, brute force logon attempts,
privilege escalation attempts, vulnerability exploitation, and more are typically
caught by an internal intrusion detection system. When combining intrusion
detection logs with other events, SAFES can help determine if a system might
have been compromised and malware was installed and active on that system.
For the purposes of this implementation of SAFES, a sample intrusion detection
set of logs could look something like this:
2014-02-27 03:41:12 pid(2230)
Alert Received: 0 1 trojan-activity
test-ids-eth1-2 {2014-02-27 03:41:11} 3 136507 {ET TROJAN Suspicious
User-Agent (MSIE)} 192.168.1.4 131.124.0.67 6 52250 80 1 2003657 12
2280 2280
2014-02-27 03:17:11 pid(2230)
Alert Received: 0 1 trojan-activity
test-ids-eth1-6 {2014-02-27 03:17:10} 7 118282 {MALWARE-CNC
Win.Trojan.Kazy variant outbound connection} 192.168.1.4 94.242.233.162
6 56605 80 1 26777 3 2370 2370
2014-02-27 03:07:54 pid(2230)
Alert Received: 0 1 trojan-activity
test-ids-eth1-4 {2014-02-27 03:07:53} 5 328412 {ET TROJAN Suspicious
User-Agent (Installer)} 192.168.1.4 108.161.189.163 6 58194 80 1
2008184 8 2344 2344
2014-02-27 03:07:04 pid(2230)
Alert Received: 0 1 trojan-activity
test-ids-eth1-1 {2014-02-27 03:07:04} 2 831844 {ET CURRENT_EVENTS
DRIVEBY Redirection - Forum Injection} 192.168.1.4 190.123.47.198 6
54015 80 1 2017453 3 2298 2298
42
Timestamp, signature of the IDS rule that was alerted on, source address,
destination address are all important fields for this type of log.
Microsoft Smart Network Data Services. One of the external logging
sources could be information obtained from Microsoft Smart Network Data
Services. Microsoft allows organizations that own a set of IP addresses to obtain
specific traffic information seen on the Windows Live Hotmail system. This
allows organizations to see what email is coming from what servers inside their
organization to Windows Live Hotmail infrastructure. The information contained
within the daily log contains sending IP address, how many receipt or data
commands were received, complaint rate, and number of email trap hits.
Additionally sample HELO and MAIL commands are shown. The information that
is sent can help directly identify a host that is sending out spam from an
organization. If a system is infected with malware that sends out spam, and it is
not detected by internal methods, the external resource will catch it. The
following log is a sample log obtained from Microsoft SNDS.
192.168.1.6 2/25/2014 3:00
< 0.1%
test@test.local
2/26/2014 1:00
646
586
0
exchange.test.local
638
GREEN
The keyword “GREEN” is significant because it identifies if the source IP address
is sending a specific level of spam. If a system is infected with spam sending
malware or has been compromised to send spam, the keyword “GREEN” will
change to “RED” which indicates over 90% of email seen from the source IP has
been identified as spam.
Pastebin.com alerts. As we have seen from previous research
performed, Pastebin.com has been used to leak user credentials from
43
organizations by attackers. Pastebin.com allows external users to sign up for
their alerts system. A user can sign up their email address to obtain alerts on
three separate keyword searches. Once a paste has been posted to
Pastebin.com that contains one of the keywords, an email alert is sent off
containing the URL to the paste. A sample subject line and body message of the
alert email is as follows:
Subject: Pastebin.com Alerts Notification
Body: Hi testaccount
You are currently subscribed to the Pastebin Alerts service.
We found pastes online that matched your alerts keyword: '192.168.'.
http://pastebin.com/acbxyz
If you want to cancel this alerts service, please login to your
Pastebin account, and remove this keyword from your Alerts page.
Kind regards,
The Pastebin Team
Based on your keyword alerts, leaked user credentials or information on your
network may be contained in the paste.
Shadow Server Foundation events. The Shadow Server Foundation
started in 2004. Its mission is to gather intelligence on malicious activity and
vulnerabilities from across the internet. Its goal is to provide information to the
security community in order to protect systems and assist in research. The
Shadow Server Foundation provides quite a bit of data, both in weekly emails
and daily emails depending on what kind of information is requested. Information
that would be helpful inside an organization would relate directly to that
organization. The Shadow Server Foundation allows organizations, owners of IP
ranges or ASNs, to be alerted any time any one of many different events seen
inside their organization is triggered. The information collected from Shadow
44
Server is collected from many different networks around the world. In order to
obtain data on an organization, the organization needs to sign up. Once properly
authenticated as the owner of the ASN or IP space, an organizational member
can receive the following information about their networks:

Detected Botnet Command and Control servers

Infected systems (drones)

DDoS attacks (source and victim)

Scans

Clickfraud

Compromised hosts

Proxies

Spam relays

Open DNS Resolvers

Malicious software droppers and other related information
The information obtained on an organization’s network is aggregated and
sent to the organization every 24 hours if an alert occurs. This information is
meant to assist organizations in their detection and mitigation methods. The
information is extremely helpful if current detection methods in the organization
cannot pick up the malicious traffic. A sample of some of the logs that are sent to
an organization:
Botnet DDoS
"Date","Time","C&C","C&C Port","C&C ASN","C&C Geo","C&C
DNS","Channel","Command","TGT","TGT ASN","TGT Geo","TGT DNS"
"2008-1103","00:00:12","76.76.19.73",1863,13618,"US","unknown.carohosting.net",
"#ha","!alls","98.124.192.1",21740,"US",""
45
Botnet Drone
"timestamp","ip","port","asn","geo","region","city","hostname","type","
infection","url","agent","cc","cc_port","cc_asn","cc_geo","cc_dns","cou
nt","proxy","application","p0f_genre","p0f_detail"
"2011-04-23
00:00:05","210.23.139.130",3218,7543,"AU","VICTORIA","MELBOURNE",,"tcp"
,"sinkhole",,,"74.208.164.166",80,8560,"US",,1,,,"Windows","2000 SP4,
XP SP1+"
Sinkhole HTTP-Drone
"timestamp","ip","asn","geo","url","type","http_agent","tor","src_port"
,"p0f_genre","p0f_detail","hostname","dst_port","http_host","http_refer
er","http_referer_asn","http_referer_geo","dst_ip","dst_asn","dst_geo"
"2010-08-31 00:09:04","202.86.21.11",23456,"AF","GET /search?q=0
HTTP/1.0","downadup","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT
5.1; SV1)",,8726,,,,80,"149.20.56.32",,,,,,
Google Alerts. Google Alerts allows individuals to receive email or RSS
alerts on information when it is indexed on Google.com. Search criteria on news,
discussion boards, blogs, video, books or anything else, can be alerted on. The
RSS feed or email that is received has the title of the information seen, a brief
snippet of information surrounding the keyword query the individual alerted on,
and a link to the article. The following is an example RSS feed:
Figure 1.
While some of this information may not be useful, it provides insight into
46
other websites that may have information pertaining to your organization that
may have been leaked.
Many other data sources can be added, both internal and external and fed
into the SAFES application. However, the above sources provide good
information for automated alerting and analysis.
Data Schemas
Search queries need to be determined in both the Pastebin.com and Google
Alerts cases. The queries, when entered into either Pastebin.com or Google
Alerts sites, need to be formatted in such a way that they will provide not only
relevant information, but also be utilized by the SAFES application as a
determination of confidence of a data breach or severity of a data breach. Data
schemas must be defined to provide these two types of data both for the user
and for SAFES.
Domain schema. When querying both Pastebin.com and Google Alerts,
it is best to have the organization’s domain as a query term. This validates that
the Pastebin.com paste or Google alert is talking about the keyword or
organization on the paste. For example, the domain “uccs.edu” will alert on a
news article or it will alert on a username seen, such as john.smith@uccs.edu.
The username may be a part of leaked credentials.
IP address schema. IP addresses can have a range from 0.0.0.0 to
255.255.255.255. If the organization’s IP address range is 128.198.0.0128.198.255.255, the individual would have to issue over 65k alerts for the IP
address space. However, if this is narrowed down to “128.198.” we are able to
47
see all 65k address spaces that may end up in an alert. The alert for IP address
space might be triggered by a compromised host listed on the Pastebin.com site
or by Google Alerts.
Username schema. While not directly able to be alerted on, usernames
typically have a specific pattern. When used with other schemas, they can
provide a good indication of false positives. Typically a username is listed as an
email address on sites such as Pastebin.com or Google Alerts. Usernames in an
organization may only contain up to 8 characters. Therefore, a username seen
as “jsmith12@uccs.edu” would be valid, but “jsmith123345@uccs.edu” would not
be. The use of the username schema will be used with confidence levels.
Password schema. Again, while not directly able to be alerted on,
passwords in an organization usually have a minimum set of requirements. If the
passwords that are breached on a third party site are less than an organization’s
set of minimum standards, the organization can be fairly confident that the user
did not re-use their credentials on the third party site which was breached.
Password policies in an organization may have requirements for specific length,
special character use, and digit usage. For example, if an organization has a
password policy of a password must contain a minimum of 8 characters, a digit
and a special character, the password of “Test4me” would not be valid on the
organizations authentication servers, but “Test4me!” would be.
The use of schemas both for external organization alerting and internal
data searching helps narrow down a possible or potential data breach on a thirdparty site.
48
Confidence Levels
While alerts from external organizations provide information to an internal
organization, the data obtained may not be relevant to alert on a potential data
breach. Confidence levels must be decided upon by the organization and
modified if data that is alerted upon is considered false positives or not relevant.
False positive information could come from Google Alerts in the form of a news
article being posted about the organization, or a Pastebin.com alert could be a
false positive if the IP schema for “128.198.” matches “61.128.198.7”. Nonrelevant data could lower the confidence level of the information if the username
schema matches the domain schema, but does not match the username
schema. For example, an alert generated by SAFES for the Pastebin.com post
of “jsmith12334@uccs.edu” could be considered non-relevant because the
username schema of uccs.edu only allows for 8 character usernames. The
username of “jsmith12334@uccs.edu” would have a lower confidence level than
that of a username of “jsmith7@uccs.edu” where the username schema matched
the organizations username schema.
Different confidence levels should also be placed on both trusted and
untrusted data sources. Untrusted data sources could be search data from thirdparty external organizations that you are searching the information on. This data
includes Pastebin.com or Google Alerts data where you might have false positive
or non-relevant information. Trusted data sources could have a direct
relationship to the data that the external organization is seeing. This includes
Microsoft SNDS and Shadow Server Foundation reports, as well as internal logs.
49
Correlation between different data sources also will boost confidence
levels. For example, if a Pastebin.com alert comes in for a username or IP
address schema, and the username shows to have been used recently in logs,
the confidence level is boosted to show potential correlation between security
events. Again, confidence levels must be decided upon by the internal
organization for what data they are searching for. Broad schemas that may
potentially receive many matches may have lower confidence levels than
narrowed schemas.
The combination of log sources, data schemas and confidence levels
helps determine the overall threat level of the external event. Administrators or
security personnel can then be alerted on the data obtained from external
organizations and react accordingly.
50
CHAPTER 5
IMPLEMENTATION OF SAFES
The SAFES app for Splunk allows organizations to monitor external data
sources for threats in our internal network or allows us to proactively protect and
alert on external security events. Since SAFES was designed to be a simple-touse tool, a majority of the programming has already been done by others.
Splunk will be the core aggregation, correlation, normalization, and alerting tool
for internal events. Additional Splunk apps must be installed in order to ease the
work of a user of SAFES.
Splunk Installation
Splunk itself is very simple to install. Splunk has two different versions,
one is an enterprise version, which is fairly expensive, aimed at operational
intelligence for many resources. The free version of Splunk contains most
features that the enterprise version does, however a few parts are stripped out
which for the basic implementation of SAFES, we don’t care about.
Since Splunk is a commercial product, in order to download it, a user must
first register on the Splunk website. Once registered, the user can go to
http://www.splunk.com/download and download Splunk Enterprise, which gives
the user a 30 day trial of Splunk Enterprise, however, the Enterprise license will
turn into the free version license if a license is not entered. For our Splunk
deployment we will use CentOS 6.5 x64 as the base operating system with
51
xwindows installed, so the rpm package splunk-6.0.2-196940-linux-2.6-x86_64.rpm is
chosen. Once downloaded from Splunk.com, the following commands will install
Splunk Enterprise on the server.
rpm -i /root/Downloads/splunk-6.0.2-196940-linux-2.6-x86_64.rpm
/opt/splunk/bin/splunk start --accept-license
The default password must be changed upon first login. The URL for the Splunk
web interface is http://127.0.0.1:8000, which will prompt a user to change the
default admin password of “changeme” to something else. Splunk is now fully
operational and ready to accept data.
Third Party Apps
As stated previously, Splunk was chosen as the base system for a number of
different reasons. One of these was the ability of Splunk to be a programming
platform and able to be extended for a variety of different applications. Third
party apps can be downloaded from splunkbase.com. Several apps that are
needed for the SAFES app implementation allow us to extend Splunk in the
following ways:

Input RSS feeds that are not a native function of Splunk inputs

Monitor external email accounts

Automate “lookups”, “searches”, and “fields” that allow us to normalize
data across all inputs

Provide rich GUI interfaces for easy access to important security events
Splunk for IMAP. Splunk for IMAP is needed for two external sources for
SAFES. The first one is Shadow Server Foundation emails. The second is
Pastebin.com alert emails. The Splunk for IMAP app polls an IMAP account on
52
regular intervals and indexes any email that is pulled in from that account. The
Splunk for IMAP app is located at http://apps.splunk.com/app/27/ and is installed
into Splunk with the following command once downloaded to the server:
/opt/splunk/bin/splunk install app /root/Downloads/splunk-forimap_120.tgz
Once installed and Splunk restarted, configuration must be completed to connect
the IMAP account used for Shadow Server Foundation and Pastebin.com emails
to Splunk.
The configuration file is located at:
/opt/splunk/etc/apps/imap/default/imap.conf
The following configuration lines will be modified for our test system:
server = mail.alphawebfx.com
user = safes@alphawebfx.com
password = DisqHAWxcwU1
useSSL = false
port = 143
Once saved and Splunk restarted, alerts from both Shadow Server Foundation
and Pastebin.com can be used.
Based on the schema of our test environment, we will add the following
keywords: “128.198.” and “uccs.edu” that will send us an email to our IMAP
account and ultimately Splunk when a keyword is matched on Pastebin.
53
Figure 2.
Shadow Server Foundation emails must also be added. To sign up for
Shadow Server Foundation alerts, an email containing the full name,
organization, Network of responsibility, email address of the reports, phone
number and contact information for verification must be included to
request_report@shadowserver.org. Once verified, daily emails will be sent to the
report email address included in the email if any alerts are generated.
RSS Scripted Input. RSS has become a popular internet language to
get out short informational messages quickly without creating a lot of content on
a page. The RSS Scripted Input indexes the metadata for the RSS feed. The
RSS Scripted Input app utilizes an open source program called feedparser from
www.feedparser.org to parse through the RSS metadata. The RSS Scripted
Input app is located at http://apps.splunk.com/app/278/ and is installed into
54
Splunk with the following command once downloaded to the server:
/opt/splunk install app /root/Downloads/rss-scripted-input_20.tgz
Once installed, configuration of the RSS feeds must be completed. The
configuration file is located at /opt/splunk/etc/apps/rss/bin/feeds.txt
Google Alerts will be used for RSS feeds. These alerts can be setup here:
http://www.google.com/alerts/manage. As with Pastebin.com, the domain and IP
address schema will be used. “128.198.” and “uccs.edu”
Figure 3.
Other inputs. One more external input is needed based on our
previously talked about SAFES design. Microsoft SNDS alerts. These alerts
come in the form of CSV files that are uploaded to a URL daily. To sign up for
Microsoft SNDS, IP ranges, ASN or CIDR notation address is required to be
entered here: https://postmaster.live.com/snds/addnetwork.aspx. Microsoft will
email a verification email to the contact on the IP range, ASN or CIDR address.
Once an organization is signed up, automated access settings can be enabled.
Automated access allows gives organizations a link in which the data can be
obtained from on their IP range, ASN or CIDR address. The link downloads a
CSV file daily. In order for the SAFES system to process the data from the CSV
55
file into SAFES, a script must be written to parse the data into a log file and
Splunk must then read that log file. The input for SNDS is located inside the
SAFES app at logs/snds.log.
Other inputs that are important to SAFES can be added as needed by the
organization to correlate internal events with external sources.
Installation of the SAFES App
All Splunk apps have nearly the same file structure. The app is uploaded
to /opt/splunk/etc/apps/ directory in our test system. Scripts are located in the
bin/
directory, and configuration files are located inside the local/ directory. The
configuration files contain Splunk specific programming language that allows
Splunk to process different characteristics of the app.
GUI
While the main goal of SAFES is to provide alerting of external events
based on confidence levels, a dashboard of external events provides data on the
four external data sources programmed into SAFES. Even though some of the
data may not correlate exactly to other events, timely searching of events without
searching through all security events may be critical in an incident response
situation. For this reason, SAFES only has one dashboard.
56
Figure 4.
Confidence Engine and Alerting
The main goal of SAFES is to provide alerting of external events to
organizations. Confidence levels are chosen by the organization based upon the
value of the external data that is being monitored. For example, for UCCS, high
value is placed on Microsoft SNDS and Shadow Server Foundation alerts;
medium value is placed on Pastebin.com alerts, and low value is placed on
Google Alerts. Confidence levels may change as the information provided may
be proven to provide more direct, valuable information. False positive
information coming from external sources can lower confidence in the external
data source. It is up to the organization to determine what value they place on
external data sources.
Internal data when correlated with external data may provide critical value
confidence for a security event. All events may not be able to be correlated
however. Constant analysis must be performed on external sources and internal
57
events to ensure alerting accuracy and correctness.
Alerts are handled by email through the Splunk alerting system. These
alerts must be modified to the specific needs of the organization. Multiple email
addresses can be used to alert key administrators or security personnel.
58
CHAPTER 6
EXPERIMENTS
Even though we have implemented the design of SAFES and the
dashboard can easily show us what data is coming in from external sources, the
SAFES application should be experimented with by using 3 test scenarios:
Botnet activity on a system, external third party data breach that may affect
internal users, and a system being used to send spam outside the organization.
Simulated botnet activity. Botnet activity may not always be picked up
by an internal IDS. When an external organization picks up botnet activity
coming from a system within an organization, this will either allow information
security personnel to confirm an infection or identify an area in which the internal
IDS is not picking up the activity.
The experiment that was carried out was an actual security event that took
place inside the UCCS network in 2013. To recreate the activity, dates and
system names have been changed. Botnet activity for the host 128.198.222.7
started on March 11, at 14:16 MDT based on alerts from internal IDS. Since the
Shadow Server Foundation only sends out email once a day to an organization,
the alert that there had been botnet activity on a host, was sent at 5:49 the next
day. Included in the attachment with the alert from the Shadow Server
Foundation, are the key data points: IP address identified as 128.198.222.7
infected with ZeroAccess malware, at 16:29 UTC, which was 9:29 MDT. This
59
indicates that the Shadow Server Foundation was able to identify botnet activity 5
hours before our internal IDS showed it started on the infected host.
When replaying this data through SAFES, we can instantly identify the “HIGH”
confidence level of the Shadow Server Foundation report. Additionally, since
more than one source is identifying botnet activity on a host, SAFES actually
raises the confidence level to “CRITICAL”.
Simulated external data breach. Data breaches that happen on thirdparty websites outside the organization are not necessarily a serious threat.
However due to the fact that research has shown password reuse to be high, an
organization should pay close attention to external breaches and user accounts
that are identified with their domain schema. A Pastebin.com paste was made
on March 9th detailing a database breach of a travel site. Usernames and
passwords for the external organization were exposed. SAFES issued an alert
because the domain schema, uccs.edu, was matched. The SAFES alert came in
as “MEDIUM” since only the domain schema was matched. The domain schema
was matched on the email address of xxxx@uccs.edu.
Simulated external spam detection. While many organization’s monitor
their own internal mail servers, it may be difficult to monitor the entire IP address
space for outgoing spam. It is very trivial to set up a mail server. Additionally
malware can also send out spam out of unsuspecting compromised systems.
Microsoft SNDS easily identifies this type of traffic since spam email is usually
sent to tens of thousands of email addresses including many addresses that
Microsoft maintains. Experimental data was taken from a compromised system
60
at UCCS in March of 2014. A user account was compromised and a script was
uploaded to the user’s directory which allowed the attacker to tunnel a PHP mail
script through the SSH connection, ultimately allowing the attacker to send email
out of the SSH server. The spam messages that were sent out of the SSH
server totaled in the millions of messages and because it was a trusted system,
port 25 was allowed to send the email out. Microsoft picked up roughly 11,000
messages each day until the problem was resolved. This data was reported by
Microsoft SNDS. When this information from Microsoft shows up on the SAFES
app, a “HIGH” confidence level alert is issued because all “RED” alerts that come
from Microsoft are considered “HIGH” confidence.
61
CONCLUSIONS AND FUTURE WORK
Data breaches occur every day. In 2013, over 311,000 compromised
accounts were available on Pastebin.com (High-Tech Bridge, 2014). This
number is staggering. The number of accounts were new breaches and leaks.
What’s even more alarming is that over 40% of the accounts that were leaked
were for email accounts. This means that the credentials could be used to get
into other systems. The 311,000 accounts were just a small number of accounts
that were actually leaked. The larger data sets were still kept by hackers and
hacktivists. Detecting that information where password or credential reuse could
be an issue is why the SAFES system was designed.
SAFES was designed and implemented so that internal organizations
could utilize the power of the internet to collect information from external
organizations reporting on what data they see coming from their organization.
Future Work
The SAFES app for Splunk was designed for one organization which was the
University of Colorado Colorado Springs. The four external sources that SAFES
collected are pertinent to the University as this information had been collected for
years in separate systems. As more external organizations open their data
collection on the internal organization, other data will be added to the SAFES
system. Additionally since the SAFES system was built statically inside of
Splunk, a configuration form will be made, so that organizations can set up their
62
own SAFES app without modifying source code. As each log from internal
systems is modified and new log sources are added, normalization of internal
logs may need to be modified as well.
63
References
Aguirre, I., & Alonso, S. (2012). Improving the automation of security information
management: A collaborative approach. IEEE Security & Privacy, 10(1), 55-59.
Retrieved from http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=
&arnumber=6060795
Aïmeur, E., & Lafond, M. (2013, September). The scourge of internet personal data
collection. In Availability, reliability and security (ARES). Paper presented at the
2013 Eighth International Conference (pp. 821-828). IEEE.
Baun, E. (2012). The digital underworld: Cyber crime and cyber warfare. Humanicus, 7,
1-25. Retrieved from http://www.humanicus.org/global/issues/humanicus-72012/humanicus-7-2012-2.pdf
Bronevetsky, G., Laguna, I., de Supinski, B. R., & Bagchi, S. (2012, June). Automatic fault
characterization via abnormality-enhanced classification. In Dependable systems
and networks (DSN). Paper presented at the 2012 42nd Annual IEEE/IFIP
International Conference on (pp. 1-12). IEEE.
Casey, E. (2006). Investigating sophisticated security breaches. Communications of the
ACM, 49(2), 48-55. doi: 10.1145/1113034.1113068
CERT. CERT incident note IN-98.03, password cracking activity. (1998). Retrieved from
the CERT Coordination Center, Carnegie Mellon University:
www.cert.org/incident_notes/IN-98.03.html
Curtin, M., & Ayres, L. T. (2008). Using science to combat data loss: Analyzing breaches
by type and industry. ISJLP, 4, 569. Retrieved from
64
http://web.interhack.com/publications/interhack-breach-taxonomy.pdf
Finkle, Jim. (2014, Febuary 25). 360 million newly stolen credentials on black market:
Cybersecurity firm. Reuters. Retrieved from http://www.reuters.com/
article/2014/02/25/us-cybercrime-databreach-idUSBREA1O20S20140225
Fisher, J. A. (2013). Secure my date or pay the price: Consumer remedy for the negligent
enablement of data breach. William & Mary Business Law Review, 4(1), 215-238.
Retrieved from http://scholarship.law.wm.edu/wmblr/vol4/iss1/7/
Florencio, D., & Herley, C. (2007, May). A large-scale study of web password habits. In
Proceedings of the 16th International Conference on World Wide Web (pp. 657666). ACM. doi: 10.1145/1242572/1242661
Franqueira, V. N., van Cleeff, A., van Eck, P., & Wieringa, R. (2010, February). External
insider threat: A real security challenge in enterprise value webs. In Availability,
reliability, and security. Paper presented at the ARES'10 International Conference
(pp. 446-453). IEEE.
Garrison, C. P., & Ncube, M. (2011). A longitudinal analysis of data breaches.
Information Management & Computer Security, 19(4), 216-230. doi:
10.1108/09685221111173049
Hampson, N. C. (2012). Hacktivism: A new breed of protest in a networked world.
Boston College International & Comparative Law Review, 35(2), 511-542.
Retrieved from http://lawdigitalcommons.bc.edu/cgi/viewcontent.cgi?
article=1685&context=iclr&sei-redir=1&referer=http%3A%2F%2Fscholar.
google.com%2Fscholar%3Fhl%3Den%26q%3DHacktivism%253A%2BA%2Bnew%
65
2Bbreed%2Bof%2Bprotest%2Bin%2Ba%2Bnetworked%2Bworld%26btnG%3D%2
6as_sdt%3D1%252C6%26as_sdtp%3D#search=%22Hacktivism%3A%20new%20b
reed%20protest%20networked%20world%22
High-Tech Bridge. (2014). 300,000 compromised accounts available on Pastebin: Just the
tip of cybercrime iceberg. Retrieved from https://www.htbridge.com
/news/300_000_compromised_accounts_available_on_pastebin.html
Hunt, R., & Slay, J. (2010, August). Achieving critical infrastructure protection through
the interaction of computer security and network forensics. In Privacy, security,
and trust (PST). Paper presented at the Eighth Annual International Conference
(pp. 23-30). IEEE.
Ives, B., Walsh, K. R., & Schneider, H. (2004). The domino effect of password reuse.
Communications of the ACM, 47(4), 75-78. doi: 10.1145/980000/975820
Jackson, Don. (2008). Untorpig [Online posting]. Retrieved from
http://www.secureworks.com/cyber-threat-intelligence/tools/untorpig/
Jenkins, J. L., Grimes, M., Proudfoot, J. G., & Lowry, P. B. (2013). Improving password
cybersecurity through inexpensive and minimally invasive means: Detecting and
deterring password reuse through keystroke-dynamics monitoring and just-intime fear appeals. Information Technology for Development. Advance online
publication. 1-18. doi: 10.1080/02681102.2013.814040
Kapoor, A., & Nazareth, D. L. (2013). Medical data breaches: What the reported data
illustrates, and implications for transitioning to electronic medical records.
Journal of Applied Security Research, 8(1), 61-79. doi: 10.1080/19361610
66
2013.738397
Kent, K., & Souppaya, M. (2006). Guide to computer security log management [Special
issue]. NIST Special Publication 800-92.
Krebs, Brian. (2014). Target hackers broke in via HVAC company [Web log post].
Retrieved from https://krebsonsecurity.com/2014/02/target-hackers-broke-invia-hvac-company/
Mansfield-Devine, S. (2011). Hacktivism: Assessing the damage. Network Security,
2011(8), 5-13. doi: 10.1016/51353-4858(11)70084-8
Maple, C., & Phillips, A. (2010). UK security breach investigations report: An analysis of
data compromise cases. Retrieved from the University of Bedfordshire
Repository website: http://uobrep.openrepository.com/
uobrep/handle/10547/270605
Matic, S., Fattori, A., Bruschi, D., & Cavallaro, L. (2012). Peering into the muddy waters
of Pastebin. ERCIM News: Special Theme Cybercrime and Privacy Issues, 16.
Retrieved from http://ercim-newsercim.downloadabusy.com/images/
stories/EN90/EN90-web.pdf#page=16
Notoatmodjo, G., & Thomborson, C. (2009, January). Passwords and perceptions. In
Proceedings of the Seventh Australasian Conference on Information Security, Vol.
98 (pp. 71-78). Australian Computer Society, Inc.
Poulsen, Kevin. (2011, June.). LulzSec releases Arizona police documents. Wired.
Retrieved from http://www.wired.com/threatlevel/2011/06/lulzsec-arizona/
Robb, Drew. (2006, August). 2006 Horizon Awards winner: Splunk’s Splunk.
67
Computerworld. Retrieved from http://www.computerworld.com/s/article
/9002558/Splunk_Inc._s_Splunk_Data_Center_Search_Party
Sherstobitoff, R. (2008). Anatomy of a data breach. Information Security Journal: A
Global Perspective, 17(5-6), 247-252. doi: 10.1080/19393550802529734
Splunk. (2014). Search managers. Retrieved from http://dev.splunk.com/view/SPCAAAEM8
Splunk. (2014). Splunk views. Retrieved from http://dev.splunk.com/view/SP-CAAAEM7
Splunk. (2014). Splunk web framework overview. Retrieved from
http://dev.splunk.com/view/web-framework/SP-CAAAER6
Stearley, J., Corwell, S., & Lord, K. (2010, October). Bridging the gaps: Joining
information sources with Splunk. In Proceedings of the 2010 Workshop on
Managing Systems via Log Analysis and Machine Learning Techniques (p. 8).
USENIX Association.
Stone-Gross, B., Cova, M., Cavallaro, L., Gilbert, B., Szydlowski, M., Kemmerer, R., ...
Vigna, G. (2009, November). Your botnet is my botnet: Analysis of a botnet
takeover. In Proceedings of the 16th ACM Conference on Computer and
Communications Security (pp. 635-647). ACM.
Weir, M., Aggarwal, S., Collins, M., & Stern, H. (2010, October). Testing metrics for
password creation policies by attacking large sets of revealed passwords. In
Proceedings of the 17th ACM Conference on Computer and Communications
Security (pp. 162-175). ACM.
Zhang, L., & McDowell, W. C. (2009). Am I really at risk? Determinants of online users'
68
intentions to use strong passwords. Journal of Internet Commerce, 8(3-4). 180197. doi: 10.1080/15332860903467508
69
APPENDIX A
INSTALLING SAFES FROM START TO FINISH
Note:
This install manual assumes that the following software and versions are what
will be used for installation:
CentOS 6.5 x64
Splunk Enterprise Version 6.02
RSS Scripted Input Version 2.0
Splunk for IMAP version 1.20
Prerequisites:
An account must be set up on Splunk.com to download Splunk Enterprise and
third party apps.
Splunk Enterprise and third party apps must be downloaded to the server that will
host Splunk, third party apps and SAFES.
Installation:
mv Downloads/* /usr/local/src/
cd /usr/local/src/
rpm -i splunk-6.0.2-196940-linux-2.6-x86_64.rpm
/opt/splunk/bin/splunk start --accept-license
open Splunk in browser at 127.0.0.1:8000
username is admin
70
password is changeme
Splunk will then prompt you to change it
Back on terminal:
/opt/splunk install app rss-scripted-input_20.tgz
This command will prompt you to enter the recently changed admin password
/opt/splunk/bin/splunk install app splunk-for-imap_120.tgz
cp -r /usr/local/src/SAFES/* /opt/splunk/etc/apps/SAFES/
/opt/splunk/bin/splunk restart
cp /opt/splunk/etc/apps/imap/default/imap.conf ../local/imap.conf
/opt/splunk/bin/splunk restart
Manual Configuration:
vi /opt/splunk/etc/apps/rss/bin/feeds.txt
Remove default feeds, and add Google Alerts
vi /opt/splunk/etc/apps/imap/local/imap.conf
Modify configuration settings to match organizations IMAP account tied to
external monitoring accounts
cp –f /opt/splunk/etc/apps/SAFES/imap/getimap.py
/opt/splunk/etc/imap/getimap.py
Post installation and configuration:
/opt/splunk/bin/splunk restart
SAFES Overview Dashboard: http://127.0.0.1:8000/en-US/app/SAFES/safes