EXTERNAL MONITORING FOR INTERNAL DATA BREACHES by GREGORY WILLIAMS B.S. Colorado Technical University, 2005 A thesis submitted to the Graduate Faculty of the University of Colorado Colorado Springs in partial fulfillment of the requirements for the degree of Master of Engineering in Information Assurance Department of Computer Science 2014 © Copyright By Gregory Williams 2014 All Rights Reserved ii This thesis for Master of Engineering degree by GREGORY WILLIAMS has been approved for the Department of Computer Science by _______________________________________ Dr. C. Edward Chow, Chair _______________________________________ Dr. Chuan Yue _______________________________________ Dr. Xiaobo Zhou _________________________ Date iii Williams, Gregory (M.E., Information Assurance) External Monitoring for Internal Data Breaches Thesis directed by Professor C. Edward Chow Data breaches and malware are commonplace these days. Even the some of the larger organizations have fallen victim to hackers or even inadvertent data exposure. Hackers have exposed details of their activities across the internet to sites like Pastebin.com, twitter.com, and others. What happens when a user on a compromised site uses the same credentials as the organization you are hired to protect? Or what if a piece of stealthy malware doesn’t trip an intrusion detection system signature? Traditionally internal information security relies on analyzing internal logs and events from desktops and servers to determine whether or not a malicious security event happened. Credentials from a user in an organization can be stolen via phishing or malware, which typically can be easily detected. However what if the organization doesn't know that a user reused their username and password for their organization on another website? What if that website were compromised? What if a piece of spam sending malware were so stealthy that the internal organization didn't know about it. How can an organization identify this security event if it doesn't know it data may have been compromised because the internal logs say everything is fine? There is a need for internal external notification for these types of security incidents. The SAFES app helps iv to import and analyze external threat information internally so that organizations are able to react to an external or internal security event more quickly by leveraging an external organization’s information on their organization. v ACKNOWLEDGEMENTS I would like to thank my family. First my wife, who continually pushes me to exceed in what I love to do and who sacrificed her time to watch me pursue my dreams. My daughters, Annie, Emily, and Adalynn, who sacrificed their time with me so I could pursue this. Dr. Chow, who continually asks the questions that makes you think. And finally Jerry Wilson, who recognized that I had a passion for this and lets me get to do what I love to do every day. Practice information security. vi TABLE OF CONTENTS Chapter 1: Introduction………………………………………………………………….1 How we got here – Threat landscape…………………………………………2 Hackers…………………………………………………………………..2 Hacktivism………………………………………………………………..3 Malware and Botnets……………………………………………………6 Users……………………………………………………………………..9 Threat Management…………………..……………………………………….16 Attack Methods and Data Types………………………………….….17 Detection Methods…………………………………………………….20 Monitoring and Analysis Tools……………………………….………20 Framework of Log Management……………………………………..22 SIEM…………………………………………………………………….24 Chapter 2: The Problem of Data Breaches, Detection, and Response……….....27 Chapter 3: Splunk………...……………………………………………………………29 Splunk as a Log Management Tool………………………………………….31 Splunk as an a Application Development Platform………………..……….33 How Requirements are Met…………………..………………………………34 Chapter 4: Design of SAFES…………..…………………………………………….37 Requirements…………………………………………………………………..38 Logging Sources……………………………………………………………….39 Windows 2012 domain controller security logs…………………….39 Intrusion detection system logs……………………………….……..41 vii Microsoft Smart Network Data Services…………………………….42 Pastebin.com Alerts………………………………………...…………42 Shadow Server Foundation Events….…………..…………………..43 Google Alerts…………….…………………………………………….45 Data Schemas……………………………………………..…………………..46 Domain schema………………………………………………………..46 IP address schema…………………………………………………….46 Username schema……………………………...……………………..47 Password schema……………………………………………………..47 Confidence Levels……………………………………………………………..47 Chapter 5: Implementation of SAFES………………………………….……………50 Splunk Installation……………………………………………….…………….50 Third Party apps……………………………………………………………….51 Splunk for IMAP………………………………………………….…….51 RSS Scripted Input…………………………………………………….53 Other Inputs…………………………………………………………………….54 Installation of the SAFES App…………………………………………..……55 GUI………………………………………………………………………………55 Confidence Engine and Alerting……………………………………………..56 Chapter 6: Experiments………………………………………………………………58 Simulated Botnet Activity………………………………………………….….58 Simulated External Data Breach……………………………………………..59 Simulated External Spam Detection………………………………………...59 viii Conclusions and Future Work…………………………………………………..……61 Future Work………………………………………………………………….…61 Bibliography……………………………………………………….……………………63 Appendix A: Installing SAFES from Start to Finish……………………………………..…68 ix TABLES Table 1. Information Collected from the torpig botnet…………………………8 x FIGURES 1. Sample Google Alert…………………………………………………..45 2. Pastebin.com Alert Setup……………………………………………..53 3. Google Alerts Setup…………………………………………………...54 4. SAFES Overview Dashboard…………………………………….…..56 xi 1 CHAPTER 1 INTRODUCTION Data breaches occur every day. They can impact individuals and organizations. They can affect individuals by exposing information that shouldn’t be known to the public or attackers. They can affect organizations as well. Malware installed on a system could allow attackers to steal users’ credentials from inside the organization. Malware can also start sending out spam messages or allow the compromised computer to become part of a larger botnet. Hackers can attack an organization’s assets and expose sensitive data. Often times data breaches are caught by an intrusion detection system (IDS) or by an individual that has been notified that their account has been compromised. However, what if internal detection methods fail? What if an organization doesn’t know they have been compromised because their tools cannot detect it? What if another organization has a data breach and a user credentials are reused on the compromised organization’s systems? The third party organization’s data could be at risk of exposure due to valid credentials leaked on a compromised system. Consider if an organization could utilize the third party systems across the internet as a sort of intrusion detection system that processes information collected and reports on possible data breaches as they are happening or in some cases before they happen. The third party organization’s information could be utilized to alert an internal organization about a possible data breach. The 2 information collected is out there but disparate. This thesis will propose and demonstrate a system that can collect third-party internal organizational data, and automatically analyze, correlate, and alert on the third party data so that potential data breaches that occur undetected by the internal organization’s and detected by the external organization’s will be known to the internal organization. How we got here – Threat Landscape Data breaches, data loss, and cybercrime happen frequently and they have been increasing for years. Everything that is connected has the potential of being breached (Baun, 2012). Hackers, hacktivists, malware and the users themselves all contribute to the problem of data breaches. Hackers There are many organizations that track data breaches including the Identity Theft Resource Center (ITRC), the Privacy Rights Clearinghouse, and a number of private firms. Research on data breaches was conducted by Garrison and Ncube in 2010. The research looked at data breaches that occurred between 2005 and 2009, specifically data that had actual number of records that were breached. Based upon their research, data breaches can be broken up into five distinct categories which are: stolen, hacker, insider, exposed and missing (Garrison & Ncube, 2011). The exposed and hacker categories are directly related to the problem that we are most concerned with. The exposed category covers unprotected data that can be found in different mediums. These mediums can include disks, files, hard drives, servers and desktop computers. The data on those mediums contained personal information such as social 3 security numbers, customer records, parents, children, etc. (Garrison & Ncube, 2011). The hacker category covers unauthorized access to a computer system or server. The data also revealed six specific types of organizations: business, education, federal/military, financial, local/state government and media. Their analysis looked to see if there was any specific data leading them to understand if certain categories of breaches or organizations had more breaches than others. What they found was interesting in both categories. Exposed and hacker categories for data breaches covered 466 data breaches with over 2.5 million records breached, which are 49.21% of the incidents and 75.91% of the total number of records breached respectively (Garrison & Ncube, 2011). The exposed category could be reduced significantly if it were not for careless employees or employers (Garrison & Ncube, 2011). During the five years that the study, it is also noted that 48.43% of the data breach records were from hackers compromising a system (Garrison & Ncube, 2011). Additionally, records that were under the exposed category totaled 28% with 26.58 of the total number of records compromised. Nearly 75% of the records that were breached came from the exposed and hacked categories. Keep in mind that these numbers were from 2009. Their research did not start to touch on the new trend in hacking, starting in 2010: hacktivism. Hacktivism Hacktivism is more or less a combination of hacking and activism (Hampson, 2012). Hacking is typically done for self-interest, whereas hacktivism is done for social or political goals. The information the hacktivists obtain is 4 typically shared out to the public (Mansdield-Devine, 2011). Hacktivists desire publicity. Typically they will claim that it is for the greater good, and to promote security awareness (Mansdield-Devine, 2011). More so it’s about making a public statement. The group Anonymous has a long history with activism. It was only until recently they started to use their skills for hacking. The first public demonstrations by Anonymous were protests against the Church of Scientology. The next major event was Operation Payback against the music industry for its pursuit of filesharers. Other operations ensued, including hacking (MansdieldDevine, 2011). One of the larger operations to expose data in recent years was the Sony data breach. The breach affected 75 million accounts. Information that was compromised included name, address, country, email address, birthdate, logins (usernames and passwords) and other data that may have been obtained during the compromise including credit card information. The data was apparently leaked by Anonymous (Fisher, 2013). A partial database dump was posted onto Pastebin. What's more concerning is that Sony did not encrypt its' customers information. Usernames and passwords were in plain text (Fisher, 2013). LulzSec, in my opinion, was far more damaging to the public. LulzSec protested only for the "lulz" - the pure joy of mayhem (Mansdield-Devine, 2011). In LulzSec’s 50 days of hacking and hacktivism stunts, organizations like PBS, Sony, CIA, and the Serious Organised Crime Agency (SOA) were compromised or taken down by denial of service attacks. LulzSec also admitted to 5 compromising the security firm HBGary (Mansdield-Devine, 2011). LulzSec’s campaign "Antisec" was geared toward the public awareness of security weaknesses. Public awareness came through the exposure of personal information on many individuals during those 50 days of “Antisec”. This included information such as email addresses, passwords, usernames, social security numbers, sensitive emails, etc. (Mansdield-Devine, 2011). Organizations that are targeted by hacktivists are not the only victims of an attack. The information that is leaked by hacktivists, such as usernames, passwords, email addresses, even physical addresses puts everyone at risk. The attacks on Arizonan law enforcement put police officers at risk (Poulsen, 2011). Information could have been used to pursue revenge. The information that is leaked by hacktivists can be used for other intentions such as revenge. LulzSec actually encouraged their followers to log into the personal accounts of the victims data he leaked and embarrass them (Mansdield-Devine, 2011). Embarrassment because the exposed data came from sometimes not so upstanding websites. Ultimately, hacktivists believe that users will just change their passwords, however it assumes that the victims know that their personal information has been compromised (Mansdield-Devine, 2011). The information that is leaked because of hacktivism may be used to log into other organizations. This is worrisome especially if a user used their same password on multiple sites or worse for an organization’s website. Leaked information may be easily obtained from other sites and should be monitored by an organization so that organizational information is not compromised if 6 organizational credentials are compromised. Hacktivists seek to expose information to get the attention of a business. They do this either promote a cause or to point out how weak their security is. In 2011, there were 419 incidents of reported data breaches, involving 23 million records (Fisher, 2013). This information was exposed. Hacking is not only a way to steal information, it is a way to get the business's attention. It is a way of stealing funds, it is a way of making fraudulent transactions. However, from a consumer's standpoint, it doesn't matter if the data was exposed by a hacker or hacktivist, information was exposed. Once the information has been exposed the fear of identity theft is left with the consumer. Consumers’ information can also be exposed by malware. Malware and Botnets The number of hidden and unidentified infections from malware will cause a degree of unknowingness and concern when it comes to the protection of sensitive information (Sherstobitoff, 2008). Banking trojans, for example, are enabling the rise of financial and economic fraud (Sherstobitoff, 2008). Online fraud and phishing campaigns are also on the rise. Electronic records can be hacked or spied upon through malware (Kapoor & Nazareth, 2013). Information shared on Pastebin.com that could be considered sensitive data can include lists of compromised accounts, database dumps, lists of compromised hosts with backdoors, stealer malware dumps and lists of premium accounts (Matic, Fattori, Bruschi & Cavallaro, 2012). Passwords have become compromised because of malware on workstations and on network equipment (Ives, Walsh & Schneider, 7 2004). One category of malware that is of particular interest is botnets. Botnets are a means for cyber criminals to carry out malicious tasks, send spam email, steal personal information and launch denial of service attacks. Researchers at the University of California Santa Barbara were able to take over the torpig botnet for 10 days several years ago. Data collected from the botnet was astounding. During the course of the botnet takeover 1.2 million IP addresses were seen communicating with the command and control servers (Stone-Gross et al., 2009). There were 180 thousand infected machines with 70 GB of information also collected during this time (Stone-Gross et al., 2009). Most of that information was of a personal nature. Due to the way the torpig botnet was set up, information was able to be gathered from a variety of user installed applications, such as email clients, FTP clients, browsers, and system programs (Stone-Gross et al., 2009). Data that was sent through these applications were able to be seen, encrypted and uploaded to the attackers every twenty minutes. However encryption on the algorithm was broken in late 2008, which allowed researchers to analyze the botnet (Jackson, 2008). The torpig botnet also used phishing attacks on its victims to collect information that was not able to be gathered from passive data collection. Phishing attacks are very difficult to detect especially when the botnet has taken over a computer, because a suspicious look website looks legitimate (StoneGross et al., 2009). The botnet communicated information back to the command and control 8 server over an HTTP POST request. During the time that the botnet was observed the following personal data was collected: Type Amount Mailbox account: 54,090 Email: 1,258,862 Form data: 11,966,532 HTTP account: 411,039 FTP account: 12,307 POP account: 415,206 SMTP account: 100,472 Windows password: 1,235,122 Table 1. Another aspect of the torpig botnet was there was evidence that different operators used the botnet for different tasks. Meaning that the botnet was used as a service for a fee. The torpig botnet also stole financial information including 8310 accounts at 410 different financial institutions. Researchers also looked at passwords and provided analysis from what they saw. The botnet saw nearly 300 thousand unique usernames and passwords send by 52540 different infected machines (Stone-Gross et al., 2009). During a 65 minute period, the researchers collected 173 thousand passwords. 56 thousand passwords were recovered using permutation, substitution and other simple replacement rules by a password cracker. Another 14 thousand were recovered when given a wordlist in 10 minutes. This means that 40 9 thousand passwords were recovered in 75 minutes (Stone-Gross et al., 2009). This information is astounding. Password reuse and weak passwords are one of the most troubling scenarios for administrators and security personnel inside an organization. If passwords are reused across different websites or organizations, those credentials can possibly get attackers into other systems in other organizations. Users Computer users are also part of how we got here. Users have credentials, and most of the time users are trusted. Users can represent two different types of threats. Internal threats and external threats. The external threat where the threat is located outside the organization and there is the insider threat, where the threat is inside the organization. External threats may be actual attackers trying to gain access to systems or user credentials remotely. Insider threats, may be users within the organization who may steal data. The third kind of threat proposed by the researchers is called an external-insider threat, where the threat originates from an outsider to the organization, but has user credentials that can place internal systems at risk (Franqueira, van Cleeff, van Eck, and Wieringa, 2010). External insiders add unique challenges to security because they are trusted, however they have more lax security posture. Insider attacks tend to be more harmful than that of outsider attacks (Franqueira et al., 2010). Other research has shown that the number of records were higher with insider threats that external threats. Value-webs, as defined in their research, are cross-organizational cooperative networks that consist of an 10 internal organization having some operational relationship with an external organization (Franqueira et al., 2010). This could be, for example, an HVAC company monitoring information for heating and cooling for a data center for a large organization. This is the same scenario that maps to the Target corporation data breach in 2013 (Krebs, 2014). Insiders are individuals that are trusted and have some authorized access over the organizations assets. They have legitimate privileges which may be required to perform certain sensitive and authorized tasks. This can represent a problem if authorization or privilege goes unchecked. It can allow insiders to acquire information they wouldn't normally have access to, causing increased risk to the organization. External insiders are individuals that are not trusted, but have some authorized access into the organizations assets. External insiders needs to have access grated to them to fulfill the organization value-web contract they have with the internal organization. This presents a risk if the privileges allow specialized access. Insider threats can also be classified into specific kinds of actors. Masqueraders are individuals that steal the identity of a legitimate user (Franqueira et al., 2010). Misfeasors are legitimate users that are authorized to use systems to access information but misuse their privilege. Clandestine users are individuals who evade access control and audit mechanisms and aren't identified until they fit the other two classifications. There are challenges, however, in identifying misuse. An organization 11 may not log enough details to have information to tie back events to a specific user. In fact, only 19% of analyzed organizations that had data breaches in 2008 had a unique ID tying them back to an event (Franqueira et al., 2010). In the 81% of the cases, shared system access was used, therefore anyone that knew the credentials for the account that caused the data breach could have been the cause of the breach (Franqueira et al., 2010). This may be from user password reuse. Password reuse is a major problem. Microsoft conducted one of the largest research studies ever conducted between July and October 2006 (Florencio & Herley, 2008). During that time when a user downloaded the Windows Live Toolbar, a portion of users were given the choice to opt-in to the component that measured password habits. The component measured or estimated the quantities of passwords in use by users, the average number of accounts used passwords, how many times users entered passwords in per day, how often passwords were shared between sites and how frequently passwords were forgotten. There are numerous kinds of websites that we must remember passwords for. Passwords are so important that when entered into a web form, they are masked. SSL is also a key component of securing the password so that it cannot be seen from observers on the network (Florencio & Herley, 2008). However, in order to remember passwords, one might think password managers would be in greater use. For most users a small set of passwords is maintained in their memory. 12 For example if a user has 30 different accounts, only 5 to 6 passwords are remembered for those 30 accounts, not 30 different passwords (Florencio & Herley, 2008). Passwords are typically remembered on paper, by memory, password trial and error and by resetting passwords, not by password managers. Phishing has also increased in past years. Keyloggers and malware are also on the rise. These allow attackers to easily steal strong passwords as well as weak ones. The component in the Windows Live Toolbar that was a key component in the study consisted of a module that monitored and recorded Password Re-use Events (PRE's). This module contained an HTML password locator that would scan the document in search of the HTML code inputtype="password" and extracted the HTML value field associated with the inputtype. If a password was found, it was hashed and added to the Protected Password List (PLL) (Florencio & Herley, 2008). Another component was Realtime Password Locator(RPL). The PRL maintained a 16 character FIFO that stored the last 16 keys typed while the browser was in focus. This allowed the researchers to show that if a series of characters entered matched an existing hash, they knew they had a password re-use event. The URL was also matched so that duplicate visits to the same site did not produce a Password Reuse Event (PRE). PRE's were only sent to the researcher’s servers if an actual PRE occurred. Unique passwords that were only used once didn't produce a PRE. The research also showed what sites had the most PRE's. The top five were Live.com, google.com, yahoo.com, myspace.com, and passport.net. 13 What's more concerning is the fact that 101 PRE reports were from known phishing sites (Florencio & Herley, 2008). This means that a user would have been successfully phished, allowing the phisher to compromise not only the account of whatever site they might have been spoofing or phishing, but the rest of the sites that the user used the same password for. Users choose weak passwords. The average user has 6.5 passwords, each of which is shared across about 4 websites each (Florencio & Herley, 2008). Users have roughly 25 accounts across the internet that take a password. A user must use around 8 passwords a day. The average number of sites that share the same password is 5.67. Weak passwords are used more often on sites, on average 6.0 and strong passwords are used less often 4.48. During the study the average client uses 7 passwords that are shared and 5 of them have been re-used within 3 days after installing the toolbar (Florencio & Herley, 2008). Password reuse can lead to direct compromise of other websites if an attacker leaks usernames/passwords for a compromised site that the user belongs to. The concern is not so much protecting your internal users from being compromised by one of your servers, it is the fact that user's passwords habits allow an internal attack to potentially go undetected because an attacker uses legitimate credentials from an external resource. If a password is reused on one account, a hacker may be able to pivot to another system or account and use that same password for that person's access. This happens in a corporate environment. As e-commerce becomes more mainstream, accounts belonging to sites that become compromised might be used to compromise other systems. For example, a user might have 14 accounts at Bank of America, AOL and Amazon.com. Users are poorly equipped to deal with today's need to multiple passwords for multiple accounts. Users do not realize that the reuse of their passwords for a high security website are now lowering security as if using a low security website (Ives et al., 2004). Password reuse can allow attackers to gain access to other websites. Users also tend to use short passwords containing only letters that generally tend to be personal in nature. They typically do not update their passwords. One theory behind this is the fact that humans have cognitive limitations which leads a user to not be able to remember a more complex password over time (Zhang & McDowell, 2009). Because of the cognitive limitations users are often less than optimal decision makers when it comes to reasonable thought about risk, especially about passwords. When presented with password creation, users tend to favor quick decisions to save cognitive thought (Zhang & McDowell, 2009). Passwords provide protection to a company intranet, bank accounts, email accounts, etc. Breach of a password can lead to personal data loss. Therefore the study by Zhang and McDowell (2009) about password perceptions suggests that users tend to pick stronger passwords based on perceived risk. However, the users’ attitudes when presented with security practices are negatively related to people’s attitudes to a security policy. The effort and time it takes to create strong passwords and updating them can be associated negatively with their password protection intention. Even though users gain more accounts over time, that doesn't necessarily mean that a user will use a new password for each 15 account. If the password of a company network was compromised, this could lead to data loss specifically confidential and sensitive data. Multiple passwords in use across multiple sites creates more cognitive thought and therefore users reuse passwords out of negativity toward coming up with multiple unique passwords (Zhang & McDowell, 2009). The Protection Motivation Theory (PMT) suggests that perceived severity in the data that is protected by the password, is not related to the password protection intentions. Users who perceive a severe consequence for losing the data to a data breach or compromised password do not necessarily intend to take more effort to protect it. Zhang and McDowell (2009) also suggest that users do not choose strong passwords due to the added response cost of the password. Users typically use passwords for main tasks in their jobs. Using a password to access data adds to their cognitive load, so when a new password is requested, users do not want to add to their cognitive load, so they choose passwords that are familiar and easy to remember, thus, password reuse may be commonplace. Users are the weakest link in password control due to our reuse of passwords. Password policies that prohibit the reuse of passwords are often abused. Security on a system can be compromised even if an attacker knows a single work. An attacker that knows a little information can impersonate a user and gain access to secured information. Social engineering, shoulder surfing, dumpster diving, and phishing are all ways that a user can be targeted to obtain a password for information. Password reuse is almost commonplace for most users since users have 16 multiple accounts. The more accounts we have, the more passwords will be reused. A user’s password reuse will grow as a user creates more accounts. Remembering the passwords was the primary reason why a user reused passwords. Protecting private information was the primary reason for not reusing passwords. Users typically base their complexity of passwords on the data that they are trying to protect. A study conducted by Notoatmodjo and Thomborson (2009) gave a very strong correlation to the number of accounts and password reuse. The R value for the correlation was .799 and the adjusted R value was .790 which is significant (Notoatmodjo & Thomborson, 2009). Users, when approached with the question of why they reused passwords indicated 35% that the password was easy to remember (Notoatmodjo & Thomborson, 2009). Only 19% said the reason for password reuse is that the site does not contain valuable data (Notoatmodjo & Thomborson, 2009). Based on user behavior, 11 out of the 24 participants reused a password even for high importance accounts (Notoatmodjo & Thomborson, 2009). 23 out of 24 reused passwords for low importance accounts. Even though perceptions about password security severity increase with the amount of sensitive data stored on a website, that doesn't mean an organization should be protecting the information from external data sources. Threat Management Hackers, hacktivists, malware and botnets, and most importantly users are the actors in how data is breached. In order to start identifying how we can 17 mitigate attacks and identify potential areas of remediation, there has to be an understanding of what mechanisms are allowing data breaches to take place. Attack Methods and Data Types There are many reports on how data is breached. However, most of the information out there is from self-reported incidents, not from the actual companies reporting the data. 7Safe collected and analyzed the information from forensic investigations that they conducted. The data was collected over a period of 18 months. During those 18 months, they analyzed 62 cases of data breaches (Maple & Phillips, 2010). Investigating the data breaches and analyzing the data should help in identifying how future attacks can be prevented. The data breaches investigated came from many different sectors. Business, financial, sports and retail. Retailers store a lot of information on their customers including credit card information. From the year 2000 to 2008, card not present - which is used in e-commerce fraud increased 350% where-as online shopping increased by 1077% (Maple & Phillips, 2010). 69% of the organizations that were breached were retail, the next highest was financial at 7%. 85% of the data compromised was credit card information, followed by sensitive company data, non-payment card information, and intellectual property (Maple & Phillips, 2010). 80% of the data breaches were external, 2% internal and 18% were business partners (Maple & Phillips, 2010). 86% of the attacks involved a web-server that was customer facing (Maple & Phillips, 2010). 62% of the attacks were of average complexity, 11% were simple attacks and 27% were sophisticated attacks that required advanced skill and knowledge of 18 programming and operating systems. Sophisticated attacks typically happen over a long period of time. SQL injection made up 40% of the breaches and another 20% were from SQL injection combined with malware. 10% were strictly malware and 30% were poor server configuration or authentication. Significance is highly placed on SQL injection which accounts for many of the data breaches that occur. Since many organizations have databases that contain sensitive or vast amounts of information, SQL injection accounts for large data breaches (Mansdield-Devine, 2011; Maple & Phillips, 2010; Weir, Aggarwal, Collins, & Stern, 2010). Types of data that are breached are also worrisome. There are several different internet data collection domains which contain specific personal information: Healthcare: Healthcare information has been able to help users interact with their providers. However, much of the information on the healthcare websites is private and sensitive information. Attackers could use this information to steal someone’s identity or worse exploit one's medical weakness (Aïmeur & Lafond, 2013; Kapoor & Nazareth, 2013). E-Commerce: Browsing habits, sites visited, products looked at are often information that is collected on a user. Attackers could use this information to exploit a user’s interest (Aïmeur & Lafond, 2013). E-learning: Students typically share information and that information is accessible to other students. Information is stored on the above domains. Information is also collected by not only looking at what accounts users create and the information contained 19 within those accounts, but a user’s habits on the internet. Social media, online data brokers, search engines and geolocation data all contain data that can be collected analyzed and parsed through linking users to what's important to them (Aïmeur & Lafond, 2013). This is a major concern if data from these sources are leaked. Surveillance, interrogation, aggregation, identification, insecurity secondary use, exclusion, breach of confidentiality disclosure, exposure, increased accessibility, blackmail appropriation and distortion are all concerns if data were to be leaked (Aïmeur & Lafond 2013). Users give away a lot of information about themselves: Identifying information such as name, age, gender, address, phone number, mother's maiden name, SSN, income, occupation, etc. (Aïmeur & Lafond, 2013). Buying patterns in which users are giving away such information as websites visited, assets, liabilities, stores they regularly shop from. Navigation habits including websites visited frequency of the visits usernames used on forums. Lifestyle information such as hobbies, social network information, traveling behavior, vacation periods, etc. Sensitive information such as medial or criminal records. Biological information such as blood group, genetic code, fingerprints, etc. All of this information can be tied together with enough time. There are often attacks on people's privacy. Hackers are always on single step behind the new technology however. If there is a vulnerability hackers will exploit it. All is not lost since detection methods can catch at least some of the attacks that happen, some even before they happen. 20 Detection Methods While there are numerous reports of data breaches happening every day, an undetected breach cannot be reported (Curtin & Ayers, 2008). Detecting a data breach can be difficult. Careful intruders hide or remove evidence of a breach by altering information such as timestamps, deleting logs, or modifying applications to avert detection (Casey, 2006). Even though a system or user may become compromised there are areas in which detection can be accomplished. Monitoring and analysis tools There has been substantial development in computer and network security design in the past few years. This is seen in the new protocols, new encryption algorithms, new authentication methods, smarter firewalls, etc. (Hunt & Slay, 2010). The security industry has also seen improvement in computer forensic tools where the methods of searching for and detection of malicious activity have become more sophisticated. Security systems have been designed to detect and provide protection from malware such as viruses, worms, trojans, spyware, botnets, rootkits, spam and denial of device attacks (Hunt & Slay, 2010). However it is often difficult to effectively assess damage caused by malware or system attacks based on the massive amount of logs collected by these systems. Traditionally computer forensics was performed by looking at the data on storage devices. However in recent years there has been a shift in the way computer and network forensics data is obtained. This is through the live analysis of network traffic and system logs. Network forensics is concerned with 21 monitoring network traffic to see if there are any anomalies. An attacker may have been able to cover their tracks, so traditional computer forensics does not work as well as network forensics. Security tools need to monitor and detect attacks and at the same time forensic tools need to both soundly record traffic and security events while providing real-time feedback. This is so that an attack can be observed and monitored, recorded forensically so that it can be preserved for evidence, tracked, so that a user understands the scope of the events, and also limit its' damage so that it is not able to take down a network. Successful investigation of a data breach relies heavily on logs of the data breach. Information that is preserved can help an investigation succeed. Successful intruders are counting on an organization to not have forensic analysis in place and strict logging information (Casey, 2006). As more information is collected from systems across an organization, the value of the evidentiary logs when dealing with a data breach increases. Security vendors will design their products with forensic principles in place, but it is still up to the organization to define what critical assets will be looked at for those key logs. Tools exist out there to collect information into a single database that can be queried for specific time periods, IP addresses, and other information. Information can be correlated and normalized. Data that is correlated and analyzed saves investigations valuable time by showing relevant information. Automatic aggregation and categorization must take place in order to address abnormalities more quickly (Bronevetsky, Laguna, de Supinski, & Bagchi, 2012). 22 System administrators are overloaded with single messages saying there are problems. The shear amount of information coming in is hard for anyone to manage, there has to be tools that do this kind of analysis and categorization automatically. Framework of Log Management In order address the needs of multiple organizations, the National Institute of Standards and Technology (NIST) came up with the Guide to Computer Security Log Management (Kent & Souppaya, 2006). The Guide to Computer Security Log Management from NIST is a guide written and vetted by many organizations and key researchers. This publication and other special publications are meant as the highest goal to implement and maintain for the specific area in which the guide talks about. The guide outlines what and how security logs should be collected and maintained. There are many different log sources that provide security event information: Antimalware Software - can show what malware was detected, disinfection attempts, updates to the software, file quarantines, etc. Intrusion Detection and Intrusion Prevention Systems - can provide logs on suspicious behavior and detected attacks. Remove Access Software (VPN) - can provide logs on who logged in and when or who attempted to log in, also where the user logged in from. Web Proxies - can provide information on web activities. Vulnerability Management Software - can provide logs about what systems are vulnerable to specific exploits and how to fix the 23 vulnerabilities. Authentication Servers - can provide user logs that detail what user was logging into what system at what time. Routers - can provide logs on specific blocked or allowed activity. Firewalls - can provide logs on specific blocked or allowed activity. Network Quarantine Servers - can provide logs as to an attempted system connection to an internal resource. This system would provide logs as to the security posture of an authorized system. Operating systems - can provide system event logs and audit records that have detailed information about the system in them. Other information that can be obtained by application logs are client requests and server responses, account information, usage information and significant operational actions. Often organizations are under specific compliance obligations for the use of logs. These include the Federal Information Security Management Act (FISMA) of 2002, Gramm-Leach-Bliley Act (GLBA), the Health Insurance Portability and Accountability Act (HIPAA), Sarbanes-Oxley (SOX), and Payment Card Industry Data Security Standard (PCI-DSS). The amount of systems in an organization complicates log management in a few different ways. The amount of logs on many different sources can be a lot of information. A single event as simple as logging in can cause a massive amount of data on many different systems. Inconsistent log content is another complication. Some logs contain information only pertaining to that specific 24 resource, such as time, resource, destination, source, MAC address, etc. Logs from various sources are not normalized as well. Each system may return a timestamp in a different format as well. There are flat files versus syslog data versus xml formats. All must be organized and normalized. Log data provides a massive amount valuable information and can sometimes contain restricted information. Log data must be protected because it is confidential information and may end up allowing an attacker to gain access to systems. Log collection is not the only thing that an organization should be doing with their data. An organization should also analyze the data that is collected from the logs. Often times system administrators are responsible for looking at the raw log data, however they do not have the tools necessary to look through the data with ease. Analysis is also treated as reactive. NIST recommends creating and maintaining a secure log management infrastructure, all information is available to analyze data from (Kent & Souppaya, 2006). Log analysis must be performed on the data that is centrally collected. Event correlation is key when analyzing logs. If a user is seen logging in one place and it is logged in an authentication system, such as Active Directory, the user may also be seen logging into a remote network, such as VPN. SIEM Security Information and Event Management (SIEM) provides a log management platform for organizations to use. A SIEM supports log collection by gathering information from log sources via an agent placed on the system or 25 through a log generating host that forwards logs onto the SIEM. SIEMs can also provide other features such as analysis and correlation of logs. However a human still may need to interpret the logs due to the variety of interpretation of the log. There is a need for context. The meaning of a log often depends on what other logs are surrounding it and the correlation of other events. Typically a system administrator can define how a log is placed in that context. SIEMs also cannot analyze every log and make a determination on what to do. Prioritizing logs is also key. Different systems may be more important than others. The combination of several factors and correlation of events might indicate that something else is going on than what the SIEM says is going on. Entry type, newness of the log, log source, source or destination of the log or IP address of the log, time or day of the week and frequency of the entry are all analysis points that must be taken into consideration for a security event (Kent & Souppaya, 2006). SIEM technology supports three critical areas: log management, threat management and compliance reporting (Hunt & Slay, 2010). SIEM is an ideal tool to collect log data, and incident responses originating from security devices at the point at which forensic logging should occur. Information from systems is sent to the SIEM to aggregate, correlate, and normalize the data. Data from these systems can be analyzed in conjunction with network behavioral data to provide a more accurate real-time picture of threats and attacks. There needs to be consistency in log data. Integrated monitoring, logging 26 and alerting is meant to accomplish the following: monitoring network status, generation of alerts and feedback to the user, reporting the system administrators, forensically sound safe keeping of traffic and log data over a period of time and a comprehensive tool set for both real-time and after-the-event analysis. Information that is fed into the SIEM is used real-time to generate reports and provide feedback to stop or prevent and attack. A SIEM must also provide a way to transmit the log data from the system securely, provide a chain of custody for potential evidence, provide a traceback system to assist in determining the source of an attack, provide reports and automatic alerts in realtime as well as maintain a forensically sound traffic and log records, and provide fine-grained information of points of interest (Hunt & Slay, 2010). Some SIEM engines support the storage of data for forensic analysis, which is addresses the cross over between security and forensics, by few systems take real-time information and adapt it to security. There are limitations in real-time adaptive feedback, however, which result from network forensic analysis. Surveillance and vulnerability scanning often end up being just another log. 27 CHAPTER 2 THE PROBLEM OF DATA BREACHES, EVENT DETECTION, AND RESPONSE The ability to locate information quickly is paramount in information security. The amount of information collected needs to be filtered correctly so it is concise and accurate for processing. Security information and event managers (SIEMs) provide analysis of security events in real-time collected from various internal systems. Typically a SIEM aggregates data from many sources, monitors incidents, correlates the information and sends alerts based on that information. A human is still responsible for looking at the information and decides whether or not they should act on it (Aguirre & Alonso, 2012). Intrusion prevention systems and intrusion detection systems are key components in detecting information from systems located within specific networks. However they can have false-positives adding more investigation work for administrators and security staff. Collaboration between SIEMs would allow for information to be correlated between many systems. Research by Aguirre and Alonso (2012) used several SIEMs based off AlienVault OSSIM to collaborate with each other. Snort was used as the IDS. Each SIEM would feed the other SIEMs information and be able to correlate information based off the each sensors directives. This allows for all systems to see all alerts. There are standalone systems that perform a variety of analysis 28 features, but not all the features in one place with feedback and automated reporting. This system was an excellent start for mass correlation of events and overall detection. However, it still does not address external events. This system only detected internal events. What if we thought about this research a different way? Utilizing the internet and external organizations as SIEMs themselves reporting back to a centralized internal SIEM. Internal data would not be pushed out to the external organizations, but collected from them and reported on. External events about data that is important to the internal organization would be alerted on proactively. A proactive approach to threat management versus reactionary approach to threat management. The research also provides support for the use of Splunk as a data analysis tool where information can be fed in quickly, correlated, analyzed automatically, and provide real-time feedback to a user. 29 CHAPTER 3 SPLUNK In order to collect, aggregate, normalize, analyze, correlate, and alert upon large data sets from either an internal or external organization, a platform is needed to build the infrastructure upon. Research, as seen previously, has shown Splunk to be able to meet the requirements of today’s vast and ever expanding data and machine knowledge. Researchers at Sandia National Laboratories used Splunk to join information for supercomputers, security and desktop systems (Stearley, Corwell, & Lord, 2010). When joining information for security and desktop systems, their analysis was able to span their two different data centers. Information from those systems included run-time data and local conditions of the computing clusters. Decoding messages can be extremely time consuming, but they are essential in diagnosing the overall health of the system. Overall system health can be impacted by software upgrades, configuration changes, and running applications. Using Splunk to analyze the data, administrators were able to isolate and resolve problems more efficiently. According to the research by Stearley et al. (2010), the typical log file contains over 11 thousand lines every 5 minutes. Administrators could go through the data searching for information from grep, however this would be very time consuming. Prior to deploying Splunk, they had coded a 654-line Perl script to parse through the data to find specific information. Once Splunk was in place, 30 administrators were just able to add the log location and come up with very simple informational "lookups" which allowed an administrator to see the data quickly. When using Splunk, administrators were also able to script alerts that would allow them to be alerted to fault conditions in the system. Another benefit to using Splunk, allowed the administrators the ability to perform deeper analysis on their computing systems easier, such as "what is the distribution of alerts versus time, job, user and type?" Splunk allows the ease searching through information and the ability to come up with a quick problem resolution based on the data. While machine learning offers promise of more automated solutions to find and correct faults, administrators are still left to analyze logs on what actually took place for the initial fault condition. They may not have the entire picture of the system. Splunk allows various data to be parsed and makes sense of the data. In addition to logging the system data, Splunk also logs all searches, which enables administrators to improve searches over time. While it still takes some analysis, Splunk solves a majority of the needs of an administrator. The SAFES app, which has been developed as part of this external monitoring research is built upon Splunk. There are three reasons why Splunk was chosen as the platform to build and expand the SAFES app on. Splunk is an excellent log management tool. Splunk provides a robust application building platform to customize and analyze the data that is being collected. Splunk also meets the requirements of much of the research already performed. 31 Splunk as a Log Management Tool Splunk was first released in 2005 (Robb, 2006). The initial intent on developing Splunk was to have a platform where data from machines could be searched through. Splunk allows organizations to aggregate structured and unstructured data and search upon the data. Splunk is a great log management tool. With Splunk, an organization can almost throw any data into the program and get useful, meaningful information back out of it, depending on how you search through it. The biggest ways that Splunk helps make sense of data for the purposes of SAFES is aggregation of data. Currently the University of Colorado Colorado Springs’ implementation of Splunk contains data on over 10,000 different log sources. If an organization were looking through over 10,000 different log sources independently for very specific information for correlation of security events, it would take an indefinite amount of time to search through it and cull the data down to create meaningful results. Splunk takes the data in, either structured or unstructured and indexes it. Types of data that are logged: System event logs Security event logs Firewalls Network routers and switches Encryption software Antivirus software Intrusion detection systems Servers 32 Email logs Web application logs Authentication servers Aggregating the data into one central place gives administrators and security personnel the ability to search through all logs and data and make sense of it. The ability to see if a user logged in from China from a VPN connection and logged into a server, right before brute force attempts stopped on a specific server from the same IP address is invaluable. The data that is put into Splunk comes from a variety of different formats. Log files Email RSS feeds SNMP traps Syslog messages This data is both structured and unstructured. Structured data is typically normalized. Normalized data usually contains data in fields that are predictable. Timestamps are understood, IP addresses, MAC addresses are also understood. Most of the data collected for the purposes of aggregation, correlation and analysis is structured data. However, there is also unstructured data that must be indexed. This data could contain information from firewalls or intrusion detection systems, where log files may not make sense and data is not predictable. Logs or data from third party external organizations could contain such unstructured data, however the data needs to be normalized, and indexed 33 for searching, correlation and analysis. The capability of Splunk to contain this data is impressive. Equally as impressive is Splunk’s capability to have multiple applications built on top of it to manage the data. Splunk as an a Application Development Platform Just searching through log files in a simple to use search interface is not enough when you need to develop specific ways to look through the data and display the results easily. Since Splunk’s primary function in the SAFES app is to store data, administrators and security personnel need to talk the same language as the data and display relevant data easily. Splunk gives us the ability to use the base platform as a building block for custom applications using Splunk’s data. There are several components that are at the core of the Splunk platform. Search managers, views, and dashboards are included in the Splunk Web Framework, which is the building tool for Splunk applications. The Search manager allows Splunk’s core search functionality to be built into a custom application. It allows an operator to search through, start, stop, pause, and save searches. Views allows a developer to customize visualizations within the application so that data can be visually interpreted quickly. Visualizations include charts, graphs, tables, events viewer, map viewer, D3 charts, etc. Views also include custom search controls such as search bars, timeline views and time range views. Form inputs are also included in the views component. Dashboards allow visualizations that are common to look at simpler to find. Dashboards can be updated in real-time or based on timeline criteria. Building on the core of Splunk to show us relevant data is useful and powerful, 34 however there are times where Splunk needs to be extended to use scripting languages to build in interactive applications. Django Bindings and SplunkJS Stack components allow developers to build dynamic applications on top of Splunk. If SAFES functionality needs to be extended to import and correlate data in a different way, a user of SAFES can modify the code that is easily managed by Splunk. These components allow a developer to use the base functionality of Splunk as the core of the application and build a robust interface to narrow down the data that can often times be overwhelming to an operator or security personnel looking through the data. Data usage and relevancy changes over time, so the SAFES app in Splunk can be changed when needed without purchasing other software. How Requirements are Met As previously identified in there are several important requirements that an application would need to have in order to be effective in aggregating, normalizing, correlating, normalizing, and alerting on both internal and external data. The requirements and how the requirements are satisfied for SAFES application in order to monitor external sources and alert on them are as follows: Accept both structured and unstructured data from multiple sources Splunk, has the ability to accept any type of data source. As previously identified, Splunk can accept many different data sources and aggregate them into one location. 35 Parse logs – Through the Splunk search functionality, logs and event data can be parsed many different ways to gather information about the data. Filter messages based on terms – Search terms can be included, excluded, or joined to help cull down the amount of irrelevant data. Data normalization – Data that comes is aggregated into Splunk can be normalized through the use of specific “lookups”, “fields”, “tags”. This allows any data from any source to match fields on other data sources even though data fields do not match up exactly. This functionality allows administrators and security personnel to have one language among all logs. Event correlation – If data is seen in one place, it may have been seen in another place. The ability to correlate information across different log sources also is functionality that is built into the core of Splunk Log viewing – Logs can be viewed any number of ways depending on how the user wants the data to be presented. Alerting – A powerful alerting engine is built into Splunk. This allows a user to search for specific information, save the search as an alert and have Splunk send an email, execute a script, or kick off another search if the pattern is seen again. The use of alerting will allow administrators to receive timely, proactive notifications about security events from both internal and external organizations. The requirements of a system that allows for timely, proactive notifications of internal and external event data fit what Splunk was designed to do. Using 36 Splunk as the core of the SAFES application allows an organization to quickly and simply deploy an external monitoring solution, with minimal programming. 37 CHAPTER 4 DESIGN OF SAFES Based on the previously stated work, a system that can automatically pull in data, aggregate the input, normalize the data, correlate information obtained from the input sources, analyze information obtained from the correlation of the data, and alert system administrators or security professionals to a possible data breach has been designed before. However, these systems that have been talked about before only include internal data. They do not include multiple sources and formats of external data. The data that can be obtained from external sources, correlated with internal organizational data can give us a somewhat more realistic picture of overall internal and external data security. The Security Advisories From Events System, or SAFES, application is designed to take external information from third party websites, data feeds and email, aggregate, normalize, correlate that information with internal security events and produce alerts to internal users about internal and external security events. SAFES helps internal organizations in the following ways, supported by research: Aggregates logs and events. Aggregation aids in quickly searching for multiple sources of events in one place. Normalizes data through aggregation. Normalization aids in talking the same language and time between different sources and/or timestamps. 38 Analyzes events to cut down on unnecessary or irrelevant events. Analysis aids in removing irrelevant data that could interfere with specific events. Provides ability to collect data from multiple external sources. External data collection provides a more complete picture of what internal organizational data might have been leaked by a data breach. Alerts on events. Alerting aids in response time of specific security events. Requirements The SAFES application requires the following: The ability to store events or logs from multiple locations Process different types of events Process different input sources Normalize timestamps and key fields Alert on specific data Analyze data while combing internal and external data inputs Provide historical SAFES alert information The SAFES app will reside within Splunk. Splunk provides a robust platform already discussed in chapter 3. Splunk supports one of the most major components of the SAFES app in that it allows organizations to store large amounts of events and logs from many different locations very easily. Second, SAFES will not only process external events, but process internal events. This requirement allows an organization to have one interface or one tool they can use to process event data. Third, the Splunk platform will allow an organization 39 to input numerous types of data sources. This includes internal organization system events from multiple sources, and external third party data sources such as websites, RSS feeds, and email. Fourth, by utilizing the Splunk platform, we can normalize logs no matter what timestamp format the event data is in. If an event or log has a timestamp that Splunk does not recognize, we can easily configure Splunk to recognize the data source and the timestamp associated with the data source. Fifth, based upon specific criteria known to the organization, an organization can utilize the alerting functionality within Splunk to provide information on external events to security personnel, correlating internal data to external data. Finally, data that has been alerted on before is stored within Splunk. If the data has been seen in other external sources, events may not need to be alerted on again. Logging Sources There are several types of events and logs that need to be included into the SAFES application so that we can see internal events. Other external events and logs can be added according to the organization’s needs. For the purposes of our deployment of SAFES the following log sources have been included: Windows 2012 domain controller security logs. These logs contain authentication information for resources utilizing a Windows Active Directory infrastructure. Resources inside a domain that rely on Windows Active Directory could include email, 802.1x authentication for wireless or wired networks, desktop systems, servers, LDAP connections, VPN connections, etc. Resources also could include Kerberos based systems relying on Windows Active Directory. 40 An example of a Windows Active Directory log is as follows: 02/26/2014 08:45:24 PM LogName=Security SourceName=Microsoft Windows security auditing. EventCode=4624 EventType=0 Type=Information ComputerName=dc1.test.local TaskCategory=Logon OpCode=Info RecordNumber=144211636 Keywords=Audit Success Message=An account was successfully logged on. Subject: Security ID: NT AUTHORITY\SYSTEM Account Name: dc1.test.local$ Account Domain: testdomain Logon ID: 0x3E7 Logon Type: 3 Impersonation Level: Impersonation New Logon: Security ID: testdomain\test Account Name: test Account Domain: testdomain Logon ID: 0x4C34215 Logon GUID: {00000000-0000-0000-0000-000000000000} Process Information: Process ID: 0x228 Process Name: C:\Windows\System32\lsass.exe Network Information: Workstation Name: DC1 Source Network Address: 192.168.0.1 Source Port: 18920 Detailed Authentication Information: Logon Process: Advapi Authentication Package: MICROSOFT_AUTHENTICATION_PACKAGE_V1_0 Transited Services: Package Name (NTLM only): Key Length: 0 There are several fields within the event log that are important. The timestamp, “Account Name”, “Workstation Name”, “Source Network Address”, and “Event Code” are all necessary pieces of information that SAFES can utilize. They can help in correlating if a third party data breach of internal user credentials were used to log into a system. When combined with intrusion detection logs, they also will show if a brute force attack on user credentials were successful or not. 41 Intrusion detection system logs. Intrusion Detection Systems provide valuable pieces of information for a variety of different security events within a network. Information such as malware activity, brute force logon attempts, privilege escalation attempts, vulnerability exploitation, and more are typically caught by an internal intrusion detection system. When combining intrusion detection logs with other events, SAFES can help determine if a system might have been compromised and malware was installed and active on that system. For the purposes of this implementation of SAFES, a sample intrusion detection set of logs could look something like this: 2014-02-27 03:41:12 pid(2230) Alert Received: 0 1 trojan-activity test-ids-eth1-2 {2014-02-27 03:41:11} 3 136507 {ET TROJAN Suspicious User-Agent (MSIE)} 192.168.1.4 131.124.0.67 6 52250 80 1 2003657 12 2280 2280 2014-02-27 03:17:11 pid(2230) Alert Received: 0 1 trojan-activity test-ids-eth1-6 {2014-02-27 03:17:10} 7 118282 {MALWARE-CNC Win.Trojan.Kazy variant outbound connection} 192.168.1.4 94.242.233.162 6 56605 80 1 26777 3 2370 2370 2014-02-27 03:07:54 pid(2230) Alert Received: 0 1 trojan-activity test-ids-eth1-4 {2014-02-27 03:07:53} 5 328412 {ET TROJAN Suspicious User-Agent (Installer)} 192.168.1.4 108.161.189.163 6 58194 80 1 2008184 8 2344 2344 2014-02-27 03:07:04 pid(2230) Alert Received: 0 1 trojan-activity test-ids-eth1-1 {2014-02-27 03:07:04} 2 831844 {ET CURRENT_EVENTS DRIVEBY Redirection - Forum Injection} 192.168.1.4 190.123.47.198 6 54015 80 1 2017453 3 2298 2298 42 Timestamp, signature of the IDS rule that was alerted on, source address, destination address are all important fields for this type of log. Microsoft Smart Network Data Services. One of the external logging sources could be information obtained from Microsoft Smart Network Data Services. Microsoft allows organizations that own a set of IP addresses to obtain specific traffic information seen on the Windows Live Hotmail system. This allows organizations to see what email is coming from what servers inside their organization to Windows Live Hotmail infrastructure. The information contained within the daily log contains sending IP address, how many receipt or data commands were received, complaint rate, and number of email trap hits. Additionally sample HELO and MAIL commands are shown. The information that is sent can help directly identify a host that is sending out spam from an organization. If a system is infected with malware that sends out spam, and it is not detected by internal methods, the external resource will catch it. The following log is a sample log obtained from Microsoft SNDS. 192.168.1.6 2/25/2014 3:00 < 0.1% test@test.local 2/26/2014 1:00 646 586 0 exchange.test.local 638 GREEN The keyword “GREEN” is significant because it identifies if the source IP address is sending a specific level of spam. If a system is infected with spam sending malware or has been compromised to send spam, the keyword “GREEN” will change to “RED” which indicates over 90% of email seen from the source IP has been identified as spam. Pastebin.com alerts. As we have seen from previous research performed, Pastebin.com has been used to leak user credentials from 43 organizations by attackers. Pastebin.com allows external users to sign up for their alerts system. A user can sign up their email address to obtain alerts on three separate keyword searches. Once a paste has been posted to Pastebin.com that contains one of the keywords, an email alert is sent off containing the URL to the paste. A sample subject line and body message of the alert email is as follows: Subject: Pastebin.com Alerts Notification Body: Hi testaccount You are currently subscribed to the Pastebin Alerts service. We found pastes online that matched your alerts keyword: '192.168.'. http://pastebin.com/acbxyz If you want to cancel this alerts service, please login to your Pastebin account, and remove this keyword from your Alerts page. Kind regards, The Pastebin Team Based on your keyword alerts, leaked user credentials or information on your network may be contained in the paste. Shadow Server Foundation events. The Shadow Server Foundation started in 2004. Its mission is to gather intelligence on malicious activity and vulnerabilities from across the internet. Its goal is to provide information to the security community in order to protect systems and assist in research. The Shadow Server Foundation provides quite a bit of data, both in weekly emails and daily emails depending on what kind of information is requested. Information that would be helpful inside an organization would relate directly to that organization. The Shadow Server Foundation allows organizations, owners of IP ranges or ASNs, to be alerted any time any one of many different events seen inside their organization is triggered. The information collected from Shadow 44 Server is collected from many different networks around the world. In order to obtain data on an organization, the organization needs to sign up. Once properly authenticated as the owner of the ASN or IP space, an organizational member can receive the following information about their networks: Detected Botnet Command and Control servers Infected systems (drones) DDoS attacks (source and victim) Scans Clickfraud Compromised hosts Proxies Spam relays Open DNS Resolvers Malicious software droppers and other related information The information obtained on an organization’s network is aggregated and sent to the organization every 24 hours if an alert occurs. This information is meant to assist organizations in their detection and mitigation methods. The information is extremely helpful if current detection methods in the organization cannot pick up the malicious traffic. A sample of some of the logs that are sent to an organization: Botnet DDoS "Date","Time","C&C","C&C Port","C&C ASN","C&C Geo","C&C DNS","Channel","Command","TGT","TGT ASN","TGT Geo","TGT DNS" "2008-1103","00:00:12","76.76.19.73",1863,13618,"US","unknown.carohosting.net", "#ha","!alls","98.124.192.1",21740,"US","" 45 Botnet Drone "timestamp","ip","port","asn","geo","region","city","hostname","type"," infection","url","agent","cc","cc_port","cc_asn","cc_geo","cc_dns","cou nt","proxy","application","p0f_genre","p0f_detail" "2011-04-23 00:00:05","210.23.139.130",3218,7543,"AU","VICTORIA","MELBOURNE",,"tcp" ,"sinkhole",,,"74.208.164.166",80,8560,"US",,1,,,"Windows","2000 SP4, XP SP1+" Sinkhole HTTP-Drone "timestamp","ip","asn","geo","url","type","http_agent","tor","src_port" ,"p0f_genre","p0f_detail","hostname","dst_port","http_host","http_refer er","http_referer_asn","http_referer_geo","dst_ip","dst_asn","dst_geo" "2010-08-31 00:09:04","202.86.21.11",23456,"AF","GET /search?q=0 HTTP/1.0","downadup","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)",,8726,,,,80,"149.20.56.32",,,,,, Google Alerts. Google Alerts allows individuals to receive email or RSS alerts on information when it is indexed on Google.com. Search criteria on news, discussion boards, blogs, video, books or anything else, can be alerted on. The RSS feed or email that is received has the title of the information seen, a brief snippet of information surrounding the keyword query the individual alerted on, and a link to the article. The following is an example RSS feed: Figure 1. While some of this information may not be useful, it provides insight into 46 other websites that may have information pertaining to your organization that may have been leaked. Many other data sources can be added, both internal and external and fed into the SAFES application. However, the above sources provide good information for automated alerting and analysis. Data Schemas Search queries need to be determined in both the Pastebin.com and Google Alerts cases. The queries, when entered into either Pastebin.com or Google Alerts sites, need to be formatted in such a way that they will provide not only relevant information, but also be utilized by the SAFES application as a determination of confidence of a data breach or severity of a data breach. Data schemas must be defined to provide these two types of data both for the user and for SAFES. Domain schema. When querying both Pastebin.com and Google Alerts, it is best to have the organization’s domain as a query term. This validates that the Pastebin.com paste or Google alert is talking about the keyword or organization on the paste. For example, the domain “uccs.edu” will alert on a news article or it will alert on a username seen, such as john.smith@uccs.edu. The username may be a part of leaked credentials. IP address schema. IP addresses can have a range from 0.0.0.0 to 255.255.255.255. If the organization’s IP address range is 128.198.0.0128.198.255.255, the individual would have to issue over 65k alerts for the IP address space. However, if this is narrowed down to “128.198.” we are able to 47 see all 65k address spaces that may end up in an alert. The alert for IP address space might be triggered by a compromised host listed on the Pastebin.com site or by Google Alerts. Username schema. While not directly able to be alerted on, usernames typically have a specific pattern. When used with other schemas, they can provide a good indication of false positives. Typically a username is listed as an email address on sites such as Pastebin.com or Google Alerts. Usernames in an organization may only contain up to 8 characters. Therefore, a username seen as “jsmith12@uccs.edu” would be valid, but “jsmith123345@uccs.edu” would not be. The use of the username schema will be used with confidence levels. Password schema. Again, while not directly able to be alerted on, passwords in an organization usually have a minimum set of requirements. If the passwords that are breached on a third party site are less than an organization’s set of minimum standards, the organization can be fairly confident that the user did not re-use their credentials on the third party site which was breached. Password policies in an organization may have requirements for specific length, special character use, and digit usage. For example, if an organization has a password policy of a password must contain a minimum of 8 characters, a digit and a special character, the password of “Test4me” would not be valid on the organizations authentication servers, but “Test4me!” would be. The use of schemas both for external organization alerting and internal data searching helps narrow down a possible or potential data breach on a thirdparty site. 48 Confidence Levels While alerts from external organizations provide information to an internal organization, the data obtained may not be relevant to alert on a potential data breach. Confidence levels must be decided upon by the organization and modified if data that is alerted upon is considered false positives or not relevant. False positive information could come from Google Alerts in the form of a news article being posted about the organization, or a Pastebin.com alert could be a false positive if the IP schema for “128.198.” matches “61.128.198.7”. Nonrelevant data could lower the confidence level of the information if the username schema matches the domain schema, but does not match the username schema. For example, an alert generated by SAFES for the Pastebin.com post of “jsmith12334@uccs.edu” could be considered non-relevant because the username schema of uccs.edu only allows for 8 character usernames. The username of “jsmith12334@uccs.edu” would have a lower confidence level than that of a username of “jsmith7@uccs.edu” where the username schema matched the organizations username schema. Different confidence levels should also be placed on both trusted and untrusted data sources. Untrusted data sources could be search data from thirdparty external organizations that you are searching the information on. This data includes Pastebin.com or Google Alerts data where you might have false positive or non-relevant information. Trusted data sources could have a direct relationship to the data that the external organization is seeing. This includes Microsoft SNDS and Shadow Server Foundation reports, as well as internal logs. 49 Correlation between different data sources also will boost confidence levels. For example, if a Pastebin.com alert comes in for a username or IP address schema, and the username shows to have been used recently in logs, the confidence level is boosted to show potential correlation between security events. Again, confidence levels must be decided upon by the internal organization for what data they are searching for. Broad schemas that may potentially receive many matches may have lower confidence levels than narrowed schemas. The combination of log sources, data schemas and confidence levels helps determine the overall threat level of the external event. Administrators or security personnel can then be alerted on the data obtained from external organizations and react accordingly. 50 CHAPTER 5 IMPLEMENTATION OF SAFES The SAFES app for Splunk allows organizations to monitor external data sources for threats in our internal network or allows us to proactively protect and alert on external security events. Since SAFES was designed to be a simple-touse tool, a majority of the programming has already been done by others. Splunk will be the core aggregation, correlation, normalization, and alerting tool for internal events. Additional Splunk apps must be installed in order to ease the work of a user of SAFES. Splunk Installation Splunk itself is very simple to install. Splunk has two different versions, one is an enterprise version, which is fairly expensive, aimed at operational intelligence for many resources. The free version of Splunk contains most features that the enterprise version does, however a few parts are stripped out which for the basic implementation of SAFES, we don’t care about. Since Splunk is a commercial product, in order to download it, a user must first register on the Splunk website. Once registered, the user can go to http://www.splunk.com/download and download Splunk Enterprise, which gives the user a 30 day trial of Splunk Enterprise, however, the Enterprise license will turn into the free version license if a license is not entered. For our Splunk deployment we will use CentOS 6.5 x64 as the base operating system with 51 xwindows installed, so the rpm package splunk-6.0.2-196940-linux-2.6-x86_64.rpm is chosen. Once downloaded from Splunk.com, the following commands will install Splunk Enterprise on the server. rpm -i /root/Downloads/splunk-6.0.2-196940-linux-2.6-x86_64.rpm /opt/splunk/bin/splunk start --accept-license The default password must be changed upon first login. The URL for the Splunk web interface is http://127.0.0.1:8000, which will prompt a user to change the default admin password of “changeme” to something else. Splunk is now fully operational and ready to accept data. Third Party Apps As stated previously, Splunk was chosen as the base system for a number of different reasons. One of these was the ability of Splunk to be a programming platform and able to be extended for a variety of different applications. Third party apps can be downloaded from splunkbase.com. Several apps that are needed for the SAFES app implementation allow us to extend Splunk in the following ways: Input RSS feeds that are not a native function of Splunk inputs Monitor external email accounts Automate “lookups”, “searches”, and “fields” that allow us to normalize data across all inputs Provide rich GUI interfaces for easy access to important security events Splunk for IMAP. Splunk for IMAP is needed for two external sources for SAFES. The first one is Shadow Server Foundation emails. The second is Pastebin.com alert emails. The Splunk for IMAP app polls an IMAP account on 52 regular intervals and indexes any email that is pulled in from that account. The Splunk for IMAP app is located at http://apps.splunk.com/app/27/ and is installed into Splunk with the following command once downloaded to the server: /opt/splunk/bin/splunk install app /root/Downloads/splunk-forimap_120.tgz Once installed and Splunk restarted, configuration must be completed to connect the IMAP account used for Shadow Server Foundation and Pastebin.com emails to Splunk. The configuration file is located at: /opt/splunk/etc/apps/imap/default/imap.conf The following configuration lines will be modified for our test system: server = mail.alphawebfx.com user = safes@alphawebfx.com password = DisqHAWxcwU1 useSSL = false port = 143 Once saved and Splunk restarted, alerts from both Shadow Server Foundation and Pastebin.com can be used. Based on the schema of our test environment, we will add the following keywords: “128.198.” and “uccs.edu” that will send us an email to our IMAP account and ultimately Splunk when a keyword is matched on Pastebin. 53 Figure 2. Shadow Server Foundation emails must also be added. To sign up for Shadow Server Foundation alerts, an email containing the full name, organization, Network of responsibility, email address of the reports, phone number and contact information for verification must be included to request_report@shadowserver.org. Once verified, daily emails will be sent to the report email address included in the email if any alerts are generated. RSS Scripted Input. RSS has become a popular internet language to get out short informational messages quickly without creating a lot of content on a page. The RSS Scripted Input indexes the metadata for the RSS feed. The RSS Scripted Input app utilizes an open source program called feedparser from www.feedparser.org to parse through the RSS metadata. The RSS Scripted Input app is located at http://apps.splunk.com/app/278/ and is installed into 54 Splunk with the following command once downloaded to the server: /opt/splunk install app /root/Downloads/rss-scripted-input_20.tgz Once installed, configuration of the RSS feeds must be completed. The configuration file is located at /opt/splunk/etc/apps/rss/bin/feeds.txt Google Alerts will be used for RSS feeds. These alerts can be setup here: http://www.google.com/alerts/manage. As with Pastebin.com, the domain and IP address schema will be used. “128.198.” and “uccs.edu” Figure 3. Other inputs. One more external input is needed based on our previously talked about SAFES design. Microsoft SNDS alerts. These alerts come in the form of CSV files that are uploaded to a URL daily. To sign up for Microsoft SNDS, IP ranges, ASN or CIDR notation address is required to be entered here: https://postmaster.live.com/snds/addnetwork.aspx. Microsoft will email a verification email to the contact on the IP range, ASN or CIDR address. Once an organization is signed up, automated access settings can be enabled. Automated access allows gives organizations a link in which the data can be obtained from on their IP range, ASN or CIDR address. The link downloads a CSV file daily. In order for the SAFES system to process the data from the CSV 55 file into SAFES, a script must be written to parse the data into a log file and Splunk must then read that log file. The input for SNDS is located inside the SAFES app at logs/snds.log. Other inputs that are important to SAFES can be added as needed by the organization to correlate internal events with external sources. Installation of the SAFES App All Splunk apps have nearly the same file structure. The app is uploaded to /opt/splunk/etc/apps/ directory in our test system. Scripts are located in the bin/ directory, and configuration files are located inside the local/ directory. The configuration files contain Splunk specific programming language that allows Splunk to process different characteristics of the app. GUI While the main goal of SAFES is to provide alerting of external events based on confidence levels, a dashboard of external events provides data on the four external data sources programmed into SAFES. Even though some of the data may not correlate exactly to other events, timely searching of events without searching through all security events may be critical in an incident response situation. For this reason, SAFES only has one dashboard. 56 Figure 4. Confidence Engine and Alerting The main goal of SAFES is to provide alerting of external events to organizations. Confidence levels are chosen by the organization based upon the value of the external data that is being monitored. For example, for UCCS, high value is placed on Microsoft SNDS and Shadow Server Foundation alerts; medium value is placed on Pastebin.com alerts, and low value is placed on Google Alerts. Confidence levels may change as the information provided may be proven to provide more direct, valuable information. False positive information coming from external sources can lower confidence in the external data source. It is up to the organization to determine what value they place on external data sources. Internal data when correlated with external data may provide critical value confidence for a security event. All events may not be able to be correlated however. Constant analysis must be performed on external sources and internal 57 events to ensure alerting accuracy and correctness. Alerts are handled by email through the Splunk alerting system. These alerts must be modified to the specific needs of the organization. Multiple email addresses can be used to alert key administrators or security personnel. 58 CHAPTER 6 EXPERIMENTS Even though we have implemented the design of SAFES and the dashboard can easily show us what data is coming in from external sources, the SAFES application should be experimented with by using 3 test scenarios: Botnet activity on a system, external third party data breach that may affect internal users, and a system being used to send spam outside the organization. Simulated botnet activity. Botnet activity may not always be picked up by an internal IDS. When an external organization picks up botnet activity coming from a system within an organization, this will either allow information security personnel to confirm an infection or identify an area in which the internal IDS is not picking up the activity. The experiment that was carried out was an actual security event that took place inside the UCCS network in 2013. To recreate the activity, dates and system names have been changed. Botnet activity for the host 128.198.222.7 started on March 11, at 14:16 MDT based on alerts from internal IDS. Since the Shadow Server Foundation only sends out email once a day to an organization, the alert that there had been botnet activity on a host, was sent at 5:49 the next day. Included in the attachment with the alert from the Shadow Server Foundation, are the key data points: IP address identified as 128.198.222.7 infected with ZeroAccess malware, at 16:29 UTC, which was 9:29 MDT. This 59 indicates that the Shadow Server Foundation was able to identify botnet activity 5 hours before our internal IDS showed it started on the infected host. When replaying this data through SAFES, we can instantly identify the “HIGH” confidence level of the Shadow Server Foundation report. Additionally, since more than one source is identifying botnet activity on a host, SAFES actually raises the confidence level to “CRITICAL”. Simulated external data breach. Data breaches that happen on thirdparty websites outside the organization are not necessarily a serious threat. However due to the fact that research has shown password reuse to be high, an organization should pay close attention to external breaches and user accounts that are identified with their domain schema. A Pastebin.com paste was made on March 9th detailing a database breach of a travel site. Usernames and passwords for the external organization were exposed. SAFES issued an alert because the domain schema, uccs.edu, was matched. The SAFES alert came in as “MEDIUM” since only the domain schema was matched. The domain schema was matched on the email address of xxxx@uccs.edu. Simulated external spam detection. While many organization’s monitor their own internal mail servers, it may be difficult to monitor the entire IP address space for outgoing spam. It is very trivial to set up a mail server. Additionally malware can also send out spam out of unsuspecting compromised systems. Microsoft SNDS easily identifies this type of traffic since spam email is usually sent to tens of thousands of email addresses including many addresses that Microsoft maintains. Experimental data was taken from a compromised system 60 at UCCS in March of 2014. A user account was compromised and a script was uploaded to the user’s directory which allowed the attacker to tunnel a PHP mail script through the SSH connection, ultimately allowing the attacker to send email out of the SSH server. The spam messages that were sent out of the SSH server totaled in the millions of messages and because it was a trusted system, port 25 was allowed to send the email out. Microsoft picked up roughly 11,000 messages each day until the problem was resolved. This data was reported by Microsoft SNDS. When this information from Microsoft shows up on the SAFES app, a “HIGH” confidence level alert is issued because all “RED” alerts that come from Microsoft are considered “HIGH” confidence. 61 CONCLUSIONS AND FUTURE WORK Data breaches occur every day. In 2013, over 311,000 compromised accounts were available on Pastebin.com (High-Tech Bridge, 2014). This number is staggering. The number of accounts were new breaches and leaks. What’s even more alarming is that over 40% of the accounts that were leaked were for email accounts. This means that the credentials could be used to get into other systems. The 311,000 accounts were just a small number of accounts that were actually leaked. The larger data sets were still kept by hackers and hacktivists. Detecting that information where password or credential reuse could be an issue is why the SAFES system was designed. SAFES was designed and implemented so that internal organizations could utilize the power of the internet to collect information from external organizations reporting on what data they see coming from their organization. Future Work The SAFES app for Splunk was designed for one organization which was the University of Colorado Colorado Springs. The four external sources that SAFES collected are pertinent to the University as this information had been collected for years in separate systems. As more external organizations open their data collection on the internal organization, other data will be added to the SAFES system. Additionally since the SAFES system was built statically inside of Splunk, a configuration form will be made, so that organizations can set up their 62 own SAFES app without modifying source code. As each log from internal systems is modified and new log sources are added, normalization of internal logs may need to be modified as well. 63 References Aguirre, I., & Alonso, S. (2012). Improving the automation of security information management: A collaborative approach. IEEE Security & Privacy, 10(1), 55-59. Retrieved from http://ieeexplore.ieee.org/stamp/stamp.jsp?tp= &arnumber=6060795 Aïmeur, E., & Lafond, M. (2013, September). The scourge of internet personal data collection. In Availability, reliability and security (ARES). Paper presented at the 2013 Eighth International Conference (pp. 821-828). IEEE. Baun, E. (2012). The digital underworld: Cyber crime and cyber warfare. Humanicus, 7, 1-25. Retrieved from http://www.humanicus.org/global/issues/humanicus-72012/humanicus-7-2012-2.pdf Bronevetsky, G., Laguna, I., de Supinski, B. R., & Bagchi, S. (2012, June). Automatic fault characterization via abnormality-enhanced classification. In Dependable systems and networks (DSN). Paper presented at the 2012 42nd Annual IEEE/IFIP International Conference on (pp. 1-12). IEEE. Casey, E. (2006). Investigating sophisticated security breaches. Communications of the ACM, 49(2), 48-55. doi: 10.1145/1113034.1113068 CERT. CERT incident note IN-98.03, password cracking activity. (1998). Retrieved from the CERT Coordination Center, Carnegie Mellon University: www.cert.org/incident_notes/IN-98.03.html Curtin, M., & Ayres, L. T. (2008). Using science to combat data loss: Analyzing breaches by type and industry. ISJLP, 4, 569. Retrieved from 64 http://web.interhack.com/publications/interhack-breach-taxonomy.pdf Finkle, Jim. (2014, Febuary 25). 360 million newly stolen credentials on black market: Cybersecurity firm. Reuters. Retrieved from http://www.reuters.com/ article/2014/02/25/us-cybercrime-databreach-idUSBREA1O20S20140225 Fisher, J. A. (2013). Secure my date or pay the price: Consumer remedy for the negligent enablement of data breach. William & Mary Business Law Review, 4(1), 215-238. Retrieved from http://scholarship.law.wm.edu/wmblr/vol4/iss1/7/ Florencio, D., & Herley, C. (2007, May). A large-scale study of web password habits. In Proceedings of the 16th International Conference on World Wide Web (pp. 657666). ACM. doi: 10.1145/1242572/1242661 Franqueira, V. N., van Cleeff, A., van Eck, P., & Wieringa, R. (2010, February). External insider threat: A real security challenge in enterprise value webs. In Availability, reliability, and security. Paper presented at the ARES'10 International Conference (pp. 446-453). IEEE. Garrison, C. P., & Ncube, M. (2011). A longitudinal analysis of data breaches. Information Management & Computer Security, 19(4), 216-230. doi: 10.1108/09685221111173049 Hampson, N. C. (2012). Hacktivism: A new breed of protest in a networked world. Boston College International & Comparative Law Review, 35(2), 511-542. Retrieved from http://lawdigitalcommons.bc.edu/cgi/viewcontent.cgi? article=1685&context=iclr&sei-redir=1&referer=http%3A%2F%2Fscholar. google.com%2Fscholar%3Fhl%3Den%26q%3DHacktivism%253A%2BA%2Bnew% 65 2Bbreed%2Bof%2Bprotest%2Bin%2Ba%2Bnetworked%2Bworld%26btnG%3D%2 6as_sdt%3D1%252C6%26as_sdtp%3D#search=%22Hacktivism%3A%20new%20b reed%20protest%20networked%20world%22 High-Tech Bridge. (2014). 300,000 compromised accounts available on Pastebin: Just the tip of cybercrime iceberg. Retrieved from https://www.htbridge.com /news/300_000_compromised_accounts_available_on_pastebin.html Hunt, R., & Slay, J. (2010, August). Achieving critical infrastructure protection through the interaction of computer security and network forensics. In Privacy, security, and trust (PST). Paper presented at the Eighth Annual International Conference (pp. 23-30). IEEE. Ives, B., Walsh, K. R., & Schneider, H. (2004). The domino effect of password reuse. Communications of the ACM, 47(4), 75-78. doi: 10.1145/980000/975820 Jackson, Don. (2008). Untorpig [Online posting]. Retrieved from http://www.secureworks.com/cyber-threat-intelligence/tools/untorpig/ Jenkins, J. L., Grimes, M., Proudfoot, J. G., & Lowry, P. B. (2013). Improving password cybersecurity through inexpensive and minimally invasive means: Detecting and deterring password reuse through keystroke-dynamics monitoring and just-intime fear appeals. Information Technology for Development. Advance online publication. 1-18. doi: 10.1080/02681102.2013.814040 Kapoor, A., & Nazareth, D. L. (2013). Medical data breaches: What the reported data illustrates, and implications for transitioning to electronic medical records. Journal of Applied Security Research, 8(1), 61-79. doi: 10.1080/19361610 66 2013.738397 Kent, K., & Souppaya, M. (2006). Guide to computer security log management [Special issue]. NIST Special Publication 800-92. Krebs, Brian. (2014). Target hackers broke in via HVAC company [Web log post]. Retrieved from https://krebsonsecurity.com/2014/02/target-hackers-broke-invia-hvac-company/ Mansfield-Devine, S. (2011). Hacktivism: Assessing the damage. Network Security, 2011(8), 5-13. doi: 10.1016/51353-4858(11)70084-8 Maple, C., & Phillips, A. (2010). UK security breach investigations report: An analysis of data compromise cases. Retrieved from the University of Bedfordshire Repository website: http://uobrep.openrepository.com/ uobrep/handle/10547/270605 Matic, S., Fattori, A., Bruschi, D., & Cavallaro, L. (2012). Peering into the muddy waters of Pastebin. ERCIM News: Special Theme Cybercrime and Privacy Issues, 16. Retrieved from http://ercim-newsercim.downloadabusy.com/images/ stories/EN90/EN90-web.pdf#page=16 Notoatmodjo, G., & Thomborson, C. (2009, January). Passwords and perceptions. In Proceedings of the Seventh Australasian Conference on Information Security, Vol. 98 (pp. 71-78). Australian Computer Society, Inc. Poulsen, Kevin. (2011, June.). LulzSec releases Arizona police documents. Wired. Retrieved from http://www.wired.com/threatlevel/2011/06/lulzsec-arizona/ Robb, Drew. (2006, August). 2006 Horizon Awards winner: Splunk’s Splunk. 67 Computerworld. Retrieved from http://www.computerworld.com/s/article /9002558/Splunk_Inc._s_Splunk_Data_Center_Search_Party Sherstobitoff, R. (2008). Anatomy of a data breach. Information Security Journal: A Global Perspective, 17(5-6), 247-252. doi: 10.1080/19393550802529734 Splunk. (2014). Search managers. Retrieved from http://dev.splunk.com/view/SPCAAAEM8 Splunk. (2014). Splunk views. Retrieved from http://dev.splunk.com/view/SP-CAAAEM7 Splunk. (2014). Splunk web framework overview. Retrieved from http://dev.splunk.com/view/web-framework/SP-CAAAER6 Stearley, J., Corwell, S., & Lord, K. (2010, October). Bridging the gaps: Joining information sources with Splunk. In Proceedings of the 2010 Workshop on Managing Systems via Log Analysis and Machine Learning Techniques (p. 8). USENIX Association. Stone-Gross, B., Cova, M., Cavallaro, L., Gilbert, B., Szydlowski, M., Kemmerer, R., ... Vigna, G. (2009, November). Your botnet is my botnet: Analysis of a botnet takeover. In Proceedings of the 16th ACM Conference on Computer and Communications Security (pp. 635-647). ACM. Weir, M., Aggarwal, S., Collins, M., & Stern, H. (2010, October). Testing metrics for password creation policies by attacking large sets of revealed passwords. In Proceedings of the 17th ACM Conference on Computer and Communications Security (pp. 162-175). ACM. Zhang, L., & McDowell, W. C. (2009). Am I really at risk? Determinants of online users' 68 intentions to use strong passwords. Journal of Internet Commerce, 8(3-4). 180197. doi: 10.1080/15332860903467508 69 APPENDIX A INSTALLING SAFES FROM START TO FINISH Note: This install manual assumes that the following software and versions are what will be used for installation: CentOS 6.5 x64 Splunk Enterprise Version 6.02 RSS Scripted Input Version 2.0 Splunk for IMAP version 1.20 Prerequisites: An account must be set up on Splunk.com to download Splunk Enterprise and third party apps. Splunk Enterprise and third party apps must be downloaded to the server that will host Splunk, third party apps and SAFES. Installation: mv Downloads/* /usr/local/src/ cd /usr/local/src/ rpm -i splunk-6.0.2-196940-linux-2.6-x86_64.rpm /opt/splunk/bin/splunk start --accept-license open Splunk in browser at 127.0.0.1:8000 username is admin 70 password is changeme Splunk will then prompt you to change it Back on terminal: /opt/splunk install app rss-scripted-input_20.tgz This command will prompt you to enter the recently changed admin password /opt/splunk/bin/splunk install app splunk-for-imap_120.tgz cp -r /usr/local/src/SAFES/* /opt/splunk/etc/apps/SAFES/ /opt/splunk/bin/splunk restart cp /opt/splunk/etc/apps/imap/default/imap.conf ../local/imap.conf /opt/splunk/bin/splunk restart Manual Configuration: vi /opt/splunk/etc/apps/rss/bin/feeds.txt Remove default feeds, and add Google Alerts vi /opt/splunk/etc/apps/imap/local/imap.conf Modify configuration settings to match organizations IMAP account tied to external monitoring accounts cp –f /opt/splunk/etc/apps/SAFES/imap/getimap.py /opt/splunk/etc/imap/getimap.py Post installation and configuration: /opt/splunk/bin/splunk restart SAFES Overview Dashboard: http://127.0.0.1:8000/en-US/app/SAFES/safes