Efficient Methods to Store and Query Network Data Paul Giura November 2, 2010 EFFICIENT METHODS TO STORE AND QUERY NETWORK DATA DISSERTATION Submitted in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY (Computer Science) at the POLYTECHNIC INSTITUTE OF NEW YORK UNIVERSITY by Paul Giura January 2011 Approved : Department Head Copy No. Approved by the Guidance Committee : Major : Computer Science Nasir Memon Professor of Computer Science Torsten Suel Associate Professor of Computer Science Joel Wein Associate Professor of Computer Science Hervé Brönnimann Associate Professor of Computer Science Microfilm or other copies of this dissertation are obtainable from UNIVERSITY MICROFILMS 300 N. Zeeb Road Ann Arbor, Michigan 48106 Vita Paul Giura was born in Slatina, Olt, Romania in 1981. He received his B.S. degree from University of Bucharest, Bucharest, Romania in the summer of 2004. In September of 2005 Paul started working towards his M.S./PhD degree as a Research Assistant at Polytechnic Institute of NYU and in 2007 he received his M.S. degree. During his first years at NYU-Poly Paul worked to developed new techniques for network payload attribution systems, then his research focused on efficient storage methods for network flow data. In 2007 he was a summer intern at Ricoh Innovations research labs where he designed and implemented a prototype system used to establish trust and integrity of electronic and paper document archives. In 2008 he was a summer intern at AT&T Security where he worked on Content Distribution Networks architectures security. His research interests include Network and Internet Security, Systems, Data Storage Technologies and Algorithms for Security. iv To my parents, Alecsandru and Florica, brother, Cristian and my love, Simona. Acknowledgements I am extremely grateful to my advisor Nasir Memon who was an excellent source of advice for both technical and professional matters. He provided support, encouragement, good company, and lots of great ideas. It is impossible to imagine someone being more selfless, having more of his students interests in mind, or giving them more freedom. I could not think of having a better advisor and mentor for my Ph.D years. Thanks especially to Hervé Brönnimann for his great advice, patience, great ideas and the key contribution to my professional development when he served as an initial Ph.D advisor and later as a valuable member of my defense committee. Thanks to Miroslav Ponec, who worked closely with me on the payload attribution methods (which became Chapter 2 of this dissertation), for his contribution to the subject and his friendship in all the Ph.D years and beyond. Many thanks to Joel Wein for his guidance on the payload attribution methods research, for all his valuable contribution and for serving as a valuable member of my Ph.D committee. I thank to Torsten Suel from whom I learned a lot about databases concepts, for the helpful feedback and for being a valuable member of my defense committee. I would like to thank Kulesh Shanmugasundaram for suggesting me the problem of finding efficient storage methods for network flow data and for all the helpful discussions we had. I would also like to thank many other excellent professors at Polytechnic Institute of NYU, from whom I learned a lot. Being a graduate student in a foreign country for the first time I encountered many obstacles that I was able to pass having great friends by my side. Especially I would like to thank George Marusca and Florin Zidaru for being good friends, and for their invaluable help when I first arrived in the United States. Thanks also to Baris Coskun for his friendship, for sharing helpful discussions in difficult moments and for being a great fishing partner. I would like to mention Marin Marcu and Cristian Giurescu for guiding me in the early years of my career, and my undergraduate advisor Ioan Tomescu at the University of Bucharest for advising and encouraging me to pursue my doctorate degree. I would like to thank my parents Alecsandru Giura and Florica Giura, and my brother Cristian Giura for their encouragement and invaluable support which always gave me strength to get through this endeavor despite many lands and seas between us. Finally, I would like to thank Simona Cirstea who was by my side in difficult moments, inspired me to persist and brought me happiness and love. Paul Giura, Polytechnic Institute of New York University November 2, 2010 vi Abstract Network data crosses network boundaries in and out and many organizations record traces of network connections for monitoring and investigation purposes. With the increase in network traffic and sophistication of the attacks there is a need for efficient methods to store and query these data. In this dissertation we propose new efficient methods for storing and querying network payload and flow data that can be used to enhance the performance of monitoring and forensic analysis. We first address the efficiency of various methods used for payload attribution. Given a history of packet transmissions and an excerpt of a possible packet payload, a Payload Attribution System (PAS) makes it feasible to identify the sources, destinations and the times of appearance on a network of all the packets that contained the specified payload excerpt. A PAS, as one of the core components in a network forensics system, enables investigating cybercrimes on the Internet, by, for example, tracing the spread of worms and viruses, identifying who has received a phishing email in an enterprise, or discovering which insider allowed an unauthorized disclosure of sensitive information. Considering the increasing volume of network traffic in today’s networks it is infeasible to effectively store and query all the actual packets for extended periods of time for investigations. In this dissertation we focus on extremely compressed digests of payload data, we analyze the existing approaches and propose several new methods for payload attribution which utilize Rabin fingerprinting, shingling, and winnowing. Our best methods allow building payload attribution systems which provide data reduction ratios greater than 100:1 while supporting efficient queries with very low false positive rates. We demonstrate the properties of the proposed methods and specifically analyze their performance and practicality when used as modules of a network forensics system. Consequently, we propose a column oriented storage infrastructure for storing historical network flow data. Transactional row-oriented databases provide satisfactory query performance for network flow data collected only over a period of several hours. In many cases, such as the detection of sophisticated coordinated attacks, it is crucial to query days, weeks or even months worth of disk resident historical data rapidly. For such monitoring and forensics queries, row oriented databases become I/O bound due to long disk access times. Furthermore, their data insertion rate is proportional to the number of indexes used, and query processing time is increased when it is necessary to load unused attributes along with the used ones. To overcome these problems in this dissertation we propose a new column oriented storage infrastructure for network flow records and present the performance evaluation of a prototype storage system implementation called NetStore. The system is aware of network data semantics and access patterns, and benefits from the simple column vii viii oriented layout without the need to meet general purpose databases requirements. We show that NetStore can potentially achieve more than ten times query speedup and ninety times less storage requirements compared to traditional row-stores, while it performs better than existing open source column-stores for network flow data. Finally, we propose an efficient querying framework to represent, implement and execute forensics and monitoring queries faster on historical network flow data. Using efficient filtering methods, the query processing algorithms can improve the query runtime performance up to an order of magnitude for simple filtering and aggregation queries, and up to six times for batch complex queries when compared to naı̈ve approaches. Additionally, we propose a simple SQL extension that implements a subset of standard SQL commands and operators and a small set of features useful for network monitoring and forensics. The presented query processing engine together with a column storage infrastructure create a complete system for storing and querying network flow data efficiently when used for monitoring and forensic analysis. Contents Vita iv Acknowledgements vi Abstract vii 1 Introduction 1.1 Network Monitoring and Forensics . . . . . 1.2 Network Data . . . . . . . . . . . . . . . . . 1.2.1 Packets Payload Data . . . . . . . . 1.2.2 Network Flow Data . . . . . . . . . 1.3 Challenges and Contributions . . . . . . . . 1.3.1 Payload Attribution Systems . . . . 1.3.2 Network Flow Data Storage Systems 1.4 Summary and Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 New Payload Attribution Methods 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . 2.2.1 Bloom Filters . . . . . . . . . . . . . . . . . 2.2.2 Rabin Fingerprinting . . . . . . . . . . . . . 2.2.3 Winnowing . . . . . . . . . . . . . . . . . . 2.2.4 Attribution Systems . . . . . . . . . . . . . 2.3 Methods for Payload Attribution . . . . . . . . . . 2.3.1 Hierarchical Bloom Filter (HBF) . . . . . . 2.3.2 Fixed Block Shingling (FBS) . . . . . . . . 2.3.3 Variable Block Shingling (VBS) . . . . . . . 2.3.4 Enhanced Variable Block Shingling (EVBS) 2.3.5 Winnowing Block Shingling (WBS) . . . . . 2.3.6 Variable Hierarchical Bloom Filter (VHBF) 2.3.7 Fixed Doubles (FD) . . . . . . . . . . . . . 2.3.8 Variable Doubles (VD) . . . . . . . . . . . . 2.3.9 Enhanced Variable Doubles (EVD) . . . . . 2.3.10 Multi-Hashing (MH) . . . . . . . . . . . . . 2.3.11 Enhanced Multi-Hashing (EMH) . . . . . . ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 3 4 5 5 7 12 . . . . . . . . . . . . . . . . . . 15 15 17 17 18 18 19 20 25 29 32 35 36 37 38 39 40 40 41 x 2.4 2.5 2.6 2.3.12 Winnowing Multi-Hashing (WMH) Payload Attribution Systems Challenges . 2.4.1 Attacks on PAS . . . . . . . . . . . 2.4.2 Multi-packet queries . . . . . . . . 2.4.3 Privacy and Simple Access Control 2.4.4 Compression . . . . . . . . . . . . Experimental Results . . . . . . . . . . . . 2.5.1 Performance Metrics . . . . . . . . 2.5.2 Block Size Distribution . . . . . . 2.5.3 Unprocessed Payload . . . . . . . . 2.5.4 Query Answers . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 A Storage Infrastructure for Network Flow 3.1 Introduction . . . . . . . . . . . . . . . . . . 3.2 Related Work . . . . . . . . . . . . . . . . . 3.3 Architecture . . . . . . . . . . . . . . . . . . 3.3.1 Network Flow Data . . . . . . . . . 3.3.2 Column Oriented Storage . . . . . . 3.3.3 Compression . . . . . . . . . . . . . 3.3.4 Query Processing . . . . . . . . . . . 3.4 Evaluation . . . . . . . . . . . . . . . . . . . 3.4.1 Parameters . . . . . . . . . . . . . . 3.4.2 Queries . . . . . . . . . . . . . . . . 3.4.3 Compression . . . . . . . . . . . . . 3.4.4 Comparison With Other Systems . . 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 A Querying Framework For Network Monitoring 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 4.2 Query Processing . . . . . . . . . . . . . . . . . . . 4.2.1 Simple Queries . . . . . . . . . . . . . . . . 4.2.2 Complex Queries . . . . . . . . . . . . . . . 4.3 Query Language . . . . . . . . . . . . . . . . . . . 4.3.1 Data Definition Language . . . . . . . . . . 4.3.2 Data Manipulation Language . . . . . . . . 4.3.3 User Input . . . . . . . . . . . . . . . . . . 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Simple Queries Performance . . . . . . . . . 4.4.2 Complex Queries Performance . . . . . . . 4.5 Related Work . . . . . . . . . . . . . . . . . . . . . 4.5.1 Multi-Query Processing . . . . . . . . . . . 4.5.2 Network Flow Query Languages . . . . . . 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . and Forensics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 43 43 45 45 47 47 48 48 51 53 56 . . . . . . . . . . . . . 58 58 63 65 66 67 71 75 78 78 80 84 87 88 . . . . . . . . . . . . . . . 90 90 93 93 99 105 106 108 108 109 110 111 114 114 115 116 5 Conclusions and Future Work 118 5.1 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 List of Figures 1.1 1.2 1.3 1.4 Network packet. . . . . . . . . . . . Network flow data representation. . Row-store RDBMS . . . . . . . . . A column-store RDBMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4 8 9 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19 2.20 2.21 Methods evolution tree. . . . . . . . . . . . . . . . Processing with HBF method. . . . . . . . . . . . . HBF querying collision. . . . . . . . . . . . . . . . Processing with FBS method. . . . . . . . . . . . . Collision because shingling failure. . . . . . . . . . Querying with FBS method. . . . . . . . . . . . . . Excerpt not found with FBS method. . . . . . . . Processing with VBS method. . . . . . . . . . . . . Processing with EVBS method. . . . . . . . . . . . Processing with WBS method. . . . . . . . . . . . Processing with VHBF method. . . . . . . . . . . . Processing with FD method. . . . . . . . . . . . . Processing with VD method . . . . . . . . . . . . . Processing with EVD method. . . . . . . . . . . . Processing with MH method. . . . . . . . . . . . . Processing with WMH method. . . . . . . . . . . . Query using WMH method. . . . . . . . . . . . . . The distributions of block sizes for VBS method . The distributions of block sizes for EVBS method . The distributions of block sizes for WBS method . All methods evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 27 28 30 31 32 33 34 35 37 38 39 40 41 42 46 46 49 50 50 53 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 Network flow traffic distribution for one day . . . Network flow traffic distribution for one month . NetStore architecture . . . . . . . . . . . . . . . . NetStore processing engine . . . . . . . . . . . . Column-store architecture . . . . . . . . . . . . . IPs inverted index . . . . . . . . . . . . . . . . . Insertion rate for different segment sizes. . . . . . Compression ratio with and without aggregation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 61 65 69 71 72 80 81 xi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii 3.9 Query time vs segment size . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10 Compression strategies results . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 4.2 4.3 4.4 4.5 4.6 Simple query execution graph . . . . . . . . . . . . . . . . . . . . . . . The interactive sequential processing model. . . . . . . . . . . . . . . . Batch of queries representation . . . . . . . . . . . . . . . . . . . . . . Simple query optimization evaluation . . . . . . . . . . . . . . . . . . . The complex interactive query performance compared to the baseline. The complex batch query performance compared to the baseline. . . . . . . . . . . . . . . . . . . . . . 84 86 95 100 102 110 112 114 List of Tables 2.1 2.2 2.3 . . . . . . . . of un. . . . . . . . 25 49 2.4 Summary of properties of methods . . . . . . . Number of elements inserted for each method . (a) False positive rates for data reduction ratio processed payload . . . . . . . . . . . . . . . . False positive rate for data reduction ratio 50:1 3.1 3.2 3.3 NetStore flow attributes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . NetStore properties and rates supported . . . . . . . . . . . . . . . . . . . . NetStore relative performance compared to PostgreSQL and LucidDB . . . 79 79 87 xiii . . . . . . . . . . . . . . . . . . . . . . . . 130:1 (b) Percentage . . . . . . . . . . . . . . . . . . . . . . . . 52 52 List of Algorithms 4.1 4.2 ExecuteSimpleQuery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ExecuteAdaptiveBatchQuery . . . . . . . . . . . . . . . . . . . . . . . . . . xiv 98 104 Chapter 1 Introduction The number of devices that connect to the Internet and the traffic generated by them increases every day, more and more data being transported in the networks. This data crosses private network boundaries in and out and many organizations record traces of network connections for monitoring and investigation purposes. In general, the collected data is preprocessed and archived using relational databases or some other file organization on permanent storage devices. In both cases the system used to process and archive the collected network data impacts the query response times as well as the efficiency of the data usage for monitoring and forensics. In this dissertation we analyze existing practices for archiving network data and propose new efficient methods that can be used to enhance the performance of monitoring and forensic analysis. 1.1 Network Monitoring and Forensics Network monitoring represents the practice of overseeing the operation of a computer network in order to detect failures of devices, failure of connections or other unexpected network communication behavior. Besides the ability to display traffic summarization per hour, per day or other predefined time window, network security monitoring systems, such as Intrusion Detection Systems (IDS), are increasingly providing the ability to detect complex network behavior patterns, such as worm infections, by using near real time network data. Therefore is important to have access to the monitored data as quick as possible in order 1 2 to reduce the time of network exposure to various security vulnerabilities or to detect communication malfunctions and resume the normal network activities in the shortest time possible. Network forensics is the capture, recording, and analysis of network events in order to discover the source of security attacks or other problem incidents. In comparison with network monitoring, the forensics tasks require access to larger amounts of network data collected for longer periods of time that can represent days, weeks, months or even years. This data is processed in several steps in a drill-down manner, narrowing and correlating the subsequent intermediary results throughout the analysis. As such, network forensic systems need the ability to identify the root cause of a security breach starting from a simple evidence point such as an excerpt of a phishing email, an internet worm signature or a piece of sensitive data disclosed by an insider. The next steps in the investigation may involve checking a suspected host’s past network activity, looking up any services run by a host, protocols used, the connection records to other hosts that may or may not be compromised. In this case, the systems used for storing and querying network data should take into account the general characteristics of forensic queries and should provide valuable evidence and reasonable response times when using the archived data. For both monitoring and forensics tasks the amount of data considered for processing and analysis is increasing every day. The existing data storage system are no longer suitable to store and query network data [19] and new methods are needed to efficiently meet the new requirements. In this dissertation we propose new methods for storing and querying network data. The proposed methods are aware of the network data semantics and potential query workloads, and can be efficiently used for monitoring and forensic analysis. 1.2 Network Data In this section we present the different network data categories that we consider when designing our methods for storage and querying: payload data and network flow data. Data is transported over the network in packets and each packet contains both a header and a payload. Payload data represents the actual data carried over the network in the network 3 packets and network flow data represents the quantitative information about the communication between two endpoints in the network. Our goal is to find efficient storage and querying methods for both of these network data categories. 1.2.1 Packets Payload Data A network packet payload data contains the actual string of bits transfered over the network. Figure 1.1 shows schematically a network packet with header and payload data. This data can represent various content types such as plaintext, images, video, audio or encrypted data. In general this data is not structured in nature and most of the time is transported in encoded format, the decoding being done at the application layer. Figure 1.1: Simple network packet payload data representation. In many situations payload data is stored along with the header information in network traces for monitoring and investigations. These traces represent data captured from the network and may be stored in full format using lossless compression methods, where all the original data can be reconstructed from the compressed format, or in extremely compressed digest format using lossy compression methods, where some of the original data might be lost but the information stored about the original data still being of considerable utility. Due to the increasing volume of network traffic in today’s networks it is infeasible to effectively store and query all the actual packets payload data for extended periods of time in order to allow analysis of network events for investigative purposes. Therefore, in this dissertation we focus on the latter case when payload data is stored using extremely compressed digests of payload and propose new methods that can be efficiently used by the payload attribution systems for investigation tasks. More specifically, we propose various methods to partition the payload into blocks using winnowing [47] and Rabin fingerprinting [43], then store them in a Bloom filter [6]. We present the details of all the payload attribution methods proposed [40, 41] in Chapter 2. 4 1.2.2 Network Flow Data Unlike payload data, the header data is structured and includes specific information from each networking layer. In general header data contains routing information used by networked devices and hosts operating systems in order to facilitate the transportation of the communication data flow. Among other identification and control information requested by the lower layers protocols, a network packet header contains information requested by the transport and network layers protocols such as source IP, source port, destination IP, destination port, protocol, etc. In the context of network communication we refer to a flow as an unidirectional data stream between two endpoints, and refer to flow data or flow record as the quantitative description of a flow. Flow data includes source IP, source port, destination IP, destination port, protocol, number of bytes transported in the flow, start time, end time, etc. A schematic representation of the network flow data is presented by Figure 1.2. Since flow data has become ubiquitous in recent years organizations have developed standards [26] and protocols [57] in order to provide a common framework for using this data. As such, many organizations store and use network flow data for various purposes such as for traffic metering, network monitoring, intrusion detection and network forensics. Figure 1.2: Network flow data representation. In contrast with packets payload data the network flow data is highly structured. Each attribute of a flow can be stored efficiently using well established representations of common numerical types (IP addresses as 4-byte integer, source port as 2-byte short, etc) or specific protocols structure (DNS records, HTTP requests, etc). Current systems [18], store 5 historical network flow records using transactional Relational Database Management Systems (RDBMS) or plain files organized in hierarchy based on capture time [17]. These systems show performance penalty when storing and querying data spanning for long periods of time due to the I/O operations overhead. Chapter 3 of this dissertation presents a novel storage architecture that avoids the limitation of the transactional databases and uses a column-oriented approach to store and query network flow records for monitoring and forensic analysis. Moreover, Chapter 4 shows the design of the query processing engine for the column-oriented system and proposes several optimization methods for simple and complex, forensics and monitoring queries. 1.3 Challenges and Contributions With the deployment of new networked devices and Internet based services there is an accelerated increase in both networks speeds and the amount of data transported. In this dissertation we propose new methods for coping with these major challenges in order to provide efficient functionality for systems used in network monitoring and forensic analysis. First, in Chapter 2 we look at the existing methods for payload attribution, observe their limitations and propose new techniques that gradually achieve the desired goals: highest payload compression ratios with the smallest false positives rates when querying. Second, we examine the existing solutions for storying and querying network flow data, then based on the data and expected queries characteristics we propose a new storage architecture in Chapter 3 and a query execution framework in Chapter 4 that are better suited to store and query historical network flow data. 1.3.1 Payload Attribution Systems Payload attribution is the process of identifying the sources and destinations of all packets that appeared on a network and contained a certain excerpt of a payload. A Payload Attribution System (PAS) is a system than can facilitate payload attribution. It can be an extremely valuable tool in helping to determine the perpetrators or the victims of a network event and to analyze security incidents in general. 6 A payload attribution system performs two separate tasks: payload processing and query processing. In payload processing, the payload of all traffic that passed through the network where the PAS is deployed is examined and some information is saved into permanent storage. This has to be done at line speed and the underlying raw packet capture component can also perform some filtering of the packets, for example, choosing to process only HTTP traffic. In general, data is stored in archive units, each of which having two timestamps (start and end of the time interval during which data was collected). For each time interval there is also a need to save a unique flow identifier as a flowID (pairs of source and destination IP addresses, etc) to allow querying later on. This information can be alternatively obtained from various sources such as connection records collected by dedicated network sensors (routers exporting NetFlow [57]), firewalls, intrusion detection systems or other log files. During query processing given the excerpt and a time interval of interest the PAS has to retrieve all the corresponding archive units from the storage. For example, when querying the PAS for the excerpt, if the excerpt is found in the archive, the system will try to query successively for all the flowIDs available corresponding to the time interval, and report all matches to the user. A naı̈ve method to design a simple payload attribution system is to store the payload of all packets. In order to decrease the demand for storage capacity and to provide some privacy guarantees, one can store hashes of payloads instead of the actual payloads. This approach reduces the amount of data per packet (to about 20 bytes by using SHA-1, for example) at the cost of false positives due to hash collisions. To further reduce the required storage space one can insert the payloads in a Bloom filter [6] which is described in Section 2.2.1 of Chapter 2. Essentially, Bloom filters are space-efficient probabilistic data structures supporting membership queries that are used in many network and other applications [10]. An empty Bloom filter is a bit vector of m bits, all set to 0, that uses k different hash functions, each of which maps a key value to one of the m positions in the vector. To insert an element into the Bloom filter, one computes the k hash function values for the element and sets the bits at the corresponding k positions to 1 in the bit vector. To test whether an element was inserted, one hashes the element with 7 these k hash functions and checks if all corresponding bits are set to 1, in which case the element is said to be found in the filter. The space savings of a Bloom filter is achieved at the cost of introducing false positives. The false positive rate of a Bloom filter depends on the data reduction ratio it provides: the greater the savings in storage, the greater the probability of a query returning a false positive. An useful property of a Bloom filter is that it preserves privacy because it allows only to ask whether a particular element was inserted into it, but it cannot be coerced into revealing the list of elements stored. Compared to storing hashes directly, the advantage of using Bloom filters is not only the space savings but also the speed of querying. It takes only a short constant time to query the Bloom filter for any packet. However, inserting the entire payload into a Bloom filter does not allow supporting queries for payload excerpts. Instead of inserting the entire payload into the Bloom filter one can partition it into blocks and insert them individually. In this way the system allows queries for excerpts of the payload by checking if all the blocks of an excerpt are in the Bloom filter. Besides reducing the storage requirements and achieving the lowest false positive rates for individual excerpts blocks, another important challenges are to determine with high accuracy where in the payload each block started (the alignment problem) and whether the resulting blocks appeared consecutively in the same payload (the consecutiveness resolution problem). In this dissertation we consider the case when a PAS processes a packets payload by partitioning the payload into blocks and then stores them in a Bloom filter. We describe a whole suite of methods used for payload partitioning in Chapter 2. We show how each payload partitioning method was derived, how it solves the alignment and consecutiveness resolution problems, and how it impacts the PAS performance in terms of compression ratio and false positives ratios achieved. 1.3.2 Network Flow Data Storage Systems Unlike network packets payload data, network flow data is highly structured and therefore can be stored in structured format in tables using a relational database as it is done in [18]. In such case, each flow record is stored as a row in a table having each flow attribute 8 in a column. Data in a relational database is manipulated and queried using standard SQL commands. An efficient storage and querying infrastructure for network flow records has to cope with two main technical challenges: keep the insertion rate high, and provide fast access to the desired flow records. As shown in [19] the query performance of using a RDBMS is influenced by the decision to physically store the data row-by-row in a so called row-store, or column-by-column in a column-store. The column oriented systems are proven to yield better query runtime performance for analytical query workloads [55,60] and in this section we provide a brief description of each approach as well as a short description of expected queries on network flow data. Row-Store RDBMS When using a traditional row-oriented database, for each flow, the relevant attributes are inserted as a row into a table as they are captured from the network as shown by Figure 1.3. Then, each flow attributes are stored sequential on disk, and flows are indexed using various techniques [18] for most accessed attributes. On one hand, such a system has to establish a trade off between the insertion rate desired and the storage and processing overhead employed by the use of auxiliary indexing data structures. On the other hand, enabling indexing for more attributes ultimately improves query performance but also increases the storage requirements and decreases insertion rates. In general, when querying disk resident data, an important problem to overcome is the I/O bottleneck caused by large disk to memory data transfers. Having the flow records laying sequential on disk, at query time, all the columns of the table have to be loaded in memory even if only a subset of the attributes are relevant for the query, adding a significant I/O penalty for the overall query processing time by loading unused columns. Figure 1.3: A row-store RDBMS table representation of the flow data. 9 Therefore, one potential solution would be to load only data that is relevant to the query. For example, to answer the query ”What is the list of all IPs that contacted IP X between dates d1 and d2 ?”, the system should load only the source and destination IPs as well as the timestamps of the flows that fall between dates d1 and d2 . The I/O time can also be decreased if the accessed data is compressed since less data traverses the disk-memory boundary. Further, the overall query response time can be improved if data is processed in compressed format by saving decompression time. Finally, since the system has to insert records at line speed, all the preprocessing algorithms used, such as compression, sorting or indexing should add negligible overhead while writing to disk. Even though for small amounts of network flow data the existing transactional database systems might provide satisfactory performance [18] they fall short when inserting and querying network data collected for prolonged periods of time. However, for large amounts of network flow data the above requirements can be met quite well by utilizing a column oriented database described below. Column-Store RDBMS When using a column oriented RDBMS, the flows attributes are also represented as rows in a table at the logical level but they are stored as columns at the physical level. Each column holds data for a single attribute of the flow and is stored sequentially on disk. A simple graphical representation is showed by Figure 1.4. Figure 1.4: A column-store RDBMS representation of the flow data, the top table shows the data source of each column. 10 Such a strategy makes the system I/O efficient for read queries since only the required attributes related to a query can be read from the disk. Moreover, having data of the same type laying sequential on disk creates incentives for efficient compression methods that allow processing data in compress format (for example when using run-length encoding). The performance benefits of column partitioning were previously analyzed in [2, 25], and some of the ideas were confirmed by the results in the databases academic research community [1, 19, 55, 60] as well as in industry [13, 27, 28, 30, 58]. However, most of commercial and open-source column stores were conceived to follow general purpose RDBMSs requirements, and do not fully use the semantics of the data carried and do not take advantage of the specific types and data access patterns of network forensic and monitoring queries. For example, network data is continuously inserted at line speed in the storage systems using append only operation, so there is no need to support all the operations required by the transactional workloads. Moreover, forensics and monitoring queries mostly use a time window associated with each query, access large amounts of stored data once and don’t require individual updates or deletes for the stored flow records unless all the flow archive is deleted. In contrast, the existing general purpose column-stores use auxiliary data structures to reconstruct the original logical table in order to support updates and deletes of individual records. By doing so, a significant overhead is added for both insertion and query time. Therefore the major challenges of building an efficient storage infrastructure for network flow records are to reuse the relevant features of existing systems, to avoid their limitations and to incorporate the knowledge about new data insertion and query workloads as early as possible in the design of the system. In Chapter 3 we present the design, implementation details and the evaluation of a column-oriented storage infrastructure for network records that, unlike the other systems, is intended to provide good performance when using network records flow data for monitoring and forensic analysis [19]. Queries on Network Flow Data Forensics and monitoring queries on historical network flow data are becoming increasingly complex [32]. In general these queries are composed of many simple queries that 11 process both structured and user input data (for example list with malicious hosts, restricted services, restricted port numbers, etc). In the case of forensic analysis historical network flow data is processed in several steps, queries being executed sequential in a drill-down manner, narrowing, reusing and correlating the subsequent intermediary results throughout the analysis. Existing SQL based querying technologies do not support unstructured user input and store intermediary results in temporary files [17] and materialized views on disks [18] before feeding data to future queries. Using this approach the query processing time is increased due to multiple unnecessary I/O operations. In forensic analysis, references between subsequent query results are not trivial to represent and implement using standard SQL syntax and semantics. For this task sophisticated stored procedures and languages are used instead. Moreover, when executing forensic queries sequentially, the query engine has the opportunity to speedup the new queries by efficiently reusing the results of the already evaluated predicates from the old queries. For monitoring, network administrators run many simple queries at once, in batches, in order to detect complex network behavior patterns, such as worm infections [32], and display traffic summarization per hour, per day or other predefined time window [17, 39]. By submitting queries in batches all the simple queries can be executed in any order, thus some order may result in better overall runtime performance. Additionally, it is expected that some of the queries in the batch to use the same filtering predicates for some attributes (known ports, IPs, etc). In this case the results from evaluating common predicates can be shared by many queries, therefore saving execution time. Moreover, evaluating predicates in a particular order across all the simple queries may result in less evaluation work for the future predicates in the execution pipeline. Additional to the column oriented storage infrastructure described in Chapter 3, in this dissertation we propose a complementary querying framework for historical network flow data for monitoring and forensic queries in Chapter 4. The proposed querying framework together with the column oriented storage system create a complete system for efficiently storing and querying network flow data. 12 1.4 Summary and Dissertation Outline There are four major components of this dissertation each presented sequentially in one of the next chapters. As such, Chapter 2 defines the problem of payload attribution and provides a comprehensive presentation of multiple payload processing methods that can be used by a payload attribution system. For better understanding of the motivation for a new method, the description of each method is introduced gradually by enhancing the previously described methods with new features proven to eliminate some of the previous methods limitations. All the presented methods were tested as modules of a payload attribution system using real data. The experiments show that the best method achieves data reduction ratios greater than 100:1 while supporting efficient queries with very low false positive rates. Chapter 3 examines the existing systems used for storing network flow data, outlines the requirements of an efficient storage infrastructure for these data and presents the design and implementation of a more efficient storage system. The proposed architecture is column oriented and each column is partitioned into segments in order to allow data access at smaller granularity than column. Each segment has associated with it a small meta data component that stores statistics about data in each segment. Having data partitioned in this fashion allows the use of compression methods that can be chosen dynamically for each segment and Chapter 3 presents the compression selection mechanism as well as the compression methods used. Additionally, Chapter 3 introduces a new indexing approach for faster access to records corresponding to internal hosts of an organization by using the concept of inverted index. The experiments using sample monitoring and forensics queries show that an implemented instance of the proposed architecture, NetStore, performs better than two of the existing open-source systems, the row-store PosgreSQL and the column-store LucidDB. Chapter 4 introduces the challenges associated with executing monitoring and forensic queries. First, based on their complexity the queries are separated into simple and complex. The simple queries are assumed to use only simple filtering (<, ≤, >, ≥, =, 6=, etc) and aggregation (MIN, MAX, COUNT, SUM, AVG, etc) operators while complex queries can be composed of many simple queries as building blocks. The simple queries that make 13 up a complex query can be executed sequential and in batches each approach raising new challenges and opportunities for optimization. Chapter 4 defines the optimization problems for both simple and complex queries and presents efficient algorithms for executing these queries using data stored in a column-store. The experiments show that the proposed optimization methods performed better than the naı̈ve approaches in both cases. Additionally, Chapter 4 introduces a simple SQL-extension for expressing monitoring and forensics queries that need easy access to previous results (for example to refine results in investigations), simpler constructs to allow loading user data (for example lists of malicious IPs) and importing network flow data from network sensors. Lastly, Chapter 5 presents a discussion about the results and implications of the presented methods. It describes the possibility of using the storage and querying methods in a more general context by varying the application domain. Finally, it enumerates the envisioned guidelines for future work taking into account the increasing popularity of parallel processing and presents the dissertation conclusions. In summary, this dissertation presents new methods for enhancing the performance of storage and querying systems for network payload and flow data. It introduces new more efficient payload attribution methods that can be successfully used to detect excerpts of payload when using extremely compressed data with low false positive rates. It presents the design and implementation details of a column oriented storage infrastructure for network flow records and proposes an efficient querying framework for monitoring and forensics queries using data stored in a column oriented system. The dissertation makes the following main intellectual contributions: • A detailed description of new methods to partition network packets payload used for payload attribution. • Implementation, analysis and comparison of payload attribution methods deployed as modules of an existing payload attribution system using real network traffic data. • Design of an efficient column oriented storage infrastructure that enables quick access to large amounts of historical network flow data for monitoring and forensic analysis. 14 • Implementation and deployment of NetStore using commodity hardware and open source software as well as analysis and comparison with other open source storage systems used currently in practice. • The design and implementation of an efficient query processing engine built on top of a column oriented storage system used for network flow data. • Query optimization methods for sequential and batch queries used in network monitoring and forensic analysis. • Design of a simple SQL extension that allows simpler representation of forensic and monitoring queries. Chapter 2 New Payload Attribution Methods 2.1 Introduction Cybercrime today is alive and well on the Internet and growing both in scope and sophistication [45]. Given the trends of increasing Internet usage by individuals and companies alike and the numerous opportunities for anonymity and non-accountability of Internet use, we expect this trend to continue for some time. While there is much excellent work going on targeted at preventing cybercrime, unfortunately there is the parallel need to develop good tools to aid law-enforcement or corporate security professionals in investigating committed crimes. Identifying the sources and destinations of all packets that appeared on a network and contained a certain excerpt of a payload, a process called payload attribution, can be an extremely valuable tool in helping to determine the perpetrators or the victims of a network event and to analyze security incidents in general [16, 29, 50, 54]. It is possible to collect full packet traces even with commodity hardware [4] but the storage and analysis of terabytes of such data from today’s high-speed networks is extremely cumbersome. Supporting network forensics by simply capturing and logging raw network traffic, however, is infeasible for anything but short periods of history. First, storage requirements limit the time over which the data can be archived (e.g., a 100 Mbit/s WAN can fill up 1 TB in just one day) and it is a common practice to overwrite old data when that limit is reached. Second, string matching over such massive amounts of data is very time-consuming. 15 16 Recently Shanmugasundaram et al. [49] presented an architecture for network forensics in which payload attribution is a key component [49]. They introduced the idea of using Bloom filters to achieve a reduced size digest of the packet history that would support queries about whether any packet containing a certain payload excerpt has been seen; the reduction in data representation comes at the price of a manageable false positive rate in the query results. Subsequently a different group has offered a variant technique for the same problem [14]. Our contribution in this chapter is to present new methods for payload attribution that have substantial performance improvements over these state-of-the-art payload attribution systems. Our approach to payload attribution, which constitutes a crucial component of a network forensic system, can be easily integrated into any existing network monitoring system. The best of our methods allow data reduction ratios greater than 100:1 and achieve very low overall false positive rates. With a data reduction ratio of 100:1 our best method gives no false positive answers for query excerpt sizes of 250 bytes and longer; in contrast, the prior best techniques had 100% false positive rate at that data reduction ratio and excerpt size. The reduction in storage requirements makes it feasible to archive data taken over an extended time period and query for events in a substantially distant past. Our methods are capable of effectively querying for small excerpts of a payload but can also be extended to handle excerpts that span several packets. The accuracy of attribution increases with the length of the excerpt and the specificity of the query. Further, the collected payload digests can be stored and queried by an untrusted party without disclosing any payload information nor the query details. This chapter is organized as follows. In the next section we review related prior work. In Section 2.3 we provide a detailed design description of our payload attribution techniques, with a particular focus on payload processing and querying. In Section 2.4 we discuss several issues related to the implementation of these techniques in a full payload attribution system. In Section 2.5 we present a performance comparison of the proposed methods and quantitatively measure their effectiveness for multiple workloads. Finally, we conclude in Section 2.6. 17 2.2 Related Work When processing a packet payload by the methods described in Section 2.3, the overall approach is to partition the payload into blocks and store them in a Bloom filter. In this section we first give a short description of Bloom filters and introduce Rabin fingerprinting and winnowing, which are techniques for block boundary selection. Thereafter we review the work related to payload attribution systems. 2.2.1 Bloom Filters Bloom filters [6] are space-efficient probabilistic data structures supporting membership queries and are used in many network and other applications [10]. An empty Bloom filter is a bit vector of m bits, all set to 0, that uses k different hash functions, each of which maps a key value to one of the m positions in the vector. To insert an element into the Bloom filter, we compute the k hash function values and set the bits at the corresponding k positions to 1. To test whether an element was inserted, we hash the element with these k hash functions and check if all corresponding bits are set to 1, in which case we say the element is in the filter. The space savings of a Bloom filter is achieved at the cost of introducing false positives; the greater the savings, the greater the probability of a query returning a false positive. Equation 2.1 gives an approximation of the false positive rate α, after n distinct elements were inserted into the Bloom filter [34]. More analysis reveals that an optimal utilization of a Bloom filter is achieved when the number of hash functions, k, equals (ln 2) · (m/n) and the probability of each bit of the Bloom filter being 0 is 1/2. In practice, of course, k has to be an integer and smaller k is mostly preferred to reduce the amount of necessary computation. Note also that while we use Bloom filters throughout this chapter, all of our payload attribution techniques can be easily modified to use any data structure which allows insertion and querying for strings with no changes to the structural design and implementation of the attribution methods. α= !k k 1 kn 1− 1− ≈ 1 − e−kn/m m (2.1) 18 2.2.2 Rabin Fingerprinting Fingerprints are short checksums of strings with the property that the probability of two different objects having the same fingerprint is very small. Rabin defined a fingerprinting scheme [43] for binary strings based on polynomials in the following way. We associate a polynomial S(x) of degree N − 1 with coefficients in Z2 with every binary string S = (s1 , . . . , sN ), for N ≥ 1: S(x) = s1 xN −1 + s2 xN −2 + · · · + sN . (2.2) Then we take a fixed irreducible polynomial P (x) of degree K, over Z2 and define the fingerprint of S to be the polynomial f (S) = S(x) mod P (x). This scheme, only slightly modified, has found several applications [8], for example, in defining block boundaries for identifying similar files [31] and for web caching [44]. We derive a fingerprinting scheme for payload content based on this Rabin’s scheme in Section 2.3.3 and use it to pick content-dependent boundaries for a priori unknown substrings of a payload. For details on the applications, properties and implementation issues of the Rabin’s scheme one can refer to [8]. 2.2.3 Winnowing Winnowing [47] is an efficient fingerprinting algorithm enabling accurate detection of full and partial copies between documents. It works as follows: For each sequence of ν consecutive characters in a document, we compute its hash value and store it in an array. Thus, the first item in the array is a hash of c1 c2 . . . cν , the second item is a hash of c2 c3 . . . cν+1 , etc., where ci are the characters in the document of size Ω bytes, for i = 1, . . . , Ω. We then slide a window of size w through the array of hashes and select the minimum hash within each window. If there are more hashes with the minimum value, we choose the rightmost one. These selected hashes form the fingerprint of the document. It is shown in [47] that fingerprints selected by winnowing are better for document fingerprinting than the subset of Rabin fingerprints which contains hashes equal to 0 mod p, for some fixed p, because winnowing guarantees that in any window of size w there is at least one 19 hash selected. We will use this idea to select boundaries for blocks in packet payloads in Section 2.3.5. 2.2.4 Attribution Systems There has been a major research effort over the last several years to design and implement feasible network traffic traceback systems, which identify the machines that directly generated certain malicious traffic and the network path this traffic subsequently followed. These approaches, however, restrict the queries to network floods, connection chains, or the entire payload of a single packet in the best case. The Source Path Isolation Engine (SPIE) [53] is a hash-based technique for IP traceback that generates audit trails for traffic within a network. SPIE creates hash-digests of packets based on the packet header and a payload fragment and stores them in a Bloom filter in routers. SPIE uses these audit trails to trace the origin of any single packet delivered by the network in the recent past. The router creates a packet digest for every forwarded packet using the packet’s non-mutable header fields and a short prefix of the payload, and stores it in a Bloom filter for a predefined time. Upon detection of a malicious attack by an intrusion detection system, SPIE can be used to trace the packet’s attack path back to the source by querying SPIE devices along the path. In many cases, an investigator may not have any header information about a packet of interest but may know some excerpt of the payload of the packets she wishes to see. Designing techniques for this problem that achieve significant data reduction (compared to storing raw packets) is a much greater challenge; the entire packet payload is much larger than the information hashed by SPIE; in addition we need to store information about numerous substrings of the payload to support queries about excerpts. Shanmugasundaram et al. [49] introduced the Hierarchical Bloom Filter (HBF), a compact hash-based payload digesting data structure, which we describe in Section 2.3.1. A payload attribution system based on an HBF is a key module for a distributed system for network forensics called ForNet [50]. The system has both low memory footprint and achieves a reasonable processing speed at a low false positive rate. It monitors network traffic, creates hash-based digests of payload, and archives them periodically. A user-friendly query mechanism based on XML provides 20 an interface to answer postmortem questions about network traffic. SPIE and HBF are both digesting schemes, but while SPIE is a packet digesting scheme, HBF is a payload digesting scheme. With an HBF module running in ForNet (or a module using any of our methods presented in this chapter), one can query for substrings of the payload (called excerpts throughout this dissertation). Recently, another group suggested an alternative approach to the payload attribution problem, the Rolling Bloom Filter (RBF) [14], which uses packet content fingerprints based on a generalization of the Rabin-Karp string-matching algorithm. Instead of aggregating queries in a hierarchy as an HBF, they aggregate query results linearly from multiple Bloom filters. The RBF tries to solve a problem of finding a correct alignment of blocks in the process of querying an HBF by considering many possible alignments of blocks at once, i.e., RBF is rolling a fixed-size window over the packet payload and recording all the window positions as payload blocks. They report performance similar to the best case performance of the HBF. The design of an HBF is well documented in the literature and currently used in practice. We created our implementation of the HBF as an example of a current payload attribution method and include it in our comparisons in Section 2.5. The RBF’s performance is comparable to that of the HBF and experimental results presented in [14] show that RBF achieves low false positive rates only for small data reduction ratios (about 32:1). 2.3 Methods for Payload Attribution In this section we introduce various data structures for payload attribution. Our primary goal is to find techniques that give the best data reduction for payload fragments of significant size at reasonable computational cost. When viewed through this lens, roughly speaking a technique that we call Winnowing Multi-Hashing (WMH) is the best and substantially outperforms previous methods; a thorough experimental evaluation is presented in Section 2.5. Our exposition of WMH will develop gradually, starting with naı̈ve approaches to the problem, building through previous work (HBF), and introducing a variety of new tech- 21 niques. Our purpose is twofold. First, this exposition should develop a solid intuition for the reader as to the various considerations that were taken into account in developing WMH. Second, and equally important, there are a variety of lenses through which one may consider and evaluate the different techniques. For example, one may want to perform less computation and cannot utilize data aging techniques and as a result opt for a method such as Winnowing Block Shingling (WBS) which is more appropriate than WMH under those circumstances. Additionally, some applications may have specific requirements on the block size and therefore prefer a different method. By carefully developing and evaluating experimentally the different methods, we present the reader with a spectrum of possibilities and a clear understanding of which to use when. As noted earlier, all of these methods follow the general program of dividing packet payloads into blocks and inserting them into a Bloom filter. They differ in how the blocks are chosen, what methods we use to determine which blocks belong to which payload in which order (“consecutiveness resolution”), and miscellaneous other techniques used to improve the number of necessary queries and to reduce the probability of false positives. We first describe the basics of block based payload attribution and the Hierarchical Bloom Filter [49] as the current state-of-the-art method. We then propose several new methods which solve multiple problems in the design of the former methods. A naı̈ve method to design a simple payload attribution system is to store the payload of all packets. In order to decrease the demand for storage capacity and to provide some privacy guarantees, we can store hashes of payloads instead of the actual payloads. This approach reduces the amount of data per packet to about 20 bytes (by using SHA-1, for example) at the cost of false positives due to hash collisions. By storing payloads in a Bloom filter (described in Section 2.2.1), we can further reduce the required space. The false positive rate of a Bloom filter depends on the data reduction ratio it provides. A Bloom filter preserves privacy because we can only ask whether a particular element was inserted into it, but it cannot be coerced into revealing the list of elements stored; even if we try to query for all possible elements, the result will be useless due to false positives. Compared to storing hashes directly, the advantage of using Bloom filters is not only the space savings but also the speed of querying. It takes only a short constant time to query 22 the Bloom filter for any packet. Inserting the entire payload into a Bloom filter, however, does not allow supporting queries for payload excerpts. Instead of inserting the entire payload into the Bloom filter we can partition it into blocks and insert them individually. This simple modification can allow queries for excerpts of the payload by checking if all the blocks of an excerpt are in the Bloom filter. Yet, we need to determine whether two blocks appeared consecutively in the same payload, or if their presence is just an artifact of the blocking scheme. The methods presented in this section deal with this problem by using offset numbers or block overlaps. The simplest data structure that uses a Bloom filter and partitions payloads into blocks with offsets is a Block-based Bloom Filter (BBF) [49]. Note that, assuming that we do one decomposition of the payload into blocks during payload processing, starting at the beginning of the packet, we will need to query the data structure for multiple starting positions of our excerpt in the payload during excerpt querying phase, as the excerpt need not start at the beginning of a block. For example, if the payload being partitioned with a block size of 4 bytes was ABCDEF GHIJ, we would insert blocks ABCD and EF GH into the Bloom filter (the remainder IJ is not long enough to form a block and is therefore not processed). Later on, when we query for an excerpt, for example, DEF GHI, we would partition the excerpt into blocks (with a block size 4 bytes as done previously on the original payload). This would give us just one block to query the Bloom filter for, DEF G. However, because we do not know where could the excerpt be located within the payload we also need to try partitioning the excerpt from starting position offsets 1 and 2, which gives us blocks EF GH and F GHI, respectively. We are then guaranteed that the Bloom filter answers positively for the correct block EF GH, however, we can also get positive answers for blocks EF GH and F GHI due to false positives of a Bloom filter. The payload attribution methods presented in this section try to limit or completely eliminate (see Section 2.3.3) this negative effect. Alternative payload processing schemes, such as [14, 21] perform partitioning of the payload at all possible starting offsets during the payload processing phase (which is basically similar to working on all n-grams of the payload) but it incurs a large overhead for processing speed and also storage requirements are multiplied. 23 We also need to set two parameters which determine the time precision of our answers and the smallest query excerpt size. First, we want to be able to attribute to each excerpt for which we query the time when the packet containing it appeared on the network. We solve that by having multiple Bloom filters, one for each time interval. The duration of each interval depends on the number of blocks inserted into the Bloom filter. In order to guarantee an upper bound on the false positive rate, we replace the Bloom filter by a new one and store the previous Bloom filter in a permanent storage after a certain number of elements are inserted into it. There is also an upper bound on the maximum length of one interval to limit the roughness of time determination. Second, we specify the size of blocks. If the chosen block size is too small, we get too many collisions as there are not enough unique patterns and the Bloom filter gets filled quickly by many blocks. If the block size is too large, there is not enough granularity to answer queries for smaller excerpts. We need to distinguish blocks from different packets to be able to answer who has sent/received the packet. The BBF as briefly described above isn’t able to recognize the origins and destinations of packets. In order to work properly as an attribution system over multiple packets a unique flow identifier (flowID) must be associated with each block before inserting into the Bloom filter. A flow identifier can be the concatenation of source and destination IP addresses, optionally with source/destination port numbers. We maintain a list (or a more efficient data structure) of flowIDs for each Bloom filter and our data reduction estimates include the storage required for this list. The connection records (flowIDs) for each Bloom filter (i.e., a time interval) can be alternatively obtained from other modules monitoring the network. The need for testing all the flowIDs in a list significantly increases the number of queries required for the attribution as the flowID of the packet that contained the query excerpt is not known a priori and it leads to higher false positive rate and decreases the total query performance. Therefore, we may either maintain two separate Bloom filters to answer queries, one into which we insert blocks only and one with blocks concatenated with the corresponding flowIDs, or insert both into one larger Bloom filter. The former allows data aging, i.e., for very old data we can delete the first Bloom filter and store only the one with flowIDs at the cost of higher false positive rate and slower querying. Another method to save storage space by reducing size taken by very old data is to take 24 a Bloom filter of size 2b and replace it by a new Bloom filter of size b by computing the logical or operation of the two halves of the original Bloom filter. This halves the amount of data and still allows querying but the false positive rate increases significantly. An alternative construction which allows the determination of source/destination pairs is using separate Bloom filters for each flow. Then instead of using one Bloom filter and inserting blocks concatenated with flowIDs, we just select a Bloom filter for the insertion of blocks based on the flowID. Because we cannot anticipate the number of blocks each flow would contain during a time interval, we use small Bloom filters, flush them to disk more often and use additional compression (such as gzip) on the Bloom filters before saving to disk which helps to significantly reduce storage requirements for very sparse flows. Having multiple small Bloom filters also has some performance advantages compared to one large Bloom filter because of caching; the size of a Bloom filter can be selected to fit into a memory cache block. This technique would most likely use TCP stream reconstruction and makes the packet processing stateful compared to the method using flowIDs. It may thus be suitable when there is another module in the system, such as intrusion detection (or prevention) system, which already does the stream reconstruction and a PAS module can be attached to it. If using this technique the evaluation of methods would then be extremely dependent on the distribution of payload among the streams. We do not take flowIDs into further consideration in method descriptions throughout this section for clarity of explanation. We have identified several important data structure properties of methods presented in this chapter and a summary can be found in Table 2.1. Figure 2.1 shows a tree structure representing the evolution of methods. For example, a VHBF method was derived from HBF by the use of variable-sized blocks. These properties of all methods are thoroughly explained within the description of a method in which they appear first. Their impact on performance is discussed in Section 2.5. There are many possible combinations of the techniques presented in this chapter and the following list of methods is not a complete of all combinations. For example, a method which builds a hierarchy of blocks with winnowing as a boundary selection technique can be developed. However, the selected subset provides enough details for a reader to construct 25 and analyze the other alternative methods and as we have experimented with them we believe the presented subset can accomplish the goal of selecting the most suitable one, which is, in a general case, the Winnowing Multi-Hashing technique (Section 2.3.12). In all the methods in this section we can extend the answer from the simple yes/no (meaning that there was/wasn’t a packet containing the specified excerpt in a specified time interval and if yes providing also the list of flowIDs of packets that contained the excerpt) to give additional details about which parts of the excerpt were found (i.e., blocks) and return, for instance, the longest continuous part of the excerpt that was found. Table 2.1: Summary of properties of methods from Section 2.3. We show how each method selects boundaries of blocks when processing a payload and how it affects the block size, how each method resolves the consecutiveness of blocks, its special characteristics, and finally, whether each method allows false negative and N/A answers to excerpt queries. 2.3.1 Hierarchical Bloom Filter (HBF) This subsection describes the (former) state-of-the-art payload attribution technique, called an HBF [49], in detail and extends the description of previous work from Section 2.2. The following eleven subsections, each showing a new technique, represent our novel contribution. An HBF supports queries for excerpts of a payload by dividing the payload of each packet into a set of blocks of fixed size s bytes (where s is a parameter specified by the system administrator1 ). The blocks of a payload of a single packet form a hierarchy (see 1 The block size used for several years by an HBF-enabled system running in our campus network is 26 Figure 2.1: The evolution tree shows the relationship among presented methods for payload attribution. Arrow captions describe the modifications made to the parent method. Figure 2.2) which is inserted into a Bloom filter with appropriate offset numbers. Thus, besides inserting all blocks of a payload as in the BBF, we insert several super-blocks, i.e., blocks created by the concatenation of 2, 4, 8, etc., subsequent blocks into the HBF. This produces the same result as having multiple BBFs with block sizes multiplied by powers of two. And a BBF can be looked upon as the base level of the hierarchy in an HBF. When processing a payload, we start at the level 0 of the hierarchy by inserting all blocks of size s bytes. In the next level we double the size of a block and insert all blocks of 64 and 32 bytes, respectively, depending on whether deployed on the main gateway or smaller local ones. Longer blocks allow higher data reduction ratios but lower the querying capability for smaller excerpts. 27 Figure 2.2: Processing of a payload consisting of blocks X0 X1 X2 X3 X4 X5 in a Hierarchical Bloom Filter. size 2s. In the n-th level we insert blocks of size 2n s bytes. We continue until the block size exceeds the payload size. The total number of blocks inserted into an HBF for a payload of P size p bytes is l ⌊p/(2l s)⌋, where l is the level index s.t. 0 ≤ l ≤ ⌊log2 (p/s)⌋. Therefore, an HBF needs about two times as much storage space compared to a BBF to achieve the same theoretical false positive rate of a Bloom filter, because the number of elements inserted into the Bloom filter is twice higher. However, for longer excerpts the hierarchy improves the confidence of the query results because they are assembled from the results for multiple levels. We use one Bloom filter to store blocks from all levels of the hierarchy to improve space utilization because the number of blocks inserted into Bloom filters at different levels depends on the distribution of payload sizes and is therefore dynamic. The utilization of this single Bloom filter is easy to control by limiting the number of inserted elements, thus we can limit the (theoretical) false positive rate. Offset numbers are the sequence numbers of blocks within the payload. Offsets are appended to block contents before insertion into an HBF: (content||offset), where 0 ≤ offset ≤ ⌊p/(2l s)⌋ − 1, p is the size of the entire payload and l is the level of hierarchy. Offset numbers are unique within one level of the hierarchy. See the example given in Fig. 2.2. We first insert all blocks of size s with the appropriate offsets: (X0 ||0), (X1 ||1), 28 Figure 2.3: The hierarchy in the HBF does not cover double-blocks at odd offset numbers. In this example, we assume that two payloads X0 X1 X2 X3 and Y0 Y1 Y2 Y3 were processed by the HBF. If we query for an excerpt X1 Y2 , we would get a positive answer which represents an offset collision, because there were two blocks (X1 ||1) and (Y2 ||2) inserted from different payloads but there was no packet containing X1 Y2 . (X2 ||2), (X3 ||3), (X4 ||4). Then we insert blocks at level 1 of the hierarchy: (X0 X1 ||0), (X2 X3 ||1). And finally the second level: (X0 X1 X2 X3 ||0). Note that in Figure 2.2 blocks X0 to X4 have size s bytes, but since block X5 has size smaller than s it does not form a block and its content is not being processed. We analyze the percentage of discarded payload content for each method in Section 2.5. Offsets don’t provide a reliable solution to the problem of detecting whether two blocks appeared in the same packet consecutively. For example, in a BBF if we process two packets made up of blocks X0 X1 X2 X3 and Y0 Y1 Y2 Y3 Y4 , respectively, and later query for an excerpt X2 Y3 , the BBF will answer that it had seen a packet with a payload containing such an excerpt. We call this event an offset collision. This happens because of inserting a block X2 with an offset 2 from the first packet and a block Y3 with an offset 3 from the second packet into the BBF. When blocks from different packets are inserted at the appropriate offsets, a BBF can answer as if they occurred inside a single packet. An HBF reduces the false positive rate due to offset collisions and due to the inherent false positives of a Bloom filter by adding supplementary checks when querying for an excerpt composed of multiple blocks. In this example, an HBF would answer correctly that it did not see such excerpt because the check for X2 Y3 in the next level of the hierarchy fails. However, if we query for 29 an excerpt X1 Y2 , both HBF and BBF fail to answer correctly (i.e, they answer positively as if there was a packet containing X1 Y2 ). Figure 2.3 provides an example of how the hierarchy tries to improve the resistance to offset collisions but still fails for two-block strings at odd offsets. We discuss offset collisions in an HBF further in Section 2.3.8. Because in the actual payload attribution system we insert blocks along with their flowIDs, collisions are less common, but they can still occur for payload inside one stream of packets within one time interval as these blocks have the same flowID and are stored in one Bloom filter. Querying an HBF for an excerpt x starts with the same procedure as querying a BBF. First, we have to try all possible offsets, where x could have occurred inside one packet. We also have to try s possible starting positions of the first block inside x since the excerpt may not start exactly on a block boundary of the original payload. To do this, we slide a window of size s starting at each of the first s positions of x and query the HBF for this window (with all possible starting offsets). After a match is found for this first block, the query proceeds to try the next block at the next offset until all blocks of an excerpt at level 0 are matched. An HBF continues by querying the next level for super-blocks of size twice the size of blocks in the previous level. Super-blocks start only at blocks from the previous level which have even offset numbers. We go up in the hierarchy until all queries for all levels succeed. The answer to an excerpt query is positive only if all answers from all levels of the hierarchy were positive. The maximum number of queries to a Bloom filter in an HBF in the worst case is roughly twice the number for a BBF. 2.3.2 Fixed Block Shingling (FBS) In a BBF and an HBF we use offsets to determine whether blocks appeared consecutively inside one packet’s payload. This causes a problem when querying for an excerpt because we do not know where the excerpt starts inside the payload (the starting offset is unknown). We have to try all possible starting offsets, which not only slows down the query process, but also increases the false positive rate because a false positive result may occur for any of these queries. As an alternative to using offsets we can use block overlapping, which we call shingling. In this scheme, the payload of a packet is divided into blocks of size s bytes as in a BBF, 30 but instead of inserting these blocks we insert strings of size s + o bytes (the block plus a part of the next block) into the Bloom filter. Blocks overlap as do shingles on the roof (see Figure 2.4) and the overlapping part assures that it is likely that two blocks appeared consecutively if they share a common part and both of them are in the Bloom filter. For a payload of size p bytes, the number of elements inserted into the Bloom filter is ⌊(p − o)/s⌋ for a FBS which is close to ⌊p/s⌋ for a BBF. However, the maximum number of queries to a Bloom filter in the worst case is about v times smaller than in a BBF, where v is the number of possible starting offsets. Since the value of v can be estimated as the system’s maximum transmission unit (MTU) divided by the block size s, this improvement is significant, which is supported by the experimental results presented in Section 2.5. Figure 2.4: Processing of a payload with a Fixed Block Shingling (FBS) method (parameters: block size s = 8, overlap size o = 2). The goal of the FBS scheme (of using an overlapping part) is to avoid trying all possible offsets during query processing in an HBF to solve the consecutiveness problem. However, both these techniques (shingling and offsets) are not guaranteed to answer correctly, for an HBF because of offset collisions and for FBS because multiple blocks can start with the same string and a FBS then confuses their position inside the payload (see Figure 2.5). Thus both can increase the number of false positive answers. For example, the FBS will incorrectly answer that it has seen a string of blocks X0 X1 Y1 Y2 after processing two packets X and Y made of blocks X0 X1 X2 X3 X4 and Y0 Y1 Y2 , respectively, where X2 has the same prefix (of size at least o bytes) as Y1 . Querying a Bloom filter in the FBS scheme is similar to querying a BBF except that we do not use any offsets and therefore do not have to try all possible offset positions of the first block of an excerpt. Thus when querying for an excerpt x we slide a window of size 31 Figure 2.5: An example of a collision due to a shingling failure. The same prefix prevented a FBS method from determining whether two blocks appeared consecutively within one payload. The FBS method incorrectly treats the string of blocks X0 X1 Y1 Y2 as if it was processed inside one payload. s + o bytes starting at each of the first s positions of x and query the Bloom filter for this window. When a match is found for this first block, the query can proceed with the next block (including the overlap) until all blocks of an excerpt are matched. Since these blocks overlap we assume that they occurred consecutively inside one single payload. The answer to an excerpt query is considered to be positive only if there exists an alignment (i.e., a position of the first block’s boundary) for which all tested blocks were found in the Bloom filter. Figures 2.6 and 2.7 show examples of querying in the FBS method. Note that these examples ignore that we need to determine all flowIDs of the excerpts found. Therefore even after a match was found for some alignment and a flowID we shall continue to check other alignments and flowIDs because multiple packets in multiple flows could contain such an excerpt. 32 Figure 2.6: An example of querying a FBS method (with a block size 8 bytes and an overlap size 2 bytes). Different alignments of the first block of the query excerpt (shown on top) are tested. When a match is found in the Bloom filter for some alignment of the first block we try subsequent blocks. In this example all blocks for the alignment starting at the third byte are found and therefore the query substring (at the bottom) is reported as found. We assume that the FBS processed the packet in Fig. 2.4 prior to querying. 2.3.3 Variable Block Shingling (VBS) The use of shingling instead of offsets in a FBS method lets us avoid testing all possible offset numbers of the first block during querying, but we still have to test all possible alignments of the first block inside an excerpt (as shown in Fig. 2.6 and 2.7). A Variable Block Shingling (VBS) solves this problem by setting block boundaries based on the payload itself. We slide a window of size k bytes through the whole payload and for each position of the window we compute a value of function H(c1 , . . . , ck ) on the byte values of the payload. When H(c1 , . . . , ck ) mod m is equal to zero, we insert a block boundary immediately after the current position of byte ck . Note that we can choose to put a block boundary before or after any of the bytes ci , 1 ≤ i ≤ k, but this selection has to be fixed. Note that for use with shingling it is better to put the boundary after the byte ck such that the overlaps are not restricted to strings having only special values which satisfy the above condition for boundary insertion (which can increase shingle collisions). When the function H is random and uniform then the parameter m sets the expected size of a block. For random payloads we will get a distribution of block sizes with an average size of m bytes. This variable block size technique’s drawback is that we can get many very small blocks, which can flood the Bloom filter, or some large blocks, which prevent us from querying for smaller excerpts. Therefore, we introduce an enhanced version of this scheme, EVBS, in the next section. 33 Figure 2.7: An example of querying a FBS method for an excerpt which is not supposed to be found (i.e., no packet containing such string has been processed). The query processing starts by testing the Bloom filter for the presence of the first block of the query excerpt at different alignment positions. For alignment 2 the first block is found because we assume that the FBS processed the packet in Fig. 2.4 prior to executing this query. The second block for this alignment has been found too due to a false positive answer of a Bloom filter. The third block for this alignment has not been found and therefore we continue with testing first block at alignment 3. As there was no alignment for which all blocks were found we report the query excerpt was not found. In order to save computational resources it is convenient to use a function that can use the computations performed for the previous positions of the window to calculate a new value as we move from bytes c1 , . . . , ck to c2 , . . . , ck+1 . Rabin fingerprints (see Section 2.2.2) have such iterative property and we define a fingerprint F of a substring c1 c2 . . . ck , where ci is the value of the i-th byte of the substring of a payload, as: F (c1 , . . . , ck ) = (c1 pk−1 + c2 pk−2 + · · · + ck ) mod M, (2.3) where p is a fixed prime number and M is a constant. To compute the fingerprint of substring c2 . . . ck+1 , we need only to add the last element and remove the first one: F (c2 , . . . , ck+1 ) = (p F (c1 , . . . , ck ) + ck+1 − c1 pk ) mod M. (2.4) Because p and k are fixed we can precompute the values for pk−1 . It is also possible to use Rabin fingerprints as hash functions in the Bloom filter. In our implementation we use a modified scheme [9] to increase randomness without any additional computational costs: 34 Figure 2.8: Processing of a payload with a Variable Block Shingling (VBS) method. F (c2 , . . . , ck+1 ) = (p(F (c1 , . . . , ck ) + ck+1 − c1 pk )) mod M. (2.5) The advantage of picking block boundaries using Rabin functions is that when we get an excerpt of a payload and divide it into blocks using the same Rabin function that we used for splitting during the processing of the payload, we will get exactly the same blocks. Thus, we do not have to try all possible alignments of the first block of a query excerpt as in previous methods. The rest of this method is similar to the FBS scheme where instead of using fixed-size blocks we have variable-size blocks depending on the payload. To process a payload we slide a window of size k bytes through the whole payload. For each its position we check whether the value of F modulo m is zero and if yes we set a new block boundary. All blocks are inserted with the overlap of o bytes as shown in Figure 2.8. Querying in a VBS method is the simplest of all methods in this section because there are no offsets and no alignment problems. Therefore, this method involves much fewer tests for membership in a Bloom filter. Querying for an excerpt is done in the same way as processing the payload in previous paragraph but instead of the insertion we query the Bloom filter. Only when all blocks are found in the Bloom filter the answer is positive. The maximum number of queries to a Bloom filter in the worst case is about v · s times smaller than in a BBF, where v is the number of possible starting offsets and s is the number of possible alignments of the first block in a BBF, while assuming the average block size in a VBS method to be s. 35 2.3.4 Enhanced Variable Block Shingling (EVBS) The enhanced version of the variable block shingling method tries to solve a problem with block sizes. A VBS can create many small blocks, which can flood the Bloom filter and do not provide enough discriminability, or some large blocks, which can prevent querying for smaller excerpts. In a EVBS we form superblocks composed from blocks found by a VBS method to achieve better control over the size of blocks. Figure 2.9: Processing of a payload with an Enhanced Variable Block Shingling (EVBS) method. To be precise, when processing a payload we slide a window of size k bytes through the entire payload and for each position of the window we compute the value of the fingerprinting function H(c1 , . . . , ck ) on the byte values of the payload as in the VBS method. When H(c1 , . . . , ck ) mod m is equal to zero, we insert a block boundary after the current position of byte ck . We take the resulting blocks of an expected size m bytes, one by one from the start of the payload, and form superblocks, i.e., new non-overlapping blocks made of multiple original blocks, with the size at least m′ bytes, where m′ ≥ m. We do this by selecting some of the original block boundaries to be the boundaries of the new superblocks. Every boundary that creates a superblock of size greater or equal to m′ is selected (Figure 2.9, where the minimum superblock size is m′ ). Finally, superblocks with an overlap to the next superblock of size o bytes are inserted into the Bloom filter. The maximum number of queries to a Bloom filter in the worst case is about the same as for a VBS, assuming the average block sizes for the two methods are the same. 36 This leads, however, to a problem when querying for an excerpt. If we use the same fingerprinting function H and parameter m we get the same block boundaries in the excerpt as in the original payload, but the starting boundary of the first superblock inside the excerpt is unknown. Therefore, we have to try all boundaries in the first m′ bytes of an excerpt (or the first that follows if there was none) to form the first boundary of the first superblock. The number of possible boundaries we have to try in an EVBS method (approximately m′ /m) is much smaller than the number of possible alignments (i.e., the block size s) in an HBF, for usual parameter values. 2.3.5 Winnowing Block Shingling (WBS) In a Winnowing Block Shingling method we use the idea of winnowing, described in Section 2.2.3, to select boundaries of blocks and shingling to resolve the consecutiveness of blocks. We select the winnowing window size instead of a block size and we are guaranteed to have at least one boundary in any window of this size inside the payload. This also sets an upper bound on the block size. We start by computing hash values for each payload byte position. In our implementation this is done by sliding a window of size k bytes through the whole payload and for each position of the window we compute the value of a fingerprinting function H(c1 , . . . , ck ) on the byte values of the payload as in the VBS method. In this way we get an array of hashes, where the i-th element is the hash of bytes ci , . . . , ci+k−1 , where ci is the i-th byte of the payload of size p, for i = 1, . . . , (p − k + 1). Then we slide a winnowing window of size w through this array and for each position of the winnowing window we put a boundary immediately before the position of the maximum hash value within this window. If there are more hashes with maximum value we choose the rightmost one. Bytes between consecutive pairs of boundaries form blocks (plus the beginning of size o of the next block, the overlap) and they are inserted into a Bloom filter. See Figure 2.10. When querying for an excerpt we do the same process except that we query the Bloom filter for blocks instead of inserting them. If all were found in the Bloom filter the answer to the query is positive. The maximum number of queries to a Bloom filter in the worst case is about the same as for a VBS, assuming the average block sizes for the two methods 37 Figure 2.10: Processing of a payload with a Winnowing Block Shingling (WBS) method. First, we compute hash values for each payload byte position. Subsequently boundaries are selected to be at the positions of the rightmost maximum hash value inside the winnowing window which we slide through the array of hashes. Bytes between consecutive pairs of boundaries form blocks (plus the overlap). are the same. There is at least one block boundary in any window of size w. Therefore the longest possible block size is w + 1 + o bytes. This also guarantees that there are always at least two boundaries to form a block in an excerpt of size at least 2w + o bytes. 2.3.6 Variable Hierarchical Bloom Filter (VHBF) Querying an HBF involved trying s possible starting positions of the first block in an excerpt. In a VHBF method we avoid this by splitting the payload into variable-sized blocks (see Figure 2.11) determined by a fingerprinting function as in Section 2.3.3 (VBS). Building the hierarchy, the insertion of blocks and querying are the same as in the original Hierarchical Bloom Filter, only the block boundary definition has changed. Reducing the number of queries by s (the size of one block in an HBF) helps reducing the resulting false positive rate of this method. Notice that even if we added overlaps between blocks (i.e., use shingling) we would still need to use offsets, because they work as a way of determining whether to check the next level of the hierarchy during the query phase because we check the next level only for even offset numbers. 38 Figure 2.11: Processing of a payload with a Variable Hierarchical Bloom Filter (VHBF) method. 2.3.7 Fixed Doubles (FD) The method of fixed doubles is designed to address a shortcoming of a hierarchy in an HBF. The hierarchy in an HBF is not complete in a sense that we do not insert all double-blocks (blocks of size 2s) and all quadruple-blocks (4s), and so on, into the Bloom filter. For example, when inserting a packet consisting of blocks S0 S1 S2 S3 S4 into an HBF, we insert blocks S0 ,. . . , S4 , S0 S1 , S2 S3 , and S0 S1 S2 S3 . And if we query for an excerpt of size 2s (or up to size 4s − 2 bytes, Figure 2.3), for example, S1 S2 , this block of size 2s is not found in the HBF (and all other double-blocks at odd offsets) and the false positive rate is worse than the one of a BBF in this case, because we need about two times more space for an HBF compared to a BBF with the same false positive rate of the Bloom filter. The same is true for other levels of the hierarchy. In fact, the probability that this event happens rises exponentially with the level number. As an alternative approach to the hierarchy we insert all double-blocks as shown in Figure 2.12, but do not continue to the next level to not increase the storage requirements. Note that this method is not identical to a FBS scheme with an overlap of size s because in a FD we insert all single blocks and also all double blocks. In this method we neither use shingling nor offsets, because the consecutiveness problem is solved by the level of all double-blocks which overlap with each other, by half of the size 39 with the previous one and the second half overlaps with the next one. Figure 2.12: Processing of a payload with a Fixed Doubles (FD) method. The query mechanism works as follows: We first find the correct alignment of the first block of an excerpt by trying to query the Bloom filter for all windows of size s starting at positions 0 through s − 1. Note that we can get multiple positive answers and in that case we continue the process independently for all of them. Then we split the excerpt into blocks of size s starting at the position found and query for each of them. Finally, we query for all double-blocks and when all answers were positive we claim that the excerpt was found. The FD scheme inserts 2⌊p/s⌋ − 1 blocks into the Bloom filter for a payload of size p bytes, which is approximately the same as an HBF and about two times more than an FBS scheme. The maximum number of queries to a Bloom filter in the worst case is about two times the number for a FBS method. 2.3.8 Variable Doubles (VD) This method is similar to the previous one (FD) but the block boundaries are determined by a fingerprinting function as in a VBS (Section 2.3.3). Hence, we do not have to try finding the correct alignment of the first block of an excerpt when querying and the blocks have variable size. Both the number of blocks inserted into the Bloom filter and the maximum number of queries in the worst case are approximately the same as for the FD scheme. An example is given in Figure 2.13. 40 Figure 2.13: Processing of a payload with a Variable Doubles (VD) method. Similar to the FD method we use neither shingling nor offsets, because the consecutiveness problem is solved by the level of all double-blocks which overlap with each other and with the single blocks. During querying we simply divide the query excerpt into blocks by a fingerprinting method and query the Bloom filter for all blocks and for all double-blocks. Finally, if all answers are positive we claim that the excerpt is found. 2.3.9 Enhanced Variable Doubles (EVD) The Enhanced Variable Doubles method uses the technique from Section 2.3.4 (EVBS) to create an extension of a VD method by forming superblocks of a payload. Then these superblocks are treated the same way as blocks in a VD method. Thus, we insert all superblocks and all doubles of these superblocks into the Bloom filter as shown in Figure 2.14. The number of blocks inserted into the Bloom filter as well as the maximum number of queries in the worst case is similar to that of the VD method (assuming similar average block sizes of both schemes). 2.3.10 Multi-Hashing (MH) One would imagine that the technique of VBS would be strengthened by using multiple independent VBS methods because it provides greater flexibility in the choice of parameters, such as the expected block size. We call this technique Multi-hashing which uses t 41 Figure 2.14: Processing of a payload with an Enhanced Variable Doubles (EVD) method. independent fingerprinting methods (or fingerprinting functions with different parameters) to set block boundaries as shown in Figure 2.15. It is equivalent to using t independent Variable Block Shingling methods and the answer to excerpt queries is positive only if all the t methods answer positively. Note that even if we set the overlapping part, i.e., the parameter o, to zero for all instances of the VBS, we would still get a guarantee that the excerpt has appeared on the network as one continuous fragment. Moreover, by using expected block sizes of the instances as multiples of powers of two we can generate a hierarchical structure with the MH method. The expected number of blocks inserted into the Bloom filter for a payload of size p P bytes is ti=1 ⌊p/mi ⌋, where mi is the expected block size for the i-th VBS. 2.3.11 Enhanced Multi-Hashing (EMH) The enhanced version of Multi-Hashing uses multiple instances of EVBS to increase the certainty of answers to excerpt queries. Blocks inserted by independent instances of EVBS are different and overlap with each other, and therefore improve the robustness of this method. Other aspects than the superblock formation are the same as for the MH method. In our experiments in Section 2.5 we use two independent instances of EVBS with identical parameters and store the data for both in one Bloom filter. 42 Figure 2.15: Processing of a payload with a Multi-Hashing (MH) method. In this case, the MH uses two independent instances of the Variable Block Shingling method simultaneously to process the payload. 2.3.12 Winnowing Multi-Hashing (WMH) The WMH method uses multiple instances of WBS (Section 2.3.5) to reduce the probability of false positives for excerpt queries. The WMH gives not only excellent control over the block sizes due to winnowing (see Figure 2.20) but also provides much greater confidence about the consecutiveness of the blocks inside the query excerpt because of overlaps both inside each instance of WBS and among the blocks of multiple instances. Both querying and payload processing are done for all t WBS instances and the final answer to an excerpt query is positive only if all t answers are positive. In our experiments in Section 2.5 we use two instances of WBS with identical winnowing window size and store data from both methods in one Bloom filter. By storing data of each instance in a separate Bloom filter we can allow data aging to save space by keeping only some of the Bloom filters for very old data at the cost of higher false positive rates. For an example of processing payload and querying in a WMH method see the multipacket case in Fig. 2.16 and 2.17. 43 2.4 Payload Attribution Systems Challenges As mentioned in Chapter refchap:intro, a payload attribution system performs two separate tasks: payload processing and query processing. In payload processing, the payload of all traffic that passed through the network where the PAS is deployed is examined and some information is saved into permanent storage. This has to be done at line speed and the underlying raw packet capture component can also perform some filtering of the packets, for example, choosing to process only traffic of a particular protocol (HTTP, FTP, SMTP, etc). Data is stored in archive units, each of which having two timestamps (start and end of the time interval during which we collected the data). For each time interval we also need to save all flowIDs (e.g., pairs of source and destination IP addresses) to allow querying later on. This information can be alternatively obtained from connection records collected by firewalls, intrusion detection systems or other log files. During query processing given the excerpt and a time interval of interest we have to retrieve all the corresponding archive units from the storage. We query each unit for the excerpt and if we get a positive answer we try to query successively for each of the flowIDs appended to the blocks of the excerpt and report all matches to the user. 2.4.1 Attacks on PAS As with any security system, there are ways an adversary can evade proper attribution. We identify the following types of attacks on a PAS (mostly similar to those in [49]): Compression and Encryption If the payload is compressed or encrypted, a PAS can allow to query only for the exact compressed or encrypted form. Fragmentation An attacker can transform the stream of data into a sequence of packets with payload sizes much smaller than the (average) block size we use in the PAS. Methods with variable 44 block sizes where block boundaries depend on the payload are harder to beat, but, for very small fragments, for example 6 bytes each, the system will not be able to do the attribution correctly. A solution is to make the PAS stateful so that it concatenates payloads of one data stream prior to processing. However, such a solution would impose additional memory and computational costs and there are known attacks on stateful IDS systems [23], such as incorrect fragmentation and timing attacks. Boundary Selection Hacks For methods with block boundaries depending on the payload an attacker can try to send special packets containing payload that can contain too many or no boundaries. The PAS can use different parameters for boundary selection algorithm for each archive unit so that it would be impossible for an attacker to fool the system. Moreover, winnowing guarantees at least one boundary in each winnowing window. Hash Collisions Hash collisions are very unpredictable and therefore hard to use by an attacker because we use different salt for the hash computation in each Bloom filter. Stuffing An attacker can inject some characters into the payload which are ignored by applications but in the network layer they change the payload structure. Our methods are robust against stuffing because the attacker has to modify most of the payload to avoid correct attribution as we can match even very small excerpts of payload. Resource Exhaustion Flooding attacks can impair a PAS. However, our methods are more robust to these attacks than raw packet loggers due to the data reduction they provide. Moreover, processing identical payloads repeatedly does not impact the precision of attribution because the insertion into a Bloom filter is an idempotent operation. On the other hand, the list of 45 flowIDs is vulnerable to flooding, for example, when a worm tries to propagate out of the network by trying many random destination addresses. Spoofing Source IP addresses can be spoofed and a PAS is primarily concerned with attributing payload according to what packets have been delivered by the network. The scope of possible spoofing depends on the deployment of the system and filtering applied in affected networks. 2.4.2 Multi-packet queries The methods described in Section 2.3 show how to query for excerpts inside one packet’s payload. Nevertheless, we can extend the querying mechanism to handle strings that span multiple packets. Methods which use offsets have to continue querying for the next block, which was not found with its sequential offset number, with a zero offset instead and try all alignments of that block as well because the fragmentation into packets could leave out some part smaller than the block size at the end of the first packet. This is very inefficient and increases the false positive rate. Moreover, for methods that form a hierarchy of blocks it means that it cannot be fully utilized. The payload attribution system can do TCP stream reconstruction and work on the reconstructed flow to fix it. On the other hand, methods using shingling can be extended without any further changes if we return as an answer to the query the full sequence of blocks found (see a WMH example in Fig. 2.16 and 2.17). 2.4.3 Privacy and Simple Access Control Processing and archiving payload information must comply with the privacy and security policies of the network where they are performed. Furthermore, authorization to use the payload attribution system should be granted only to properly authorized parties and all necessary precautions must be taken to minimize the possibility of a system compromise. 46 Figure 2.16: Processing of payloads of two packets with a Winnowing Multi-Hashing (WMH) method where both packets are processed by two independent instances of the Winnowing Block Shingling (WBS) method simultaneously. Figure 2.17: Querying for an excerpt spanning multiple packets in a Winnowing MultiHashing (WMH) method comprised of two instances of WBS. We assume the WMH method processed the packets in Fig. 2.16 prior to querying. In this case, we see that WMH can easily query for an excerpt spanning two packets and that the blocks found significantly overlap which increases the confidence of the query result. However, there is still a small gap between the two parts because WMH works on individual packets (unless we perform TCP stream reconstruction). The privacy stems from using a Bloom filter to hold the data. It is only possible to query the Bloom filter for a specific packet content but it cannot be forced to provide a list of packet data stored inside. Simple access control (i.e., restricting the ability to query the Bloom filter) can be easily achieved as follows. Our methods allow the collected data to be stored and queried by an untrusted party without disclosing any payload information nor giving the query engine any knowledge of the contents of queries. We achieve this by adding a secret salt when computing hashes for insertion and querying the Bloom filter. A different salt is used for each Bloom filter and serves the purpose of a secret key. We can also easily achieve much finer granularity of access control by using different keys for different protocols or subranges of IP address space. Without a key the Bloom filter cannot be queried and the key doesn’t have to be made available to the querier (only the indices of bits for which we want to query are disclosed). Without knowing the key a third party cannot query the 47 Bloom filter. However, additional measures must be taken to enforce that the third party provides correct answers and does not alter the archived data. Also note that this kind of access control is not cryptographically secure and some information leakage can occur. On the other hand, there is no additional computational or storage cost associated with using it and also no need for decrypting the data before querying as is common with standard techniques. A detailed analysis of privacy achieved by using Bloom filters can be found in [5]. 2.4.4 Compression In addition to the inherent data reduction provided by our attribution methods due to the use of Bloom filters, our experiments show that we can achieve another about 20 percent storage savings by compressing the archived data (after careful optimization of parameters [34]), for example by gzip. The results presented in the next section do not include this additional compression. 2.5 Experimental Results In this section we show performance measurements of payload attribution methods described in Section 2.3 and discuss the results from various perspectives. For this purpose we collected a network trace of 4 GB of HTTP traffic from our campus network. For performance evaluation throughout this section we consider processing 3.1 MB segment (about 5000 packets) of the trace as one unit trace collected during one time interval. As discussed earlier, we store all network traffic information in one Bloom filter in memory and we save it to a permanent storage in predefined time intervals or when it becomes full. A new Bloom filter is used for each time interval. The time interval should be short because it determines the time precision at which we can attribute packets, that is, we can determine only in which time interval a packet appeared on the network, not the exact time. The results presented do not depend on the size of the unit because we use a data reduction ratio to set the Bloom filter size (for example 100:1 means a Bloom filter of size 31 kB). Each method uses one Bloom filter of an equal size to store all data. Our results did not show any 48 deviations depending on the selection of the segment within the trace. All methods were tested to select the best combination of parameters for each of them. Results are grouped into subsections by different points of interest. 2.5.1 Performance Metrics To compare payload attribution methods we consider several aspects which are not completely independent. The first and most important aspect is the amount of storage space a method needs to allow querying with a false positive rate bounded by a pre-defined value. We provide a detailed comparison and analysis in the following subsections. Second, the methods differ in the number of elements they insert into a Bloom filter when processing packets and also in the number of queries to a Bloom filter performed when querying for an excerpt in the worst case (that is when the answer is negative). A summary can be found in Table 2.2. Methods which use shingling and variable block size achieve significant decrease in the number of queries they have to perform to analyze each excerpt. It is important not only for the computational performance but also for the resulting false positive rate as each query to the Bloom filter takes a risk of a false positive answer. The boundary selection techniques these methods use are very computationally efficient and can be performed in a single pass through the payload. The implementation can be highly optimized for a particular platform and some part of the processing can be also done by a special hardware. Our implementation running on a Linux-based commodity PC (with a kernel modified for fast packet capturing [37]) can smoothly handle 200 Mbps and the processing can be easily split among multiple machines (e.g., by having each machine process packets according to a hash value of the packet header). 2.5.2 Block Size Distribution The graphs in Figure 2.18, Figure 2.19 and Figure 2.20 show the distributions of block sizes for three different methods of block boundary selection. We use a block (or winnowing window) size parameter of 32 bytes, a small block size for an EVBS of 8 bytes, and an overlap of 4 bytes. Both VBS and EVBS show a distribution with an exponential decrease in the number of blocks with an increasing block size, shifted by the overlap size for a VBS 49 Table 2.2: Comparison of payload attribution methods from Section 2.3 based on the number of elements inserted into a Bloom filter when processing one packet of a fixed size and the number of blocks tested for presence in a Bloom filter when querying for an excerpt of a fixed size in the worst case (i.e., when the answer is negative). Note that the values are approximations and we assume all methods have the same average block size. The variables refer to: n: the number of blocks inserted by a BBF taken as a base, s: the size of a block in fixed block size methods (BBF, HBF, FBS, FD), v: the number of possible offset numbers, p: the number of alignments tested for enhanced methods (EVBS, EVD, EMH). Note that the actual number of bits tested or set in a Bloom filter depends on the number of hash functions used for each method and therefore this table presents numbers of blocks. Figure 2.18: The distributions of block sizes for VBS method after processing 100000 packets of HTTP traffic or the block size plus the overlap size for an EVBS. Long tails were cropped for clarity and the longest block was 1029 bytes long. On the other hand, a winnowing method results in a quite uniform distribution where the block sizes are bounded by the winnowing window size plus the overlap. The apparent peaks for the smallest block size in graphs 2.18 and 2.20 are caused by low-entropy payloads, such as long blocks of zeros. The distributions of block sizes obtained by processing random 50 Figure 2.19: The distributions of block sizes for EVBS method after processing 100000 packets of HTTP traffic Figure 2.20: The distributions of block sizes for WBS method after processing 100000 packets of HTTP traffic 51 payloads generated with the same payload sizes as in the original trace show the same distributions just without these peaks. Nevertheless, the huge amount of small blocks does not significantly affect the attribution because inserting a block into the Bloom filter is an idempotent operation. 2.5.3 Unprocessed Payload Some fraction of each packet’s payload is not processed by the attribution mechanisms presented in Section 2.3. Table 2.3(b) shows how each boundary selection method affects the percentage of unprocessed payload. For methods with a fixed block size the part of a payload between the last block’s boundary and the end of the payload is ignored by the payload attribution system. With (enhanced) Rabin fingerprinting, and winnowing methods the part starting at the beginning of the payload and ending at the first block boundary and the part between the last block boundary and the end of the payload are not processed. The enhanced version of Rabin fingerprinting achieves much better results because the small block size, which was four times smaller that the superblock size in our test, applies when selecting the first block boundary in a payload. Winnowing performs better than other methods with a variable block size in terms of unprocessed payload. Note also that a WMH method, even though it uses winnowing for block boundary selection as well as a WBS does, has about t times smaller percentage of unprocessed payload than a WBS because each of the t instances of a WBS within the WMH covers a different part of the payload independently. Moreover, the “inner” part of the payload is covered t times which makes the method much more resistant to collisions because t collisions have to occur at the same time to produce a false positive answer. For large packets the small percentage of unprocessed payload does not pose a problem, however, for very small packets, for example only 6 bytes long, it means that they are possibly not processed at all. Therefore we can optionally insert the entire payload of a packet in addition to inserting all blocks and add a special query type to the system to support queries for exact packets. This will increase the storage requirements only slightly because we would insert one additional element per packet into the Bloom filter. 52 (a) (b) Table 2.3: (a) False positive rates for data reduction ratio 130:1 for a WMH method with a winnowing window size 64 bytes and therefore an average block size about 32 bytes. The table summarizes answers to 10000 queries for each query excerpt size. All 10000 answers should be NO. YES answers are due to false positives inherent to a Bloom filter. WMH guarantees no N/A answers for these excerpt sizes. (b) The percentage of unprocessed payload of 50000 packets depending on the block boundary selection method used. Details are provided in Sections 2.5.2 and 2.5.3. Table 2.4: Measurements of false positive rate for data reduction ratio 50:1. The table summarizes answers to 10000 excerpt queries using all methods (with block size 32 bytes) described in Section 2.3 for various query excerpt lengths (top row). These queries were performed after processing a real packet trace and all methods use the same size of a Bloom filter (50 times smaller than the trace size). All 10000 answers should be NO since these excerpts were not present in the trace. YES answers are due to false positives in Bloom filter and N/A answers mean that there were no boundaries selected inside the excerpt to form a block for which we can query. 53 Figure 2.21: The graph shows the number of correct answers to 10000 excerpt queries for a varying length of a query excerpt for each method (with block size or winnowing window size 64 bytes) and data reduction ratio 100:1. This reduction ratio can be further improved to about 120:1 by using a compressed Bloom filter. The WMH method has no false positives for excerpt sizes 250 bytes and longer. The previous state-of-the-art method HBF does not provide any useful answers at this high data reduction ratio for excerpts shorter than about 400 bytes. 2.5.4 Query Answers To measure and compare the performance of attribution methods, and in particular to analyze the false positive rate, we processed the trace by each method and queried for random strings which included a small excerpt of size 8 bytes in the middle that did not occur in the trace. In this way we made sure that these query strings did not represent payloads inserted into the Bloom filters. Every method has to answer either YES, NO or answer not available (N/A) to each query. A YES answer means a match was found for the entire query string for at least one of the flowIDs and represents a false positive. A NO answer is the correct answer for the query, and a N/A answer is returned if the blocking mechanism specific to the method did not select even one block inside the query excerpt. The N/A answer can occur, for example, when the query excerpt is smaller than the block 54 size in an HBF, or when there was one or no boundary selected in the excerpt by a VBS method and so there was no block to query for. In Table 2.1 we summarize the possibility of getting N/A and false negative answers for each method. A false negative answer can occur when we query for an excerpt which has size greater than the block size but none of the alignments of blocks inside the excerpt fit the alignment that has been used to process the payload which contained the excerpt. For example, if we processed a payload ABCDEF GHI by an HBF with block size 4 bytes, we would have blocks ABCD, EF GH, ABCDEF GH, and if we queried for an excerpt BCDEF G, the HBF would answer NO. Note that false negatives can occur only for excerpts smaller than twice the block size and only for methods which involve testing the alignment of blocks. Table 2.4 provides detailed results of 10000 excerpt queries for all methods with the same storage capacity and data reduction ratio 50:1. The WMH method achieves the best results among the listed methods for all excerpt sizes and for excerpts longer than 200 bytes has no false positives. The WMH also excels in the number of N/A answers among the methods with a variable block size because it guarantees at least one block boundary in each winnowing window. The results also show that methods with a variable block size are in general better than methods with a fixed block size because there are no problems with finding the right alignment. The enhanced version of Rabin fingerprinting for the block boundary selection does not perform better than the original version. This is mostly because we need to try all alignments of blocks inside superblocks when querying which increases the false positive rate. These enhanced methods therefore favorably enhance the block size control but not always the false positive rate. The graph in Figure 2.21 shows the number of correct answers to 10000 excerpt queries as a function of the length of a query excerpt for each method with block size parameter set to 64 bytes and data reduction ratio 100:1. The WMH method outperforms all other methods for all excerpt lengths. On the other hand, the HBF’s results are the worst because it can fully utilize the hierarchy only for long excerpts and it has very high false positive rate for high data reduction ratios due to the use of offsets, problems with block alignments, and the large number of elements inserted into the Bloom filter. As can be observed when 55 comparing HBF’s results to the ones of a FD method, using double-blocks instead of building a hierarchy is a significant improvement and a FD, for excerpt sizes of 220 bytes and longer, performs even better than a variable block size version of the hierarchy, a VHBF. It is interesting to see that an HBF outperforms a VHBF version for excerpts of length 400 bytes. For longer excerpts than 400 bytes (not shown) they perform about the same and both quickly achieve no false positives. The results of methods in this graph are clearly separated into two groups by performance; the curves representing the methods which use a variable block size and do not use offset numbers or superblocks (i.e., VBS, VD, WBS, MH, WMH) have concave shape and in general perform better. For very long excerpts all methods provide highly reliable results. The Winnowing Multi-Hashing achieves the best overall performance in all our tests and allows to query for very small excerpts because the average block size is approximately half of the winnowing window size (plus the overlap size). The average block size was 18.9 bytes for a winnowing window size 32 bytes and an overlap 4 bytes. Table 2.3(a) shows the false positive rates for WMH for a data reduction ratio 130:1. This data reduction ratio means that the total size of the processed payload was 130-times the size of the Bloom filter which is archived to allow querying. The Bloom filter could be additionally compressed to achieve a final compression of about 158:1, but note that the parameters of the Bloom filter (i.e., the relation among the number of hash functions used, the number of elements inserted and the size of the Bloom filter) have to be set in a different way [34]. We have to decide in advance whether we want to use the additional compression of the Bloom filter and if so, optimize the parameters for it, otherwise the Bloom filter data would have very high entropy and is hard to compress. The compression is possible because the approximate representation of a set by a standard Bloom filter does not reach the information theoretical lower bound [10]. Winnowing block shingling method (WBS) performs almost as well as WMH (see Fig. 2.21) and requires t-times less computation. However, the confidence of the results is lower than with WMH because multi-hashing covers majority of the query excerpts multiple times and if storage is needed, a data aging method can be used to downgrade it to a simple WBS later. 56 2.6 Conclusion In this chapter, we presented novel methods for payload attribution. When incorporated into a network forensics system they provide an efficient probabilistic query mechanism to answer queries for excerpts of a payload that passed through the network. Our methods allow data reduction ratios greater than 100:1 while having a very low false positive rate. They allow queries for very small excerpts of a payload and also for excerpts that span multiple packets. The experimental results show that our methods represent a significant improvement in query accuracy and storage space requirements compared to previous attribution techniques. More specifically, we found that winnowing represents the best technique for block boundary selection in payload attribution applications and that shingling as a method for consecutiveness resolution is a clear win over the use of offset numbers and finally, the use of multiple instances of payload attribution methods can provide additional benefits, such as improved false positive rates and data-aging capability. These techniques combined together form a payload attribution method called Winnowing Multi-Hashing which substantially outperforms previous methods. The experimental results also testify that in general the accuracy of attribution increases with the length and the specificity of a query. Moreover, privacy and simple access control is achieved by the use of Bloom filters and one-way hashing with a secret key. Thus, even if the system is compromised no raw traffic data is ever exposed and querying the system is possible only with the knowledge of the secret key. We believe that these methods also have much broader range of applicability in various areas where large amounts of data are being processed. Additional to the network payload data the systems used for network monitoring and forensics make use of network flow data that represents structured information about each connection records. This information including flowID (with source IP and destination IP) can be stored and queried efficiently by storing the data in tables similar to traditional database systems. However, as shown in [19, 55] when working with large amounts of network flow data, the performance of the storage system may be influenced by how the data is stored on disk, row-by-row or column-by-column, and the characteristics of the expected query workload. In the next chapter we analyze the existing systems used to 57 store network flow data and propose an efficient storage infrastructure that can be used for monitoring and forensics along with a payload attribution system that incorporates the methods presented in Section 2.3 of this chapter. Chapter 3 A Storage Infrastructure for Network Flow Data 3.1 Introduction In previous chapter we defined the payload attribution problem and we presented a suite of payload attribution methods that can be used as modules in a payload attribution system. In order to provide meaningful attribution results a payload attribution system has to use the network flow identifiers (or flowIDs) corresponding to a time interval in the past. However, maintaining the flowIDs in a simple list as described in previous chapter cannot provide satisfactory runtime performance for queries accessing large amounts of network flow data spanning for long periods of time. In such a case more efficient methods are needed to store network flowIDs information. Additionally, network monitoring systems that traditionally were designed to detect and flag malicious or suspicious activity in real time, are increasingly providing the ability to assist in network forensic investigation and to identify the root cause of a security breach. This may involve checking a suspected host’s past network activity, looking up any services run by a host, protocols used, the connection records to other hosts that may or may not be compromised, etc. This new workload requires flexible and fast access to increasing amounts of network flow historical data. 58 59 In this chapter we present the design, implementation details and the evaluation of a column-oriented storage infrastructure called NetStore, designed to store and analyze very large amounts of network flow data. The proposed system can be used in conjunction with other network monitoring and forensics systems, such as a payload attribution system using methods presented in previous chapter, in order to make informed security decisions. Recall from Chapter 1 that we refer to a flow as an unidirectional data stream between two endpoints and to a flow record as a quantitative description of a flow. In general we refer to a flowID as the key that uniquely identifies a flow. In previous chapter we defined a flowID as being composed only of the source and destination IPs since only those items were required by the payload attribution system. However, when designing the storage infrastructure for network flow data we considered the flow ID as being composed of five attributes: source IP, source port, destination IP, destination port and protocol. We assume that each flow record has associated a start time and an end time representing the time interval when the flow was active in the network. Challenges Network flow data can grow very large in the number of records and storage footprint. Figure 3.1 and Figure 3.2 show network flow distribution of traffic captured from edge routers in a moderate sized campus network for a day and a month respectively. This network with about 3,000 hosts, commonly reaches up to 1,300 flows/second, an average 53 million flows daily and roughly 1.7 billion flows in a month. We consider records with the average size of 200 Bytes. Besides CISCO NetFlow data [57] there may be other specific information that a sensor can capture from the network such as the IP, transport and application headers information. Hence, in this example, the storage requirement is roughly 10 GB of data per day which adds up to at least 310 GB per month. When working with large amounts of disk resident data, the main challenge is no longer to ensure the necessary storage space, but to minimize the time it takes to process and access the data. An efficient storage and querying infrastructure for network records has to cope with two main technical challenges: keep the insertion rate high, and provide fast access to the desired flow records. When using a traditional row-oriented Relational Database Management 60 Figure 3.1: Network flow traffic distribution for one day. In a typical day the busiest time interval is 1PM - 2PM with 4,381,876 flows, and the slowest time interval is 5AM - 6AM with 978,888 flows. Systems (RDBMS), the relevant flow attributes are inserted as a row into a table as they are captured from the network, and are indexed using various techniques [18]. On the one hand, such a system has to establish a trade off between the insertion rate desired and the storage and processing overhead employed by the use of auxiliary indexing data structures. On the other hand, enabling indexing for more attributes ultimately improves query performance but also increases the storage requirements and decreases insertion rates. At query time, all the columns of the table have to be loaded in memory even if only a subset of the attributes are relevant for the query, adding a significant I/O penalty for the overall query processing time by loading unused columns. Therefore, when querying disk resident data, an important problem is to overcome the I/O bottleneck specific to large disk to memory data transfers. One partial solution is to load only data that is relevant to the query. For example, to answer the query ”What is the 61 Figure 3.2: Network flow traffic distribution for one month. For a typical month we noticed the slow down in week-ends and the peek traffic in weekdays. Days marked with * correspond to a break week. list of all IPs that contacted IP X between dates d1 and d2 ?”, the system should load only the source and destination IPs as well as the timestamps of the flows that fall between dates d1 and d2 . The I/O time can also be decreased if the accessed data is compressed since less data traverses the disk to memory boundary. Further, the overall query response time can be improved if data is processed in compressed format, thus saving the decompression time. Therefore, the system should use the optimal data access, compression strategies and algorithms in such a way to achieve the best compression ratios while the decompression remains very fast or unnecessary at the query time. Since the system should insert elements at line speed, all the preprocessing algorithms used should add negligible overhead to the disk writing process. Considering all the above requirements and implications, a column oriented storage architectures seems to be a good fit to store network flow data captured for prolonged periods of time. 62 Column Stores Overview As briefly presented in Section 1.3.2 in Chapter 1, the basic idea of column orientation is to store the data by columns rather than by rows, where each column holds data for a single attribute of the flow and is stored sequentially on disk. Such a strategy makes the system I/O efficient for read queries since only the required attributes related to a query can be read from the disk. It is widely accepted both in academic communities [1,2,25,55,60] as well as in industry [13,27,30,58] that column-stores provide better performance than row-stores for analytical query workloads. However, most of commercial and open-source column stores were conceived to follow general purpose RDBMSs requirements, and do not fully use the semantics of the data carried and do not take advantage of the specific types and data access patterns of network forensic and monitoring queries. In this chapter we present the design, implementation details and the evaluation of NetStore, a column-oriented storage infrastructure for network records that, unlike the other systems, is intended to provide good performance for network records flow data. Contribution NetStore is the implementation of a network flow data storage infrastructure that can work jointly with other systems that process massive amounts of data such as the ones described in [7,12,15,56]. The column based storage is similar to a column oriented database and partitions the network attributes in columns, one column for each attribute. Each column holds the data of the same type and therefore can be heavily compressed. The compression algorithms used depend on a set of features extracted from the data. We provide the compression methods description and selection strategy later in Section 3.3.3. NetStore is designed and implemented to facilitate efficient interaction with various security applications such as Firewalls, Network Intrusion Detection Systems (NIDSs) and other security administrative tools as shown in Figure 3.3. Based on our knowledge, NetStore is the first attempt to consider a column oriented storage design, special tuned for network flow historical data. The key contributions of this chapter include the following: • Simple and efficient column oriented design of NetStore that enables quick access to 63 large amounts of data for monitoring and forensic analysis. • Efficient compression methods and selection strategies to facilitate the best compression for network flow data, that allow accessing and querying data in compressed format. • Implementation and deployment of NetStore using commodity hardware and open source software as well as analysis and comparison with other open source storage systems used currently in practice. The rest of the chapter is organized as follows: we present related work in Section 3.2, NetStore system architecture and the details of each component in Section 3.3. Experimental results and evaluation are presented in Section 3.4 and we conclude the chapter in Section 3.5. 3.2 Related Work The problem of discovering network security incidents has received significant attention over the past years. Most of the work done has focused on near-real time security event detection, by improving existing security mechanisms that monitor traffic at a network perimeter and block known attacks, detect suspicious network behavior such as network scans, or malicious binary transfers [38, 46]. Other systems such as Tribeca [56] and Gigascope [15], use stream databases and process network data as it arrives but do not store the date for retroactive analysis for long periods of time. There has been some work done to store network flow records using a traditional RDBMS such as PostgreSQL [18]. Using this approach, when a NIDS triggers an alarm, the database system builds indexes and materialized views for the attributes that are the subject of the alarm, and could potentially be used by forensics queries in the investigation of the alarm. The system works reasonably well for small networks and is able to help forensic analysis for events that happened over the last few hours. However, queries for traffic spanning more than a few hours become I/O bound and the auxiliary data used to speed up the queries slows down the record insertion process. Therefore, such a solution is not feasible for medium to large networks and not 64 even for small networks in the future, if we consider the accelerated growth of internet traffic. Additionally, a time window of several hours is not a realistic assumption when trying to detect the behavior of a complex botnet engaged in stealthy malicious activity over prolonged periods of time. In the database community, many researchers have proposed the physical organization of database storage by columns in order to cope with poor read query performance of traditional row-based RDBMS [13, 30, 52, 55, 60]. As shown in [2, 22, 25, 55], a column store provides many times better performance than a row store for read intensive workloads. In [60] the focus is on optimizing the cache-RAM access time by decompressing data in the cache rather than in the RAM. This system assumes the working columns are RAM resident, and shows a performance penalty if data has to be read from the disk and processed in the same run. The solution in [55] relies on processing parallelism by partitioning data into sets of columns, called projections, indexed and sorted together, independent of other projections. This layout has the benefit of rapid loading of the attributes belonging to the same projection and referred to by the same query without the use of auxiliary data structure for tuple reconstruction. However, when attributes from different projections are accessed, the tuple reconstruction process adds significant overhead to the data access pattern. The system presented in [52] emphasizes the use of an auxiliary metadata layer on top of the column partitioning that is shown to be an efficient alternative to the indexing approach. However, the metadata overhead is sizable and the design does not take into account the correlation between various attributes. Finally, in [25] authors present several factors that should be considered when one has to decide to use a column store versus a row store for a read intensive workload. The relative large number of network flow attributes and the workloads with the predominant set of queries with large selectivity and few predicates favor the use of a column store system for historical network flow records storage. NetStore is a column oriented storage infrastructure that shares some of the features with the other systems, and is designed to provide the best performance for large amounts of disk resident network flow records. It avoids tuple reconstruction overhead by keeping at all times the same order of elements in all columns. It provides fast data insertion and 65 quick querying by dynamically choosing the most suitable compression method available and using a simple and efficient design with a negligible meta data layer overhead. 3.3 Architecture In this section we describe the architecture and the key components of NetStore. We first present the characteristics of network data and query types that guide our design. We then describe the technical design details: how the data is partitioned into columns, how columns are partitioned into segments, what are the compression methods used and how a compression method is selected for each segment. We finally present the metadata associated with each segment, the index nodes, and the internal IPs inverted index structure, as well as the basic set of operators. Figure 3.3: NetStore main components: Processing Engine and Column-Store. 66 3.3.1 Network Flow Data Network flow records and the queries made on them show some special characteristics compared to other time sequential data, and we tried to apply this knowledge as early as possible in the design of the system. First, flow attributes tend to exhibit temporal clustering, that is, the range of values is small within short time intervals. Second, the attributes of the flows with the same source IP and destination IP tend to have the same values (e.g. port numbers, protocols, packets sizes etc.). Third, columns of some attributes can be efficiently encoded when partitioned into time based segments that are encoded independently. Finally, most attributes that are of interest for monitoring and forensics can be encoded using basic integer data types. The records insertion operation is represented by bulk loads of time sequential data that will not be updated after writing. Having the attributes stored in the same order across the columns makes the join operation become trivial when attributes from more than one column are used together. Network data analysis does not require fast random access on all the attributes. Most of the monitoring queries need fast sequential access to large number of records and the ability to aggregate and summarize the data over a time window. Forensic queries access specific predictable attributes but collected over longer periods of time. To observe their specific characteristics we first compiled a comprehensive list of forensic and monitoring queries used in practice in various scenarios [17]. Based on the data access pattern, we identified five types among the initial list. Spot queries (S) that target a single key (usually an IP address or port number) and return a list with the values associated with that key. Range queries (R) that return a list with results for multiple keys (usually attributes corresponding to the IPs of a subnet). Aggregation queries (A) that aggregate the data for the entire network and return the result of the aggregation (e.g. traffic sent out for network). Spot Aggregation queries (SA) that aggregate the values found for one key in a single value. Range Aggregation queries (RA) that aggregate data for multiple keys into a single value. Examples of these types of queries expressed in plain words: (S) ”What applications are observed on host X between dates d1 and d2 ?” 67 (R) ”What is the list of destination IPs that have source IPs in a subnet between dates d1 and d2 ?” (A) ”What is the total number of connections for the entire network between dates d1 and d2 ?” (SA) ”What is the number of bytes that host X sent between dates d1 and d2 ?” (RA) ”What is the number of hosts that each of the hosts in a subnet contacted between dates d1 and d2 ?” 3.3.2 Column Oriented Storage Having described the network flow data characteristics and forensics and monitoring queries types on this data, in this section we introduce and present the main components of the column oriented storage architecture: columns, segments, the column index and the internal IPs index. Columns In NetStore, we consider that flow records with n attributes are stored in the logical table with n columns and an increasing number of rows (tuples) one for each flow record. The values of each attribute are stored in one column and have the same data type. By default almost all of the values of a column are not sorted. Having the data sorted in a column might help get better compression and faster retrieval, but changing the initial order of the elements requires the use of auxiliary data structure for tuple reconstruction at query time. We investigated several techniques to ease tuple reconstruction and all methods added much more overhead at query time than the benefit of better compression and faster data access. Therefore, we decided to maintain the same order of elements across columns to avoid any tuple reconstruction penalty when querying. However, since we can afford one column to be sorted without the need to use any reconstruction auxiliary data, we choose to first sort only one column and partially sort the rest of the columns. We call the first sorted column the anchor column. Note that after sorting, given our storage architecture, each segment can still be processed independently. 68 The main purpose of the anchor column choosing algorithm is to select the ordering that facilitates the best compression and fast data access. Network flow data express strong correlation between several attributes and we exploit this characteristic by keeping the strongly correlated columns in consecutive sorting order as much as possible for better compression results. Additionally, based on previous queries data access pattern, columns are arranged by taking into account the probability of each column to be accessed by future queries. The columns with higher probabilities are arranged at the beginning of the sorting order. As such, we maintain the counting probabilities associated with each of the columns given by the formula P (ci ) = ai t , where ci is the i-th column, ai the number of queries that accessed ci and t the total number of queries. Segments Each column is further partitioned into fixed sets of values called segments. Segments partitioning enables physical storage and processing at a smaller granularity than simple column based partitioning. These design decisions provide more flexibility for compression strategies and data access. At query time only used segments will be read from disk and processed based on the information collected from segments metadata structures called index nodes. Each segment has associated a unique identifier called segment ID. For each column, a segment ID represents an auto incremental number, started at the installation of the system. The segment sizes are dependent of the hardware configuration and can be set in such a way to use the most of available main memory. For better control over data structures used, the segments have the same number of values across all the columns. In this way there is no need to store a record ID for each value of a segment, and this is one major difference compared to some existing column stores [30]. As we will show in Section 3.4 the performance of the system is related to the segment size used. The larger the segment size, the better the compression performance and query processing times. However, we notice that records insertion speed decreases with the increase of segment size, so, there is a trade off between the query performance desired and the insertion speed needed. Most of the columns store segments in compressed format and, in a later section we present the compression algorithms used. Column segmentation 69 design is an important difference compared to traditional row oriented systems that process data a tuple at a time, whereas NetStore processes data segment at a time, which translates to many tuples at a time. Figure 3.4 shows the processing steps for the three processing phases: buffering, segmenting and query processing. Figure 3.4: NetStore processing phases: buffering, segmenting and query processing. Column Index For each column we store the meta data associated with each of the segments in an index node corresponding to the segment. The set of all index nodes for the segments of a column represent the column index. The information in each index node includes statistics about data and different features that are used in the decision about the compression method to use and optimal data access, as well as the time interval associated with the segment in the format [min start time, max end time]. Figure 3.5 presents an intuitive representation of the columns, segments and index for each column. Each column index is implemented 70 using a time interval tree. Every query is relative to a time window T. At query time, the index of every column accessed is looked up and only the segments that have the time interval overlapping window T are considered for processing. In the next step, the statistics on segment values are checked to decide if the segment should be loaded in memory and decompressed. This two-phase index processing helps in early filtering out unused data in query processing similar to what is done in [52]. Note that the index nodes do not hold data values, but statistics about the segments such as the minimum and the maximum values, the time interval of the segment, the compression method used, the number of distinct values, etc. Therefore, index usage adds negligible storage and processing overhead. From the list of initial queries we observed that the column for the source IP attribute is most frequently accessed. Therefore, we choose this column as our first sorted anchor column, and used it as a clustered index for each source IP segment. However, for workloads where the predominant query types are spot queries targeting a specific column other than the anchor column, the use of indexes for values inside the column segments is beneficial at a cost of increased storage and slowdown in insertion rate. Thus, this situation can be acceptable for slow networks were the insertion rate requirements are not too high. When the insertion rate is high then it is best not to use any index but rely on the meta-data from the index nodes. Internal IPs Index Besides the column index, NetStore maintains another indexing data structure for the network internal IP addresses called the Internal IPs index. Essentially the IPs index is an inverted index for the internal IPs. That is, for each internal IP address the index stores in a list the absolute positions where the IP address occurs in the column, sourceIP or destIP , as if the column is not partitioned into segments. Figure 3.6 shows an intuitive representation of the IPs index. For each internal IP address the positions list represents an array of increasing integer values that are compressed and stored on disk on a daily basis. Because IP addresses tend to occur in consecutive positions in a column, we chose to compress the positions list by applying run-length-encoding on differences between adjacent values. 71 Figure 3.5: Schematic representation of columns, segments, index nodes and column indexes. 3.3.3 Compression Each of the segments in NetStore is compressed independently. We observed that segments within a column did not have the same distribution due to the temporal variation of network activity in working hours, days, nights, weekends, breaks etc. Hence segments of the same column were best compressed using different methods. We explored different compression methods. We investigated methods that allow data processing in compressed format and do not need decompression of all the segment values if only one value is requested. We also looked at methods that provide fast decompression and reasonable compression ratio and speed. The decision on which compression algorithm to use is done automatically for each segment, and is based on the data features of the segment such as data type, the number of distinct values, range of the values and number of switches between adjacent values. We tested a wide range of compression methods, including some we designed for the purpose or currently used by similar systems in [1, 30, 55, 60], with needed variations if any. Below we list the techniques that emerged effective based on our experimentation: 72 Figure 3.6: Intuitive representation of the IPs inverted index. • Run-Length Encoding (RLE): is used for segments that have few distinct repetitive values. If value v appears consecutively r times, and r > 1, we compress it as the pair (v, r). It provides fast compression as well as the ability to process data in compressed format. • Variable Byte Encoding: is a byte-oriented encoding method used for positive integers. It uses a variable number of bytes to encode each integer value as follows: if value < 128 use one byte (set highest bit to 0), for value < 128 ∗ 128 use 2 bytes (first byte has highest bit set to 1 and second to 0) and so on. This method can be used in conjunction with RLE for both values and runs. It provides reasonable compression ratio and good decompression speed allowing the decompression of only the requested value without the need to decompress the whole segment. 73 • Dictionary Encoding: is used for columns with few distinct values and sometimes before RLE is applied (e.g. to encode ”protocol” attribute). • Frame Of Reference: considers the interval bounded by the minimum and maximum values as the frame of reference for the values to be compressed [20]. We use it to compress non-empty timestamp attributes within a segment (e.g. start time, end time, etc.) that are integer values representing the number of seconds from the epoch. Typically the time difference between minimum and maximum timestamp values in a segment is less than few hours, therefore the encoding of the difference is possible using short values of 2 bytes instead of integers of 4 bytes. It allows processing data in compressed format by decompressing each timestamp value individually without the need to decompress the whole segment. • Generic Compression: we use the DEFLATE algorithm from the zlib library that is a variation of the LZ77 [59]. This method provides compression at the binary level, and does not allow values to be individually accessed unless the whole segment is decompressed. It is chosen if it enables faster data insertion and access than the value-based methods presented earlier. • No Compression: is listed as a compression method since it will represent the base case for our compression selection algorithm. Compression Methods Selection The selection of a compression method is done based on the statistics collected in one pass over the data of each segment. As mentioned earlier, the two major requirements of our system are to keep records insertion rates high and to provide fast data access. Data compression does not always provide better insertion and better query performance compared to ”No compression”, and for this we developed a model to decide on when compression is suitable and if so, what method to choose. Essentially, we compute a score for each candidate compression method and we select the one that has the best score. More formally, we assume we have k + 1 compression methods m0 , m1 , . . . , mk , with m0 being the ”No Compression” method. We then compute the insertion time as the time to compress 74 and write to disk, and the access time, to read from disk and decompress, as functions of each compression method. For value-based compression methods, we estimate the compression, write, read and decompression times based on the statistics collected for each segment. For the generic compression we estimate the parameters based on the average results obtained when processing sample segments. For each segment we evaluate: insertion (mi ) = c (mi ) + w (mi ) , access (mi ) = r (mi ) + d (mi ) , i = 1, . . . , k i = 1, . . . , k As the base case for each method evaluation we consider the ”No Compression” method. We take I0 to represent the time to insert an uncompressed segment which is represented by only the writing time since there is no time spent for compression and, similarly A0 to represent the time to access the segment which is represented by only the time to read the segment from disk since there is no decompression. Formally, following the above equations we have: insertion (m0 ) = w (m0 ) = I0 access (m0 ) = r (m0 ) = A0 We then choose the candidate compression methods mi only if we have both: insertion (mi ) < I0 access (mi ) < A0 Next, among the candidate compression methods we choose the one that provides the lowest access time. Note that we primarily consider the access time as the main differentiator factor and not the insertion time. The disk read is the most frequent and time consuming operation and it is many times slower than disk write of the same size file for commodity hard drives. Additionally, insertion time can be improved by bulk loading or by other ways that take into account that the network traffic rate is not steady and varies greatly over time, whereas the access mechanism should provide the same level of performance at all times. The model presented above does not take into account if the data can be processed in compressed format and the assumption is that decompression is necessary at all times. 75 However, for a more accurate compression method selection we should include the probability of a query processing the data in compressed format in the access time equation. Since forensic and monitoring queries are usually predictable, we can assume without affecting the generality of our system, that we have a total number of t queries, each query qj having t X pj = 1. We consider the probability of a segment the probability of occurrence pj with j=1 s being processed in compressed format as the probability of occurrence of the queries that process the segment in compressed format. Let CF be the set of all the queries that process s in compressed format, we then get: P (s) = X pj , CF = {qj |qj processes s in compressed format} qj ∈CF Now, a more accurate access time equation can be rewritten taking into account the possibility of not decompressing the segment for each access: access (mi ) = r (mi ) + d (mi ) · (1 − P (s)) , i = 1, . . . , k (3.1) Note that the compression selection model can accommodate any compression, not only the ones mentioned in this chapter, and is also valid in the cases when the probability of processing the data in compressed format is 0. 3.3.4 Query Processing Figure 3.4 illustrates NetStore data flow, from network flow record insertion to the query result output. Data is written only once in bulk, and read many times for processing. NetStore does not support transaction processing queries such as record updates or deletes, it is suitable for analytical queries in general and network forensics and monitoring queries in special. Data Insertion Network data is processed in several phases before being delivered to permanent storage. First, raw flow data is collected from the network sensors and is then preprocessed. Preprocessing includes the buffering and segmenting phases. Each flow is identified by a 76 flow ID represented by the 5-tuple [sourceIP, sourcePort, destIP, destPort, protocol ]. In the buffering phase, raw network flow information is collected until the buffer is filled. The flow records in the buffer are aggregated and then sorted. As mentioned in Section 3.3.3, the purpose of sorting is twofold: better compression and faster data access. All the columns are sorted following the sorting order determined based on access probabilities and correlation between columns using the first sorted column as anchor. In the segmenting phase, all the columns are partitioned into segments, that is, once the number of flow records reach the buffer capacity the column data in the buffer is considered a full segment and is processed. Each of the segments is then compressed using the appropriate compression method based on the data it carries. The information about the compression method used and statistics about the data is collected and stored in the index node associated with the segment. Note that once the segments are created, the statistics collection and compression of each segment is done independent of the rest of the segments in the same column or in other columns. By doing so, the system takes advantage of the increasing number of cores in a machine and provides good record insertion rates in multi threaded environments. After preprocessing all the data is sent to permanent storage. As monitoring queries tend to access the most recent data, some data is also kept in memory for a predefined length of time. NetStore uses a small active window of size W and all the requests from queries accessing the data in the time interval [NOW - W, NOW] are served from memory, where NOW represents the actual time of the query. Query Execution For flexibility NetStore supports limited SQL syntax and implements a basic set of segment operators related to the query types presented in Section 3.3.1. Each SQL query statement is translated into a statement in terms of the basic set of segment operators. Below we briefly present each general operator: • filter segs (d1 , d2 ): Returns the set with segment IDs of the segments that overlap with the time interval [d1 , d2 ]. This operator is used by all queries. 77 • filter atts(segIDs, pred1 (att1 ), . . . , predk (attk )): Returns the list of pairs (segID, pos list), where pos list represents the intersection of attribute position lists in the corresponding segment with id segID, for which the attribute atti satisfies the predicate predi , with i = 1, . . . , k. • aggregate (segIDs, pred1 (att1 ), . . . , predk (attk )): Returns the result of aggregating values of attribute attk by attk−1 by . . . att1 that satisfy their corresponding predicates predk , . . . , pred1 in segments with ids in segIDs. The aggregation can be summation, counting, min or max. The queries considered in Section 3.4.2 can all be expressed in terms of the above operators. For example the query: ”What is the number of unique hosts that each of the hosts in the network contacted in the interval [d1 , d2 ]?” can be expressed as follows: aggregate(filter segs(d1 , d2 ), sourceIP = 128.238.0.0/16, destIP ). After the operator filter segs is applied, only the sourceIP and destIP segments that overlap with the time interval [d1 , d2 ] are considered for processing and their corresponding index nodes are read from disk. Since this is a range aggregation query, all the considered segments will be loaded and processed. If we consider the query ”What is the number of unique hosts that host X contacted in the interval [d1 , d2 ]?” it can be expressed as follows: aggregate(filter segs(d1 , d2 ), sourceIP = X, destIP ). For this query the number of relevant segments can be reduced even more by discarding the ones that do not overlap with the time interval [d1 , d2 ], as well as the ones that don’t hold the value X for sourceIP by checking corresponding index nodes statistics. If the value X represents the IP address of an internal node, then the internal IPs index will be used to retrieve all the positions where the value X occurs in the sourceIP column. Then a count operation is performed of all the unique destIP addresses corresponding to the positions. Note that by using internal IPs index, the data of sourceIP column is not touched. The only information loaded in memory is the positions list of IP X as well as the segments in column destIP that correspond to those positions. 78 3.4 Evaluation In this section we present an evaluation of NetStore. We designed and implemented NetStore using Java programming language on the FreeBSD 7.2-RELEASE platform. For all the experiments we used a single machine with 6 GB DDR2 RAM, two Quad-Core 2.3 Ghz CPUs, 1TB SATA-300 32 MB Buffer 7200 rpm disk with a RAID-Z configuration. We consider this machine representative of what a medium scale enterprise will use as a storage server for network flow records. For experiments we used the network flow data captured over a 24 hour period of one weekday at our campus border router. The size of raw text file data was about 8 GB, 62,397,593 network flow records. For our experiments we considered only 12 attributes for each network flow record, that is only the ones that were meaningful for the queries presented in this chapter. Table 3.1 shows the attributes used as well as the types and the size for each attribute. We compared NetStore’s performance with two open source RDBMS, a row-store, PostgreSQL [42] and a column-store, LucidDB [30]. We chose PostgreSQL over other open source systems because we intended to follow the example in [18] which uses it for similar tasks. Additionally we intended to make use of the partial index support for internal IPs that other systems don’t offer in order to compare the performance of our inverted IPs index. We chose LucidDB as the column-store to compare with as it is, to the best of our knowledge, the only stable open source columnstore that yields good performance for disk resident data and provides reasonable insertion speed. We chose only data captured over one day, with size slightly larger than the available memory, because we wanted to maintain reasonable running times for the other systems that we compared NetStore to. These systems become very slow for larger data sets and performance gap compared to NetStore increases with the size of the data. 3.4.1 Parameters Figure 3.7 shows the influence that the segment size has over the insertion rate. We observe that the insertion rate drops with the increase of segment size. This trend is expected and is caused by the delay in preprocessing phase, mostly because of the larger 79 Column sourceIP destIP sourcePort destPort protocol startTime endTime tcpSyns tcpAcks tcpFins tcpRsts numBytes Type int int short short byte short short byte byte byte byte int Bytes 4 4 2 2 1 2 2 1 1 1 1 4 Table 3.1: NetStore flow attributes. Property records insertion rate number of records number of bytes transported bytes transported per record bits rate supported number of packets transported packets transported per record packets rate supported Value 10,000 62,397,594 1.17 20,616.64 1.54 2,028,392,356 32.51 325,075.41 Unit records/second records Terabytes Bytes/record Gbit/s packets packets/record packets/second Table 3.2: NetStore properties and network rates supported based on 24 hour flow records data and the 12 attributes. segment array sorting. Figure 3.8 shows that the segment size also affects the compression ratio of each segment, the larger the segment size the larger the compression ratio achieved. But high compression ratio is not a critical requirement. The size of the segments is more critically related to the available memory, the desired insertion rate for the network and the number of attributes used for each record. We set the insertion rate goal at 10,000 records/second, and for this goal we set a segment size of 2 million records given the above hardware specification and records sizes. Table 3.2 shows the insertion performance of NetStore. The numbers presented are computed based on average bytes per record and average packets per record given the insertion rate of 10,000 records/second. When installed on a machine with the above specification, NetStore can keep up with traffic rates up to 1.5 Gbit/s for the current experimental im- 80 plementation. For a constant memory size, this rate decreases with the increase in segment size and the increase in the number of attributes for each flow record. Figure 3.7: Insertion rate for different segment sizes. 3.4.2 Queries Having described the NetStore architecture and it’s design details, in this section we consider the queries described in [17], but taking into account data collected over the 24 hours for internal network 128.238.0.0/16. We consider both the queries and methodology in [17] meaningful for how an investigator will perform security analysis on network flow data. We assume all the flow attributes used are inserted into a table f low and we use standard SQL to describe all our examples. Scanning Scanning attack refers to the activity of sending a large number of TCP SYN packets to a wide range of IP addresses. Based on the received answer the attacker can determine if a 81 Figure 3.8: Compression ratio with and without aggregation. particular vulnerable service is running on the victim’s host. As such, we want to identify any TCP SYN scanning activity initiated by an external hosts, with no TCP ACK or TCP FIN flags set and targeted against a large number of internal IP destinations, larger than a preset limit. We use the following range aggregation query (Q1): SELECT sourceIP, destPort, count(distinct destIP), startTime FROM flow WHERE sourceIP <> 128.238.0.0/16 AND destIP = 128.238.0.0/16 AND protocol = tcp AND tcpSyns = 1 AND tcpAcks = 0 AND tcpFins = 0 GROUP BY sourceIP HAVING count(distinct destIP) > limit; External IP address 61.139.105.163 was found scanning starting at time t1 . We check if 82 there were any valid responses after time t1 from the internal hosts, where no packet had the TCP RST flag set, and we use the following query (Q2): SELECT sourceIP, sourcePort, destIP FROM flow WHERE startTime > t1 AND sourceIP = 128.238.0.0/16 AND destIP = 61.139.105.163 AND protocol = tcp AND tcpRsts = 0; Worm Infected Hosts Internal host with the IP address 128.238.1.100 was discovered to have been responded to a scanning initiated by a host infected with the Conficker worm and we want to check if the internal host is compromised. Typically, after a host is infected, the worm copies itself into memory and begins propagating to random IP addresses across a network by exploiting the same vulnerability. The worm opens a random port and starts scanning random IPs on port 445. We use the following query to check the internal host (Q3): SELECT sourceIP, destPort, count(distinct destIP) FROM flow WHERE startTime > t1 AND sourceIP = 128.238.1.100 AND destPort = 445; SYN Flooding It is a network based-denial of service attack in which the attacker sends an unusual large number of SYN request, over a threshold t, to a specific target over a small time window W. To detect such an attack we filter all the incoming traffic and count the number of flows with TCP SYN bit set and no TCP ACK or TCP FIN for all the internal hosts. We use the following query(Q4): SELECT destIP, count(distinct sourceP), startTime FROM flow WHERE startTime > ’NOW - W’ AND destIP = 128.238.0.0/16 83 AND protocol = tcp AND tcpSyns = 1 AND tcpAcks = 0 AND tcpFins = 0 GROUP BY destIP HAVING count(sourceIP) > t; Network Statistics Besides security analysis, network statistics and performance monitoring is another important usage for network flow data. To get this information we use aggregation queries for all collected data over a large time window, both incoming and outgoing. Aggregation operation can be number of bytes or packets summation, number of unique hosts contacted or some other meaningful aggregation statistics. For example we use the following simple aggregation query to find the number of bytes transported in the last 24 hours (Q5): SELECT sum(numBytes) FROM flow WHERE startTime > ’NOW - 24h’; General Queries The sample queries described above are complex and belong to more than one basic type described in Section 3.3.1. However, each of them can be separated into several basic types such that the result of one query becomes the input for the next one. We build a more general set of queries starting from the ones described above by varying the parameters in such a way to achieve different level of data selectivity form low to high. Then, for each type we reported the average performance for all the queries of that type. Figure 3.9 shows the average running times of selected queries for increasing segment sizes. We observe that for S type queries that don’t use IPs index (e.g. for attributes other than internal sourceIP or destIP), the performance decreases when the segment size increases. This is an expected result since for larger segments there is more unused data loaded as part of the segment where the spotted value resides. When using the IPs index the performance benefit comes from skipping the irrelevant segments whose positions are not found in the positions list. However, for internal busy servers that have corresponding 84 flow records in all the segments, all corresponding segments of attributes have to be read but not the IPs segments. This is an advantage since an IP segment is several times larger in general than the other attributes segments. Hence, except for spot queries that use non-indexed attributes, queries tend to be faster for larger segment sizes. Figure 3.9: Average query times for different segment sizes and different query types. 3.4.3 Compression Our goal with using compression is not to achieve the best compression ratio nor the best compression or decompression speed, but to obtain the highest records insertion rate and the best query performance. We evaluated our compression selection model by comparing performance when using a single method for all the segments in the column, with the performance when using the compression selection algorithm for each segment. To select the method for a column we compressed first all the segments of the columns with all the six 85 methods presented. We then measured the access performance for each column compressed with each method. Finally, we selected the compression method of a column, the method that provides the best access times for the majority of the segments. For the variable segments compression, we activated the methods selection mechanism for all columns and then we inserted the data, compressing each segment based on the statistics of its own data rather than the entire column. In both cases we did not change anything in the statistic collection process since all the statistics were used in the query process for both approaches. We obtained on an average 10 to 15 percent improvement per query using the segment based compression method selection model with no penalty for the insertion rate. However, we consider the overall performance of compression methods selection model is satisfactory and the true value resides in the framework implementation, being limited only by the individual methods used not by the general model design. If the data changes and other compression methods are more efficient for the new data, only the compression algorithm and the operators that work on this compressed data should be changed, with the overall architecture remaining the same. Some commercial systems [58] apply on top of the value-based compressed columns another layer of general binary compression for increased performance. We investigated the same possibility and compared four different approaches to compression on top of the implemented column oriented architecture: no compression, value-based compression only, binary compression only and value-based plus binary compression on top of that. For the no compression case, we processed the data using the same indexing structure and column oriented layout but with the compression disabled for all the segments. For the binary compression only we compress each segment using the generic binary compression. In the case of value-based compression we compress all the segments having the dynamic selection mechanism enabled, and for the last approach we apply another layer of generic compression on top of already value-based compressed segments. The results of our experiment for the four cases are shown in Figure 3.10. We can see that compression is a determining factor in performance metrics. Using value-based compression achieves the best average running time for the queries while the uncompressed segments scenario yields the worst performance.We also see that adding another compression layer 86 Figure 3.10: Average query times for the compression strategies implemented. does not help in query performance nor in the insertion rate even though it provides better compression ratio. However, the general compression method can be used for data aging, to compress and archive older data that is not actively used. Figure 3.8 shows the compression performance for different segment sizes and how flow aggregation affects storage footprint. As expected, compression performance is better for larger segment sizes in both cases, with and without aggregation. That is the case because of the compression methods used. The larger the segment, the longer the runs for column with few distinct values, the smaller the dictionary size for each segment. The overall compression ratio of raw network flow data for the segment size of 2 million records is 4.5 with no aggregation and 8.4 with aggregation enabled. Note that the size of compressed data includes also the size of both indexing structures: column indexes and IPs index. 87 3.4.4 Comparison With Other Systems For comparison we used the same data and performed a system-specific tuning for each of the systems parameters. To maintain the insertion rate above our target of 10,000 records/second we created three indexes for each Postgres and Luciddb: one clustered index on startTime and two un-clustered indexes, one on sourceIP and one on destIP attributes. Although we believe we chose good values for the other tuning parameters we cannot guarantee they are optimal and we only present the performance we observed. We show the performance for using the data and the example queries presented in Section 3.4.2. Postgres/NetStore LucidDB/NetStore Q1 10.98 5.14 Q2 7.98 1.10 Q3 2.21 2.25 Q4 15.46 2.58 Q5 1.67 1.53 Storage 93.6 6.04 Table 3.3: Relative performance of NetStore versus columns only PostgreSQL and LucidDB for query running times and total storage needed. Table 3.3 shows the relative performance of NetStore compared to PostgresSQL for the same data. Since our main goal is to improve disk resident data access, we ran each query once for each system to minimize the use of cached data. The numbers presented show how many times NetStore is better. To maintain a fair overall comparison we created a PostgresSQL table for each column of Netstore. As mentioned in [2], row-stores with columnar design provide better performance for queries that access a small number of columns such as the sample queries in Section 3.4.2. We observe that Netstore clearly outperforms PostgreSQL for all the query types providing the best results for queries accessing more attributes (e.g. Q1 and Q4) even though it uses 90 times more disk space including all the auxiliary data. The poor PostgreSQL performance can be explained by the absence of more clustered indexes, the lack of compression, and the unnecessary tuple overhead. Table 3.3 also shows the relative performance compared to LucidDB. We observe that the performance gap is not at the same order of magnitude compared to that of PostgreSQL even when more attributes are accessed. However, NetStore performs clearly better when storing about 6 times less data. The performance penalty of LucidDB can be explain by the lack of column segmentation design and by early materialization in the processing phase specific 88 to general-purpose column stores. However we noticed that LucidDB achieves a significant performance improvement for the subsequent runs of the same query by efficiently using memory resident data. 3.5 Conclusion With the growth of network traffic, there is an increasing demand for solutions to better manage and take advantage of the wealth of network flow information recorded for monitoring and forensic investigations. The problem is no longer the availability and the storage capacity of the data, but the ability to quickly extract the relevant information about potential malicious activities that can affect network security and resources. In this chapter we have presented the design, implementation and evaluation of a novel working architecture, called NetStore, that is useful in the network monitoring tasks and assists in network forensics investigations. The simple column oriented design of NetStore helps in reducing query processing time by spending less time for disk I/O and loading only needed data. The column partitioning facilitates the use of efficient compression methods for network flow attributes that allow data processing in compressed format, therefore boosting query runtime performance. NetStore clearly outperforms existing row-based DBMSs systems and provides better results that the general purpose column oriented systems because of simple design decisions tailored for network flow records. Experiments show that NetStore can provide more than ten times faster query response compared to other storage systems while maintaining much smaller storage size. In future work we seek to explore the use of NetStore for new types of time sequential data, such as host log analysis, and the possibility to release it as an open source system. Having described the design and implementation of the column oriented storage infrastructure architecture and the general processing engine, in the next chapter we go a step further towards our goal of improving the runtime performance of queries on network flow data. As such, we analyze the characteristics of monitoring and forensic query workloads, we define two general types of queries based on their complexity and present the querying 89 models and optimization methods for each of simple and complex queries types. Chapter 4 A Querying Framework For Network Monitoring and Forensics 4.1 Introduction Forensics and monitoring queries on historical network flow data are becoming increasingly complex [32]. In general these queries are composed of many simple filtering and aggregation queries, executed sequential or in batches, that process large amounts of network flow data spanning for long periods of time. Running complex queries faster increases network analysis capabilities. The column oriented architecture described in previous chapter along with the general purpose columnstores presented in [55, 60] show that a column oriented storage system yields better query runtime performance than a transactional row oriented system for analytical query workloads. In this chapter we show that query performance can be further improved by using efficient processing methods for the forensic analysis and monitoring queries. In forensic analysis network flow data is processed in several steps in a drill-down manner, narrowing and correlating the subsequent intermediary results throughout the analysis [17, 32]. For example, suppose a host X is detected as having been scanning hundreds of hosts on a particular port. This event may lead the administrator to check host X past network activity by querying network flow data to find any services run and protocols used, then the 90 91 connection records to other hosts. This investigation process continues by using previous results and issuing new sequential forensic queries until, eventually, the root of the security incident is revealed. In general, to enable the use of previous queries results, existing systems store intermediary data in temporary files and materialized views on disks before feeding data to the new queries [17, 18]. Using this approach the query runtime is increased by the I/O operations. Moreover, using standard SQL syntax, references between subsequent query results are not trivial to represent. Instead, sophisticated nested queries, stored procedures and scripting languages are used [17, 32, 39]. In this chapter, we also propose a simple SQL extension to easily express previous queries results as well as other features useful when working with network flow data. When executing queries sequentially, the query engine has the opportunity to speedup the new queries by efficiently reusing the results of the already evaluated predicates in the querying session. As such, in Section 4.2.2 we show how forensic queries can be executed more efficiently by reusing previous computation when queries are sequential and share some filters. For monitoring, network administrators run many simple queries in batches in order to detect complex network behavior patterns, such as worm infections [32], and display traffic summarization per hour, per day or other predefined time window [17, 39]. By submitting queries in batches all the simple queries can be executed in any order, thus some order may result in better overall runtime performance. Additionally, it is expected that some of the queries in the batch to use the same filtering predicates for some attributes (known ports, IPs, etc). Thus, the results from evaluating common predicates can be shared by many queries, therefore saving execution time. Moreover, evaluating predicates in a particular order across all the simple queries may result in less evaluation work for the future predicates in the execution pipeline. Taking into account the above properties of monitoring queries, in section 4.2.2 we present an efficient method to execute batch monitoring queries faster using network flow data stored in a column oriented system. 92 Data Storage The proposed query processing engine is designed to work with network flow data such as CISCO Netflow [57]. We assume the data is stored using the column-oriented storage infrastructure whose design and implementation are described in previous chapter and we present a brief summary here. As such, the flow data is collected at the edge of an enterprise network and represents all the flow traffic that crosses the network boundary in and out. Each flow contains attributes such as source IP, source port, destination IP, destination port, protocol, start and end time of the flow, etc. Data is assumed to be partitioned into columns at the physical level, one column for each network flow attribute. The set of all the columns that store attributes of the same set of flows is represented by a storage. Conceptually, at logical level a storage is similar to a database table with the distinction that the flows are stored as they are collected. That is, data is only appended in time sequential order rather than inserted at specific positions in the columns. All the columns are stored in the same order. All the attributes at the same positions in different columns represent a logical row or logical tuple. A subset of values for a column represents a segment and an index node stores the metadata for a segment. The index node contains data about the segment such as minimum and maximum values, number of distinct values, the corresponding time window, the encoding method, small data histogram, etc. All the index nodes of the column represent the column index. At the physical level each column is actually represented by a set of segments and a column index stored in compressed binary files on disk. Additionally an internal IPs index structure, as presented in [19], is maintained for the internal IP addresses to facilitate faster access to their corresponding records. Essentially, the inverted IPs index maintains for each internal IP address the positions in the column where records of the internal IP address appear. Contributions Having described the general context and the requirements for the query processing engine, with this chapter we make the following key contributions: • The design and implementation of an efficient query processing engine built on top of 93 a column oriented storage system used for network flow data. • Query optimization methods for sequential and batch queries used in network monitoring and forensic analysis. • Design of a simple SQL extension that allows simpler representation of forensic and monitoring queries. For the rest of the chapter we organize the presentation as follows: in Section 4.2 we describe the simple and complex queries as well as the processing and optimization methods, in Section 4.3 we present the SQL extension primitives, in Section 4.4 we show the processing experimental results, Section 4.5 discusses related work and in Section 4.6 we conclude. 4.2 Query Processing The query processing engine is built on top of the column oriented storage system presented in previous chapter. We are looking to optimize two general types of queries: simple queries and complex queries. A simple query is composed of several filtering predicates and aggregation operators over a subset of the network flow data attributes. A complex query is composed of several simple queries as building blocks that can be executed sequentially, query-at-a-time, with new simple queries using the results of previous simple queries, or in batches when all the simple queries are sent for execution in the same time. In this section we describe the two query types as well as the proposed processing and optimization methods for each query type. 4.2.1 Simple Queries A simple query has the following general format: SELECT Ac1 , . . . , Ack , f1 (Aa1 ), . . . , fm (Aam ) FROM f low records storage WHERE σ1 (A1 ) AND . . . AND σn (An ) 94 GROUP BY Ac1 , . . . , Ack ; where Ac1 , . . . , Ack represent the attributes in the output, Aa1 , . . ., Aam the attributes that are aggregated in the output, f1 , . . . , fm the SQL standard aggregation functions (MIN, MAX, SUM, COUNT, AVG, etc) and σ1 , . . . , σn the filtering predicates over the accessed attributes. For simplicity we assume that the predicates in the WHERE clause are ANDed. Therefore, a logical tuple that satisfies the boolean expression in the WHERE clause of a simple query has to satisfy all the predicates in the clause. Thus, the predicates can be executed in any order and the runtime performance is influenced by the order in which corresponding filters are evaluated. One way to execute a simple query is to scan relevant columns individually, evaluate the predicates for each column independently and in the final step merge the values of all the columns that satisfy the corresponding predicates. This approach, called parallel early materialization was analyzed in [3]. However, evaluating each predicate independently might result in wasting processing time to generate unnecessary intermediary results that might be discarded by the final merging operation. Another more efficient approach is to evaluate each predicate over each column in order and send to the next predicate the positions of the values that satisfy the previous predicate. In this way the next predicate is evaluated only for the values at relevant positions. In the final step the columns in the output are scanned again and only the values at the positions that satisfied all the predicates are merged. This approach, called pipelined late materialization was also analyzed in [3] and our proposed processing engine uses a variation of this method for simple queries processing. Note that we assume that all the columns are partitioned into segments and corresponding segments from different columns (containing values at the same position in the columns) fit in main memory. As such, each simple filtering query is evaluated per segment and the intermediary results for all segments are merged in the final step. By doing so, all the working segments are in main memory yielding faster processing. All the aggregation operations are performed in the last step after the filtering operations 95 are completed. If the aggregation operations form a large number of groups and the size of the output becomes larger than the available memory the intermediary results are flushed to disk. However, we found such a case unusual for our test query workload. Figure 4.1: Simple query representation and execution graph: C1 , C3 , C4 , C6 - the accessed columns; L1 , L2 , L3 - the positions lists of values satisfying the predicates σ1 , σ2 and σ3 respectively; π - the output merging operator; U - user input; O - a previous query result. A simple query execution can be represented as a directed graph as shown in Figure 4.1. Each predicate is evaluated by a processing node. Excepting the first node, each other processing node can have as input column disk resident data, user input and temporary data from other queries as well as a list of positions representing the values satisfying the previous predicate. The first processing node can have as input only column, user and temporary data. Simple Query Optimization It is beneficial to filter out unused data and avoid reading unused segments as early as possible in the processing phase. Since all the aggregation operations are performed using the same methods (hash-based) after the filtering is completed we omit their execution cost from the formal problem description. Therefore, the optimization problem is to find the 96 filtering order that will yield the best running time, that is the filtering order with the minimum execution cost. This problem is equivalent to the pipelined filters ordering problem that was studied in the database research community. We use a polynomial algorithm similar to the one proposed in [24] to solve the instance of the problem when data is stored in a column-store. Suppose a simple query is represented as a set of m filters, {F1 , . . . , Fm } and for each filter Fk is given the cost ck of evaluating the filter on a tuple, and the probability pk that the tuple satisfies the filter predicate independent of whether the tuple satisfies other filter predicates. A simple polynomial time solution to this problem presented in [24] is to order all the m filters in non-decreasing order of the ratio ck /(1 − pk ) also called the rank of the filter. The running time of this simple algorithm is asymptotically bounded by the time of sorting the filters in non-decreasing order of ranks, O(m log m). However, since we expect to have a small number of filters, practically the running time is determined by the overhead needed to read from disk the parameters used to compute each filter’s rank. We use a variation of this algorithm by considering segment processing cost and selectivity, rather than tuple evaluation cost and probability of the tuple passing the filter. Since segment reading is the most expensive operation, its cost is directly proportional to the disk segment size and data is most of the time processed in compressed format, in our simple query processing model we consider the execution cost ck of each filter equal with the size of the segment on disk si . The selectivity of a filter over each segment is represented by the fraction of the segment values that satisfy the filter predicate. The selectivity factor (SF ) of a filter over a segment is estimated based on the statistics collected in the index nodes of each segment (minimum value, maximum value and number of distinct values) for each type of filtering operator (<, ≤, >, ≥, 6=, =, IN). By default we consider all the values in a segment to be uniformly distributed and we use simple estimation equations similar to the ones in [51] to estimate the selectivity factors. Suppose a column has a segment Si with ni elements, the size on disk si and having the minimum value mi , the maximum value Mi and the number of distinct values di . Also, 97 suppose a filtering predicate σi is given in the format Ai op v, where Ai is the attribute name, op is one of the filtering operators and v is a value given in the query statement. For a filter with predicate σi we estimate the selectivity factor SF (Si , σi ) over segment Si with uniform distributions of the values using one of the following simple equations: SF (Si , Ai = v) = 0 if v < mi or Mi < v ni × SF (Si , Ai 6= v) = 0 ni × SF (Si , Ai INV ) = 1 di if mi < v < Mi if v < mi or Mi < v di −1 di if mi < v < Mi ni if V undef P SF (Si , Ai = v) if V def v∈V 0 SF (Si , Ai < v) = ni ni × 0 SF (Si , Ai ≤ v) = ni ni × 0 SF (Si , Ai > v) = ni ni × 0 SF (Si , Ai ≥ v) = ni ni × if v < mi if Mi < v v−mi Mi −mi if mi < v < Mi if v < mi if Mi ≤ v v−mi +1 Mi −mi if mi ≤ v ≤ Mi if Mi < v if v < mi Mi −v Mi −mi if mi < v < Mi if Mi < v if v ≤ mi Mi −v+1 Mi −mi if mi ≤ v ≤ Mi 98 where V is a set of values that can be explicitly enumerated in the query statement or can be loaded from some external source or previous query result. In the case when V is explicitly enumerated the SF can be estimated more accurately by adding the selectivity factors of the equality predicates of the elements in V . In the case when V is not known the selectivity factor will be equal with the segment size. In general, the selectivity factors estimated with the previous simple equations are quite accurate for uniformly distributed segment values. However, network data segments not all have values uniformly distributed. For those segments, the simple equations don’t give accurate estimations and some other auxiliary stored data such as histograms can be used at the expense of a decrease in network flows insertion rate and an increase in storage overhead. For example, NetStore uses small histograms and inverted IPs index to estimate the selectivity factors for segments. Histograms are mainly used to estimate the busy servers counts, the common protocols and common port numbers counts, the rest of the values being pseudo-randomly distributed. We use Algorithm 4.1 to efficiently execute a simple query. Algorithm 4.1 ExecuteSimpleQuery INPUT: Q = {F1 , . . . , Fm } for all Fi ∈ Q do Si ← segment in Fi predicate ci ← size of Si on disk SFi ← selectivity factor for Fi ci rank(Fi ) ← 1−SF i end for sort {F1 , . . . , Fm } in non-decreasing order by rank P OS ← nil for all Fi ∈ Q do P OS ← positions of evaluating Fi using P OS end for The expected running time of this simple algorithm is asymptotically bounded by the time of sorting the filters in non-decreasing order of ranks, thus O(m log m). However, since we expect to have a small number of filters, practically the running time is determined by the overhead needed to read from disk the parameters used to compute each filter’s rank. Since the parameters of segments contributing to the rank can change for segments in 99 the same column the algorithm steps are executed for each set of corresponding segments (e.g. first for segment 1 in columns C1 , C2 , . . . , second for segment 2 in columns C1 , C2 , . . . , and so on). Having this processing model enables the use of parallelism on a multi-core execution server by processing each set of segments in a single thread and combining the partial results to obtain the final result. However, the parallel processing was not included in our current evaluation of the system and is intended as future work. For the rest of this chapter we use segment and columns interchangeably and we represent column data when we refer to either columns or segment. 4.2.2 Complex Queries Besides the simple queries described in previous section, network forensics and monitoring tasks use more complex queries. On one hand investigation queries are interactive, sequential and process data in a drilldown manner. As such, the investigator issues a simple query, analyzes the results and issues other more specific subsequent queries reusing some of the data from the previous queries output [17, 18]. Thus, investigation queries are executed one-at-a-time with possible references to previous queries results. We call this type of complex queries sequential interactive queries. On the other hand the monitoring tools use queries that seek patterns of network behavior [32], summarize network traffic [39] and access a larger set of common attributes. These queries are sent and executed in batches many-at-a-time. We call this type of complex queries batch queries. Both types of complex queries can be represented using filtering and aggregation simple queries as building blocks as shown in [32]. To describe the complex queries we again assume the network flow data is stored using the column oriented storage infrastructure described in Chapter 3. We define a complex query CQ as a set of simple queries {Q1 , . . . , Qk }. When processing a complex query using historical network flow data, the execution is slowed down by the expensive I/O operation when accessing disk resident data and by performing redundant and unnecessary operations. Suppose each network flow has n attributes and the data is stored using n columns, one column for each attributes. From data perspective the input of a simple query Qi is represented by a subset, C ∗ , of the total set of columns C = {C1 , . . . , Cn }, a subset, O∗ 100 of the output column set of previous i − 1 queries O = {O1 , . . . , Oi−1 }, and possibly some user input data whose size is assumed to be much smaller than either disk resident or intermediary data. We assume the intermediary and user data is memory resident and the data in columns is stored on disk. Therefore, the problem is to find an efficient execution model for the complex query CQ that will minimize the disk access, will avoid unnecessary operations, and will reuse intermediary data in an efficient way. Sequential Interactive Queries This type of complex query is predominant in an investigation scenario and we consider the whole querying session when designing the processing model. As such, a sequential interactive complex query can be implemented by the automaton in Figure 4.2. Figure 4.2: The interactive sequential processing model. The input of the automaton is represented by the sequence of simple queries that make up CQ. The automaton has only three states, initial state S0 , intermediary state S1 and final state S2 . At each state it is processed the disk data represented by C ∗ and intermediary data represented by O∗ . For this model we omit the user data since we assume it has negligible size compared to the other two types of data. The arrows between any two states represent the simple query executed and the data source for the query. The tuple (Q1 , C ∗ ) represents that the machine will execute Q1 using only column data from disk and will generate intermediary data that will be stored at the intermediary state S1 . The tuple 101 (Qi , C ∗ , O∗ ) with i = 2, . . . , k − 1 represents that the machine will execute queries Q2 to Qk−1 , can take as input both disk resident and intermediary data, will generate intermediary results and will remain in the intermediary state S1 . Finally, the tuple (Qk , C ∗ , O∗ ) will execute the last query in the sequence and the machine will make the transition to the final state and will output the final result of the complex query. Using this processing model the available memory might end up being filled after a number of queries depending on the output size of each query. In such case the least recently issued query result will be flushed to disk and a reference will be kept in main memory. Since the only operations permitted for a simple query are filtering and aggregation, the output size of each simple query is expected to be in general much smaller than the input size. Therefore, in the case when a disk resident previous query output is referenced by a new query, the time to read the already computed result from disk is expected to be much smaller than the time of reading the input data plus the= query processing time. Batch Queries In the case of the batch complex query processing all the simple queries are sent for processing in the same time (for example by loading the simple queries statements from a file), therefore all the simple queries are known when processing starts. A straight forward way to execute a batch of queries is to use the previous sequential processing model and execute simple queries one-at-a-time. However, the execution runtime for the batch of queries can further be significantly improved compared to the sequential case by knowing all the queries statements from the beginning. As such, all the simple queries in the batch can be executed in any order, thus some order may result in better overall runtime performance. Additionally, for network flow data it is expected that some of the queries in the batch to use the same filtering predicates for some attributes (for example known ports, static internal IPs, known protocols, etc). Therefore, the results from evaluating common predicates can be shared by many queries, thus saving execution time. Moreover, evaluating predicates in a particular order across all the simple queries may result in less evaluation work for the future predicates in the execution pipeline. Following the execution model for 102 simple queries illustrated by Figure 4.1, Figure 4.3 shows a schematic batch query execution strategy when some filters are shared between the simple queries and the shared filters are executed only once. We simplified the graph and highlighted only the data loaded by the shared filters. Figure 4.3: Directed Acyclic Graph representation of the batch of 3 queries with shared filters. Highlighted processing nodes represent the shared filters across the 3 simple queries. Order of execution is represented by the direction of the arrows between processing nodes. When designing the query processing engine we address all the above opportunities for improvement and in the next section we present the optimization methods used for both types of complex queries. Complex Query Optimization In the case of a sequential interactive query, some computation time might be saved for new simple queries by reusing the previous simple queries intermediary filter evaluation results if there are any common subexpression already computed. For this we maintain the resulting columns positions obtained when executing each filter for whole querying session. Therefore, when a new simple query is issued, first all the filters of the query are checked 103 in the pool of already evaluated filters in the session. If no match is found then the simple query is optimized using the method described in Section 4.2.1 for simple queries and the most efficient filtering order is decided. If a filter is found to be already evaluated, then it is placed at the beginning of the filtering order by assigning its rank to zero. Then the optimized ordering for the remainder of the filters is decided using the optimization method for simple queries. For network forensic investigations we expect to have a small number of sequential interactive simple queries per session, therefore we assume all the filters positions will fit in main memory. However, if the available memory will fill up, the least recently used filter will be flushed to disk and a reference will be kept in main memory. In the case of batch queries we build the processing model using a similar approach as the one presented in [35] for stream processing. Suppose a complex batch query is composed of a set of simple queries CQ = {Q1 , . . . , Qk }. Each simple query Qi is represented by a subset of the set of all filters {F1 , . . . , Fm }. We assume that filters can be shared among queries and filters are evaluated independent of each other. Each filter Fj is represented by a predicate over a column (segment) and has associated a cost of being evaluated cj and a selectivity factor SFj . We say that a filter Fi resolves a query Qi if the predicate of Fi evaluates to FALSE for all the values in the column or if Fi is the last query to be evaluated in Qi . We define a strategy ζ to execute a complex batch query CQ as a sequence of filters ζ = {Fa , . . . , Fb }. Let pi be the probability that a filter Fi will be evaluated by a strategy ζ. Thus, the cost of using strategy ζ to resolve all the queries in CQ is given by: cost(ζ) = b X pi · c i i=a Therefore the problem is to find, among all the possible execution strategies the execution strategy with the minimum cost. It is shown in [35] that this problem is NP-hard with a reduction from the set cover problem. However, some alternative approximation algorithms can be used and we consider two of them: the naı̈ve method and the adaptive method. 104 The naı̈ve way to execute a batch complex query is to first optimize each simple query and execute the sequence of all the simple queries in order. However, since each filter might be shared by many queries, a more efficient approximate method is to consider also the participation of each filter in the queries and use a ranking function similar to the one used for simple query processing optimization in Section 4.2.1. As such, for each filter Fi , if qi represents the number of unresolved queries that share filter Fi , ci the cost to evaluate the filter and SFi the selectivity factor of the filter predicate, the ranking function can be define by: rank(Fi ) = ci qi · (1 − SFi ) Algorithm 4.2 shows the processing steps for the batch query execution using for storage the column oriented system described in previous chapter, NetStore. Algorithm 4.2 ExecuteAdaptiveBatchQuery INPUT: CQ = {Q1 , . . . , Qk } create F ← {F1 , . . . , Fm } for all Fi ∈ F do segi ← segment in Fi predicate ci ← size of segi on disk SFi ← selectivity factor of Fi qi ← number of unresolved queries containing Fi ci rank(Fi ) ← qi ·(1−SF i) end for S ← {Q1 , . . . , Qk } set of unresolved queries P (Q1 ), . . . , P (Qk ) result positions while S 6= ∅ do Fi ← filter with minimum rank(Fi ) P OS ← evaluate Fi for all Qj with Fi ∈ Qj do P (Qj ) ← P (Qj ) ∩ P OS S ← S − {Qi | Fi resolves Qi } F ← F − {Fi } recompute rank(Fj ) for all Fj ∈ F end while Initially all the filters of all the simple queries are added to the set of unevaluated filters F . Then, for each filter the algorithm estimates the parameters used for the rank computation. To obtain the result, for each query Qi we maintain a list with positions corresponding to the values that passed the evaluated filters in variable P (Qi ). When a 105 new filter has to be evaluated, at each step the filter Fi with the lowest rank is selected to be evaluated from the pool of unevaluated filters. After the filter Fi is evaluated, the list with resulting column positions corresponding to the values that passed the filter is intersected with the current list of positions for each query containing Fi . Now, the resulting set of filtered values of all the queries resolved by filter Fi can be constructed by scanning again the corresponding columns. At this point each resolved query execution terminates and the result can be sent to the user. All the filters of the queries that are resolved after the evaluation of filter Fi and are not part of any other query are considered evaluated and are removed from the filters pool. The processing continues with updating the ranks for the remainder of the filters and with sorting the remainder of the filters in non-decreasing order of ranks. We call this method the adaptive method since at each step the rank of each filter is adaptively recomputed taking the change of filter participation into account. Suppose θ is the number of pairs (Fj ,Qi ), where Fj appears as a filter of Qi . Then, when evaluating a filter, the elimination of the resolved queries after filter evaluation is done in O(θ) time. The total number of times the rank of each filter that belongs to the resolved queries is updated in O(θ). We maintain the filters in a priority queue sorted by rank, therefore the update of each filter’s rank is performed in O(log m) time, where m is the number of filters. Therefore, since we assume the filters are independent, the execution overhead of Algorithm 4.2 is O(θ log m). In [35] authors show that an equivalent algorithm with k simple queries and m filters achieves an O(log2 k log m) approximation factor. However, in the case of monitoring batch queries executed on network flow data, the number of columns and distinct filters is rather small compared to the expected number of simple queries, therefore we expect that the processing time approximation factor to be dominated by the polylogarithmic function in the number of queries rather than the number of filters. 4.3 Query Language In many situation the limited scope of pure SQL commands is not suitable for the network data analysis. Network monitoring and forensic tasks might require the use of 106 sophisticated SQL nested queries, stored procedures and scripting languages that are not trivial to represent [17, 32]. Additional to the processing engine and query optimization methods, in this chapter we propose a simple SQL extension for querying network flow data called NetSQL. The query language extension was designed to easily describe join-free network monitoring and forensic queries based on their characteristics. The set of language operators is a subset of SQL commands and a small set of added features implemented on top of the column oriented storage system described in previous chapter. Besides standard SQL commands and operators, we only added features necessary for efficiency of monitoring and forensics workloads. 4.3.1 Data Definition Language NetSQL is implemented at the logical level, the physical level being transparent to the user. Taking into account the assumptions in previous section, in this section we define the primitives for the data definition language. As such, the syntax of the command to create a storage with n columns of various types is as follows: CREATE STORAGE storage name (A1 T1 , . . . , An Tn ); where A1 , . . . , An represent the attributes of the network flow records that will be stored and T1 , . . . , Tn their respective types. Since network data is permanently appended at line speed it is often important to provide a mechanism to send data for storage as soon as it is collected from the network. Unlike standard SQL, NetSQL implements a primitive to import data from a network resource. The following construct is used for this purpose: LOAD INTO storage name FROM url ; Note that the data sent from the source can be in either format IPFIX [26] or CSV text format. The inverted IPs index structure is created using the command: 107 CREATE IPS INDEX index name ON ips column name USING ip addr ; where index name is the name of the inverted IPs index, ips column name can be either the source IP attribute or destination IP attribute matching a specific IP address expression or being a member in a set represented by ip addr. For example, the IPv4 address expression can be given in the subnet CIDR format and the inverted IP index will be created only for the matching IP addresses. NetSQL also supports another operation for selecting an active storage, similar to the one defined to select a table in a relational databases, with the following syntax: SELECT STORAGE storage name; Depending on the projected usage scenario other operations can also be designed and implemented. As such, we also implemented commands similar with the SQL commands for exporting a storage, deleting a storage, dropping a storage and dropping an index that can have the following syntax: EXPORT STORAGE storage name TO csv file; DELETE STORAGE storage name; DROP STORAGE storage name; DROP INDEX index name; However, at the time of writing this dissertation NetSQL supports only the operations described in Section 4.3. These commands are used to enable the performance evaluation of the system and to enable testing of various optimization strategies. 108 4.3.2 Data Manipulation Language NetSQL does not consider the relational model for data organization, it assumes that at any given time only one storage is active and all the queries refer to that storage. The storage selection is done using the command presented in Section 4.3.1. Suppose that the storage has n attributes, then the following syntax represents the general type of a filtering and aggregation simple query written in NetSQL: label: SELECT Ac1 , . . . , Ack , f1 (Aa1 ), . . . , fm (Aam ) WHERE σ1 (A1 ) AND . . . AND σn (An ) GROUP BY Ac1 , . . . , Ack ; where Ac1 , . . . , Ack with k ≤ n represent the attributes in the output, Aa1 , . . . , Aam with m ≤ n the attributes that are aggregated in the output, f1 , . . . , fm the aggregation functions applied to the m attributes and σ1 , . . . , σn the predicates over all the attributes considered. The output of a query in NetSQL is represented by a set of columns. In order to facilitate the use of previous query results in subsequent queries a label is added to each of the queries. When an attribute Ai of the output of a query with label Qi is accessed by some other subsequent query Qj the dot notation Qi .Ai is used in the WHERE clause of the query labeled Qj to represent the attribute values in the output of Qi . NetSQL implements the basic filtering operators <, ≤, 6=, >, ≥, = and IN to represent the membership of an element in a set. The basic aggregation operators implemented in NetSQL are M IN , M AX, SU M and COU N T . 4.3.3 User Input Sometimes, in network data analysis tasks it is useful to load and use large predefined user inputs, such as lists of malicious hosts or lists with restricted port numbers that are not part of the initial storage schema. Unlike standard SQL, NetSQL allows the user to input a set with large number of values that can be used in conjunction with the existing 109 storage data without the need to change the underlying storage and data structures. The following primitive implements this feature: LOAD INTO U1 T1 , . . . , U s Ts FROM url ; where U1 , . . . , Us represent the names of the user input attributes and T1 , . . . , Ts their types. By specifying the types at loading time each column will be encoded using specific encoding methods for each type in order to facilitate more efficient operations with the data in encoded format. For example, if the IP addresses are represented in dotted notation as strings in the user source file, they will be encoded into 4-byte integer values in the memory resident data structure. Note that the storage and user data loading routines are very similar in syntax but have different semantics. The main difference is represented by the fact that the storage data is written to disk and the user data will be memory resident, therefore it is assumed that the size of the data structure created from the user input will fit in main memory at all times. 4.4 Experiments We implemented an instance of the querying framework presented using Java programming language on the FreeBSD 7.2 platform. For all the experiments we used a single machine with 6 GB DDR2 RAM, two Quad-Core 2.3 Ghz CPUs and 1TB SATA-300 32 MB Buffer 7200 rpm disk. We used the network flow data captured over a 24 hour period of one weekday at our campus border router representing about 8 GB of raw network flow data. We considered 12 attributes for each network flow record, each of byte, short or integer type and stored the data using NetStore, the column oriented system described in Chapter 3. We present the experiments setting and performance evaluation for both simple and complex queries. 110 4.4.1 Simple Queries Performance We first generated simple queries with increasing number of predicates with selectivities chosen at random from interval (0, 1). In the first phase, for each simple query we generated all the permutations of all the filters in the query. We executed the simple queries permutations sequentially without the filtering order optimization enabled and recorded the running time for each permutation. Next, we run the same simple queries with optimization enabled and recorded the running times. Figure 4.4: The average simple queries running times for permutation, optimized and optimum strategies. Figure 4.4 shows the average running times for the queries with increasing number of predicates for the two approaches, permutation and optimized compared with the optimum running time. As the figure shows, the optimized simple query processing achieves better performance than the average running times for permutation queries while maintaining a close to optimum performance. This proves that the filtering cost estimation model is quite 111 accurate within some small predictable error. The small difference from optimum is due, on one hand, to the assumption that the segment values are pseudorandomly distributed, thus affecting the selectivity factor computation accuracy, and on the other hand to the preprocessing overhead to compute the estimation parameters. Since the number of filters within a simple query is rather small, the preprocessing overhead added by filtering order optimization is expected to be negligible and much smaller than the absolute query execution time. We believe a more accurate optimal filtering order can be obtained by using more expensive approaches to maintain segment statistics and values distributions (for example full histograms) that will help to improve the selectivity factor estimation. 4.4.2 Complex Queries Performance We consider the following parameters for each test: m the number of filters, k the number of simple queries and p the average participation of each filter (the number of queries that share the same filter). Then the expected number of filters per query is q = pm/k. Since we have only 12 flow attributes and we allow at most one attribute in a filter we have q ≤ 12. For both types of complex queries, sequential and batch, we generated the simple queries in a similar way. We fist generated the m filters having predicates with random selectivity factors in interval (0,1). Even though larger number of predicates can be used we set the number of predicates m = 66 for all the experiments just to maintain reasonable running times for all the complex queries. Next we generated k queries with the expected number of filters per query q randomly selected in the interval [1,12], by increasing the average participation of each filter p and calculating k for each value of q and p using the formula k = mp/q. For both complex query cases, we considered as baseline the naı̈ve method when all the simple queries are optimized and are executed in sequential order. Sequential Interactive Queries We tested the efficiency of the complex interactive queries by generating increasing number of simple queries with increasing participation p of each filter, each simple query being executed sequentially one-at-a-time. We then run the same queries enabling the filter 112 evaluation result sharing between new queries and previous queries. We call this method of sharing filter evaluation results, the sequential method. Figure 4.5 shows the results for m = 66, q randomly selected in interval [1,12] and p increasing integer from 1 to 9. When the average filter participation p was under 2, that is each filter occurred on average in less than two queries (number of queries 22 on the graph), the result of the naı̈ve method is slightly better than the sequential method. This is the case because the sequential method basically performed the same strategy as the naı̈ve method since no filter was evaluated before. The performance penalty of sequential methods represents the overhead to manage the already computed results. After the point where the filter participation is larger than 2 (number of queries larger than 22 on the graph), the computation savings by reusing the previously evaluated filters are clearly visible. The performance benefit of the sequential method varies up to an order of magnitude compared to the naı̈ve method and is dependent on the filters selectivity and participation. Figure 4.5: The complex interactive query performance compared to the baseline. 113 We observed that, even though after a while all the filters are pre-evaluated with results stored in memory, the performance gap is not increasing. This is the case because the sequential method still has to perform intersection of the filtering positions for each preevaluated filter. If the number of predicates is small and the processing engine has a large memory available, another level of optimization can be implemented that will store and reuse the most frequent intersections of the filtered positions, but this design was not implemented, and is left as future work. Batch Queries To test the batch query processing algorithm we considered again the naı̈ve method as the baseline and we varied the number of generated queries executed by each method using m = 66 filters, q random in interval [1,12] and p increasing integer in interval [1, 21]. We ran each number of queries in batches using the naı̈ve and adaptive methods by executing all the queries in the same time and Figure 4.6 shows the results obtained. Similar to the previous case, the performance difference compared to the naı̈ve method starts to be noticed once the participation of each filter increases. This is the case because once a shared filter is being evaluated more queries can benefit from its evaluation. However, the performance gap is not substantially increased until filter participation is larger than 11 (on average 11 queries share at least one filter). This is the case because for fewer filters, the smaller performance benefit is diminished by the large number of initial unresolved queries and by the filter rank re-computation overhead. Once the filters are evaluated the resolved queries are no longer taken into account by Algorithm 4.2. However, for workloads of batch queries with more than 330 simple queries and sharing more than 11 filters, the adaptive method can perform up to 6 times faster in our experiments. These workloads are not uncommon for monitoring purposes if we consider the increasing number of devices connected in the IP networks. The main limitation of the adaptive batch processing algorithm is that, unlike the naı̈ve method, it does not guarantee the optimum execution time for each individual simple query. However, a hybrid execution strategy that assigns priorities to each simple query can be implemented. As such, the queries that have highest priority can be executed first using 114 Figure 4.6: The complex batch query performance compared to the baseline. the naı̈ve method, their filters evaluation positions can be cached and the rest of the queries can be executed in a batch. Previously evaluated filters results can then be used in the adaptive algorithm by assigning them the lowest rank. We believe this is another interesting optimization scheme and we plan to pursue it in future work. 4.5 Related Work There is a rich body of research concerned with methods of querying network flow records and efficiency of concurrent query processing and we review the techniques related to both areas. 4.5.1 Multi-Query Processing In [11] authors propose a method that enables ad-hoc concurrent queries to execute efficiently in a shared resource environment regardless of when they where submitted. This 115 system works well for a data warehouse model that assumes non-blocking queries. However, this system is not compatible with blocking forensic queries, nor with the batch query processing for monitoring. When queries are considered for processing in batches, it is possible to identify common sub-expressions and optimize the execution of the batch of queries by generating plans that share computational cost among simple queries as described in [48]. However, if we consider each simple query in the batch as a set of pipelined filters, the actual savings of performing a single evaluation of all the common filters can be lost if the optimal filtering order of individual queries is changed. The new filtering order might require the scanning of non relevant segments thus wasting processing time. As such, for batch queries we use the idea of efficient work sharing with care by assigning a rank to each filter as in [35]. In [35], authors formalize the problem of optimally executing a collection of queries represented by a conjunction of possibly expensive filters that may be shared across queries. We use a similar algorithm and derive a cost function to contribute to the rank of each filter by taking into account the underlying storage infrastructure described in [19]. Based on the estimated cost, we prioritize the evaluation of expensive filters that are shared by many queries and will lead to more queries being executed faster, thus reducing the overall processing time. 4.5.2 Network Flow Query Languages The existing query languages for network flow data can be grouped into two main categories: SQL-based and customized. In general the SQL-based languages, use a transactional row-oriented database system for storage and use standard SQL commands to define and manipulate the network flow data [18, 36]. These systems are able to provide support for basic forensics and monitoring tasks but show performance penalty when querying large amounts of historical network flow data. Each of the systems described in [15, 56] proposes an extended SQL-based language to support streaming data operations such as selection, projection, aggregation and ways to query data based on time windows and our proposed simple language is similar in nature. However, these systems are designed to provide the best performance for live streaming 116 data, not historical disk resident data, and are not suitable for interactive queries when user input is interleaved at each step. The customized languages implement either filtering methods or a set of data manipulation commands. The filtering methods, such as the ones described in [33], allow users to construct filtering expressions for network traces using IP address, port numbers, protocols, etc and use the filtered data to generate traffic reports and to perform network analysis. Similar to the case of streaming SQL-based systems, these filtering methods are concerned with the real time analysis of the network traffic and don’t try to optimize monitoring and forensic interactive queries on disk resident data. Some systems implement a set of command line tools [17], or a collection of pearl scripts [39], and use their own primitives to define filtering expressions and to perform analysis. These systems have the capability to generate high level traffic reports using live and short-term historical data but the reports should be expressed in terms of the provided set of primitives and are not trivial to represent. In [32], authors propose a more intuitive framework for designing a stream-based query language for network flow data but implementation and the query runtime performance is not the primary goal, so it is omitted from the analysis. In our framework we propose a simple SQL-based query language extension for network flow data because SQL is a well know standard and offers a rich set of primitives. Additionally, we implement a small set of features that enable simpler representation of network monitoring and forensic queries without the need to use a complex scripting language. 4.6 Conclusion In this chapter we presented the design and implementation of a querying framework for monitoring and forensic analysis when using network flow data stored on disk in a column oriented storage system. We showed that simple monitoring and forensics queries performance can be improved up to an order of magnitude compared to the average case. We provided an efficient execution model for sequential interactive queries and showed that batch queries can potentially achieve many times better performance than the naı̈ve case, 117 using an adaptive query processing algorithm when the number of filters is small and the queries share many filters. Additionally, we presented the primitives of a simple SQL-based query language with the small set of added features than enables simpler definition and representation of the monitoring and forensic queries. To the best of our knowledge, our work is among the first to consider filtering optimization methods for sequential and batch queries on network flow data stored in a column-oriented system. Since the column data is stored in segments that can be easily processed in parallel, in future work we seek to investigate the design and development of efficient parallel processing algorithms in a distributed environment. The implementation of the querying framework presented in this chapter together with the column oriented storage system described in Chapter 3 form a complete system for storying and querying network flow data. In the next chapter we present a discussion of the methods described in this dissertation, we propose new directions for future work and we conclude our presentation. Chapter 5 Conclusions and Future Work 5.1 Concluding Remarks In this dissertation, we presented new methods to store and query network data that have the potential to enhance the efficiency of network monitoring and forensic analysis. First, we presented novel methods for payload attribution. As part of a network forensics system the proposed methods provide an efficient probabilistic query mechanism to answer queries for excerpts of a payload that passed through the network. The methods allow data reduction ratios greater than 100:1 while having a very low false positive rate when querying. In the same time, they allow queries for very small excerpts of a payload and also for excerpts that span multiple packets. The experimental results show that the methods achieve a significant improvement in query accuracy and storage space requirements compared to previous attribution techniques. More specifically, the evaluation in Chapter 2 shows that winnowing method represents the best technique for block boundary selection in payload attribution applications, shingling is clearly a more efficient method for consecutiveness resolution than the use of offset numbers and finally, the use of multiple instances of payload attribution methods can provide improved false positive rates and data-aging capability. The above payload processing techniques combined together form a payload attribution method called Winnowing Multi-Hashing which substantially outperforms previous methods. The experimental results also show that in general the accuracy of attribution increases with the length and the specificity of a query. Moreover, privacy and simple access control 118 119 is achieved by the use of Bloom filters and one-way hashing with a secret key. Thus, even if the system is compromised no raw traffic data is ever exposed and querying the system is possible only with the knowledge of the secret key. Second, we have presented the challenges and design guidelines of an efficient storage and query infrastructure for network flow data. We presented the specific design, implementation and evaluation of a novel working architecture, called NetStore, that is useful in the network monitoring tasks and assists in network forensics investigations. The simple column oriented design of NetStore helps in reducing query processing time by reducing the time spent with disk I/O and loading only data needed for processing. Moreover, the column partitioning facilitates the use of efficient compression methods for network flow attributes that allow data processing in compressed format, therefore boosting even more query runtime performance. NetStore clearly outperforms existing row-based database systems and provides better results than the general purpose column oriented systems because of simple design decisions tailored for network flow records without using auxiliary data structures for tuple reconstruction. Experiments show that NetStore can provide more than ten times faster query response compared to other storage systems while maintaining much smaller storage size. Finally, we presented the design and implementation of a querying framework for monitoring and forensic analysis when using network flow data stored on disk in a column oriented storage system. We showed that simple monitoring and forensics queries performance can be improved up to an order of magnitude compared to the average case if the order of the filtering predicates is chosen with care by computing a rank for each filter based on data statistics, then executing the filters in increasing order of the rank. We provided an efficient execution model for sequential interactive queries and showed that batch queries can potentially achieve many times better performance than the naı̈ve case by using an adaptive query processing algorithm when the number of filters is small and the queries share many filters. We also presented the primitives of a simple SQL-based query language called NetSQL. The proposed language was designed for monitoring and forensic query workloads that use network flow data. Additional to the useful standard SQL commands, the language implements a small set of added features than enables simpler definition and 120 representation of the monitoring and forensic queries. To the best of our knowledge, our work is among the first to consider filtering optimization methods for sequential and batch queries on network flow data stored in a column-oriented system. Moreover, the implementation of the querying framework presented in Chapter 4 together with the column oriented storage system described in Chapter 3 form a complete system for storying and querying network flow data. 5.2 Future Work The methods for processing, storing and querying network data presented in this dissertation were designed to be mainly used by network forensic investigations and monitoring query workloads. However, we believe that these methods also have much broader range of applicability in various areas where large amounts of data are being processed. Therefore, each of the possible application domains itself creates interesting open problems that can be pursued in future work. For example, the payload attribution methods presented in Chapter 2 can be used to better decide the similarity score of document contents or other binary large data files in general. Based on the underling data some of the methods proposed might yield different performance metrics than in the case of payload processing and the performance rankings for the new application requirements might be different. As such, it would be interesting to see the performance of each method in partitioning the content when making decisions about the similarity of large chunks of data transported over the network, such as in the case of web caching. Additionally, the storage infrastructure described in Chapter 3 can be used not only for network flow data but also for any kind of time sequential data that share the same characteristics and query workloads as network flow data. As such, a similar type of storage infrastructure is expected to yield good performance for most categories of structured time sequential data, for example log records used for log analysis, or application protocols information (HTTP, DNS, FTP, etc) used for various network analysis tasks. The interesting problem in this case is to analyze the use of the same storage infrastructure for storing 121 attributes of arbitrary data types, for example strings of variable length or other binary objects that are added to the permanent storage using appended only operations. Moreover, the querying framework described in Chapter 4 can be used for other types of queries than monitoring and forensics. As such, any query workload that contains many queries sharing a large number of predicates, or queries that are executed sequential, share some filters and use data stored by a column store can be executed faster using the algorithms proposed. Given the increasing emphasis on multi-core chip design and the increasing popularity of cloud computing models, another interesting research direction is to find new techniques and more efficient methods and algorithms to exploit the parallel processing in a multi-threaded and distributed environment. In Chapter 2 we argued that payload data collected from our campus border router is highly compressed in Bloom filters and querying operations currently provide satisfactory performance for data stored in a single Bloom filter. However, with the increase in traffic volumes it is expected that querying to become a computational intensive operation when many filters are queried at once. Therefore, an interesting problem is to explore the data processing models in a distributed environment when all the filters are queried in parallel. Similarly, NetStore uses data stored as independent segments that can be easily processed in parallel. In this case we believe it would be interesting to investigate the design and development of efficient parallel processing algorithms that process the relevant segments independently and merge the resulting values in the final step. Finally, the methods and systems presented in this dissertation were mainly implemented as prototypes for concepts evaluation using network data collected at our university edge routers. Storing network data is becoming more popular and many organizations use it for various purposes. We believe it would be useful to have an efficient system for storing network data in many places, in many organizations, that will incorporate some of the techniques presented in this dissertation. Then, all these storage systems could collaboratively work to detect networks abuse or other network complex behavior by using historical network data to asses the reputation of each unknown network entity. For example, it would be useful to have a global view of all the historical network data when an a priori unknown host A is trying to establish a connection with an organizations’s host X. In this case, 122 the decision of whether the connection should be allowed or not between host A and host X could be established after querying other external historical network storages on other organizations that may have experienced connections with host A before. Based on this interesting idea a whole reputation mechanism for external hosts of an organization could be build. Bibliography [1] Daniel Abadi, Samuel Madden, and Miguel Ferreira. Integrating compression and execution in column-oriented database systems. In SIGMOD ’06: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, pages 671– 682, New York, NY, USA, 2006. ACM. [2] Daniel J. Abadi, Samuel R. Madden, and Nabil Hachem. Column-stores vs. rowstores: how different are they really? In SIGMOD ’08: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pages 967–980, New York, NY, USA, 2008. ACM. [3] Daniel J. Abadi, Daniel S. Myers, David J. DeWitt, and Samuel Madden. Materialization strategies in a column-oriented dbms. In ICDE, pages 466–475, 2007. [4] E. Anderson and M. Arlitt. Full Packet Capture and Offline Analysis on 1 and 10 Gb Networks. Technical Report HPL-2006-156, 2006. [5] S. Bellovin and W. Cheswick. Privacy-enhanced searches using encrypted Bloom filters. Cryptology ePrint Archive, Report 2004/022, 2004. Available at http://eprint.iacr.org/. [6] B. Bloom. Space/time tradeoffs in hash coding with allowable errors. In Communications of the ACM (CACM), pages 422–426, 1970. [7] Lars Brenna, Alan Demers, Johannes Gehrke, Mingsheng Hong, Joel Ossher, Biswanath Panda, Mirek Riedewald, Mohit Thatte, and Walker White. Cayuga: a high-performance event processing engine. In SIGMOD ’07: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pages 1100–1102, New York, NY, USA, 2007. ACM. [8] A. Broder. Some applications of Rabin’s fingerprinting method. In Sequences II: Methods in Communications, Security, and Computer Science, pages 143–152. SpringerVerlag, 1993. [9] A. Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences, 1997. [10] A. Broder and M. Mitzenmatcher. Network Applications of Bloom Filters: A Survey. In Annual Allerton Conference on Communication, Control, and Computing, UrbanaChampaign, Illinois, USA, October 2002. 123 124 [11] George Candea, Neoklis Polyzotis, and Radek Vingralek. A scalable, predictable join operator for highly concurrent data warehouses. Proc. VLDB Endow., 2(1):277–288, 2009. [12] Sirish Chandrasekaran, Sirish Ch, Owen Cooper, Amol Deshpande, Michael J. Franklin, Joseph M. Hellerstein, Wei Hong, Sailesh Krishnamurthy, Sam Madden, Vijayshankar Raman, Fred Reiss, and Mehul Shah. Telegraphcq: Continuous dataflow processing for an uncertain world, 2003. [13] F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R.E. Gruber. Bigtable: A distributed storage system for structured data. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI06), 2006. [14] C. Y. Cho, S. Y. Lee, C. P. Tan, and Y. T. Tan. Network forensics on packet fingerprints. In 21st IFIP Information Security Conference (SEC 2006), Karlstad, Sweden, 2006. [15] Chuck Cranor, Theodore Johnson, Oliver Spataschek, and Vladislav Shkapenyuk. Gigascope: a stream database for network applications. In SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pages 647–651, New York, NY, USA, 2003. ACM. [16] S. Garfinkel. Network forensics: Tapping the internet. O’Reilly Network, 2002. [17] Carrie Gates, Michael Collins, Michael Duggan, Andrew Kompanek, and Mark Thomas. More netflow tools for performance and security. In LISA ’04: Proceedings of the 18th USENIX Conference on System Administration, pages 121–132, Berkeley, CA, USA, 2004. USENIX Association. [18] Roxana Geambasu, Tanya Bragin, Jaeyeon Jung, and Magdalena Balazinska. Ondemand view materialization and indexing for network forensic analysis. In NETB’07: Proceedings of the 3rd USENIX International Workshop on Networking Meets Databases, pages 1–7, Berkeley, CA, USA, 2007. USENIX Association. [19] Paul Giura and Nasir Memon. Netstore: An efficient storage infrastructure for network forensics and monitoring. In Proceedings of the 13th International Symposium on Recent Advances in Intrusion Detection, Ottawa, Canada, September 2010. [20] Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. Compressing relations and indexes. In In proceedings of IEEE International Conference on Data Engineering, pages 370–379, 1998. [21] Guofei Gu, Phillip Porras, Vinod Yegneswaran, Martin Fong, and Wenke Lee. BotHunter: Detecting Malware Infection Through IDS-Driven Dialog Correlation. In Proceedings of the 16th USENIX Security Symposium, pages 167–182, August 2007. [22] Alan Halverson, Jennifer L. Beckmann, Jeffrey F. Naughton, , and David J. Dewitt. A comparison of c-store and row-store in a common framework. Technical Report TR1570, University of Wisconsin-Madison, 2006. 125 [23] M. Handley, C. Kreibich, and V. Paxson. Network Intrusion Detection: Evasion, Traffic Normalization, and End-to-End Protocol Semantics. In Proceedings of the USENIX Security Symposium, Washington, USA, 2001. [24] Joseph M. Hellerstein and Michael Stonebraker. Predicate migration: Optimizing queries with expensive predicates. In SIGMOD Conference, pages 267–276, 1993. [25] Allison L. Holloway and David J. DeWitt. Read-optimized databases, in depth. Proc. VLDB Endow., 1(1):502–513, 2008. [26] IETF. Ip flow information export(ipfix). http://datatracker.ietf.org/wg/ipfix/charter. [27] Infobright Inc. Infobright. http://www.infobright.com. [28] Sybase Inc. Sybase iq. http://www.sybase.com. [29] N. King and E. Weiss. Network Forensics Analysis Tools (NFATs) reveal insecurities, turn sysadmins into system detectives. Information Security, Feb. 2002. Available at www.infosecuritymag.com/2002/feb/cover.shtml. [30] LucidEra. Luciddb. http://www.luciddb.org. [31] U. Manber. Finding similar files in a large file system. In Proceedings of the USENIX Winter 1994 Technical Conference, pages 1–10, San Fransisco, CA, USA, 1994. [32] Vladislav Marinov and Jürgen Schönwälder. Design of a stream-based ip flow record query language. In DSOM ’09: Proceedings of the 20th IFIP/IEEE International Workshop on Distributed Systems: Operations and Management, pages 15–28, Berlin, Heidelberg, 2009. Springer-Verlag. [33] Steven McCanne and Van Jacobson. The bsd packet filter: a new architecture for user-level packet capture. In USENIX’93: Proceedings of the USENIX Winter 1993 Conference, pages 2–2, Berkeley, CA, USA, 1993. USENIX Association. [34] M. Mitzenmacher. Compressed Bloom Filters. IEEE/ACM Transactions on Networking (TON), 10(5):604 – 612, 2002. [35] Kamesh Munagala, Utkarsh Srivastava, and Jennifer Widom. Optimization of continuous queries with shared expensive filters. In PODS ’07: Proceedings of the twentysixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pages 215–224, New York, NY, USA, 2007. ACM. [36] Bill Nickless. Combining cisco netflow exports with relational database technology for usage statistics, intrusion detection, and network forensics. In LISA ’00: Proceedings of the 14th USENIX Conference on System Administration, pages 285–290, Berkeley, CA, USA, 2000. USENIX Association. [37] NTOP. PF RING Linux kernel http://www.ntop.org/PF RING.html, 2008. patch. Available at [38] Vern Paxson. Bro: A system for detecting network intruders in real-time. In Computer Networks, pages 2435–2463, 1998. 126 [39] Dave Plonka. Flowscan: A network traffic flow reporting and visualization tool. In LISA ’00: Proceedings of the 14th USENIX Conference on System Administration, pages 305–318, Berkeley, CA, USA, 2000. USENIX Association. [40] M. Ponec, P. Giura, H. Brönnimann, and J. Wein. Highly Efficient Techniques for Network Forensics. In Proceedings of the 14th ACM Conference on Computer and Communications Security, pages 150–160, Alexandria, Virginia, USA, October 2007. [41] Miroslav Ponec, Paul Giura, Joel Wein, and Hervé Brönnimann. New payload attribution methods for network forensic investigations. ACM Trans. Inf. Syst. Secur., 13(2):1–32, 2010. [42] PostgreSQL. Postgresql. http://www.postgresql.org. [43] M. O. Rabin. Fingerprinting by random polynomials. Technical report 15-81, Harvard University, 1981. [44] S. Rhea, K. Liang, and E. Brewer. Value-based web caching. In Proceedings of the Twelfth International World Wide Web Conference, May 2003. [45] Robert Richardson and Sara Peters. 2007 CSI Computer Crime and Security Survey Shows Average Cyber-Losses Jumping After Five-Year Decline. CSI Press Release, September 2007. Available at http://www.gocsi.com/press/20070913.jhtml. [46] Martin Roesch. Snort - lightweight intrusion detection for networks. In LISA ’99: Proceedings of the 13th USENIX conference on System administration, pages 229–238, Berkeley, CA, USA, 1999. USENIX Association. [47] S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. In SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pages 76–85, New York, NY, USA, 2003. ACM Press. [48] Timos K. Sellis. Multiple-query optimization. ACM Transactions on Database Systems, 13:23–52, 1988. [49] K. Shanmugasundaram, H. Brönnimann, and N. Memon. Payload Attribution via Hierarchical Bloom Filters. In Proc. of ACM CCS, 2004. [50] K. Shanmugasundaram, N. Memon, A. Savant, and H. Brönnimann. ForNet: A Distributed Forensics Network. In Proc. of MMM-ACNS Workshop, pages 1–16, 2003. [51] Abraham Silberschatz, Henry Korth, and S. Sudarshan. Database Systems Concepts. McGraw-Hill, Inc., New York, NY, USA, 2010. [52] Dominik Ślȩzak, Jakub Wróblewski, Victoria Eastwood, and Piotr Synak. Brighthouse: an analytic data warehouse for ad-hoc queries. Proc. VLDB Endow., 1(2):1337–1345, 2008. [53] A. C. Snoeren, C. Partridge, L. A. Sanchez, C. E. Jones, F. Tchakountio, S. T. Kent, and W. T. Strayer. Hash-based IP traceback. In ACM SIGCOMM, San Diego, California, USA, August 2001. 127 [54] S. Staniford-Chen and L.T. Heberlein. Holding intruders accountable on the internet. In Proceedings of the 1995 IEEE Symposium on Security and Privacy, Oakland, 1995. [55] Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil, Pat O’Neil, Alex Rasin, Nga Tran, and Stan Zdonik. C-store: a column-oriented dbms. In VLDB ’05: Proceedings of the 31st International Conference on Very Large Data Bases, pages 553–564. VLDB Endowment, 2005. [56] Mark Sullivan and Andrew Heybey. Tribeca: A system for managing large databases of network traffic. In In USENIX, pages 13–24, 1998. [57] Cisco Systems. Cisco ios netflow. http://www.cisco.com. [58] Vertica Systems. Vertica. http://www.vertica.com. [59] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23:337–343, 1977. [60] Marcin Zukowski, Peter A. Boncz, Niels Nes, and Sándor Héman. Monetdb/x100 - a dbms in the cpu cache. IEEE Data Eng. Bull., 28(2):17–22, 2005.