Efficient Methods to Store and Query Network Data

advertisement
Efficient Methods to Store and Query Network Data
Paul Giura
November 2, 2010
EFFICIENT METHODS TO STORE AND QUERY NETWORK DATA
DISSERTATION
Submitted in Partial Fulfillment
of the Requirements for the
Degree of
DOCTOR OF PHILOSOPHY (Computer Science)
at the
POLYTECHNIC INSTITUTE OF NEW YORK UNIVERSITY
by
Paul Giura
January 2011
Approved :
Department Head
Copy No.
Approved by the Guidance Committee :
Major : Computer Science
Nasir Memon
Professor of
Computer Science
Torsten Suel
Associate Professor of
Computer Science
Joel Wein
Associate Professor of
Computer Science
Hervé Brönnimann
Associate Professor of
Computer Science
Microfilm or other copies of this dissertation are obtainable from
UNIVERSITY MICROFILMS
300 N. Zeeb Road
Ann Arbor, Michigan 48106
Vita
Paul Giura was born in Slatina, Olt, Romania in 1981. He received his B.S. degree from
University of Bucharest, Bucharest, Romania in the summer of 2004. In September of 2005
Paul started working towards his M.S./PhD degree as a Research Assistant at Polytechnic
Institute of NYU and in 2007 he received his M.S. degree.
During his first years at NYU-Poly Paul worked to developed new techniques for network
payload attribution systems, then his research focused on efficient storage methods for
network flow data. In 2007 he was a summer intern at Ricoh Innovations research labs where
he designed and implemented a prototype system used to establish trust and integrity of
electronic and paper document archives. In 2008 he was a summer intern at AT&T Security
where he worked on Content Distribution Networks architectures security. His research
interests include Network and Internet Security, Systems, Data Storage Technologies and
Algorithms for Security.
iv
To my parents, Alecsandru and Florica, brother, Cristian and my love, Simona.
Acknowledgements
I am extremely grateful to my advisor Nasir Memon who was an excellent source of
advice for both technical and professional matters. He provided support, encouragement,
good company, and lots of great ideas. It is impossible to imagine someone being more
selfless, having more of his students interests in mind, or giving them more freedom. I
could not think of having a better advisor and mentor for my Ph.D years.
Thanks especially to Hervé Brönnimann for his great advice, patience, great ideas and
the key contribution to my professional development when he served as an initial Ph.D
advisor and later as a valuable member of my defense committee.
Thanks to Miroslav Ponec, who worked closely with me on the payload attribution
methods (which became Chapter 2 of this dissertation), for his contribution to the subject
and his friendship in all the Ph.D years and beyond. Many thanks to Joel Wein for his
guidance on the payload attribution methods research, for all his valuable contribution and
for serving as a valuable member of my Ph.D committee.
I thank to Torsten Suel from whom I learned a lot about databases concepts, for the
helpful feedback and for being a valuable member of my defense committee. I would like to
thank Kulesh Shanmugasundaram for suggesting me the problem of finding efficient storage
methods for network flow data and for all the helpful discussions we had. I would also like
to thank many other excellent professors at Polytechnic Institute of NYU, from whom I
learned a lot.
Being a graduate student in a foreign country for the first time I encountered many
obstacles that I was able to pass having great friends by my side. Especially I would like to
thank George Marusca and Florin Zidaru for being good friends, and for their invaluable
help when I first arrived in the United States. Thanks also to Baris Coskun for his friendship,
for sharing helpful discussions in difficult moments and for being a great fishing partner.
I would like to mention Marin Marcu and Cristian Giurescu for guiding me in the
early years of my career, and my undergraduate advisor Ioan Tomescu at the University of
Bucharest for advising and encouraging me to pursue my doctorate degree.
I would like to thank my parents Alecsandru Giura and Florica Giura, and my brother
Cristian Giura for their encouragement and invaluable support which always gave me
strength to get through this endeavor despite many lands and seas between us.
Finally, I would like to thank Simona Cirstea who was by my side in difficult moments,
inspired me to persist and brought me happiness and love.
Paul Giura, Polytechnic Institute of New York University
November 2, 2010
vi
Abstract
Network data crosses network boundaries in and out and many organizations record
traces of network connections for monitoring and investigation purposes. With the increase
in network traffic and sophistication of the attacks there is a need for efficient methods to
store and query these data. In this dissertation we propose new efficient methods for storing
and querying network payload and flow data that can be used to enhance the performance
of monitoring and forensic analysis.
We first address the efficiency of various methods used for payload attribution. Given
a history of packet transmissions and an excerpt of a possible packet payload, a Payload
Attribution System (PAS) makes it feasible to identify the sources, destinations and the
times of appearance on a network of all the packets that contained the specified payload
excerpt. A PAS, as one of the core components in a network forensics system, enables
investigating cybercrimes on the Internet, by, for example, tracing the spread of worms
and viruses, identifying who has received a phishing email in an enterprise, or discovering
which insider allowed an unauthorized disclosure of sensitive information. Considering the
increasing volume of network traffic in today’s networks it is infeasible to effectively store
and query all the actual packets for extended periods of time for investigations. In this
dissertation we focus on extremely compressed digests of payload data, we analyze the
existing approaches and propose several new methods for payload attribution which utilize
Rabin fingerprinting, shingling, and winnowing. Our best methods allow building payload
attribution systems which provide data reduction ratios greater than 100:1 while supporting
efficient queries with very low false positive rates. We demonstrate the properties of the
proposed methods and specifically analyze their performance and practicality when used as
modules of a network forensics system.
Consequently, we propose a column oriented storage infrastructure for storing historical network flow data. Transactional row-oriented databases provide satisfactory query
performance for network flow data collected only over a period of several hours. In many
cases, such as the detection of sophisticated coordinated attacks, it is crucial to query days,
weeks or even months worth of disk resident historical data rapidly. For such monitoring
and forensics queries, row oriented databases become I/O bound due to long disk access
times. Furthermore, their data insertion rate is proportional to the number of indexes used,
and query processing time is increased when it is necessary to load unused attributes along
with the used ones. To overcome these problems in this dissertation we propose a new column oriented storage infrastructure for network flow records and present the performance
evaluation of a prototype storage system implementation called NetStore. The system is
aware of network data semantics and access patterns, and benefits from the simple column
vii
viii
oriented layout without the need to meet general purpose databases requirements. We show
that NetStore can potentially achieve more than ten times query speedup and ninety times
less storage requirements compared to traditional row-stores, while it performs better than
existing open source column-stores for network flow data.
Finally, we propose an efficient querying framework to represent, implement and execute
forensics and monitoring queries faster on historical network flow data. Using efficient filtering methods, the query processing algorithms can improve the query runtime performance
up to an order of magnitude for simple filtering and aggregation queries, and up to six times
for batch complex queries when compared to naı̈ve approaches. Additionally, we propose a
simple SQL extension that implements a subset of standard SQL commands and operators
and a small set of features useful for network monitoring and forensics. The presented query
processing engine together with a column storage infrastructure create a complete system
for storing and querying network flow data efficiently when used for monitoring and forensic
analysis.
Contents
Vita
iv
Acknowledgements
vi
Abstract
vii
1 Introduction
1.1 Network Monitoring and Forensics . . . . .
1.2 Network Data . . . . . . . . . . . . . . . . .
1.2.1 Packets Payload Data . . . . . . . .
1.2.2 Network Flow Data . . . . . . . . .
1.3 Challenges and Contributions . . . . . . . .
1.3.1 Payload Attribution Systems . . . .
1.3.2 Network Flow Data Storage Systems
1.4 Summary and Dissertation Outline . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 New Payload Attribution Methods
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
2.2 Related Work . . . . . . . . . . . . . . . . . . . . .
2.2.1 Bloom Filters . . . . . . . . . . . . . . . . .
2.2.2 Rabin Fingerprinting . . . . . . . . . . . . .
2.2.3 Winnowing . . . . . . . . . . . . . . . . . .
2.2.4 Attribution Systems . . . . . . . . . . . . .
2.3 Methods for Payload Attribution . . . . . . . . . .
2.3.1 Hierarchical Bloom Filter (HBF) . . . . . .
2.3.2 Fixed Block Shingling (FBS) . . . . . . . .
2.3.3 Variable Block Shingling (VBS) . . . . . . .
2.3.4 Enhanced Variable Block Shingling (EVBS)
2.3.5 Winnowing Block Shingling (WBS) . . . . .
2.3.6 Variable Hierarchical Bloom Filter (VHBF)
2.3.7 Fixed Doubles (FD) . . . . . . . . . . . . .
2.3.8 Variable Doubles (VD) . . . . . . . . . . . .
2.3.9 Enhanced Variable Doubles (EVD) . . . . .
2.3.10 Multi-Hashing (MH) . . . . . . . . . . . . .
2.3.11 Enhanced Multi-Hashing (EMH) . . . . . .
ix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
3
4
5
5
7
12
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
15
17
17
18
18
19
20
25
29
32
35
36
37
38
39
40
40
41
x
2.4
2.5
2.6
2.3.12 Winnowing Multi-Hashing (WMH)
Payload Attribution Systems Challenges .
2.4.1 Attacks on PAS . . . . . . . . . . .
2.4.2 Multi-packet queries . . . . . . . .
2.4.3 Privacy and Simple Access Control
2.4.4 Compression . . . . . . . . . . . .
Experimental Results . . . . . . . . . . . .
2.5.1 Performance Metrics . . . . . . . .
2.5.2 Block Size Distribution . . . . . .
2.5.3 Unprocessed Payload . . . . . . . .
2.5.4 Query Answers . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
3 A Storage Infrastructure for Network Flow
3.1 Introduction . . . . . . . . . . . . . . . . . .
3.2 Related Work . . . . . . . . . . . . . . . . .
3.3 Architecture . . . . . . . . . . . . . . . . . .
3.3.1 Network Flow Data . . . . . . . . .
3.3.2 Column Oriented Storage . . . . . .
3.3.3 Compression . . . . . . . . . . . . .
3.3.4 Query Processing . . . . . . . . . . .
3.4 Evaluation . . . . . . . . . . . . . . . . . . .
3.4.1 Parameters . . . . . . . . . . . . . .
3.4.2 Queries . . . . . . . . . . . . . . . .
3.4.3 Compression . . . . . . . . . . . . .
3.4.4 Comparison With Other Systems . .
3.5 Conclusion . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Data
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
4 A Querying Framework For Network Monitoring
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
4.2 Query Processing . . . . . . . . . . . . . . . . . . .
4.2.1 Simple Queries . . . . . . . . . . . . . . . .
4.2.2 Complex Queries . . . . . . . . . . . . . . .
4.3 Query Language . . . . . . . . . . . . . . . . . . .
4.3.1 Data Definition Language . . . . . . . . . .
4.3.2 Data Manipulation Language . . . . . . . .
4.3.3 User Input . . . . . . . . . . . . . . . . . .
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . .
4.4.1 Simple Queries Performance . . . . . . . . .
4.4.2 Complex Queries Performance . . . . . . .
4.5 Related Work . . . . . . . . . . . . . . . . . . . . .
4.5.1 Multi-Query Processing . . . . . . . . . . .
4.5.2 Network Flow Query Languages . . . . . .
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
and Forensics
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
43
43
45
45
47
47
48
48
51
53
56
.
.
.
.
.
.
.
.
.
.
.
.
.
58
58
63
65
66
67
71
75
78
78
80
84
87
88
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
90
90
93
93
99
105
106
108
108
109
110
111
114
114
115
116
5 Conclusions and Future Work
118
5.1 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
List of Figures
1.1
1.2
1.3
1.4
Network packet. . . . . . . . . . . .
Network flow data representation. .
Row-store RDBMS . . . . . . . . .
A column-store RDBMS . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
4
8
9
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13
2.14
2.15
2.16
2.17
2.18
2.19
2.20
2.21
Methods evolution tree. . . . . . . . . . . . . . . .
Processing with HBF method. . . . . . . . . . . . .
HBF querying collision. . . . . . . . . . . . . . . .
Processing with FBS method. . . . . . . . . . . . .
Collision because shingling failure. . . . . . . . . .
Querying with FBS method. . . . . . . . . . . . . .
Excerpt not found with FBS method. . . . . . . .
Processing with VBS method. . . . . . . . . . . . .
Processing with EVBS method. . . . . . . . . . . .
Processing with WBS method. . . . . . . . . . . .
Processing with VHBF method. . . . . . . . . . . .
Processing with FD method. . . . . . . . . . . . .
Processing with VD method . . . . . . . . . . . . .
Processing with EVD method. . . . . . . . . . . .
Processing with MH method. . . . . . . . . . . . .
Processing with WMH method. . . . . . . . . . . .
Query using WMH method. . . . . . . . . . . . . .
The distributions of block sizes for VBS method .
The distributions of block sizes for EVBS method .
The distributions of block sizes for WBS method .
All methods evaluation. . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26
27
28
30
31
32
33
34
35
37
38
39
40
41
42
46
46
49
50
50
53
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
Network flow traffic distribution for one day . . .
Network flow traffic distribution for one month .
NetStore architecture . . . . . . . . . . . . . . . .
NetStore processing engine . . . . . . . . . . . .
Column-store architecture . . . . . . . . . . . . .
IPs inverted index . . . . . . . . . . . . . . . . .
Insertion rate for different segment sizes. . . . . .
Compression ratio with and without aggregation.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
60
61
65
69
71
72
80
81
xi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xii
3.9 Query time vs segment size . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.10 Compression strategies results . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1
4.2
4.3
4.4
4.5
4.6
Simple query execution graph . . . . . . . . . . . . . . . . . . . . . . .
The interactive sequential processing model. . . . . . . . . . . . . . . .
Batch of queries representation . . . . . . . . . . . . . . . . . . . . . .
Simple query optimization evaluation . . . . . . . . . . . . . . . . . . .
The complex interactive query performance compared to the baseline.
The complex batch query performance compared to the baseline. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
84
86
95
100
102
110
112
114
List of Tables
2.1
2.2
2.3
. . . .
. . . .
of un. . . .
. . . .
25
49
2.4
Summary of properties of methods . . . . . . .
Number of elements inserted for each method .
(a) False positive rates for data reduction ratio
processed payload . . . . . . . . . . . . . . . .
False positive rate for data reduction ratio 50:1
3.1
3.2
3.3
NetStore flow attributes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NetStore properties and rates supported . . . . . . . . . . . . . . . . . . . .
NetStore relative performance compared to PostgreSQL and LucidDB . . .
79
79
87
xiii
. . . . . . . . . . . .
. . . . . . . . . . . .
130:1 (b) Percentage
. . . . . . . . . . . .
. . . . . . . . . . . .
52
52
List of Algorithms
4.1
4.2
ExecuteSimpleQuery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ExecuteAdaptiveBatchQuery . . . . . . . . . . . . . . . . . . . . . . . . . .
xiv
98
104
Chapter
1
Introduction
The number of devices that connect to the Internet and the traffic generated by them
increases every day, more and more data being transported in the networks. This data
crosses private network boundaries in and out and many organizations record traces of
network connections for monitoring and investigation purposes. In general, the collected
data is preprocessed and archived using relational databases or some other file organization
on permanent storage devices. In both cases the system used to process and archive the
collected network data impacts the query response times as well as the efficiency of the data
usage for monitoring and forensics. In this dissertation we analyze existing practices for
archiving network data and propose new efficient methods that can be used to enhance the
performance of monitoring and forensic analysis.
1.1
Network Monitoring and Forensics
Network monitoring represents the practice of overseeing the operation of a computer
network in order to detect failures of devices, failure of connections or other unexpected
network communication behavior. Besides the ability to display traffic summarization per
hour, per day or other predefined time window, network security monitoring systems, such as
Intrusion Detection Systems (IDS), are increasingly providing the ability to detect complex
network behavior patterns, such as worm infections, by using near real time network data.
Therefore is important to have access to the monitored data as quick as possible in order
1
2
to reduce the time of network exposure to various security vulnerabilities or to detect
communication malfunctions and resume the normal network activities in the shortest time
possible.
Network forensics is the capture, recording, and analysis of network events in order
to discover the source of security attacks or other problem incidents. In comparison with
network monitoring, the forensics tasks require access to larger amounts of network data
collected for longer periods of time that can represent days, weeks, months or even years.
This data is processed in several steps in a drill-down manner, narrowing and correlating
the subsequent intermediary results throughout the analysis. As such, network forensic
systems need the ability to identify the root cause of a security breach starting from a
simple evidence point such as an excerpt of a phishing email, an internet worm signature
or a piece of sensitive data disclosed by an insider. The next steps in the investigation
may involve checking a suspected host’s past network activity, looking up any services run
by a host, protocols used, the connection records to other hosts that may or may not be
compromised. In this case, the systems used for storing and querying network data should
take into account the general characteristics of forensic queries and should provide valuable
evidence and reasonable response times when using the archived data.
For both monitoring and forensics tasks the amount of data considered for processing
and analysis is increasing every day. The existing data storage system are no longer suitable
to store and query network data [19] and new methods are needed to efficiently meet the
new requirements. In this dissertation we propose new methods for storing and querying
network data. The proposed methods are aware of the network data semantics and potential
query workloads, and can be efficiently used for monitoring and forensic analysis.
1.2
Network Data
In this section we present the different network data categories that we consider when
designing our methods for storage and querying: payload data and network flow data. Data
is transported over the network in packets and each packet contains both a header and a
payload. Payload data represents the actual data carried over the network in the network
3
packets and network flow data represents the quantitative information about the communication between two endpoints in the network. Our goal is to find efficient storage and
querying methods for both of these network data categories.
1.2.1
Packets Payload Data
A network packet payload data contains the actual string of bits transfered over the
network. Figure 1.1 shows schematically a network packet with header and payload data.
This data can represent various content types such as plaintext, images, video, audio or
encrypted data. In general this data is not structured in nature and most of the time is
transported in encoded format, the decoding being done at the application layer.
Figure 1.1: Simple network packet payload data representation.
In many situations payload data is stored along with the header information in network
traces for monitoring and investigations. These traces represent data captured from the
network and may be stored in full format using lossless compression methods, where all the
original data can be reconstructed from the compressed format, or in extremely compressed
digest format using lossy compression methods, where some of the original data might be
lost but the information stored about the original data still being of considerable utility. Due
to the increasing volume of network traffic in today’s networks it is infeasible to effectively
store and query all the actual packets payload data for extended periods of time in order to
allow analysis of network events for investigative purposes. Therefore, in this dissertation
we focus on the latter case when payload data is stored using extremely compressed digests
of payload and propose new methods that can be efficiently used by the payload attribution
systems for investigation tasks. More specifically, we propose various methods to partition
the payload into blocks using winnowing [47] and Rabin fingerprinting [43], then store
them in a Bloom filter [6]. We present the details of all the payload attribution methods
proposed [40, 41] in Chapter 2.
4
1.2.2
Network Flow Data
Unlike payload data, the header data is structured and includes specific information
from each networking layer. In general header data contains routing information used by
networked devices and hosts operating systems in order to facilitate the transportation of
the communication data flow. Among other identification and control information requested
by the lower layers protocols, a network packet header contains information requested by
the transport and network layers protocols such as source IP, source port, destination IP,
destination port, protocol, etc.
In the context of network communication we refer to a flow as an unidirectional data
stream between two endpoints, and refer to flow data or flow record as the quantitative
description of a flow. Flow data includes source IP, source port, destination IP, destination
port, protocol, number of bytes transported in the flow, start time, end time, etc. A
schematic representation of the network flow data is presented by Figure 1.2. Since flow
data has become ubiquitous in recent years organizations have developed standards [26]
and protocols [57] in order to provide a common framework for using this data. As such,
many organizations store and use network flow data for various purposes such as for traffic
metering, network monitoring, intrusion detection and network forensics.
Figure 1.2: Network flow data representation.
In contrast with packets payload data the network flow data is highly structured. Each
attribute of a flow can be stored efficiently using well established representations of common numerical types (IP addresses as 4-byte integer, source port as 2-byte short, etc) or
specific protocols structure (DNS records, HTTP requests, etc). Current systems [18], store
5
historical network flow records using transactional Relational Database Management Systems (RDBMS) or plain files organized in hierarchy based on capture time [17]. These
systems show performance penalty when storing and querying data spanning for long periods of time due to the I/O operations overhead. Chapter 3 of this dissertation presents
a novel storage architecture that avoids the limitation of the transactional databases and
uses a column-oriented approach to store and query network flow records for monitoring
and forensic analysis. Moreover, Chapter 4 shows the design of the query processing engine
for the column-oriented system and proposes several optimization methods for simple and
complex, forensics and monitoring queries.
1.3
Challenges and Contributions
With the deployment of new networked devices and Internet based services there is an
accelerated increase in both networks speeds and the amount of data transported. In this
dissertation we propose new methods for coping with these major challenges in order to
provide efficient functionality for systems used in network monitoring and forensic analysis.
First, in Chapter 2 we look at the existing methods for payload attribution, observe their
limitations and propose new techniques that gradually achieve the desired goals: highest
payload compression ratios with the smallest false positives rates when querying. Second,
we examine the existing solutions for storying and querying network flow data, then based
on the data and expected queries characteristics we propose a new storage architecture in
Chapter 3 and a query execution framework in Chapter 4 that are better suited to store
and query historical network flow data.
1.3.1
Payload Attribution Systems
Payload attribution is the process of identifying the sources and destinations of all packets that appeared on a network and contained a certain excerpt of a payload. A Payload
Attribution System (PAS) is a system than can facilitate payload attribution. It can be an
extremely valuable tool in helping to determine the perpetrators or the victims of a network
event and to analyze security incidents in general.
6
A payload attribution system performs two separate tasks: payload processing and
query processing. In payload processing, the payload of all traffic that passed through
the network where the PAS is deployed is examined and some information is saved into
permanent storage. This has to be done at line speed and the underlying raw packet
capture component can also perform some filtering of the packets, for example, choosing to
process only HTTP traffic.
In general, data is stored in archive units, each of which having two timestamps (start
and end of the time interval during which data was collected). For each time interval there
is also a need to save a unique flow identifier as a flowID (pairs of source and destination IP
addresses, etc) to allow querying later on. This information can be alternatively obtained
from various sources such as connection records collected by dedicated network sensors
(routers exporting NetFlow [57]), firewalls, intrusion detection systems or other log files.
During query processing given the excerpt and a time interval of interest the PAS has to
retrieve all the corresponding archive units from the storage. For example, when querying
the PAS for the excerpt, if the excerpt is found in the archive, the system will try to query
successively for all the flowIDs available corresponding to the time interval, and report all
matches to the user.
A naı̈ve method to design a simple payload attribution system is to store the payload
of all packets. In order to decrease the demand for storage capacity and to provide some
privacy guarantees, one can store hashes of payloads instead of the actual payloads. This
approach reduces the amount of data per packet (to about 20 bytes by using SHA-1, for
example) at the cost of false positives due to hash collisions.
To further reduce the required storage space one can insert the payloads in a Bloom
filter [6] which is described in Section 2.2.1 of Chapter 2. Essentially, Bloom filters are
space-efficient probabilistic data structures supporting membership queries that are used in
many network and other applications [10]. An empty Bloom filter is a bit vector of m bits,
all set to 0, that uses k different hash functions, each of which maps a key value to one of
the m positions in the vector. To insert an element into the Bloom filter, one computes the
k hash function values for the element and sets the bits at the corresponding k positions to
1 in the bit vector. To test whether an element was inserted, one hashes the element with
7
these k hash functions and checks if all corresponding bits are set to 1, in which case the
element is said to be found in the filter.
The space savings of a Bloom filter is achieved at the cost of introducing false positives.
The false positive rate of a Bloom filter depends on the data reduction ratio it provides:
the greater the savings in storage, the greater the probability of a query returning a false
positive. An useful property of a Bloom filter is that it preserves privacy because it allows
only to ask whether a particular element was inserted into it, but it cannot be coerced into
revealing the list of elements stored. Compared to storing hashes directly, the advantage of
using Bloom filters is not only the space savings but also the speed of querying. It takes
only a short constant time to query the Bloom filter for any packet.
However, inserting the entire payload into a Bloom filter does not allow supporting
queries for payload excerpts. Instead of inserting the entire payload into the Bloom filter
one can partition it into blocks and insert them individually. In this way the system allows
queries for excerpts of the payload by checking if all the blocks of an excerpt are in the Bloom
filter. Besides reducing the storage requirements and achieving the lowest false positive rates
for individual excerpts blocks, another important challenges are to determine with high
accuracy where in the payload each block started (the alignment problem) and whether the
resulting blocks appeared consecutively in the same payload (the consecutiveness resolution
problem).
In this dissertation we consider the case when a PAS processes a packets payload by
partitioning the payload into blocks and then stores them in a Bloom filter. We describe
a whole suite of methods used for payload partitioning in Chapter 2. We show how each
payload partitioning method was derived, how it solves the alignment and consecutiveness
resolution problems, and how it impacts the PAS performance in terms of compression ratio
and false positives ratios achieved.
1.3.2
Network Flow Data Storage Systems
Unlike network packets payload data, network flow data is highly structured and therefore can be stored in structured format in tables using a relational database as it is done
in [18]. In such case, each flow record is stored as a row in a table having each flow attribute
8
in a column. Data in a relational database is manipulated and queried using standard SQL
commands. An efficient storage and querying infrastructure for network flow records has to
cope with two main technical challenges: keep the insertion rate high, and provide fast access to the desired flow records. As shown in [19] the query performance of using a RDBMS
is influenced by the decision to physically store the data row-by-row in a so called row-store,
or column-by-column in a column-store. The column oriented systems are proven to yield
better query runtime performance for analytical query workloads [55,60] and in this section
we provide a brief description of each approach as well as a short description of expected
queries on network flow data.
Row-Store RDBMS
When using a traditional row-oriented database, for each flow, the relevant attributes are
inserted as a row into a table as they are captured from the network as shown by Figure 1.3.
Then, each flow attributes are stored sequential on disk, and flows are indexed using various
techniques [18] for most accessed attributes. On one hand, such a system has to establish
a trade off between the insertion rate desired and the storage and processing overhead
employed by the use of auxiliary indexing data structures. On the other hand, enabling
indexing for more attributes ultimately improves query performance but also increases the
storage requirements and decreases insertion rates. In general, when querying disk resident
data, an important problem to overcome is the I/O bottleneck caused by large disk to
memory data transfers. Having the flow records laying sequential on disk, at query time, all
the columns of the table have to be loaded in memory even if only a subset of the attributes
are relevant for the query, adding a significant I/O penalty for the overall query processing
time by loading unused columns.
Figure 1.3: A row-store RDBMS table representation of the flow data.
9
Therefore, one potential solution would be to load only data that is relevant to the
query. For example, to answer the query ”What is the list of all IPs that contacted IP
X between dates d1 and d2 ?”, the system should load only the source and destination
IPs as well as the timestamps of the flows that fall between dates d1 and d2 . The I/O
time can also be decreased if the accessed data is compressed since less data traverses the
disk-memory boundary. Further, the overall query response time can be improved if data is
processed in compressed format by saving decompression time. Finally, since the system has
to insert records at line speed, all the preprocessing algorithms used, such as compression,
sorting or indexing should add negligible overhead while writing to disk. Even though
for small amounts of network flow data the existing transactional database systems might
provide satisfactory performance [18] they fall short when inserting and querying network
data collected for prolonged periods of time. However, for large amounts of network flow
data the above requirements can be met quite well by utilizing a column oriented database
described below.
Column-Store RDBMS
When using a column oriented RDBMS, the flows attributes are also represented as rows
in a table at the logical level but they are stored as columns at the physical level. Each
column holds data for a single attribute of the flow and is stored sequentially on disk. A
simple graphical representation is showed by Figure 1.4.
Figure 1.4: A column-store RDBMS representation of the flow data, the top table shows
the data source of each column.
10
Such a strategy makes the system I/O efficient for read queries since only the required
attributes related to a query can be read from the disk. Moreover, having data of the same
type laying sequential on disk creates incentives for efficient compression methods that allow
processing data in compress format (for example when using run-length encoding).
The performance benefits of column partitioning were previously analyzed in [2, 25],
and some of the ideas were confirmed by the results in the databases academic research
community [1, 19, 55, 60] as well as in industry [13, 27, 28, 30, 58].
However, most of commercial and open-source column stores were conceived to follow
general purpose RDBMSs requirements, and do not fully use the semantics of the data
carried and do not take advantage of the specific types and data access patterns of network
forensic and monitoring queries. For example, network data is continuously inserted at
line speed in the storage systems using append only operation, so there is no need to
support all the operations required by the transactional workloads. Moreover, forensics
and monitoring queries mostly use a time window associated with each query, access large
amounts of stored data once and don’t require individual updates or deletes for the stored
flow records unless all the flow archive is deleted. In contrast, the existing general purpose
column-stores use auxiliary data structures to reconstruct the original logical table in order
to support updates and deletes of individual records. By doing so, a significant overhead
is added for both insertion and query time. Therefore the major challenges of building an
efficient storage infrastructure for network flow records are to reuse the relevant features
of existing systems, to avoid their limitations and to incorporate the knowledge about new
data insertion and query workloads as early as possible in the design of the system.
In Chapter 3 we present the design, implementation details and the evaluation of a
column-oriented storage infrastructure for network records that, unlike the other systems, is
intended to provide good performance when using network records flow data for monitoring
and forensic analysis [19].
Queries on Network Flow Data
Forensics and monitoring queries on historical network flow data are becoming increasingly complex [32]. In general these queries are composed of many simple queries that
11
process both structured and user input data (for example list with malicious hosts, restricted services, restricted port numbers, etc). In the case of forensic analysis historical
network flow data is processed in several steps, queries being executed sequential in a
drill-down manner, narrowing, reusing and correlating the subsequent intermediary results
throughout the analysis.
Existing SQL based querying technologies do not support unstructured user input and
store intermediary results in temporary files [17] and materialized views on disks [18] before
feeding data to future queries. Using this approach the query processing time is increased
due to multiple unnecessary I/O operations. In forensic analysis, references between subsequent query results are not trivial to represent and implement using standard SQL syntax
and semantics. For this task sophisticated stored procedures and languages are used instead. Moreover, when executing forensic queries sequentially, the query engine has the
opportunity to speedup the new queries by efficiently reusing the results of the already
evaluated predicates from the old queries.
For monitoring, network administrators run many simple queries at once, in batches,
in order to detect complex network behavior patterns, such as worm infections [32], and
display traffic summarization per hour, per day or other predefined time window [17, 39].
By submitting queries in batches all the simple queries can be executed in any order, thus
some order may result in better overall runtime performance. Additionally, it is expected
that some of the queries in the batch to use the same filtering predicates for some attributes
(known ports, IPs, etc). In this case the results from evaluating common predicates can be
shared by many queries, therefore saving execution time. Moreover, evaluating predicates
in a particular order across all the simple queries may result in less evaluation work for the
future predicates in the execution pipeline.
Additional to the column oriented storage infrastructure described in Chapter 3, in this
dissertation we propose a complementary querying framework for historical network flow
data for monitoring and forensic queries in Chapter 4. The proposed querying framework
together with the column oriented storage system create a complete system for efficiently
storing and querying network flow data.
12
1.4
Summary and Dissertation Outline
There are four major components of this dissertation each presented sequentially in one
of the next chapters. As such, Chapter 2 defines the problem of payload attribution and
provides a comprehensive presentation of multiple payload processing methods that can be
used by a payload attribution system. For better understanding of the motivation for a new
method, the description of each method is introduced gradually by enhancing the previously
described methods with new features proven to eliminate some of the previous methods
limitations. All the presented methods were tested as modules of a payload attribution
system using real data. The experiments show that the best method achieves data reduction
ratios greater than 100:1 while supporting efficient queries with very low false positive rates.
Chapter 3 examines the existing systems used for storing network flow data, outlines
the requirements of an efficient storage infrastructure for these data and presents the design
and implementation of a more efficient storage system. The proposed architecture is column
oriented and each column is partitioned into segments in order to allow data access at smaller
granularity than column. Each segment has associated with it a small meta data component
that stores statistics about data in each segment. Having data partitioned in this fashion
allows the use of compression methods that can be chosen dynamically for each segment
and Chapter 3 presents the compression selection mechanism as well as the compression
methods used. Additionally, Chapter 3 introduces a new indexing approach for faster
access to records corresponding to internal hosts of an organization by using the concept
of inverted index. The experiments using sample monitoring and forensics queries show
that an implemented instance of the proposed architecture, NetStore, performs better than
two of the existing open-source systems, the row-store PosgreSQL and the column-store
LucidDB.
Chapter 4 introduces the challenges associated with executing monitoring and forensic
queries. First, based on their complexity the queries are separated into simple and complex. The simple queries are assumed to use only simple filtering (<, ≤, >, ≥, =, 6=, etc)
and aggregation (MIN, MAX, COUNT, SUM, AVG, etc) operators while complex queries
can be composed of many simple queries as building blocks. The simple queries that make
13
up a complex query can be executed sequential and in batches each approach raising new
challenges and opportunities for optimization. Chapter 4 defines the optimization problems for both simple and complex queries and presents efficient algorithms for executing
these queries using data stored in a column-store. The experiments show that the proposed
optimization methods performed better than the naı̈ve approaches in both cases. Additionally, Chapter 4 introduces a simple SQL-extension for expressing monitoring and forensics
queries that need easy access to previous results (for example to refine results in investigations), simpler constructs to allow loading user data (for example lists of malicious IPs)
and importing network flow data from network sensors.
Lastly, Chapter 5 presents a discussion about the results and implications of the presented methods. It describes the possibility of using the storage and querying methods in
a more general context by varying the application domain. Finally, it enumerates the envisioned guidelines for future work taking into account the increasing popularity of parallel
processing and presents the dissertation conclusions.
In summary, this dissertation presents new methods for enhancing the performance of
storage and querying systems for network payload and flow data. It introduces new more
efficient payload attribution methods that can be successfully used to detect excerpts of
payload when using extremely compressed data with low false positive rates. It presents the
design and implementation details of a column oriented storage infrastructure for network
flow records and proposes an efficient querying framework for monitoring and forensics
queries using data stored in a column oriented system.
The dissertation makes the following main intellectual contributions:
• A detailed description of new methods to partition network packets payload used for
payload attribution.
• Implementation, analysis and comparison of payload attribution methods deployed as
modules of an existing payload attribution system using real network traffic data.
• Design of an efficient column oriented storage infrastructure that enables quick access
to large amounts of historical network flow data for monitoring and forensic analysis.
14
• Implementation and deployment of NetStore using commodity hardware and open
source software as well as analysis and comparison with other open source storage
systems used currently in practice.
• The design and implementation of an efficient query processing engine built on top of
a column oriented storage system used for network flow data.
• Query optimization methods for sequential and batch queries used in network monitoring and forensic analysis.
• Design of a simple SQL extension that allows simpler representation of forensic and
monitoring queries.
Chapter
2
New Payload Attribution Methods
2.1
Introduction
Cybercrime today is alive and well on the Internet and growing both in scope and sophistication [45]. Given the trends of increasing Internet usage by individuals and companies
alike and the numerous opportunities for anonymity and non-accountability of Internet use,
we expect this trend to continue for some time. While there is much excellent work going on
targeted at preventing cybercrime, unfortunately there is the parallel need to develop good
tools to aid law-enforcement or corporate security professionals in investigating committed
crimes. Identifying the sources and destinations of all packets that appeared on a network
and contained a certain excerpt of a payload, a process called payload attribution, can be an
extremely valuable tool in helping to determine the perpetrators or the victims of a network
event and to analyze security incidents in general [16, 29, 50, 54].
It is possible to collect full packet traces even with commodity hardware [4] but the storage and analysis of terabytes of such data from today’s high-speed networks is extremely
cumbersome. Supporting network forensics by simply capturing and logging raw network
traffic, however, is infeasible for anything but short periods of history. First, storage requirements limit the time over which the data can be archived (e.g., a 100 Mbit/s WAN
can fill up 1 TB in just one day) and it is a common practice to overwrite old data when
that limit is reached. Second, string matching over such massive amounts of data is very
time-consuming.
15
16
Recently Shanmugasundaram et al. [49] presented an architecture for network forensics
in which payload attribution is a key component [49]. They introduced the idea of using
Bloom filters to achieve a reduced size digest of the packet history that would support
queries about whether any packet containing a certain payload excerpt has been seen; the
reduction in data representation comes at the price of a manageable false positive rate in
the query results. Subsequently a different group has offered a variant technique for the
same problem [14].
Our contribution in this chapter is to present new methods for payload attribution that
have substantial performance improvements over these state-of-the-art payload attribution
systems. Our approach to payload attribution, which constitutes a crucial component of
a network forensic system, can be easily integrated into any existing network monitoring
system. The best of our methods allow data reduction ratios greater than 100:1 and achieve
very low overall false positive rates. With a data reduction ratio of 100:1 our best method
gives no false positive answers for query excerpt sizes of 250 bytes and longer; in contrast, the
prior best techniques had 100% false positive rate at that data reduction ratio and excerpt
size. The reduction in storage requirements makes it feasible to archive data taken over an
extended time period and query for events in a substantially distant past. Our methods
are capable of effectively querying for small excerpts of a payload but can also be extended
to handle excerpts that span several packets. The accuracy of attribution increases with
the length of the excerpt and the specificity of the query. Further, the collected payload
digests can be stored and queried by an untrusted party without disclosing any payload
information nor the query details.
This chapter is organized as follows. In the next section we review related prior work. In
Section 2.3 we provide a detailed design description of our payload attribution techniques,
with a particular focus on payload processing and querying. In Section 2.4 we discuss
several issues related to the implementation of these techniques in a full payload attribution
system. In Section 2.5 we present a performance comparison of the proposed methods and
quantitatively measure their effectiveness for multiple workloads. Finally, we conclude in
Section 2.6.
17
2.2
Related Work
When processing a packet payload by the methods described in Section 2.3, the overall
approach is to partition the payload into blocks and store them in a Bloom filter. In this
section we first give a short description of Bloom filters and introduce Rabin fingerprinting
and winnowing, which are techniques for block boundary selection. Thereafter we review
the work related to payload attribution systems.
2.2.1
Bloom Filters
Bloom filters [6] are space-efficient probabilistic data structures supporting membership
queries and are used in many network and other applications [10]. An empty Bloom filter
is a bit vector of m bits, all set to 0, that uses k different hash functions, each of which
maps a key value to one of the m positions in the vector. To insert an element into the
Bloom filter, we compute the k hash function values and set the bits at the corresponding
k positions to 1. To test whether an element was inserted, we hash the element with these
k hash functions and check if all corresponding bits are set to 1, in which case we say
the element is in the filter. The space savings of a Bloom filter is achieved at the cost of
introducing false positives; the greater the savings, the greater the probability of a query
returning a false positive. Equation 2.1 gives an approximation of the false positive rate α,
after n distinct elements were inserted into the Bloom filter [34]. More analysis reveals that
an optimal utilization of a Bloom filter is achieved when the number of hash functions, k,
equals (ln 2) · (m/n) and the probability of each bit of the Bloom filter being 0 is 1/2. In
practice, of course, k has to be an integer and smaller k is mostly preferred to reduce the
amount of necessary computation. Note also that while we use Bloom filters throughout
this chapter, all of our payload attribution techniques can be easily modified to use any data
structure which allows insertion and querying for strings with no changes to the structural
design and implementation of the attribution methods.
α=
!k k
1 kn
1− 1−
≈ 1 − e−kn/m
m
(2.1)
18
2.2.2
Rabin Fingerprinting
Fingerprints are short checksums of strings with the property that the probability of two
different objects having the same fingerprint is very small. Rabin defined a fingerprinting
scheme [43] for binary strings based on polynomials in the following way. We associate
a polynomial S(x) of degree N − 1 with coefficients in Z2 with every binary string S =
(s1 , . . . , sN ), for N ≥ 1:
S(x) = s1 xN −1 + s2 xN −2 + · · · + sN .
(2.2)
Then we take a fixed irreducible polynomial P (x) of degree K, over Z2 and define the
fingerprint of S to be the polynomial f (S) = S(x) mod P (x).
This scheme, only slightly modified, has found several applications [8], for example, in
defining block boundaries for identifying similar files [31] and for web caching [44]. We
derive a fingerprinting scheme for payload content based on this Rabin’s scheme in Section 2.3.3 and use it to pick content-dependent boundaries for a priori unknown substrings
of a payload. For details on the applications, properties and implementation issues of the
Rabin’s scheme one can refer to [8].
2.2.3
Winnowing
Winnowing [47] is an efficient fingerprinting algorithm enabling accurate detection of
full and partial copies between documents. It works as follows: For each sequence of ν
consecutive characters in a document, we compute its hash value and store it in an array.
Thus, the first item in the array is a hash of c1 c2 . . . cν , the second item is a hash of
c2 c3 . . . cν+1 , etc., where ci are the characters in the document of size Ω bytes, for i =
1, . . . , Ω. We then slide a window of size w through the array of hashes and select the
minimum hash within each window. If there are more hashes with the minimum value, we
choose the rightmost one. These selected hashes form the fingerprint of the document. It is
shown in [47] that fingerprints selected by winnowing are better for document fingerprinting
than the subset of Rabin fingerprints which contains hashes equal to 0 mod p, for some
fixed p, because winnowing guarantees that in any window of size w there is at least one
19
hash selected. We will use this idea to select boundaries for blocks in packet payloads in
Section 2.3.5.
2.2.4
Attribution Systems
There has been a major research effort over the last several years to design and implement feasible network traffic traceback systems, which identify the machines that directly
generated certain malicious traffic and the network path this traffic subsequently followed.
These approaches, however, restrict the queries to network floods, connection chains, or the
entire payload of a single packet in the best case.
The Source Path Isolation Engine (SPIE) [53] is a hash-based technique for IP traceback
that generates audit trails for traffic within a network. SPIE creates hash-digests of packets
based on the packet header and a payload fragment and stores them in a Bloom filter in
routers. SPIE uses these audit trails to trace the origin of any single packet delivered by the
network in the recent past. The router creates a packet digest for every forwarded packet
using the packet’s non-mutable header fields and a short prefix of the payload, and stores it
in a Bloom filter for a predefined time. Upon detection of a malicious attack by an intrusion
detection system, SPIE can be used to trace the packet’s attack path back to the source by
querying SPIE devices along the path.
In many cases, an investigator may not have any header information about a packet of
interest but may know some excerpt of the payload of the packets she wishes to see. Designing techniques for this problem that achieve significant data reduction (compared to storing
raw packets) is a much greater challenge; the entire packet payload is much larger than
the information hashed by SPIE; in addition we need to store information about numerous
substrings of the payload to support queries about excerpts. Shanmugasundaram et al. [49]
introduced the Hierarchical Bloom Filter (HBF), a compact hash-based payload digesting
data structure, which we describe in Section 2.3.1. A payload attribution system based on
an HBF is a key module for a distributed system for network forensics called ForNet [50].
The system has both low memory footprint and achieves a reasonable processing speed at
a low false positive rate. It monitors network traffic, creates hash-based digests of payload,
and archives them periodically. A user-friendly query mechanism based on XML provides
20
an interface to answer postmortem questions about network traffic. SPIE and HBF are
both digesting schemes, but while SPIE is a packet digesting scheme, HBF is a payload
digesting scheme. With an HBF module running in ForNet (or a module using any of our
methods presented in this chapter), one can query for substrings of the payload (called
excerpts throughout this dissertation).
Recently, another group suggested an alternative approach to the payload attribution
problem, the Rolling Bloom Filter (RBF) [14], which uses packet content fingerprints based
on a generalization of the Rabin-Karp string-matching algorithm. Instead of aggregating
queries in a hierarchy as an HBF, they aggregate query results linearly from multiple Bloom
filters. The RBF tries to solve a problem of finding a correct alignment of blocks in the
process of querying an HBF by considering many possible alignments of blocks at once, i.e.,
RBF is rolling a fixed-size window over the packet payload and recording all the window
positions as payload blocks. They report performance similar to the best case performance
of the HBF.
The design of an HBF is well documented in the literature and currently used in practice.
We created our implementation of the HBF as an example of a current payload attribution method and include it in our comparisons in Section 2.5. The RBF’s performance is
comparable to that of the HBF and experimental results presented in [14] show that RBF
achieves low false positive rates only for small data reduction ratios (about 32:1).
2.3
Methods for Payload Attribution
In this section we introduce various data structures for payload attribution. Our primary goal is to find techniques that give the best data reduction for payload fragments of
significant size at reasonable computational cost. When viewed through this lens, roughly
speaking a technique that we call Winnowing Multi-Hashing (WMH) is the best and substantially outperforms previous methods; a thorough experimental evaluation is presented
in Section 2.5.
Our exposition of WMH will develop gradually, starting with naı̈ve approaches to the
problem, building through previous work (HBF), and introducing a variety of new tech-
21
niques. Our purpose is twofold. First, this exposition should develop a solid intuition
for the reader as to the various considerations that were taken into account in developing
WMH. Second, and equally important, there are a variety of lenses through which one may
consider and evaluate the different techniques. For example, one may want to perform less
computation and cannot utilize data aging techniques and as a result opt for a method
such as Winnowing Block Shingling (WBS) which is more appropriate than WMH under
those circumstances. Additionally, some applications may have specific requirements on the
block size and therefore prefer a different method. By carefully developing and evaluating
experimentally the different methods, we present the reader with a spectrum of possibilities
and a clear understanding of which to use when.
As noted earlier, all of these methods follow the general program of dividing packet
payloads into blocks and inserting them into a Bloom filter. They differ in how the blocks
are chosen, what methods we use to determine which blocks belong to which payload in
which order (“consecutiveness resolution”), and miscellaneous other techniques used to
improve the number of necessary queries and to reduce the probability of false positives.
We first describe the basics of block based payload attribution and the Hierarchical Bloom
Filter [49] as the current state-of-the-art method. We then propose several new methods
which solve multiple problems in the design of the former methods.
A naı̈ve method to design a simple payload attribution system is to store the payload
of all packets. In order to decrease the demand for storage capacity and to provide some
privacy guarantees, we can store hashes of payloads instead of the actual payloads. This
approach reduces the amount of data per packet to about 20 bytes (by using SHA-1, for
example) at the cost of false positives due to hash collisions. By storing payloads in a
Bloom filter (described in Section 2.2.1), we can further reduce the required space. The
false positive rate of a Bloom filter depends on the data reduction ratio it provides. A
Bloom filter preserves privacy because we can only ask whether a particular element was
inserted into it, but it cannot be coerced into revealing the list of elements stored; even if
we try to query for all possible elements, the result will be useless due to false positives.
Compared to storing hashes directly, the advantage of using Bloom filters is not only the
space savings but also the speed of querying. It takes only a short constant time to query
22
the Bloom filter for any packet.
Inserting the entire payload into a Bloom filter, however, does not allow supporting
queries for payload excerpts. Instead of inserting the entire payload into the Bloom filter
we can partition it into blocks and insert them individually. This simple modification can
allow queries for excerpts of the payload by checking if all the blocks of an excerpt are in
the Bloom filter. Yet, we need to determine whether two blocks appeared consecutively
in the same payload, or if their presence is just an artifact of the blocking scheme. The
methods presented in this section deal with this problem by using offset numbers or block
overlaps. The simplest data structure that uses a Bloom filter and partitions payloads into
blocks with offsets is a Block-based Bloom Filter (BBF) [49].
Note that, assuming that we do one decomposition of the payload into blocks during
payload processing, starting at the beginning of the packet, we will need to query the data
structure for multiple starting positions of our excerpt in the payload during excerpt querying phase, as the excerpt need not start at the beginning of a block. For example, if the
payload being partitioned with a block size of 4 bytes was ABCDEF GHIJ, we would insert blocks ABCD and EF GH into the Bloom filter (the remainder IJ is not long enough
to form a block and is therefore not processed). Later on, when we query for an excerpt, for
example, DEF GHI, we would partition the excerpt into blocks (with a block size 4 bytes
as done previously on the original payload). This would give us just one block to query the
Bloom filter for, DEF G. However, because we do not know where could the excerpt be located within the payload we also need to try partitioning the excerpt from starting position
offsets 1 and 2, which gives us blocks EF GH and F GHI, respectively. We are then guaranteed that the Bloom filter answers positively for the correct block EF GH, however, we
can also get positive answers for blocks EF GH and F GHI due to false positives of a Bloom
filter. The payload attribution methods presented in this section try to limit or completely
eliminate (see Section 2.3.3) this negative effect. Alternative payload processing schemes,
such as [14, 21] perform partitioning of the payload at all possible starting offsets during
the payload processing phase (which is basically similar to working on all n-grams of the
payload) but it incurs a large overhead for processing speed and also storage requirements
are multiplied.
23
We also need to set two parameters which determine the time precision of our answers
and the smallest query excerpt size. First, we want to be able to attribute to each excerpt
for which we query the time when the packet containing it appeared on the network. We
solve that by having multiple Bloom filters, one for each time interval. The duration of
each interval depends on the number of blocks inserted into the Bloom filter. In order to
guarantee an upper bound on the false positive rate, we replace the Bloom filter by a new
one and store the previous Bloom filter in a permanent storage after a certain number of
elements are inserted into it. There is also an upper bound on the maximum length of one
interval to limit the roughness of time determination. Second, we specify the size of blocks.
If the chosen block size is too small, we get too many collisions as there are not enough
unique patterns and the Bloom filter gets filled quickly by many blocks. If the block size is
too large, there is not enough granularity to answer queries for smaller excerpts.
We need to distinguish blocks from different packets to be able to answer who has
sent/received the packet. The BBF as briefly described above isn’t able to recognize the
origins and destinations of packets. In order to work properly as an attribution system over
multiple packets a unique flow identifier (flowID) must be associated with each block before
inserting into the Bloom filter. A flow identifier can be the concatenation of source and destination IP addresses, optionally with source/destination port numbers. We maintain a list
(or a more efficient data structure) of flowIDs for each Bloom filter and our data reduction
estimates include the storage required for this list. The connection records (flowIDs) for
each Bloom filter (i.e., a time interval) can be alternatively obtained from other modules
monitoring the network. The need for testing all the flowIDs in a list significantly increases
the number of queries required for the attribution as the flowID of the packet that contained the query excerpt is not known a priori and it leads to higher false positive rate and
decreases the total query performance. Therefore, we may either maintain two separate
Bloom filters to answer queries, one into which we insert blocks only and one with blocks
concatenated with the corresponding flowIDs, or insert both into one larger Bloom filter.
The former allows data aging, i.e., for very old data we can delete the first Bloom filter and
store only the one with flowIDs at the cost of higher false positive rate and slower querying.
Another method to save storage space by reducing size taken by very old data is to take
24
a Bloom filter of size 2b and replace it by a new Bloom filter of size b by computing the
logical or operation of the two halves of the original Bloom filter. This halves the amount
of data and still allows querying but the false positive rate increases significantly.
An alternative construction which allows the determination of source/destination pairs
is using separate Bloom filters for each flow. Then instead of using one Bloom filter and
inserting blocks concatenated with flowIDs, we just select a Bloom filter for the insertion of
blocks based on the flowID. Because we cannot anticipate the number of blocks each flow
would contain during a time interval, we use small Bloom filters, flush them to disk more
often and use additional compression (such as gzip) on the Bloom filters before saving to
disk which helps to significantly reduce storage requirements for very sparse flows. Having
multiple small Bloom filters also has some performance advantages compared to one large
Bloom filter because of caching; the size of a Bloom filter can be selected to fit into a
memory cache block. This technique would most likely use TCP stream reconstruction
and makes the packet processing stateful compared to the method using flowIDs. It may
thus be suitable when there is another module in the system, such as intrusion detection
(or prevention) system, which already does the stream reconstruction and a PAS module
can be attached to it. If using this technique the evaluation of methods would then be
extremely dependent on the distribution of payload among the streams. We do not take
flowIDs into further consideration in method descriptions throughout this section for clarity
of explanation.
We have identified several important data structure properties of methods presented in
this chapter and a summary can be found in Table 2.1. Figure 2.1 shows a tree structure
representing the evolution of methods. For example, a VHBF method was derived from
HBF by the use of variable-sized blocks. These properties of all methods are thoroughly
explained within the description of a method in which they appear first. Their impact on
performance is discussed in Section 2.5.
There are many possible combinations of the techniques presented in this chapter and
the following list of methods is not a complete of all combinations. For example, a method
which builds a hierarchy of blocks with winnowing as a boundary selection technique can be
developed. However, the selected subset provides enough details for a reader to construct
25
and analyze the other alternative methods and as we have experimented with them we
believe the presented subset can accomplish the goal of selecting the most suitable one,
which is, in a general case, the Winnowing Multi-Hashing technique (Section 2.3.12).
In all the methods in this section we can extend the answer from the simple yes/no
(meaning that there was/wasn’t a packet containing the specified excerpt in a specified
time interval and if yes providing also the list of flowIDs of packets that contained the
excerpt) to give additional details about which parts of the excerpt were found (i.e., blocks)
and return, for instance, the longest continuous part of the excerpt that was found.
Table 2.1: Summary of properties of methods from Section 2.3. We show how each method
selects boundaries of blocks when processing a payload and how it affects the block size, how
each method resolves the consecutiveness of blocks, its special characteristics, and finally,
whether each method allows false negative and N/A answers to excerpt queries.
2.3.1
Hierarchical Bloom Filter (HBF)
This subsection describes the (former) state-of-the-art payload attribution technique,
called an HBF [49], in detail and extends the description of previous work from Section 2.2.
The following eleven subsections, each showing a new technique, represent our novel contribution.
An HBF supports queries for excerpts of a payload by dividing the payload of each
packet into a set of blocks of fixed size s bytes (where s is a parameter specified by the
system administrator1 ). The blocks of a payload of a single packet form a hierarchy (see
1
The block size used for several years by an HBF-enabled system running in our campus network is
26
Figure 2.1: The evolution tree shows the relationship among presented methods for payload
attribution. Arrow captions describe the modifications made to the parent method.
Figure 2.2) which is inserted into a Bloom filter with appropriate offset numbers. Thus,
besides inserting all blocks of a payload as in the BBF, we insert several super-blocks, i.e.,
blocks created by the concatenation of 2, 4, 8, etc., subsequent blocks into the HBF. This
produces the same result as having multiple BBFs with block sizes multiplied by powers of
two. And a BBF can be looked upon as the base level of the hierarchy in an HBF.
When processing a payload, we start at the level 0 of the hierarchy by inserting all
blocks of size s bytes. In the next level we double the size of a block and insert all blocks of
64 and 32 bytes, respectively, depending on whether deployed on the main gateway or smaller local ones.
Longer blocks allow higher data reduction ratios but lower the querying capability for smaller excerpts.
27
Figure 2.2: Processing of a payload consisting of blocks X0 X1 X2 X3 X4 X5 in a Hierarchical
Bloom Filter.
size 2s. In the n-th level we insert blocks of size 2n s bytes. We continue until the block size
exceeds the payload size. The total number of blocks inserted into an HBF for a payload of
P
size p bytes is l ⌊p/(2l s)⌋, where l is the level index s.t. 0 ≤ l ≤ ⌊log2 (p/s)⌋. Therefore, an
HBF needs about two times as much storage space compared to a BBF to achieve the same
theoretical false positive rate of a Bloom filter, because the number of elements inserted
into the Bloom filter is twice higher. However, for longer excerpts the hierarchy improves
the confidence of the query results because they are assembled from the results for multiple
levels.
We use one Bloom filter to store blocks from all levels of the hierarchy to improve
space utilization because the number of blocks inserted into Bloom filters at different levels
depends on the distribution of payload sizes and is therefore dynamic. The utilization of
this single Bloom filter is easy to control by limiting the number of inserted elements, thus
we can limit the (theoretical) false positive rate.
Offset numbers are the sequence numbers of blocks within the payload. Offsets are
appended to block contents before insertion into an HBF: (content||offset), where 0 ≤
offset ≤ ⌊p/(2l s)⌋ − 1, p is the size of the entire payload and l is the level of hierarchy.
Offset numbers are unique within one level of the hierarchy. See the example given in
Fig. 2.2. We first insert all blocks of size s with the appropriate offsets: (X0 ||0), (X1 ||1),
28
Figure 2.3: The hierarchy in the HBF does not cover double-blocks at odd offset numbers.
In this example, we assume that two payloads X0 X1 X2 X3 and Y0 Y1 Y2 Y3 were processed by
the HBF. If we query for an excerpt X1 Y2 , we would get a positive answer which represents
an offset collision, because there were two blocks (X1 ||1) and (Y2 ||2) inserted from different
payloads but there was no packet containing X1 Y2 .
(X2 ||2), (X3 ||3), (X4 ||4). Then we insert blocks at level 1 of the hierarchy: (X0 X1 ||0),
(X2 X3 ||1). And finally the second level: (X0 X1 X2 X3 ||0).
Note that in Figure 2.2 blocks X0 to X4 have size s bytes, but since block X5 has size
smaller than s it does not form a block and its content is not being processed. We analyze
the percentage of discarded payload content for each method in Section 2.5.
Offsets don’t provide a reliable solution to the problem of detecting whether two blocks
appeared in the same packet consecutively. For example, in a BBF if we process two packets
made up of blocks X0 X1 X2 X3 and Y0 Y1 Y2 Y3 Y4 , respectively, and later query for an excerpt
X2 Y3 , the BBF will answer that it had seen a packet with a payload containing such an
excerpt. We call this event an offset collision. This happens because of inserting a block
X2 with an offset 2 from the first packet and a block Y3 with an offset 3 from the second
packet into the BBF. When blocks from different packets are inserted at the appropriate
offsets, a BBF can answer as if they occurred inside a single packet. An HBF reduces the
false positive rate due to offset collisions and due to the inherent false positives of a Bloom
filter by adding supplementary checks when querying for an excerpt composed of multiple
blocks. In this example, an HBF would answer correctly that it did not see such excerpt
because the check for X2 Y3 in the next level of the hierarchy fails. However, if we query for
29
an excerpt X1 Y2 , both HBF and BBF fail to answer correctly (i.e, they answer positively as
if there was a packet containing X1 Y2 ). Figure 2.3 provides an example of how the hierarchy
tries to improve the resistance to offset collisions but still fails for two-block strings at odd
offsets. We discuss offset collisions in an HBF further in Section 2.3.8. Because in the
actual payload attribution system we insert blocks along with their flowIDs, collisions are
less common, but they can still occur for payload inside one stream of packets within one
time interval as these blocks have the same flowID and are stored in one Bloom filter.
Querying an HBF for an excerpt x starts with the same procedure as querying a BBF.
First, we have to try all possible offsets, where x could have occurred inside one packet.
We also have to try s possible starting positions of the first block inside x since the excerpt
may not start exactly on a block boundary of the original payload. To do this, we slide a
window of size s starting at each of the first s positions of x and query the HBF for this
window (with all possible starting offsets). After a match is found for this first block, the
query proceeds to try the next block at the next offset until all blocks of an excerpt at level
0 are matched. An HBF continues by querying the next level for super-blocks of size twice
the size of blocks in the previous level. Super-blocks start only at blocks from the previous
level which have even offset numbers. We go up in the hierarchy until all queries for all
levels succeed. The answer to an excerpt query is positive only if all answers from all levels
of the hierarchy were positive. The maximum number of queries to a Bloom filter in an
HBF in the worst case is roughly twice the number for a BBF.
2.3.2
Fixed Block Shingling (FBS)
In a BBF and an HBF we use offsets to determine whether blocks appeared consecutively
inside one packet’s payload. This causes a problem when querying for an excerpt because
we do not know where the excerpt starts inside the payload (the starting offset is unknown).
We have to try all possible starting offsets, which not only slows down the query process,
but also increases the false positive rate because a false positive result may occur for any
of these queries.
As an alternative to using offsets we can use block overlapping, which we call shingling.
In this scheme, the payload of a packet is divided into blocks of size s bytes as in a BBF,
30
but instead of inserting these blocks we insert strings of size s + o bytes (the block plus
a part of the next block) into the Bloom filter. Blocks overlap as do shingles on the roof
(see Figure 2.4) and the overlapping part assures that it is likely that two blocks appeared
consecutively if they share a common part and both of them are in the Bloom filter. For a
payload of size p bytes, the number of elements inserted into the Bloom filter is ⌊(p − o)/s⌋
for a FBS which is close to ⌊p/s⌋ for a BBF. However, the maximum number of queries to a
Bloom filter in the worst case is about v times smaller than in a BBF, where v is the number
of possible starting offsets. Since the value of v can be estimated as the system’s maximum
transmission unit (MTU) divided by the block size s, this improvement is significant, which
is supported by the experimental results presented in Section 2.5.
Figure 2.4: Processing of a payload with a Fixed Block Shingling (FBS) method (parameters: block size s = 8, overlap size o = 2).
The goal of the FBS scheme (of using an overlapping part) is to avoid trying all possible
offsets during query processing in an HBF to solve the consecutiveness problem. However,
both these techniques (shingling and offsets) are not guaranteed to answer correctly, for an
HBF because of offset collisions and for FBS because multiple blocks can start with the
same string and a FBS then confuses their position inside the payload (see Figure 2.5).
Thus both can increase the number of false positive answers. For example, the FBS will
incorrectly answer that it has seen a string of blocks X0 X1 Y1 Y2 after processing two packets
X and Y made of blocks X0 X1 X2 X3 X4 and Y0 Y1 Y2 , respectively, where X2 has the same
prefix (of size at least o bytes) as Y1 .
Querying a Bloom filter in the FBS scheme is similar to querying a BBF except that we
do not use any offsets and therefore do not have to try all possible offset positions of the
first block of an excerpt. Thus when querying for an excerpt x we slide a window of size
31
Figure 2.5: An example of a collision due to a shingling failure. The same prefix prevented
a FBS method from determining whether two blocks appeared consecutively within one
payload. The FBS method incorrectly treats the string of blocks X0 X1 Y1 Y2 as if it was
processed inside one payload.
s + o bytes starting at each of the first s positions of x and query the Bloom filter for this
window. When a match is found for this first block, the query can proceed with the next
block (including the overlap) until all blocks of an excerpt are matched. Since these blocks
overlap we assume that they occurred consecutively inside one single payload. The answer
to an excerpt query is considered to be positive only if there exists an alignment (i.e., a
position of the first block’s boundary) for which all tested blocks were found in the Bloom
filter. Figures 2.6 and 2.7 show examples of querying in the FBS method. Note that these
examples ignore that we need to determine all flowIDs of the excerpts found. Therefore
even after a match was found for some alignment and a flowID we shall continue to check
other alignments and flowIDs because multiple packets in multiple flows could contain such
an excerpt.
32
Figure 2.6: An example of querying a FBS method (with a block size 8 bytes and an overlap
size 2 bytes). Different alignments of the first block of the query excerpt (shown on top)
are tested. When a match is found in the Bloom filter for some alignment of the first block
we try subsequent blocks. In this example all blocks for the alignment starting at the third
byte are found and therefore the query substring (at the bottom) is reported as found. We
assume that the FBS processed the packet in Fig. 2.4 prior to querying.
2.3.3
Variable Block Shingling (VBS)
The use of shingling instead of offsets in a FBS method lets us avoid testing all possible
offset numbers of the first block during querying, but we still have to test all possible
alignments of the first block inside an excerpt (as shown in Fig. 2.6 and 2.7). A Variable
Block Shingling (VBS) solves this problem by setting block boundaries based on the payload
itself.
We slide a window of size k bytes through the whole payload and for each position of
the window we compute a value of function H(c1 , . . . , ck ) on the byte values of the payload.
When H(c1 , . . . , ck ) mod m is equal to zero, we insert a block boundary immediately after
the current position of byte ck . Note that we can choose to put a block boundary before
or after any of the bytes ci , 1 ≤ i ≤ k, but this selection has to be fixed. Note that for
use with shingling it is better to put the boundary after the byte ck such that the overlaps
are not restricted to strings having only special values which satisfy the above condition for
boundary insertion (which can increase shingle collisions). When the function H is random
and uniform then the parameter m sets the expected size of a block. For random payloads
we will get a distribution of block sizes with an average size of m bytes. This variable block
size technique’s drawback is that we can get many very small blocks, which can flood the
Bloom filter, or some large blocks, which prevent us from querying for smaller excerpts.
Therefore, we introduce an enhanced version of this scheme, EVBS, in the next section.
33
Figure 2.7: An example of querying a FBS method for an excerpt which is not supposed to
be found (i.e., no packet containing such string has been processed). The query processing
starts by testing the Bloom filter for the presence of the first block of the query excerpt at
different alignment positions. For alignment 2 the first block is found because we assume
that the FBS processed the packet in Fig. 2.4 prior to executing this query. The second
block for this alignment has been found too due to a false positive answer of a Bloom filter.
The third block for this alignment has not been found and therefore we continue with testing
first block at alignment 3. As there was no alignment for which all blocks were found we
report the query excerpt was not found.
In order to save computational resources it is convenient to use a function that can use
the computations performed for the previous positions of the window to calculate a new
value as we move from bytes c1 , . . . , ck to c2 , . . . , ck+1 . Rabin fingerprints (see Section 2.2.2)
have such iterative property and we define a fingerprint F of a substring c1 c2 . . . ck , where
ci is the value of the i-th byte of the substring of a payload, as:
F (c1 , . . . , ck ) = (c1 pk−1 + c2 pk−2 + · · · + ck ) mod M,
(2.3)
where p is a fixed prime number and M is a constant. To compute the fingerprint of substring
c2 . . . ck+1 , we need only to add the last element and remove the first one:
F (c2 , . . . , ck+1 ) = (p F (c1 , . . . , ck ) + ck+1 − c1 pk ) mod M.
(2.4)
Because p and k are fixed we can precompute the values for pk−1 . It is also possible to
use Rabin fingerprints as hash functions in the Bloom filter. In our implementation we use
a modified scheme [9] to increase randomness without any additional computational costs:
34
Figure 2.8: Processing of a payload with a Variable Block Shingling (VBS) method.
F (c2 , . . . , ck+1 ) = (p(F (c1 , . . . , ck ) + ck+1 − c1 pk )) mod M.
(2.5)
The advantage of picking block boundaries using Rabin functions is that when we get
an excerpt of a payload and divide it into blocks using the same Rabin function that we
used for splitting during the processing of the payload, we will get exactly the same blocks.
Thus, we do not have to try all possible alignments of the first block of a query excerpt as
in previous methods.
The rest of this method is similar to the FBS scheme where instead of using fixed-size
blocks we have variable-size blocks depending on the payload. To process a payload we
slide a window of size k bytes through the whole payload. For each its position we check
whether the value of F modulo m is zero and if yes we set a new block boundary. All blocks
are inserted with the overlap of o bytes as shown in Figure 2.8.
Querying in a VBS method is the simplest of all methods in this section because there
are no offsets and no alignment problems. Therefore, this method involves much fewer
tests for membership in a Bloom filter. Querying for an excerpt is done in the same way
as processing the payload in previous paragraph but instead of the insertion we query the
Bloom filter. Only when all blocks are found in the Bloom filter the answer is positive. The
maximum number of queries to a Bloom filter in the worst case is about v · s times smaller
than in a BBF, where v is the number of possible starting offsets and s is the number of
possible alignments of the first block in a BBF, while assuming the average block size in a
VBS method to be s.
35
2.3.4
Enhanced Variable Block Shingling (EVBS)
The enhanced version of the variable block shingling method tries to solve a problem
with block sizes. A VBS can create many small blocks, which can flood the Bloom filter and
do not provide enough discriminability, or some large blocks, which can prevent querying
for smaller excerpts. In a EVBS we form superblocks composed from blocks found by a
VBS method to achieve better control over the size of blocks.
Figure 2.9: Processing of a payload with an Enhanced Variable Block Shingling (EVBS)
method.
To be precise, when processing a payload we slide a window of size k bytes through the
entire payload and for each position of the window we compute the value of the fingerprinting
function H(c1 , . . . , ck ) on the byte values of the payload as in the VBS method. When
H(c1 , . . . , ck ) mod m is equal to zero, we insert a block boundary after the current position
of byte ck . We take the resulting blocks of an expected size m bytes, one by one from the
start of the payload, and form superblocks, i.e., new non-overlapping blocks made of multiple
original blocks, with the size at least m′ bytes, where m′ ≥ m. We do this by selecting
some of the original block boundaries to be the boundaries of the new superblocks. Every
boundary that creates a superblock of size greater or equal to m′ is selected (Figure 2.9,
where the minimum superblock size is m′ ). Finally, superblocks with an overlap to the
next superblock of size o bytes are inserted into the Bloom filter. The maximum number
of queries to a Bloom filter in the worst case is about the same as for a VBS, assuming the
average block sizes for the two methods are the same.
36
This leads, however, to a problem when querying for an excerpt. If we use the same
fingerprinting function H and parameter m we get the same block boundaries in the excerpt
as in the original payload, but the starting boundary of the first superblock inside the excerpt
is unknown. Therefore, we have to try all boundaries in the first m′ bytes of an excerpt (or
the first that follows if there was none) to form the first boundary of the first superblock.
The number of possible boundaries we have to try in an EVBS method (approximately
m′ /m) is much smaller than the number of possible alignments (i.e., the block size s) in an
HBF, for usual parameter values.
2.3.5
Winnowing Block Shingling (WBS)
In a Winnowing Block Shingling method we use the idea of winnowing, described in
Section 2.2.3, to select boundaries of blocks and shingling to resolve the consecutiveness of
blocks. We select the winnowing window size instead of a block size and we are guaranteed
to have at least one boundary in any window of this size inside the payload. This also sets
an upper bound on the block size.
We start by computing hash values for each payload byte position. In our implementation this is done by sliding a window of size k bytes through the whole payload and for
each position of the window we compute the value of a fingerprinting function H(c1 , . . . , ck )
on the byte values of the payload as in the VBS method. In this way we get an array of
hashes, where the i-th element is the hash of bytes ci , . . . , ci+k−1 , where ci is the i-th byte
of the payload of size p, for i = 1, . . . , (p − k + 1). Then we slide a winnowing window of size
w through this array and for each position of the winnowing window we put a boundary
immediately before the position of the maximum hash value within this window. If there are
more hashes with maximum value we choose the rightmost one. Bytes between consecutive
pairs of boundaries form blocks (plus the beginning of size o of the next block, the overlap)
and they are inserted into a Bloom filter. See Figure 2.10.
When querying for an excerpt we do the same process except that we query the Bloom
filter for blocks instead of inserting them. If all were found in the Bloom filter the answer
to the query is positive. The maximum number of queries to a Bloom filter in the worst
case is about the same as for a VBS, assuming the average block sizes for the two methods
37
Figure 2.10: Processing of a payload with a Winnowing Block Shingling (WBS) method.
First, we compute hash values for each payload byte position. Subsequently boundaries are
selected to be at the positions of the rightmost maximum hash value inside the winnowing
window which we slide through the array of hashes. Bytes between consecutive pairs of
boundaries form blocks (plus the overlap).
are the same.
There is at least one block boundary in any window of size w. Therefore the longest
possible block size is w + 1 + o bytes. This also guarantees that there are always at least
two boundaries to form a block in an excerpt of size at least 2w + o bytes.
2.3.6
Variable Hierarchical Bloom Filter (VHBF)
Querying an HBF involved trying s possible starting positions of the first block in an
excerpt. In a VHBF method we avoid this by splitting the payload into variable-sized
blocks (see Figure 2.11) determined by a fingerprinting function as in Section 2.3.3 (VBS).
Building the hierarchy, the insertion of blocks and querying are the same as in the original
Hierarchical Bloom Filter, only the block boundary definition has changed. Reducing the
number of queries by s (the size of one block in an HBF) helps reducing the resulting false
positive rate of this method.
Notice that even if we added overlaps between blocks (i.e., use shingling) we would still
need to use offsets, because they work as a way of determining whether to check the next
level of the hierarchy during the query phase because we check the next level only for even
offset numbers.
38
Figure 2.11: Processing of a payload with a Variable Hierarchical Bloom Filter (VHBF)
method.
2.3.7
Fixed Doubles (FD)
The method of fixed doubles is designed to address a shortcoming of a hierarchy in
an HBF. The hierarchy in an HBF is not complete in a sense that we do not insert all
double-blocks (blocks of size 2s) and all quadruple-blocks (4s), and so on, into the Bloom
filter. For example, when inserting a packet consisting of blocks S0 S1 S2 S3 S4 into an HBF,
we insert blocks S0 ,. . . , S4 , S0 S1 , S2 S3 , and S0 S1 S2 S3 . And if we query for an excerpt of
size 2s (or up to size 4s − 2 bytes, Figure 2.3), for example, S1 S2 , this block of size 2s is not
found in the HBF (and all other double-blocks at odd offsets) and the false positive rate is
worse than the one of a BBF in this case, because we need about two times more space for
an HBF compared to a BBF with the same false positive rate of the Bloom filter. The same
is true for other levels of the hierarchy. In fact, the probability that this event happens
rises exponentially with the level number. As an alternative approach to the hierarchy we
insert all double-blocks as shown in Figure 2.12, but do not continue to the next level to not
increase the storage requirements. Note that this method is not identical to a FBS scheme
with an overlap of size s because in a FD we insert all single blocks and also all double
blocks.
In this method we neither use shingling nor offsets, because the consecutiveness problem
is solved by the level of all double-blocks which overlap with each other, by half of the size
39
with the previous one and the second half overlaps with the next one.
Figure 2.12: Processing of a payload with a Fixed Doubles (FD) method.
The query mechanism works as follows: We first find the correct alignment of the first
block of an excerpt by trying to query the Bloom filter for all windows of size s starting at
positions 0 through s − 1. Note that we can get multiple positive answers and in that case
we continue the process independently for all of them. Then we split the excerpt into blocks
of size s starting at the position found and query for each of them. Finally, we query for all
double-blocks and when all answers were positive we claim that the excerpt was found.
The FD scheme inserts 2⌊p/s⌋ − 1 blocks into the Bloom filter for a payload of size p
bytes, which is approximately the same as an HBF and about two times more than an FBS
scheme. The maximum number of queries to a Bloom filter in the worst case is about two
times the number for a FBS method.
2.3.8
Variable Doubles (VD)
This method is similar to the previous one (FD) but the block boundaries are determined
by a fingerprinting function as in a VBS (Section 2.3.3). Hence, we do not have to try finding
the correct alignment of the first block of an excerpt when querying and the blocks have
variable size. Both the number of blocks inserted into the Bloom filter and the maximum
number of queries in the worst case are approximately the same as for the FD scheme. An
example is given in Figure 2.13.
40
Figure 2.13: Processing of a payload with a Variable Doubles (VD) method.
Similar to the FD method we use neither shingling nor offsets, because the consecutiveness problem is solved by the level of all double-blocks which overlap with each other and
with the single blocks.
During querying we simply divide the query excerpt into blocks by a fingerprinting
method and query the Bloom filter for all blocks and for all double-blocks. Finally, if all
answers are positive we claim that the excerpt is found.
2.3.9
Enhanced Variable Doubles (EVD)
The Enhanced Variable Doubles method uses the technique from Section 2.3.4 (EVBS)
to create an extension of a VD method by forming superblocks of a payload. Then these
superblocks are treated the same way as blocks in a VD method. Thus, we insert all
superblocks and all doubles of these superblocks into the Bloom filter as shown in Figure 2.14. The number of blocks inserted into the Bloom filter as well as the maximum
number of queries in the worst case is similar to that of the VD method (assuming similar
average block sizes of both schemes).
2.3.10
Multi-Hashing (MH)
One would imagine that the technique of VBS would be strengthened by using multiple
independent VBS methods because it provides greater flexibility in the choice of parameters, such as the expected block size. We call this technique Multi-hashing which uses t
41
Figure 2.14: Processing of a payload with an Enhanced Variable Doubles (EVD) method.
independent fingerprinting methods (or fingerprinting functions with different parameters)
to set block boundaries as shown in Figure 2.15. It is equivalent to using t independent
Variable Block Shingling methods and the answer to excerpt queries is positive only if all
the t methods answer positively.
Note that even if we set the overlapping part, i.e., the parameter o, to zero for all
instances of the VBS, we would still get a guarantee that the excerpt has appeared on the
network as one continuous fragment.
Moreover, by using expected block sizes of the instances as multiples of powers of two
we can generate a hierarchical structure with the MH method.
The expected number of blocks inserted into the Bloom filter for a payload of size p
P
bytes is ti=1 ⌊p/mi ⌋, where mi is the expected block size for the i-th VBS.
2.3.11
Enhanced Multi-Hashing (EMH)
The enhanced version of Multi-Hashing uses multiple instances of EVBS to increase
the certainty of answers to excerpt queries. Blocks inserted by independent instances of
EVBS are different and overlap with each other, and therefore improve the robustness of
this method. Other aspects than the superblock formation are the same as for the MH
method. In our experiments in Section 2.5 we use two independent instances of EVBS with
identical parameters and store the data for both in one Bloom filter.
42
Figure 2.15: Processing of a payload with a Multi-Hashing (MH) method. In this case, the
MH uses two independent instances of the Variable Block Shingling method simultaneously
to process the payload.
2.3.12
Winnowing Multi-Hashing (WMH)
The WMH method uses multiple instances of WBS (Section 2.3.5) to reduce the probability of false positives for excerpt queries. The WMH gives not only excellent control over
the block sizes due to winnowing (see Figure 2.20) but also provides much greater confidence
about the consecutiveness of the blocks inside the query excerpt because of overlaps both
inside each instance of WBS and among the blocks of multiple instances. Both querying
and payload processing are done for all t WBS instances and the final answer to an excerpt
query is positive only if all t answers are positive.
In our experiments in Section 2.5 we use two instances of WBS with identical winnowing
window size and store data from both methods in one Bloom filter. By storing data of each
instance in a separate Bloom filter we can allow data aging to save space by keeping only
some of the Bloom filters for very old data at the cost of higher false positive rates.
For an example of processing payload and querying in a WMH method see the multipacket case in Fig. 2.16 and 2.17.
43
2.4
Payload Attribution Systems Challenges
As mentioned in Chapter refchap:intro, a payload attribution system performs two separate tasks: payload processing and query processing. In payload processing, the payload
of all traffic that passed through the network where the PAS is deployed is examined and
some information is saved into permanent storage. This has to be done at line speed and the
underlying raw packet capture component can also perform some filtering of the packets,
for example, choosing to process only traffic of a particular protocol (HTTP, FTP, SMTP,
etc).
Data is stored in archive units, each of which having two timestamps (start and end of
the time interval during which we collected the data). For each time interval we also need
to save all flowIDs (e.g., pairs of source and destination IP addresses) to allow querying
later on. This information can be alternatively obtained from connection records collected
by firewalls, intrusion detection systems or other log files.
During query processing given the excerpt and a time interval of interest we have to
retrieve all the corresponding archive units from the storage. We query each unit for the
excerpt and if we get a positive answer we try to query successively for each of the flowIDs
appended to the blocks of the excerpt and report all matches to the user.
2.4.1
Attacks on PAS
As with any security system, there are ways an adversary can evade proper attribution.
We identify the following types of attacks on a PAS (mostly similar to those in [49]):
Compression and Encryption
If the payload is compressed or encrypted, a PAS can allow to query only for the exact
compressed or encrypted form.
Fragmentation
An attacker can transform the stream of data into a sequence of packets with payload
sizes much smaller than the (average) block size we use in the PAS. Methods with variable
44
block sizes where block boundaries depend on the payload are harder to beat, but, for very
small fragments, for example 6 bytes each, the system will not be able to do the attribution
correctly. A solution is to make the PAS stateful so that it concatenates payloads of one
data stream prior to processing. However, such a solution would impose additional memory
and computational costs and there are known attacks on stateful IDS systems [23], such as
incorrect fragmentation and timing attacks.
Boundary Selection Hacks
For methods with block boundaries depending on the payload an attacker can try to
send special packets containing payload that can contain too many or no boundaries. The
PAS can use different parameters for boundary selection algorithm for each archive unit
so that it would be impossible for an attacker to fool the system. Moreover, winnowing
guarantees at least one boundary in each winnowing window.
Hash Collisions
Hash collisions are very unpredictable and therefore hard to use by an attacker because
we use different salt for the hash computation in each Bloom filter.
Stuffing
An attacker can inject some characters into the payload which are ignored by applications but in the network layer they change the payload structure. Our methods are robust
against stuffing because the attacker has to modify most of the payload to avoid correct
attribution as we can match even very small excerpts of payload.
Resource Exhaustion
Flooding attacks can impair a PAS. However, our methods are more robust to these
attacks than raw packet loggers due to the data reduction they provide. Moreover, processing identical payloads repeatedly does not impact the precision of attribution because
the insertion into a Bloom filter is an idempotent operation. On the other hand, the list of
45
flowIDs is vulnerable to flooding, for example, when a worm tries to propagate out of the
network by trying many random destination addresses.
Spoofing
Source IP addresses can be spoofed and a PAS is primarily concerned with attributing
payload according to what packets have been delivered by the network. The scope of
possible spoofing depends on the deployment of the system and filtering applied in affected
networks.
2.4.2
Multi-packet queries
The methods described in Section 2.3 show how to query for excerpts inside one packet’s
payload. Nevertheless, we can extend the querying mechanism to handle strings that span
multiple packets.
Methods which use offsets have to continue querying for the next block, which was not
found with its sequential offset number, with a zero offset instead and try all alignments of
that block as well because the fragmentation into packets could leave out some part smaller
than the block size at the end of the first packet. This is very inefficient and increases the
false positive rate. Moreover, for methods that form a hierarchy of blocks it means that it
cannot be fully utilized. The payload attribution system can do TCP stream reconstruction
and work on the reconstructed flow to fix it.
On the other hand, methods using shingling can be extended without any further changes
if we return as an answer to the query the full sequence of blocks found (see a WMH example
in Fig. 2.16 and 2.17).
2.4.3
Privacy and Simple Access Control
Processing and archiving payload information must comply with the privacy and security
policies of the network where they are performed. Furthermore, authorization to use the
payload attribution system should be granted only to properly authorized parties and all
necessary precautions must be taken to minimize the possibility of a system compromise.
46
Figure 2.16: Processing of payloads of two packets with a Winnowing Multi-Hashing
(WMH) method where both packets are processed by two independent instances of the
Winnowing Block Shingling (WBS) method simultaneously.
Figure 2.17: Querying for an excerpt spanning multiple packets in a Winnowing MultiHashing (WMH) method comprised of two instances of WBS. We assume the WMH method
processed the packets in Fig. 2.16 prior to querying. In this case, we see that WMH can
easily query for an excerpt spanning two packets and that the blocks found significantly
overlap which increases the confidence of the query result. However, there is still a small
gap between the two parts because WMH works on individual packets (unless we perform
TCP stream reconstruction).
The privacy stems from using a Bloom filter to hold the data. It is only possible to
query the Bloom filter for a specific packet content but it cannot be forced to provide a list
of packet data stored inside. Simple access control (i.e., restricting the ability to query the
Bloom filter) can be easily achieved as follows. Our methods allow the collected data to be
stored and queried by an untrusted party without disclosing any payload information nor
giving the query engine any knowledge of the contents of queries. We achieve this by adding
a secret salt when computing hashes for insertion and querying the Bloom filter. A different
salt is used for each Bloom filter and serves the purpose of a secret key. We can also easily
achieve much finer granularity of access control by using different keys for different protocols
or subranges of IP address space. Without a key the Bloom filter cannot be queried and
the key doesn’t have to be made available to the querier (only the indices of bits for which
we want to query are disclosed). Without knowing the key a third party cannot query the
47
Bloom filter. However, additional measures must be taken to enforce that the third party
provides correct answers and does not alter the archived data. Also note that this kind of
access control is not cryptographically secure and some information leakage can occur. On
the other hand, there is no additional computational or storage cost associated with using
it and also no need for decrypting the data before querying as is common with standard
techniques. A detailed analysis of privacy achieved by using Bloom filters can be found
in [5].
2.4.4
Compression
In addition to the inherent data reduction provided by our attribution methods due
to the use of Bloom filters, our experiments show that we can achieve another about 20
percent storage savings by compressing the archived data (after careful optimization of
parameters [34]), for example by gzip. The results presented in the next section do not
include this additional compression.
2.5
Experimental Results
In this section we show performance measurements of payload attribution methods
described in Section 2.3 and discuss the results from various perspectives. For this purpose
we collected a network trace of 4 GB of HTTP traffic from our campus network. For
performance evaluation throughout this section we consider processing 3.1 MB segment
(about 5000 packets) of the trace as one unit trace collected during one time interval. As
discussed earlier, we store all network traffic information in one Bloom filter in memory and
we save it to a permanent storage in predefined time intervals or when it becomes full. A
new Bloom filter is used for each time interval. The time interval should be short because it
determines the time precision at which we can attribute packets, that is, we can determine
only in which time interval a packet appeared on the network, not the exact time. The
results presented do not depend on the size of the unit because we use a data reduction
ratio to set the Bloom filter size (for example 100:1 means a Bloom filter of size 31 kB). Each
method uses one Bloom filter of an equal size to store all data. Our results did not show any
48
deviations depending on the selection of the segment within the trace. All methods were
tested to select the best combination of parameters for each of them. Results are grouped
into subsections by different points of interest.
2.5.1
Performance Metrics
To compare payload attribution methods we consider several aspects which are not
completely independent. The first and most important aspect is the amount of storage
space a method needs to allow querying with a false positive rate bounded by a pre-defined
value. We provide a detailed comparison and analysis in the following subsections. Second,
the methods differ in the number of elements they insert into a Bloom filter when processing
packets and also in the number of queries to a Bloom filter performed when querying for an
excerpt in the worst case (that is when the answer is negative). A summary can be found in
Table 2.2. Methods which use shingling and variable block size achieve significant decrease
in the number of queries they have to perform to analyze each excerpt. It is important not
only for the computational performance but also for the resulting false positive rate as each
query to the Bloom filter takes a risk of a false positive answer. The boundary selection
techniques these methods use are very computationally efficient and can be performed in
a single pass through the payload. The implementation can be highly optimized for a
particular platform and some part of the processing can be also done by a special hardware.
Our implementation running on a Linux-based commodity PC (with a kernel modified for
fast packet capturing [37]) can smoothly handle 200 Mbps and the processing can be easily
split among multiple machines (e.g., by having each machine process packets according to
a hash value of the packet header).
2.5.2
Block Size Distribution
The graphs in Figure 2.18, Figure 2.19 and Figure 2.20 show the distributions of block
sizes for three different methods of block boundary selection. We use a block (or winnowing
window) size parameter of 32 bytes, a small block size for an EVBS of 8 bytes, and an
overlap of 4 bytes. Both VBS and EVBS show a distribution with an exponential decrease
in the number of blocks with an increasing block size, shifted by the overlap size for a VBS
49
Table 2.2: Comparison of payload attribution methods from Section 2.3 based on the number of elements inserted into a Bloom filter when processing one packet of a fixed size and
the number of blocks tested for presence in a Bloom filter when querying for an excerpt of
a fixed size in the worst case (i.e., when the answer is negative). Note that the values are
approximations and we assume all methods have the same average block size. The variables
refer to: n: the number of blocks inserted by a BBF taken as a base, s: the size of a block in
fixed block size methods (BBF, HBF, FBS, FD), v: the number of possible offset numbers,
p: the number of alignments tested for enhanced methods (EVBS, EVD, EMH). Note that
the actual number of bits tested or set in a Bloom filter depends on the number of hash
functions used for each method and therefore this table presents numbers of blocks.
Figure 2.18: The distributions of block sizes for VBS method after processing 100000 packets
of HTTP traffic
or the block size plus the overlap size for an EVBS. Long tails were cropped for clarity and
the longest block was 1029 bytes long.
On the other hand, a winnowing method results in a quite uniform distribution where the
block sizes are bounded by the winnowing window size plus the overlap. The apparent peaks
for the smallest block size in graphs 2.18 and 2.20 are caused by low-entropy payloads, such
as long blocks of zeros. The distributions of block sizes obtained by processing random
50
Figure 2.19: The distributions of block sizes for EVBS method after processing 100000
packets of HTTP traffic
Figure 2.20: The distributions of block sizes for WBS method after processing 100000
packets of HTTP traffic
51
payloads generated with the same payload sizes as in the original trace show the same
distributions just without these peaks. Nevertheless, the huge amount of small blocks does
not significantly affect the attribution because inserting a block into the Bloom filter is an
idempotent operation.
2.5.3
Unprocessed Payload
Some fraction of each packet’s payload is not processed by the attribution mechanisms
presented in Section 2.3. Table 2.3(b) shows how each boundary selection method affects
the percentage of unprocessed payload. For methods with a fixed block size the part of
a payload between the last block’s boundary and the end of the payload is ignored by
the payload attribution system. With (enhanced) Rabin fingerprinting, and winnowing
methods the part starting at the beginning of the payload and ending at the first block
boundary and the part between the last block boundary and the end of the payload are
not processed. The enhanced version of Rabin fingerprinting achieves much better results
because the small block size, which was four times smaller that the superblock size in our
test, applies when selecting the first block boundary in a payload.
Winnowing performs better than other methods with a variable block size in terms of
unprocessed payload. Note also that a WMH method, even though it uses winnowing for
block boundary selection as well as a WBS does, has about t times smaller percentage of
unprocessed payload than a WBS because each of the t instances of a WBS within the
WMH covers a different part of the payload independently. Moreover, the “inner” part of
the payload is covered t times which makes the method much more resistant to collisions
because t collisions have to occur at the same time to produce a false positive answer.
For large packets the small percentage of unprocessed payload does not pose a problem,
however, for very small packets, for example only 6 bytes long, it means that they are
possibly not processed at all. Therefore we can optionally insert the entire payload of a
packet in addition to inserting all blocks and add a special query type to the system to
support queries for exact packets. This will increase the storage requirements only slightly
because we would insert one additional element per packet into the Bloom filter.
52
(a)
(b)
Table 2.3: (a) False positive rates for data reduction ratio 130:1 for a WMH method with
a winnowing window size 64 bytes and therefore an average block size about 32 bytes. The
table summarizes answers to 10000 queries for each query excerpt size. All 10000 answers
should be NO. YES answers are due to false positives inherent to a Bloom filter. WMH
guarantees no N/A answers for these excerpt sizes. (b) The percentage of unprocessed
payload of 50000 packets depending on the block boundary selection method used. Details
are provided in Sections 2.5.2 and 2.5.3.
Table 2.4: Measurements of false positive rate for data reduction ratio 50:1. The table
summarizes answers to 10000 excerpt queries using all methods (with block size 32 bytes)
described in Section 2.3 for various query excerpt lengths (top row). These queries were
performed after processing a real packet trace and all methods use the same size of a Bloom
filter (50 times smaller than the trace size). All 10000 answers should be NO since these
excerpts were not present in the trace. YES answers are due to false positives in Bloom
filter and N/A answers mean that there were no boundaries selected inside the excerpt to
form a block for which we can query.
53
Figure 2.21: The graph shows the number of correct answers to 10000 excerpt queries for
a varying length of a query excerpt for each method (with block size or winnowing window
size 64 bytes) and data reduction ratio 100:1. This reduction ratio can be further improved
to about 120:1 by using a compressed Bloom filter. The WMH method has no false positives
for excerpt sizes 250 bytes and longer. The previous state-of-the-art method HBF does not
provide any useful answers at this high data reduction ratio for excerpts shorter than about
400 bytes.
2.5.4
Query Answers
To measure and compare the performance of attribution methods, and in particular
to analyze the false positive rate, we processed the trace by each method and queried for
random strings which included a small excerpt of size 8 bytes in the middle that did not
occur in the trace. In this way we made sure that these query strings did not represent
payloads inserted into the Bloom filters. Every method has to answer either YES, NO or
answer not available (N/A) to each query. A YES answer means a match was found for
the entire query string for at least one of the flowIDs and represents a false positive. A NO
answer is the correct answer for the query, and a N/A answer is returned if the blocking
mechanism specific to the method did not select even one block inside the query excerpt.
The N/A answer can occur, for example, when the query excerpt is smaller than the block
54
size in an HBF, or when there was one or no boundary selected in the excerpt by a VBS
method and so there was no block to query for.
In Table 2.1 we summarize the possibility of getting N/A and false negative answers for
each method. A false negative answer can occur when we query for an excerpt which has
size greater than the block size but none of the alignments of blocks inside the excerpt fit
the alignment that has been used to process the payload which contained the excerpt. For
example, if we processed a payload ABCDEF GHI by an HBF with block size 4 bytes,
we would have blocks ABCD, EF GH, ABCDEF GH, and if we queried for an excerpt
BCDEF G, the HBF would answer NO. Note that false negatives can occur only for excerpts
smaller than twice the block size and only for methods which involve testing the alignment
of blocks.
Table 2.4 provides detailed results of 10000 excerpt queries for all methods with the
same storage capacity and data reduction ratio 50:1. The WMH method achieves the best
results among the listed methods for all excerpt sizes and for excerpts longer than 200 bytes
has no false positives. The WMH also excels in the number of N/A answers among the
methods with a variable block size because it guarantees at least one block boundary in
each winnowing window. The results also show that methods with a variable block size
are in general better than methods with a fixed block size because there are no problems
with finding the right alignment. The enhanced version of Rabin fingerprinting for the
block boundary selection does not perform better than the original version. This is mostly
because we need to try all alignments of blocks inside superblocks when querying which
increases the false positive rate. These enhanced methods therefore favorably enhance the
block size control but not always the false positive rate.
The graph in Figure 2.21 shows the number of correct answers to 10000 excerpt queries
as a function of the length of a query excerpt for each method with block size parameter
set to 64 bytes and data reduction ratio 100:1. The WMH method outperforms all other
methods for all excerpt lengths. On the other hand, the HBF’s results are the worst because
it can fully utilize the hierarchy only for long excerpts and it has very high false positive
rate for high data reduction ratios due to the use of offsets, problems with block alignments,
and the large number of elements inserted into the Bloom filter. As can be observed when
55
comparing HBF’s results to the ones of a FD method, using double-blocks instead of building
a hierarchy is a significant improvement and a FD, for excerpt sizes of 220 bytes and longer,
performs even better than a variable block size version of the hierarchy, a VHBF. It is
interesting to see that an HBF outperforms a VHBF version for excerpts of length 400
bytes. For longer excerpts than 400 bytes (not shown) they perform about the same and
both quickly achieve no false positives. The results of methods in this graph are clearly
separated into two groups by performance; the curves representing the methods which use
a variable block size and do not use offset numbers or superblocks (i.e., VBS, VD, WBS,
MH, WMH) have concave shape and in general perform better. For very long excerpts all
methods provide highly reliable results.
The Winnowing Multi-Hashing achieves the best overall performance in all our tests
and allows to query for very small excerpts because the average block size is approximately
half of the winnowing window size (plus the overlap size). The average block size was 18.9
bytes for a winnowing window size 32 bytes and an overlap 4 bytes. Table 2.3(a) shows the
false positive rates for WMH for a data reduction ratio 130:1. This data reduction ratio
means that the total size of the processed payload was 130-times the size of the Bloom filter
which is archived to allow querying. The Bloom filter could be additionally compressed
to achieve a final compression of about 158:1, but note that the parameters of the Bloom
filter (i.e., the relation among the number of hash functions used, the number of elements
inserted and the size of the Bloom filter) have to be set in a different way [34]. We have to
decide in advance whether we want to use the additional compression of the Bloom filter
and if so, optimize the parameters for it, otherwise the Bloom filter data would have very
high entropy and is hard to compress. The compression is possible because the approximate
representation of a set by a standard Bloom filter does not reach the information theoretical
lower bound [10].
Winnowing block shingling method (WBS) performs almost as well as WMH (see
Fig. 2.21) and requires t-times less computation. However, the confidence of the results
is lower than with WMH because multi-hashing covers majority of the query excerpts multiple times and if storage is needed, a data aging method can be used to downgrade it to a
simple WBS later.
56
2.6
Conclusion
In this chapter, we presented novel methods for payload attribution. When incorporated
into a network forensics system they provide an efficient probabilistic query mechanism to
answer queries for excerpts of a payload that passed through the network. Our methods allow data reduction ratios greater than 100:1 while having a very low false positive
rate. They allow queries for very small excerpts of a payload and also for excerpts that
span multiple packets. The experimental results show that our methods represent a significant improvement in query accuracy and storage space requirements compared to previous
attribution techniques. More specifically, we found that winnowing represents the best technique for block boundary selection in payload attribution applications and that shingling
as a method for consecutiveness resolution is a clear win over the use of offset numbers and
finally, the use of multiple instances of payload attribution methods can provide additional
benefits, such as improved false positive rates and data-aging capability. These techniques
combined together form a payload attribution method called Winnowing Multi-Hashing
which substantially outperforms previous methods. The experimental results also testify
that in general the accuracy of attribution increases with the length and the specificity of a
query. Moreover, privacy and simple access control is achieved by the use of Bloom filters
and one-way hashing with a secret key. Thus, even if the system is compromised no raw
traffic data is ever exposed and querying the system is possible only with the knowledge of
the secret key. We believe that these methods also have much broader range of applicability
in various areas where large amounts of data are being processed.
Additional to the network payload data the systems used for network monitoring and
forensics make use of network flow data that represents structured information about each
connection records. This information including flowID (with source IP and destination IP)
can be stored and queried efficiently by storing the data in tables similar to traditional
database systems. However, as shown in [19, 55] when working with large amounts of
network flow data, the performance of the storage system may be influenced by how the
data is stored on disk, row-by-row or column-by-column, and the characteristics of the
expected query workload. In the next chapter we analyze the existing systems used to
57
store network flow data and propose an efficient storage infrastructure that can be used
for monitoring and forensics along with a payload attribution system that incorporates the
methods presented in Section 2.3 of this chapter.
Chapter
3
A Storage Infrastructure for Network Flow
Data
3.1
Introduction
In previous chapter we defined the payload attribution problem and we presented a
suite of payload attribution methods that can be used as modules in a payload attribution
system. In order to provide meaningful attribution results a payload attribution system has
to use the network flow identifiers (or flowIDs) corresponding to a time interval in the past.
However, maintaining the flowIDs in a simple list as described in previous chapter cannot
provide satisfactory runtime performance for queries accessing large amounts of network
flow data spanning for long periods of time. In such a case more efficient methods are
needed to store network flowIDs information.
Additionally, network monitoring systems that traditionally were designed to detect and
flag malicious or suspicious activity in real time, are increasingly providing the ability to
assist in network forensic investigation and to identify the root cause of a security breach.
This may involve checking a suspected host’s past network activity, looking up any services
run by a host, protocols used, the connection records to other hosts that may or may not
be compromised, etc. This new workload requires flexible and fast access to increasing
amounts of network flow historical data.
58
59
In this chapter we present the design, implementation details and the evaluation of a
column-oriented storage infrastructure called NetStore, designed to store and analyze very
large amounts of network flow data. The proposed system can be used in conjunction with
other network monitoring and forensics systems, such as a payload attribution system using
methods presented in previous chapter, in order to make informed security decisions.
Recall from Chapter 1 that we refer to a flow as an unidirectional data stream between
two endpoints and to a flow record as a quantitative description of a flow. In general we
refer to a flowID as the key that uniquely identifies a flow. In previous chapter we defined
a flowID as being composed only of the source and destination IPs since only those items
were required by the payload attribution system. However, when designing the storage
infrastructure for network flow data we considered the flow ID as being composed of five
attributes: source IP, source port, destination IP, destination port and protocol. We assume
that each flow record has associated a start time and an end time representing the time
interval when the flow was active in the network.
Challenges
Network flow data can grow very large in the number of records and storage footprint.
Figure 3.1 and Figure 3.2 show network flow distribution of traffic captured from edge
routers in a moderate sized campus network for a day and a month respectively. This
network with about 3,000 hosts, commonly reaches up to 1,300 flows/second, an average 53
million flows daily and roughly 1.7 billion flows in a month. We consider records with the
average size of 200 Bytes. Besides CISCO NetFlow data [57] there may be other specific
information that a sensor can capture from the network such as the IP, transport and
application headers information. Hence, in this example, the storage requirement is roughly
10 GB of data per day which adds up to at least 310 GB per month. When working with
large amounts of disk resident data, the main challenge is no longer to ensure the necessary
storage space, but to minimize the time it takes to process and access the data.
An efficient storage and querying infrastructure for network records has to cope with two
main technical challenges: keep the insertion rate high, and provide fast access to the desired
flow records. When using a traditional row-oriented Relational Database Management
60
Figure 3.1: Network flow traffic distribution for one day. In a typical day the busiest time
interval is 1PM - 2PM with 4,381,876 flows, and the slowest time interval is 5AM - 6AM
with 978,888 flows.
Systems (RDBMS), the relevant flow attributes are inserted as a row into a table as they
are captured from the network, and are indexed using various techniques [18]. On the
one hand, such a system has to establish a trade off between the insertion rate desired
and the storage and processing overhead employed by the use of auxiliary indexing data
structures. On the other hand, enabling indexing for more attributes ultimately improves
query performance but also increases the storage requirements and decreases insertion rates.
At query time, all the columns of the table have to be loaded in memory even if only a
subset of the attributes are relevant for the query, adding a significant I/O penalty for the
overall query processing time by loading unused columns.
Therefore, when querying disk resident data, an important problem is to overcome the
I/O bottleneck specific to large disk to memory data transfers. One partial solution is to
load only data that is relevant to the query. For example, to answer the query ”What is the
61
Figure 3.2: Network flow traffic distribution for one month. For a typical month we noticed the slow down in week-ends and the peek traffic in weekdays. Days marked with *
correspond to a break week.
list of all IPs that contacted IP X between dates d1 and d2 ?”, the system should load only
the source and destination IPs as well as the timestamps of the flows that fall between dates
d1 and d2 . The I/O time can also be decreased if the accessed data is compressed since
less data traverses the disk to memory boundary. Further, the overall query response time
can be improved if data is processed in compressed format, thus saving the decompression
time. Therefore, the system should use the optimal data access, compression strategies and
algorithms in such a way to achieve the best compression ratios while the decompression
remains very fast or unnecessary at the query time. Since the system should insert elements
at line speed, all the preprocessing algorithms used should add negligible overhead to the
disk writing process. Considering all the above requirements and implications, a column
oriented storage architectures seems to be a good fit to store network flow data captured
for prolonged periods of time.
62
Column Stores Overview
As briefly presented in Section 1.3.2 in Chapter 1, the basic idea of column orientation is
to store the data by columns rather than by rows, where each column holds data for a single
attribute of the flow and is stored sequentially on disk. Such a strategy makes the system
I/O efficient for read queries since only the required attributes related to a query can be read
from the disk. It is widely accepted both in academic communities [1,2,25,55,60] as well as
in industry [13,27,30,58] that column-stores provide better performance than row-stores for
analytical query workloads. However, most of commercial and open-source column stores
were conceived to follow general purpose RDBMSs requirements, and do not fully use the
semantics of the data carried and do not take advantage of the specific types and data
access patterns of network forensic and monitoring queries. In this chapter we present the
design, implementation details and the evaluation of NetStore, a column-oriented storage
infrastructure for network records that, unlike the other systems, is intended to provide
good performance for network records flow data.
Contribution
NetStore is the implementation of a network flow data storage infrastructure that can
work jointly with other systems that process massive amounts of data such as the ones
described in [7,12,15,56]. The column based storage is similar to a column oriented database
and partitions the network attributes in columns, one column for each attribute. Each
column holds the data of the same type and therefore can be heavily compressed. The
compression algorithms used depend on a set of features extracted from the data. We
provide the compression methods description and selection strategy later in Section 3.3.3.
NetStore is designed and implemented to facilitate efficient interaction with various security
applications such as Firewalls, Network Intrusion Detection Systems (NIDSs) and other
security administrative tools as shown in Figure 3.3. Based on our knowledge, NetStore is
the first attempt to consider a column oriented storage design, special tuned for network
flow historical data. The key contributions of this chapter include the following:
• Simple and efficient column oriented design of NetStore that enables quick access to
63
large amounts of data for monitoring and forensic analysis.
• Efficient compression methods and selection strategies to facilitate the best compression for network flow data, that allow accessing and querying data in compressed
format.
• Implementation and deployment of NetStore using commodity hardware and open
source software as well as analysis and comparison with other open source storage
systems used currently in practice.
The rest of the chapter is organized as follows: we present related work in Section 3.2,
NetStore system architecture and the details of each component in Section 3.3. Experimental results and evaluation are presented in Section 3.4 and we conclude the chapter in
Section 3.5.
3.2
Related Work
The problem of discovering network security incidents has received significant attention
over the past years. Most of the work done has focused on near-real time security event
detection, by improving existing security mechanisms that monitor traffic at a network
perimeter and block known attacks, detect suspicious network behavior such as network
scans, or malicious binary transfers [38, 46]. Other systems such as Tribeca [56] and Gigascope [15], use stream databases and process network data as it arrives but do not store the
date for retroactive analysis for long periods of time. There has been some work done to
store network flow records using a traditional RDBMS such as PostgreSQL [18]. Using this
approach, when a NIDS triggers an alarm, the database system builds indexes and materialized views for the attributes that are the subject of the alarm, and could potentially be
used by forensics queries in the investigation of the alarm. The system works reasonably
well for small networks and is able to help forensic analysis for events that happened over
the last few hours. However, queries for traffic spanning more than a few hours become I/O
bound and the auxiliary data used to speed up the queries slows down the record insertion
process. Therefore, such a solution is not feasible for medium to large networks and not
64
even for small networks in the future, if we consider the accelerated growth of internet
traffic. Additionally, a time window of several hours is not a realistic assumption when
trying to detect the behavior of a complex botnet engaged in stealthy malicious activity
over prolonged periods of time.
In the database community, many researchers have proposed the physical organization
of database storage by columns in order to cope with poor read query performance of
traditional row-based RDBMS [13, 30, 52, 55, 60]. As shown in [2, 22, 25, 55], a column store
provides many times better performance than a row store for read intensive workloads.
In [60] the focus is on optimizing the cache-RAM access time by decompressing data in
the cache rather than in the RAM. This system assumes the working columns are RAM
resident, and shows a performance penalty if data has to be read from the disk and processed
in the same run. The solution in [55] relies on processing parallelism by partitioning data
into sets of columns, called projections, indexed and sorted together, independent of other
projections. This layout has the benefit of rapid loading of the attributes belonging to
the same projection and referred to by the same query without the use of auxiliary data
structure for tuple reconstruction. However, when attributes from different projections
are accessed, the tuple reconstruction process adds significant overhead to the data access
pattern. The system presented in [52] emphasizes the use of an auxiliary metadata layer on
top of the column partitioning that is shown to be an efficient alternative to the indexing
approach. However, the metadata overhead is sizable and the design does not take into
account the correlation between various attributes.
Finally, in [25] authors present several factors that should be considered when one has to
decide to use a column store versus a row store for a read intensive workload. The relative
large number of network flow attributes and the workloads with the predominant set of
queries with large selectivity and few predicates favor the use of a column store system for
historical network flow records storage.
NetStore is a column oriented storage infrastructure that shares some of the features
with the other systems, and is designed to provide the best performance for large amounts
of disk resident network flow records. It avoids tuple reconstruction overhead by keeping
at all times the same order of elements in all columns. It provides fast data insertion and
65
quick querying by dynamically choosing the most suitable compression method available
and using a simple and efficient design with a negligible meta data layer overhead.
3.3
Architecture
In this section we describe the architecture and the key components of NetStore. We
first present the characteristics of network data and query types that guide our design.
We then describe the technical design details: how the data is partitioned into columns,
how columns are partitioned into segments, what are the compression methods used and
how a compression method is selected for each segment. We finally present the metadata
associated with each segment, the index nodes, and the internal IPs inverted index structure,
as well as the basic set of operators.
Figure 3.3: NetStore main components: Processing Engine and Column-Store.
66
3.3.1
Network Flow Data
Network flow records and the queries made on them show some special characteristics
compared to other time sequential data, and we tried to apply this knowledge as early
as possible in the design of the system. First, flow attributes tend to exhibit temporal
clustering, that is, the range of values is small within short time intervals. Second, the
attributes of the flows with the same source IP and destination IP tend to have the same
values (e.g. port numbers, protocols, packets sizes etc.). Third, columns of some attributes
can be efficiently encoded when partitioned into time based segments that are encoded
independently. Finally, most attributes that are of interest for monitoring and forensics can
be encoded using basic integer data types.
The records insertion operation is represented by bulk loads of time sequential data that
will not be updated after writing. Having the attributes stored in the same order across
the columns makes the join operation become trivial when attributes from more than one
column are used together. Network data analysis does not require fast random access on all
the attributes. Most of the monitoring queries need fast sequential access to large number of
records and the ability to aggregate and summarize the data over a time window. Forensic
queries access specific predictable attributes but collected over longer periods of time. To
observe their specific characteristics we first compiled a comprehensive list of forensic and
monitoring queries used in practice in various scenarios [17]. Based on the data access
pattern, we identified five types among the initial list. Spot queries (S) that target a single
key (usually an IP address or port number) and return a list with the values associated
with that key. Range queries (R) that return a list with results for multiple keys (usually
attributes corresponding to the IPs of a subnet). Aggregation queries (A) that aggregate
the data for the entire network and return the result of the aggregation (e.g. traffic sent out
for network). Spot Aggregation queries (SA) that aggregate the values found for one key in
a single value. Range Aggregation queries (RA) that aggregate data for multiple keys into
a single value. Examples of these types of queries expressed in plain words:
(S) ”What applications are observed on host X between dates d1 and d2 ?”
67
(R) ”What is the list of destination IPs that have source IPs in a subnet between dates d1
and d2 ?”
(A) ”What is the total number of connections for the entire network between dates d1 and
d2 ?”
(SA) ”What is the number of bytes that host X sent between dates d1 and d2 ?”
(RA) ”What is the number of hosts that each of the hosts in a subnet contacted between
dates d1 and d2 ?”
3.3.2
Column Oriented Storage
Having described the network flow data characteristics and forensics and monitoring
queries types on this data, in this section we introduce and present the main components
of the column oriented storage architecture: columns, segments, the column index and the
internal IPs index.
Columns
In NetStore, we consider that flow records with n attributes are stored in the logical
table with n columns and an increasing number of rows (tuples) one for each flow record.
The values of each attribute are stored in one column and have the same data type.
By default almost all of the values of a column are not sorted. Having the data sorted
in a column might help get better compression and faster retrieval, but changing the initial
order of the elements requires the use of auxiliary data structure for tuple reconstruction at
query time. We investigated several techniques to ease tuple reconstruction and all methods
added much more overhead at query time than the benefit of better compression and faster
data access. Therefore, we decided to maintain the same order of elements across columns
to avoid any tuple reconstruction penalty when querying. However, since we can afford one
column to be sorted without the need to use any reconstruction auxiliary data, we choose
to first sort only one column and partially sort the rest of the columns. We call the first
sorted column the anchor column. Note that after sorting, given our storage architecture,
each segment can still be processed independently.
68
The main purpose of the anchor column choosing algorithm is to select the ordering
that facilitates the best compression and fast data access. Network flow data express strong
correlation between several attributes and we exploit this characteristic by keeping the
strongly correlated columns in consecutive sorting order as much as possible for better
compression results. Additionally, based on previous queries data access pattern, columns
are arranged by taking into account the probability of each column to be accessed by future
queries. The columns with higher probabilities are arranged at the beginning of the sorting
order. As such, we maintain the counting probabilities associated with each of the columns
given by the formula P (ci ) =
ai
t ,
where ci is the i-th column, ai the number of queries that
accessed ci and t the total number of queries.
Segments
Each column is further partitioned into fixed sets of values called segments. Segments
partitioning enables physical storage and processing at a smaller granularity than simple
column based partitioning. These design decisions provide more flexibility for compression
strategies and data access. At query time only used segments will be read from disk and
processed based on the information collected from segments metadata structures called
index nodes. Each segment has associated a unique identifier called segment ID. For each
column, a segment ID represents an auto incremental number, started at the installation of
the system. The segment sizes are dependent of the hardware configuration and can be set
in such a way to use the most of available main memory.
For better control over data structures used, the segments have the same number of
values across all the columns. In this way there is no need to store a record ID for each
value of a segment, and this is one major difference compared to some existing column
stores [30]. As we will show in Section 3.4 the performance of the system is related to the
segment size used. The larger the segment size, the better the compression performance
and query processing times. However, we notice that records insertion speed decreases with
the increase of segment size, so, there is a trade off between the query performance desired
and the insertion speed needed. Most of the columns store segments in compressed format
and, in a later section we present the compression algorithms used. Column segmentation
69
design is an important difference compared to traditional row oriented systems that process
data a tuple at a time, whereas NetStore processes data segment at a time, which translates
to many tuples at a time. Figure 3.4 shows the processing steps for the three processing
phases: buffering, segmenting and query processing.
Figure 3.4: NetStore processing phases: buffering, segmenting and query processing.
Column Index
For each column we store the meta data associated with each of the segments in an index
node corresponding to the segment. The set of all index nodes for the segments of a column
represent the column index. The information in each index node includes statistics about
data and different features that are used in the decision about the compression method to
use and optimal data access, as well as the time interval associated with the segment in
the format [min start time, max end time]. Figure 3.5 presents an intuitive representation
of the columns, segments and index for each column. Each column index is implemented
70
using a time interval tree. Every query is relative to a time window T. At query time,
the index of every column accessed is looked up and only the segments that have the time
interval overlapping window T are considered for processing. In the next step, the statistics
on segment values are checked to decide if the segment should be loaded in memory and
decompressed. This two-phase index processing helps in early filtering out unused data in
query processing similar to what is done in [52]. Note that the index nodes do not hold data
values, but statistics about the segments such as the minimum and the maximum values,
the time interval of the segment, the compression method used, the number of distinct
values, etc. Therefore, index usage adds negligible storage and processing overhead.
From the list of initial queries we observed that the column for the source IP attribute
is most frequently accessed. Therefore, we choose this column as our first sorted anchor
column, and used it as a clustered index for each source IP segment. However, for workloads
where the predominant query types are spot queries targeting a specific column other than
the anchor column, the use of indexes for values inside the column segments is beneficial
at a cost of increased storage and slowdown in insertion rate. Thus, this situation can be
acceptable for slow networks were the insertion rate requirements are not too high. When
the insertion rate is high then it is best not to use any index but rely on the meta-data
from the index nodes.
Internal IPs Index
Besides the column index, NetStore maintains another indexing data structure for the
network internal IP addresses called the Internal IPs index. Essentially the IPs index is an
inverted index for the internal IPs. That is, for each internal IP address the index stores
in a list the absolute positions where the IP address occurs in the column, sourceIP or
destIP , as if the column is not partitioned into segments. Figure 3.6 shows an intuitive
representation of the IPs index. For each internal IP address the positions list represents
an array of increasing integer values that are compressed and stored on disk on a daily
basis. Because IP addresses tend to occur in consecutive positions in a column, we chose to
compress the positions list by applying run-length-encoding on differences between adjacent
values.
71
Figure 3.5: Schematic representation of columns, segments, index nodes and column indexes.
3.3.3
Compression
Each of the segments in NetStore is compressed independently. We observed that segments within a column did not have the same distribution due to the temporal variation of
network activity in working hours, days, nights, weekends, breaks etc. Hence segments of the
same column were best compressed using different methods. We explored different compression methods. We investigated methods that allow data processing in compressed format
and do not need decompression of all the segment values if only one value is requested.
We also looked at methods that provide fast decompression and reasonable compression
ratio and speed. The decision on which compression algorithm to use is done automatically
for each segment, and is based on the data features of the segment such as data type, the
number of distinct values, range of the values and number of switches between adjacent
values. We tested a wide range of compression methods, including some we designed for
the purpose or currently used by similar systems in [1, 30, 55, 60], with needed variations if
any. Below we list the techniques that emerged effective based on our experimentation:
72
Figure 3.6: Intuitive representation of the IPs inverted index.
• Run-Length Encoding (RLE): is used for segments that have few distinct repetitive values. If value v appears consecutively r times, and r > 1, we compress it as
the pair (v, r). It provides fast compression as well as the ability to process data in
compressed format.
• Variable Byte Encoding: is a byte-oriented encoding method used for positive
integers. It uses a variable number of bytes to encode each integer value as follows: if
value < 128 use one byte (set highest bit to 0), for value < 128 ∗ 128 use 2 bytes (first
byte has highest bit set to 1 and second to 0) and so on. This method can be used in
conjunction with RLE for both values and runs. It provides reasonable compression
ratio and good decompression speed allowing the decompression of only the requested
value without the need to decompress the whole segment.
73
• Dictionary Encoding: is used for columns with few distinct values and sometimes
before RLE is applied (e.g. to encode ”protocol” attribute).
• Frame Of Reference: considers the interval bounded by the minimum and maximum values as the frame of reference for the values to be compressed [20]. We use it
to compress non-empty timestamp attributes within a segment (e.g. start time, end
time, etc.) that are integer values representing the number of seconds from the epoch.
Typically the time difference between minimum and maximum timestamp values in
a segment is less than few hours, therefore the encoding of the difference is possible
using short values of 2 bytes instead of integers of 4 bytes. It allows processing data
in compressed format by decompressing each timestamp value individually without
the need to decompress the whole segment.
• Generic Compression: we use the DEFLATE algorithm from the zlib library that
is a variation of the LZ77 [59]. This method provides compression at the binary level,
and does not allow values to be individually accessed unless the whole segment is
decompressed. It is chosen if it enables faster data insertion and access than the
value-based methods presented earlier.
• No Compression: is listed as a compression method since it will represent the base
case for our compression selection algorithm.
Compression Methods Selection
The selection of a compression method is done based on the statistics collected in one
pass over the data of each segment. As mentioned earlier, the two major requirements
of our system are to keep records insertion rates high and to provide fast data access.
Data compression does not always provide better insertion and better query performance
compared to ”No compression”, and for this we developed a model to decide on when
compression is suitable and if so, what method to choose. Essentially, we compute a score
for each candidate compression method and we select the one that has the best score. More
formally, we assume we have k + 1 compression methods m0 , m1 , . . . , mk , with m0 being the
”No Compression” method. We then compute the insertion time as the time to compress
74
and write to disk, and the access time, to read from disk and decompress, as functions of each
compression method. For value-based compression methods, we estimate the compression,
write, read and decompression times based on the statistics collected for each segment. For
the generic compression we estimate the parameters based on the average results obtained
when processing sample segments. For each segment we evaluate:
insertion (mi ) = c (mi ) + w (mi ) ,
access (mi ) = r (mi ) + d (mi ) ,
i = 1, . . . , k
i = 1, . . . , k
As the base case for each method evaluation we consider the ”No Compression” method.
We take I0 to represent the time to insert an uncompressed segment which is represented
by only the writing time since there is no time spent for compression and, similarly A0 to
represent the time to access the segment which is represented by only the time to read the
segment from disk since there is no decompression. Formally, following the above equations
we have:
insertion (m0 ) = w (m0 ) = I0
access (m0 ) = r (m0 ) = A0
We then choose the candidate compression methods mi only if we have both:
insertion (mi ) < I0
access (mi ) < A0
Next, among the candidate compression methods we choose the one that provides the
lowest access time. Note that we primarily consider the access time as the main differentiator
factor and not the insertion time. The disk read is the most frequent and time consuming
operation and it is many times slower than disk write of the same size file for commodity
hard drives. Additionally, insertion time can be improved by bulk loading or by other ways
that take into account that the network traffic rate is not steady and varies greatly over
time, whereas the access mechanism should provide the same level of performance at all
times.
The model presented above does not take into account if the data can be processed
in compressed format and the assumption is that decompression is necessary at all times.
75
However, for a more accurate compression method selection we should include the probability of a query processing the data in compressed format in the access time equation. Since
forensic and monitoring queries are usually predictable, we can assume without affecting
the generality of our system, that we have a total number of t queries, each query qj having
t
X
pj = 1. We consider the probability of a segment
the probability of occurrence pj with
j=1
s being processed in compressed format as the probability of occurrence of the queries that
process the segment in compressed format. Let CF be the set of all the queries that process
s in compressed format, we then get:
P (s) =
X
pj ,
CF = {qj |qj processes s in compressed format}
qj ∈CF
Now, a more accurate access time equation can be rewritten taking into account the possibility of not decompressing the segment for each access:
access (mi ) = r (mi ) + d (mi ) · (1 − P (s)) ,
i = 1, . . . , k
(3.1)
Note that the compression selection model can accommodate any compression, not only
the ones mentioned in this chapter, and is also valid in the cases when the probability of
processing the data in compressed format is 0.
3.3.4
Query Processing
Figure 3.4 illustrates NetStore data flow, from network flow record insertion to the
query result output. Data is written only once in bulk, and read many times for processing.
NetStore does not support transaction processing queries such as record updates or deletes,
it is suitable for analytical queries in general and network forensics and monitoring queries
in special.
Data Insertion
Network data is processed in several phases before being delivered to permanent storage. First, raw flow data is collected from the network sensors and is then preprocessed.
Preprocessing includes the buffering and segmenting phases. Each flow is identified by a
76
flow ID represented by the 5-tuple [sourceIP, sourcePort, destIP, destPort, protocol ]. In the
buffering phase, raw network flow information is collected until the buffer is filled. The flow
records in the buffer are aggregated and then sorted. As mentioned in Section 3.3.3, the
purpose of sorting is twofold: better compression and faster data access. All the columns
are sorted following the sorting order determined based on access probabilities and correlation between columns using the first sorted column as anchor. In the segmenting phase,
all the columns are partitioned into segments, that is, once the number of flow records
reach the buffer capacity the column data in the buffer is considered a full segment and
is processed. Each of the segments is then compressed using the appropriate compression
method based on the data it carries. The information about the compression method used
and statistics about the data is collected and stored in the index node associated with the
segment. Note that once the segments are created, the statistics collection and compression
of each segment is done independent of the rest of the segments in the same column or in
other columns. By doing so, the system takes advantage of the increasing number of cores
in a machine and provides good record insertion rates in multi threaded environments.
After preprocessing all the data is sent to permanent storage. As monitoring queries
tend to access the most recent data, some data is also kept in memory for a predefined
length of time. NetStore uses a small active window of size W and all the requests from
queries accessing the data in the time interval [NOW - W, NOW] are served from memory,
where NOW represents the actual time of the query.
Query Execution
For flexibility NetStore supports limited SQL syntax and implements a basic set of
segment operators related to the query types presented in Section 3.3.1. Each SQL query
statement is translated into a statement in terms of the basic set of segment operators.
Below we briefly present each general operator:
• filter segs (d1 , d2 ): Returns the set with segment IDs of the segments that overlap
with the time interval [d1 , d2 ]. This operator is used by all queries.
77
• filter atts(segIDs, pred1 (att1 ), . . . , predk (attk )): Returns the list of pairs (segID, pos list),
where pos list represents the intersection of attribute position lists in the corresponding segment with id segID, for which the attribute atti satisfies the predicate predi ,
with i = 1, . . . , k.
• aggregate (segIDs, pred1 (att1 ), . . . , predk (attk )): Returns the result of aggregating
values of attribute attk by attk−1 by . . . att1 that satisfy their corresponding predicates
predk , . . . , pred1 in segments with ids in segIDs. The aggregation can be summation,
counting, min or max.
The queries considered in Section 3.4.2 can all be expressed in terms of the above
operators. For example the query: ”What is the number of unique hosts that each of the
hosts in the network contacted in the interval [d1 , d2 ]?” can be expressed as follows:
aggregate(filter segs(d1 , d2 ), sourceIP = 128.238.0.0/16, destIP ).
After the operator filter segs is applied, only the sourceIP and destIP segments that
overlap with the time interval [d1 , d2 ] are considered for processing and their corresponding
index nodes are read from disk. Since this is a range aggregation query, all the considered
segments will be loaded and processed. If we consider the query ”What is the number of
unique hosts that host X contacted in the interval [d1 , d2 ]?” it can be expressed as follows:
aggregate(filter segs(d1 , d2 ), sourceIP = X, destIP ).
For this query the number of relevant segments can be reduced even more by discarding
the ones that do not overlap with the time interval [d1 , d2 ], as well as the ones that don’t
hold the value X for sourceIP by checking corresponding index nodes statistics. If the value
X represents the IP address of an internal node, then the internal IPs index will be used to
retrieve all the positions where the value X occurs in the sourceIP column. Then a count
operation is performed of all the unique destIP addresses corresponding to the positions.
Note that by using internal IPs index, the data of sourceIP column is not touched. The
only information loaded in memory is the positions list of IP X as well as the segments in
column destIP that correspond to those positions.
78
3.4
Evaluation
In this section we present an evaluation of NetStore. We designed and implemented
NetStore using Java programming language on the FreeBSD 7.2-RELEASE platform. For
all the experiments we used a single machine with 6 GB DDR2 RAM, two Quad-Core 2.3
Ghz CPUs, 1TB SATA-300 32 MB Buffer 7200 rpm disk with a RAID-Z configuration. We
consider this machine representative of what a medium scale enterprise will use as a storage
server for network flow records.
For experiments we used the network flow data captured over a 24 hour period of one
weekday at our campus border router. The size of raw text file data was about 8 GB,
62,397,593 network flow records. For our experiments we considered only 12 attributes
for each network flow record, that is only the ones that were meaningful for the queries
presented in this chapter. Table 3.1 shows the attributes used as well as the types and
the size for each attribute. We compared NetStore’s performance with two open source
RDBMS, a row-store, PostgreSQL [42] and a column-store, LucidDB [30].
We chose PostgreSQL over other open source systems because we intended to follow the
example in [18] which uses it for similar tasks. Additionally we intended to make use of the
partial index support for internal IPs that other systems don’t offer in order to compare
the performance of our inverted IPs index. We chose LucidDB as the column-store to
compare with as it is, to the best of our knowledge, the only stable open source columnstore that yields good performance for disk resident data and provides reasonable insertion
speed. We chose only data captured over one day, with size slightly larger than the available
memory, because we wanted to maintain reasonable running times for the other systems
that we compared NetStore to. These systems become very slow for larger data sets and
performance gap compared to NetStore increases with the size of the data.
3.4.1
Parameters
Figure 3.7 shows the influence that the segment size has over the insertion rate. We
observe that the insertion rate drops with the increase of segment size. This trend is
expected and is caused by the delay in preprocessing phase, mostly because of the larger
79
Column
sourceIP
destIP
sourcePort
destPort
protocol
startTime
endTime
tcpSyns
tcpAcks
tcpFins
tcpRsts
numBytes
Type
int
int
short
short
byte
short
short
byte
byte
byte
byte
int
Bytes
4
4
2
2
1
2
2
1
1
1
1
4
Table 3.1: NetStore flow attributes.
Property
records insertion rate
number of records
number of bytes transported
bytes transported per record
bits rate supported
number of packets transported
packets transported per record
packets rate supported
Value
10,000
62,397,594
1.17
20,616.64
1.54
2,028,392,356
32.51
325,075.41
Unit
records/second
records
Terabytes
Bytes/record
Gbit/s
packets
packets/record
packets/second
Table 3.2: NetStore properties and network rates supported based on 24 hour flow records
data and the 12 attributes.
segment array sorting.
Figure 3.8 shows that the segment size also affects the compression ratio of each segment,
the larger the segment size the larger the compression ratio achieved. But high compression
ratio is not a critical requirement. The size of the segments is more critically related to the
available memory, the desired insertion rate for the network and the number of attributes
used for each record. We set the insertion rate goal at 10,000 records/second, and for this
goal we set a segment size of 2 million records given the above hardware specification and
records sizes.
Table 3.2 shows the insertion performance of NetStore. The numbers presented are computed based on average bytes per record and average packets per record given the insertion
rate of 10,000 records/second. When installed on a machine with the above specification,
NetStore can keep up with traffic rates up to 1.5 Gbit/s for the current experimental im-
80
plementation. For a constant memory size, this rate decreases with the increase in segment
size and the increase in the number of attributes for each flow record.
Figure 3.7: Insertion rate for different segment sizes.
3.4.2
Queries
Having described the NetStore architecture and it’s design details, in this section we
consider the queries described in [17], but taking into account data collected over the 24
hours for internal network 128.238.0.0/16. We consider both the queries and methodology
in [17] meaningful for how an investigator will perform security analysis on network flow
data. We assume all the flow attributes used are inserted into a table f low and we use
standard SQL to describe all our examples.
Scanning
Scanning attack refers to the activity of sending a large number of TCP SYN packets to
a wide range of IP addresses. Based on the received answer the attacker can determine if a
81
Figure 3.8: Compression ratio with and without aggregation.
particular vulnerable service is running on the victim’s host. As such, we want to identify
any TCP SYN scanning activity initiated by an external hosts, with no TCP ACK or TCP
FIN flags set and targeted against a large number of internal IP destinations, larger than a
preset limit. We use the following range aggregation query (Q1):
SELECT sourceIP, destPort, count(distinct destIP), startTime
FROM flow
WHERE sourceIP <> 128.238.0.0/16 AND destIP = 128.238.0.0/16
AND protocol = tcp AND tcpSyns = 1 AND tcpAcks = 0 AND tcpFins = 0
GROUP BY sourceIP
HAVING count(distinct destIP) > limit;
External IP address 61.139.105.163 was found scanning starting at time t1 . We check if
82
there were any valid responses after time t1 from the internal hosts, where no packet had
the TCP RST flag set, and we use the following query (Q2):
SELECT sourceIP, sourcePort, destIP
FROM flow
WHERE startTime > t1 AND sourceIP = 128.238.0.0/16
AND destIP = 61.139.105.163 AND protocol = tcp AND tcpRsts = 0;
Worm Infected Hosts
Internal host with the IP address 128.238.1.100 was discovered to have been responded
to a scanning initiated by a host infected with the Conficker worm and we want to check if
the internal host is compromised. Typically, after a host is infected, the worm copies itself
into memory and begins propagating to random IP addresses across a network by exploiting
the same vulnerability. The worm opens a random port and starts scanning random IPs on
port 445. We use the following query to check the internal host (Q3):
SELECT sourceIP, destPort, count(distinct destIP)
FROM flow
WHERE startTime > t1 AND sourceIP = 128.238.1.100 AND destPort = 445;
SYN Flooding
It is a network based-denial of service attack in which the attacker sends an unusual
large number of SYN request, over a threshold t, to a specific target over a small time
window W. To detect such an attack we filter all the incoming traffic and count the number
of flows with TCP SYN bit set and no TCP ACK or TCP FIN for all the internal hosts.
We use the following query(Q4):
SELECT destIP, count(distinct sourceP), startTime
FROM flow
WHERE startTime > ’NOW - W’ AND destIP = 128.238.0.0/16
83
AND protocol = tcp AND tcpSyns = 1 AND tcpAcks = 0 AND tcpFins = 0
GROUP BY destIP
HAVING count(sourceIP) > t;
Network Statistics
Besides security analysis, network statistics and performance monitoring is another important usage for network flow data. To get this information we use aggregation queries
for all collected data over a large time window, both incoming and outgoing. Aggregation
operation can be number of bytes or packets summation, number of unique hosts contacted
or some other meaningful aggregation statistics. For example we use the following simple
aggregation query to find the number of bytes transported in the last 24 hours (Q5):
SELECT sum(numBytes)
FROM flow
WHERE startTime > ’NOW - 24h’;
General Queries
The sample queries described above are complex and belong to more than one basic
type described in Section 3.3.1. However, each of them can be separated into several basic
types such that the result of one query becomes the input for the next one. We build a more
general set of queries starting from the ones described above by varying the parameters in
such a way to achieve different level of data selectivity form low to high. Then, for each
type we reported the average performance for all the queries of that type.
Figure 3.9 shows the average running times of selected queries for increasing segment
sizes. We observe that for S type queries that don’t use IPs index (e.g. for attributes
other than internal sourceIP or destIP), the performance decreases when the segment size
increases. This is an expected result since for larger segments there is more unused data
loaded as part of the segment where the spotted value resides. When using the IPs index
the performance benefit comes from skipping the irrelevant segments whose positions are
not found in the positions list. However, for internal busy servers that have corresponding
84
flow records in all the segments, all corresponding segments of attributes have to be read
but not the IPs segments. This is an advantage since an IP segment is several times larger
in general than the other attributes segments. Hence, except for spot queries that use
non-indexed attributes, queries tend to be faster for larger segment sizes.
Figure 3.9: Average query times for different segment sizes and different query types.
3.4.3
Compression
Our goal with using compression is not to achieve the best compression ratio nor the best
compression or decompression speed, but to obtain the highest records insertion rate and
the best query performance. We evaluated our compression selection model by comparing
performance when using a single method for all the segments in the column, with the
performance when using the compression selection algorithm for each segment. To select
the method for a column we compressed first all the segments of the columns with all the six
85
methods presented. We then measured the access performance for each column compressed
with each method. Finally, we selected the compression method of a column, the method
that provides the best access times for the majority of the segments.
For the variable segments compression, we activated the methods selection mechanism
for all columns and then we inserted the data, compressing each segment based on the
statistics of its own data rather than the entire column. In both cases we did not change
anything in the statistic collection process since all the statistics were used in the query
process for both approaches. We obtained on an average 10 to 15 percent improvement per
query using the segment based compression method selection model with no penalty for
the insertion rate. However, we consider the overall performance of compression methods
selection model is satisfactory and the true value resides in the framework implementation,
being limited only by the individual methods used not by the general model design. If
the data changes and other compression methods are more efficient for the new data, only
the compression algorithm and the operators that work on this compressed data should be
changed, with the overall architecture remaining the same.
Some commercial systems [58] apply on top of the value-based compressed columns
another layer of general binary compression for increased performance. We investigated
the same possibility and compared four different approaches to compression on top of the
implemented column oriented architecture: no compression, value-based compression only,
binary compression only and value-based plus binary compression on top of that. For the
no compression case, we processed the data using the same indexing structure and column
oriented layout but with the compression disabled for all the segments. For the binary
compression only we compress each segment using the generic binary compression. In the
case of value-based compression we compress all the segments having the dynamic selection
mechanism enabled, and for the last approach we apply another layer of generic compression
on top of already value-based compressed segments.
The results of our experiment for the four cases are shown in Figure 3.10. We can see that
compression is a determining factor in performance metrics. Using value-based compression
achieves the best average running time for the queries while the uncompressed segments
scenario yields the worst performance.We also see that adding another compression layer
86
Figure 3.10: Average query times for the compression strategies implemented.
does not help in query performance nor in the insertion rate even though it provides better
compression ratio. However, the general compression method can be used for data aging,
to compress and archive older data that is not actively used.
Figure 3.8 shows the compression performance for different segment sizes and how flow
aggregation affects storage footprint. As expected, compression performance is better for
larger segment sizes in both cases, with and without aggregation. That is the case because
of the compression methods used. The larger the segment, the longer the runs for column
with few distinct values, the smaller the dictionary size for each segment. The overall
compression ratio of raw network flow data for the segment size of 2 million records is 4.5
with no aggregation and 8.4 with aggregation enabled. Note that the size of compressed
data includes also the size of both indexing structures: column indexes and IPs index.
87
3.4.4
Comparison With Other Systems
For comparison we used the same data and performed a system-specific tuning for each
of the systems parameters. To maintain the insertion rate above our target of 10,000
records/second we created three indexes for each Postgres and Luciddb: one clustered
index on startTime and two un-clustered indexes, one on sourceIP and one on destIP
attributes. Although we believe we chose good values for the other tuning parameters we
cannot guarantee they are optimal and we only present the performance we observed. We
show the performance for using the data and the example queries presented in Section 3.4.2.
Postgres/NetStore
LucidDB/NetStore
Q1
10.98
5.14
Q2
7.98
1.10
Q3
2.21
2.25
Q4
15.46
2.58
Q5
1.67
1.53
Storage
93.6
6.04
Table 3.3: Relative performance of NetStore versus columns only PostgreSQL and LucidDB
for query running times and total storage needed.
Table 3.3 shows the relative performance of NetStore compared to PostgresSQL for the
same data. Since our main goal is to improve disk resident data access, we ran each query
once for each system to minimize the use of cached data. The numbers presented show how
many times NetStore is better.
To maintain a fair overall comparison we created a PostgresSQL table for each column of
Netstore. As mentioned in [2], row-stores with columnar design provide better performance
for queries that access a small number of columns such as the sample queries in Section 3.4.2.
We observe that Netstore clearly outperforms PostgreSQL for all the query types providing
the best results for queries accessing more attributes (e.g. Q1 and Q4) even though it uses 90
times more disk space including all the auxiliary data. The poor PostgreSQL performance
can be explained by the absence of more clustered indexes, the lack of compression, and the
unnecessary tuple overhead.
Table 3.3 also shows the relative performance compared to LucidDB. We observe that the
performance gap is not at the same order of magnitude compared to that of PostgreSQL even
when more attributes are accessed. However, NetStore performs clearly better when storing
about 6 times less data. The performance penalty of LucidDB can be explain by the lack
of column segmentation design and by early materialization in the processing phase specific
88
to general-purpose column stores. However we noticed that LucidDB achieves a significant
performance improvement for the subsequent runs of the same query by efficiently using
memory resident data.
3.5
Conclusion
With the growth of network traffic, there is an increasing demand for solutions to better
manage and take advantage of the wealth of network flow information recorded for monitoring and forensic investigations. The problem is no longer the availability and the storage
capacity of the data, but the ability to quickly extract the relevant information about potential malicious activities that can affect network security and resources. In this chapter
we have presented the design, implementation and evaluation of a novel working architecture, called NetStore, that is useful in the network monitoring tasks and assists in network
forensics investigations.
The simple column oriented design of NetStore helps in reducing query processing time
by spending less time for disk I/O and loading only needed data. The column partitioning
facilitates the use of efficient compression methods for network flow attributes that allow
data processing in compressed format, therefore boosting query runtime performance. NetStore clearly outperforms existing row-based DBMSs systems and provides better results
that the general purpose column oriented systems because of simple design decisions tailored for network flow records. Experiments show that NetStore can provide more than
ten times faster query response compared to other storage systems while maintaining much
smaller storage size. In future work we seek to explore the use of NetStore for new types of
time sequential data, such as host log analysis, and the possibility to release it as an open
source system.
Having described the design and implementation of the column oriented storage infrastructure architecture and the general processing engine, in the next chapter we go a step
further towards our goal of improving the runtime performance of queries on network flow
data. As such, we analyze the characteristics of monitoring and forensic query workloads,
we define two general types of queries based on their complexity and present the querying
89
models and optimization methods for each of simple and complex queries types.
Chapter
4
A Querying Framework For Network
Monitoring and Forensics
4.1
Introduction
Forensics and monitoring queries on historical network flow data are becoming increasingly complex [32]. In general these queries are composed of many simple filtering and
aggregation queries, executed sequential or in batches, that process large amounts of network flow data spanning for long periods of time.
Running complex queries faster increases network analysis capabilities. The column
oriented architecture described in previous chapter along with the general purpose columnstores presented in [55, 60] show that a column oriented storage system yields better query
runtime performance than a transactional row oriented system for analytical query workloads. In this chapter we show that query performance can be further improved by using
efficient processing methods for the forensic analysis and monitoring queries.
In forensic analysis network flow data is processed in several steps in a drill-down manner,
narrowing and correlating the subsequent intermediary results throughout the analysis [17,
32]. For example, suppose a host X is detected as having been scanning hundreds of hosts
on a particular port. This event may lead the administrator to check host X past network
activity by querying network flow data to find any services run and protocols used, then the
90
91
connection records to other hosts. This investigation process continues by using previous
results and issuing new sequential forensic queries until, eventually, the root of the security
incident is revealed.
In general, to enable the use of previous queries results, existing systems store intermediary data in temporary files and materialized views on disks before feeding data to
the new queries [17, 18]. Using this approach the query runtime is increased by the I/O
operations. Moreover, using standard SQL syntax, references between subsequent query
results are not trivial to represent. Instead, sophisticated nested queries, stored procedures
and scripting languages are used [17, 32, 39]. In this chapter, we also propose a simple SQL
extension to easily express previous queries results as well as other features useful when
working with network flow data. When executing queries sequentially, the query engine has
the opportunity to speedup the new queries by efficiently reusing the results of the already
evaluated predicates in the querying session. As such, in Section 4.2.2 we show how forensic
queries can be executed more efficiently by reusing previous computation when queries are
sequential and share some filters.
For monitoring, network administrators run many simple queries in batches in order to
detect complex network behavior patterns, such as worm infections [32], and display traffic
summarization per hour, per day or other predefined time window [17, 39]. By submitting
queries in batches all the simple queries can be executed in any order, thus some order may
result in better overall runtime performance. Additionally, it is expected that some of the
queries in the batch to use the same filtering predicates for some attributes (known ports,
IPs, etc). Thus, the results from evaluating common predicates can be shared by many
queries, therefore saving execution time. Moreover, evaluating predicates in a particular
order across all the simple queries may result in less evaluation work for the future predicates
in the execution pipeline. Taking into account the above properties of monitoring queries,
in section 4.2.2 we present an efficient method to execute batch monitoring queries faster
using network flow data stored in a column oriented system.
92
Data Storage
The proposed query processing engine is designed to work with network flow data such
as CISCO Netflow [57]. We assume the data is stored using the column-oriented storage
infrastructure whose design and implementation are described in previous chapter and we
present a brief summary here. As such, the flow data is collected at the edge of an enterprise
network and represents all the flow traffic that crosses the network boundary in and out.
Each flow contains attributes such as source IP, source port, destination IP, destination port,
protocol, start and end time of the flow, etc. Data is assumed to be partitioned into columns
at the physical level, one column for each network flow attribute. The set of all the columns
that store attributes of the same set of flows is represented by a storage. Conceptually,
at logical level a storage is similar to a database table with the distinction that the flows
are stored as they are collected. That is, data is only appended in time sequential order
rather than inserted at specific positions in the columns. All the columns are stored in the
same order. All the attributes at the same positions in different columns represent a logical
row or logical tuple. A subset of values for a column represents a segment and an index
node stores the metadata for a segment. The index node contains data about the segment
such as minimum and maximum values, number of distinct values, the corresponding time
window, the encoding method, small data histogram, etc. All the index nodes of the column
represent the column index. At the physical level each column is actually represented by a
set of segments and a column index stored in compressed binary files on disk. Additionally
an internal IPs index structure, as presented in [19], is maintained for the internal IP
addresses to facilitate faster access to their corresponding records. Essentially, the inverted
IPs index maintains for each internal IP address the positions in the column where records
of the internal IP address appear.
Contributions
Having described the general context and the requirements for the query processing
engine, with this chapter we make the following key contributions:
• The design and implementation of an efficient query processing engine built on top of
93
a column oriented storage system used for network flow data.
• Query optimization methods for sequential and batch queries used in network monitoring and forensic analysis.
• Design of a simple SQL extension that allows simpler representation of forensic and
monitoring queries.
For the rest of the chapter we organize the presentation as follows: in Section 4.2 we
describe the simple and complex queries as well as the processing and optimization methods,
in Section 4.3 we present the SQL extension primitives, in Section 4.4 we show the processing
experimental results, Section 4.5 discusses related work and in Section 4.6 we conclude.
4.2
Query Processing
The query processing engine is built on top of the column oriented storage system presented in previous chapter. We are looking to optimize two general types of queries: simple
queries and complex queries. A simple query is composed of several filtering predicates and
aggregation operators over a subset of the network flow data attributes. A complex query
is composed of several simple queries as building blocks that can be executed sequentially,
query-at-a-time, with new simple queries using the results of previous simple queries, or in
batches when all the simple queries are sent for execution in the same time. In this section we describe the two query types as well as the proposed processing and optimization
methods for each query type.
4.2.1
Simple Queries
A simple query has the following general format:
SELECT Ac1 , . . . , Ack , f1 (Aa1 ), . . . , fm (Aam )
FROM f low records storage
WHERE σ1 (A1 ) AND . . . AND σn (An )
94
GROUP BY Ac1 , . . . , Ack ;
where Ac1 , . . . , Ack represent the attributes in the output, Aa1 , . . ., Aam the attributes that
are aggregated in the output, f1 , . . . , fm the SQL standard aggregation functions (MIN,
MAX, SUM, COUNT, AVG, etc) and σ1 , . . . , σn the filtering predicates over the accessed
attributes.
For simplicity we assume that the predicates in the WHERE clause are ANDed. Therefore, a logical tuple that satisfies the boolean expression in the WHERE clause of a simple
query has to satisfy all the predicates in the clause. Thus, the predicates can be executed in
any order and the runtime performance is influenced by the order in which corresponding
filters are evaluated.
One way to execute a simple query is to scan relevant columns individually, evaluate
the predicates for each column independently and in the final step merge the values of all
the columns that satisfy the corresponding predicates. This approach, called parallel early
materialization was analyzed in [3]. However, evaluating each predicate independently
might result in wasting processing time to generate unnecessary intermediary results that
might be discarded by the final merging operation.
Another more efficient approach is to evaluate each predicate over each column in order
and send to the next predicate the positions of the values that satisfy the previous predicate.
In this way the next predicate is evaluated only for the values at relevant positions. In
the final step the columns in the output are scanned again and only the values at the
positions that satisfied all the predicates are merged. This approach, called pipelined late
materialization was also analyzed in [3] and our proposed processing engine uses a variation
of this method for simple queries processing.
Note that we assume that all the columns are partitioned into segments and corresponding segments from different columns (containing values at the same position in the columns)
fit in main memory. As such, each simple filtering query is evaluated per segment and the
intermediary results for all segments are merged in the final step. By doing so, all the
working segments are in main memory yielding faster processing.
All the aggregation operations are performed in the last step after the filtering operations
95
are completed. If the aggregation operations form a large number of groups and the size of
the output becomes larger than the available memory the intermediary results are flushed
to disk. However, we found such a case unusual for our test query workload.
Figure 4.1: Simple query representation and execution graph: C1 , C3 , C4 , C6 - the accessed
columns; L1 , L2 , L3 - the positions lists of values satisfying the predicates σ1 , σ2 and σ3
respectively; π - the output merging operator; U - user input; O - a previous query result.
A simple query execution can be represented as a directed graph as shown in Figure 4.1.
Each predicate is evaluated by a processing node. Excepting the first node, each other
processing node can have as input column disk resident data, user input and temporary
data from other queries as well as a list of positions representing the values satisfying the
previous predicate. The first processing node can have as input only column, user and
temporary data.
Simple Query Optimization
It is beneficial to filter out unused data and avoid reading unused segments as early as
possible in the processing phase. Since all the aggregation operations are performed using
the same methods (hash-based) after the filtering is completed we omit their execution cost
from the formal problem description. Therefore, the optimization problem is to find the
96
filtering order that will yield the best running time, that is the filtering order with the
minimum execution cost. This problem is equivalent to the pipelined filters ordering problem that was studied in the database research community. We use a polynomial algorithm
similar to the one proposed in [24] to solve the instance of the problem when data is stored
in a column-store.
Suppose a simple query is represented as a set of m filters, {F1 , . . . , Fm } and for each
filter Fk is given the cost ck of evaluating the filter on a tuple, and the probability pk that
the tuple satisfies the filter predicate independent of whether the tuple satisfies other filter
predicates. A simple polynomial time solution to this problem presented in [24] is to order
all the m filters in non-decreasing order of the ratio ck /(1 − pk ) also called the rank of the
filter.
The running time of this simple algorithm is asymptotically bounded by the time of
sorting the filters in non-decreasing order of ranks, O(m log m). However, since we expect
to have a small number of filters, practically the running time is determined by the overhead
needed to read from disk the parameters used to compute each filter’s rank. We use a
variation of this algorithm by considering segment processing cost and selectivity, rather
than tuple evaluation cost and probability of the tuple passing the filter.
Since segment reading is the most expensive operation, its cost is directly proportional
to the disk segment size and data is most of the time processed in compressed format, in
our simple query processing model we consider the execution cost ck of each filter equal
with the size of the segment on disk si .
The selectivity of a filter over each segment is represented by the fraction of the segment
values that satisfy the filter predicate. The selectivity factor (SF ) of a filter over a segment
is estimated based on the statistics collected in the index nodes of each segment (minimum
value, maximum value and number of distinct values) for each type of filtering operator (<,
≤, >, ≥, 6=, =, IN). By default we consider all the values in a segment to be uniformly
distributed and we use simple estimation equations similar to the ones in [51] to estimate
the selectivity factors.
Suppose a column has a segment Si with ni elements, the size on disk si and having
the minimum value mi , the maximum value Mi and the number of distinct values di . Also,
97
suppose a filtering predicate σi is given in the format Ai
op
v, where Ai is the attribute
name, op is one of the filtering operators and v is a value given in the query statement. For
a filter with predicate σi we estimate the selectivity factor SF (Si , σi ) over segment Si with
uniform distributions of the values using one of the following simple equations:
SF (Si , Ai = v) =



0
if v < mi or Mi < v


ni ×
SF (Si , Ai 6= v) =



0


ni ×
SF (Si , Ai INV ) =
1
di
if mi < v < Mi
if v < mi or Mi < v
di −1
di
if mi < v < Mi



 ni
if V undef
P



SF (Si , Ai = v)
if V def
v∈V




0




SF (Si , Ai < v) = ni






 ni ×




0




SF (Si , Ai ≤ v) = ni






ni ×




0




SF (Si , Ai > v) = ni






 ni ×




0




SF (Si , Ai ≥ v) = ni






ni ×
if v < mi
if Mi < v
v−mi
Mi −mi
if mi < v < Mi
if v < mi
if Mi ≤ v
v−mi +1
Mi −mi
if mi ≤ v ≤ Mi
if Mi < v
if v < mi
Mi −v
Mi −mi
if mi < v < Mi
if Mi < v
if v ≤ mi
Mi −v+1
Mi −mi
if mi ≤ v ≤ Mi
98
where V is a set of values that can be explicitly enumerated in the query statement or
can be loaded from some external source or previous query result. In the case when V is
explicitly enumerated the SF can be estimated more accurately by adding the selectivity
factors of the equality predicates of the elements in V . In the case when V is not known
the selectivity factor will be equal with the segment size.
In general, the selectivity factors estimated with the previous simple equations are quite
accurate for uniformly distributed segment values. However, network data segments not
all have values uniformly distributed. For those segments, the simple equations don’t give
accurate estimations and some other auxiliary stored data such as histograms can be used
at the expense of a decrease in network flows insertion rate and an increase in storage
overhead. For example, NetStore uses small histograms and inverted IPs index to estimate
the selectivity factors for segments. Histograms are mainly used to estimate the busy servers
counts, the common protocols and common port numbers counts, the rest of the values being
pseudo-randomly distributed.
We use Algorithm 4.1 to efficiently execute a simple query.
Algorithm 4.1 ExecuteSimpleQuery
INPUT: Q = {F1 , . . . , Fm }
for all Fi ∈ Q do
Si ← segment in Fi predicate
ci ← size of Si on disk
SFi ← selectivity factor for Fi
ci
rank(Fi ) ← 1−SF
i
end for
sort {F1 , . . . , Fm } in non-decreasing order by rank
P OS ← nil
for all Fi ∈ Q do
P OS ← positions of evaluating Fi using P OS
end for
The expected running time of this simple algorithm is asymptotically bounded by the
time of sorting the filters in non-decreasing order of ranks, thus O(m log m). However, since
we expect to have a small number of filters, practically the running time is determined by
the overhead needed to read from disk the parameters used to compute each filter’s rank.
Since the parameters of segments contributing to the rank can change for segments in
99
the same column the algorithm steps are executed for each set of corresponding segments
(e.g. first for segment 1 in columns C1 , C2 , . . . , second for segment 2 in columns C1 , C2 , . . . ,
and so on). Having this processing model enables the use of parallelism on a multi-core
execution server by processing each set of segments in a single thread and combining the
partial results to obtain the final result. However, the parallel processing was not included
in our current evaluation of the system and is intended as future work. For the rest of this
chapter we use segment and columns interchangeably and we represent column data when
we refer to either columns or segment.
4.2.2
Complex Queries
Besides the simple queries described in previous section, network forensics and monitoring tasks use more complex queries.
On one hand investigation queries are interactive, sequential and process data in a drilldown manner. As such, the investigator issues a simple query, analyzes the results and
issues other more specific subsequent queries reusing some of the data from the previous
queries output [17, 18]. Thus, investigation queries are executed one-at-a-time with possible references to previous queries results. We call this type of complex queries sequential
interactive queries. On the other hand the monitoring tools use queries that seek patterns
of network behavior [32], summarize network traffic [39] and access a larger set of common
attributes. These queries are sent and executed in batches many-at-a-time. We call this
type of complex queries batch queries. Both types of complex queries can be represented
using filtering and aggregation simple queries as building blocks as shown in [32].
To describe the complex queries we again assume the network flow data is stored using
the column oriented storage infrastructure described in Chapter 3. We define a complex
query CQ as a set of simple queries {Q1 , . . . , Qk }. When processing a complex query using
historical network flow data, the execution is slowed down by the expensive I/O operation
when accessing disk resident data and by performing redundant and unnecessary operations. Suppose each network flow has n attributes and the data is stored using n columns,
one column for each attributes. From data perspective the input of a simple query Qi is
represented by a subset, C ∗ , of the total set of columns C = {C1 , . . . , Cn }, a subset, O∗
100
of the output column set of previous i − 1 queries O = {O1 , . . . , Oi−1 }, and possibly some
user input data whose size is assumed to be much smaller than either disk resident or intermediary data. We assume the intermediary and user data is memory resident and the
data in columns is stored on disk. Therefore, the problem is to find an efficient execution
model for the complex query CQ that will minimize the disk access, will avoid unnecessary
operations, and will reuse intermediary data in an efficient way.
Sequential Interactive Queries
This type of complex query is predominant in an investigation scenario and we consider
the whole querying session when designing the processing model. As such, a sequential
interactive complex query can be implemented by the automaton in Figure 4.2.
Figure 4.2: The interactive sequential processing model.
The input of the automaton is represented by the sequence of simple queries that make
up CQ. The automaton has only three states, initial state S0 , intermediary state S1 and final
state S2 . At each state it is processed the disk data represented by C ∗ and intermediary
data represented by O∗ . For this model we omit the user data since we assume it has
negligible size compared to the other two types of data. The arrows between any two states
represent the simple query executed and the data source for the query. The tuple (Q1 , C ∗ )
represents that the machine will execute Q1 using only column data from disk and will
generate intermediary data that will be stored at the intermediary state S1 . The tuple
101
(Qi , C ∗ , O∗ ) with i = 2, . . . , k − 1 represents that the machine will execute queries Q2 to
Qk−1 , can take as input both disk resident and intermediary data, will generate intermediary
results and will remain in the intermediary state S1 . Finally, the tuple (Qk , C ∗ , O∗ ) will
execute the last query in the sequence and the machine will make the transition to the final
state and will output the final result of the complex query.
Using this processing model the available memory might end up being filled after a number of queries depending on the output size of each query. In such case the least recently
issued query result will be flushed to disk and a reference will be kept in main memory.
Since the only operations permitted for a simple query are filtering and aggregation, the
output size of each simple query is expected to be in general much smaller than the input
size. Therefore, in the case when a disk resident previous query output is referenced by a
new query, the time to read the already computed result from disk is expected to be much
smaller than the time of reading the input data plus the= query processing time.
Batch Queries
In the case of the batch complex query processing all the simple queries are sent for
processing in the same time (for example by loading the simple queries statements from a
file), therefore all the simple queries are known when processing starts.
A straight forward way to execute a batch of queries is to use the previous sequential
processing model and execute simple queries one-at-a-time. However, the execution runtime
for the batch of queries can further be significantly improved compared to the sequential case
by knowing all the queries statements from the beginning. As such, all the simple queries in
the batch can be executed in any order, thus some order may result in better overall runtime
performance. Additionally, for network flow data it is expected that some of the queries in
the batch to use the same filtering predicates for some attributes (for example known ports,
static internal IPs, known protocols, etc). Therefore, the results from evaluating common
predicates can be shared by many queries, thus saving execution time. Moreover, evaluating
predicates in a particular order across all the simple queries may result in less evaluation
work for the future predicates in the execution pipeline. Following the execution model for
102
simple queries illustrated by Figure 4.1, Figure 4.3 shows a schematic batch query execution
strategy when some filters are shared between the simple queries and the shared filters are
executed only once. We simplified the graph and highlighted only the data loaded by the
shared filters.
Figure 4.3: Directed Acyclic Graph representation of the batch of 3 queries with shared
filters. Highlighted processing nodes represent the shared filters across the 3 simple queries.
Order of execution is represented by the direction of the arrows between processing nodes.
When designing the query processing engine we address all the above opportunities for
improvement and in the next section we present the optimization methods used for both
types of complex queries.
Complex Query Optimization
In the case of a sequential interactive query, some computation time might be saved
for new simple queries by reusing the previous simple queries intermediary filter evaluation
results if there are any common subexpression already computed. For this we maintain the
resulting columns positions obtained when executing each filter for whole querying session.
Therefore, when a new simple query is issued, first all the filters of the query are checked
103
in the pool of already evaluated filters in the session. If no match is found then the simple
query is optimized using the method described in Section 4.2.1 for simple queries and the
most efficient filtering order is decided. If a filter is found to be already evaluated, then
it is placed at the beginning of the filtering order by assigning its rank to zero. Then the
optimized ordering for the remainder of the filters is decided using the optimization method
for simple queries.
For network forensic investigations we expect to have a small number of sequential
interactive simple queries per session, therefore we assume all the filters positions will fit in
main memory. However, if the available memory will fill up, the least recently used filter
will be flushed to disk and a reference will be kept in main memory.
In the case of batch queries we build the processing model using a similar approach as
the one presented in [35] for stream processing. Suppose a complex batch query is composed
of a set of simple queries CQ = {Q1 , . . . , Qk }. Each simple query Qi is represented by a
subset of the set of all filters {F1 , . . . , Fm }. We assume that filters can be shared among
queries and filters are evaluated independent of each other. Each filter Fj is represented by
a predicate over a column (segment) and has associated a cost of being evaluated cj and a
selectivity factor SFj .
We say that a filter Fi resolves a query Qi if the predicate of Fi evaluates to FALSE for
all the values in the column or if Fi is the last query to be evaluated in Qi . We define a
strategy ζ to execute a complex batch query CQ as a sequence of filters ζ = {Fa , . . . , Fb }.
Let pi be the probability that a filter Fi will be evaluated by a strategy ζ. Thus, the cost
of using strategy ζ to resolve all the queries in CQ is given by:
cost(ζ) =
b
X
pi · c i
i=a
Therefore the problem is to find, among all the possible execution strategies the execution strategy with the minimum cost. It is shown in [35] that this problem is NP-hard
with a reduction from the set cover problem. However, some alternative approximation
algorithms can be used and we consider two of them: the naı̈ve method and the adaptive
method.
104
The naı̈ve way to execute a batch complex query is to first optimize each simple query
and execute the sequence of all the simple queries in order. However, since each filter might
be shared by many queries, a more efficient approximate method is to consider also the
participation of each filter in the queries and use a ranking function similar to the one used
for simple query processing optimization in Section 4.2.1. As such, for each filter Fi , if qi
represents the number of unresolved queries that share filter Fi , ci the cost to evaluate the
filter and SFi the selectivity factor of the filter predicate, the ranking function can be define
by:
rank(Fi ) =
ci
qi · (1 − SFi )
Algorithm 4.2 shows the processing steps for the batch query execution using for storage
the column oriented system described in previous chapter, NetStore.
Algorithm 4.2 ExecuteAdaptiveBatchQuery
INPUT: CQ = {Q1 , . . . , Qk }
create F ← {F1 , . . . , Fm }
for all Fi ∈ F do
segi ← segment in Fi predicate
ci ← size of segi on disk
SFi ← selectivity factor of Fi
qi ← number of unresolved queries containing Fi
ci
rank(Fi ) ← qi ·(1−SF
i)
end for
S ← {Q1 , . . . , Qk } set of unresolved queries
P (Q1 ), . . . , P (Qk ) result positions
while S 6= ∅ do
Fi ← filter with minimum rank(Fi )
P OS ← evaluate Fi
for all Qj with Fi ∈ Qj do P (Qj ) ← P (Qj ) ∩ P OS
S ← S − {Qi | Fi resolves Qi }
F ← F − {Fi }
recompute rank(Fj ) for all Fj ∈ F
end while
Initially all the filters of all the simple queries are added to the set of unevaluated
filters F . Then, for each filter the algorithm estimates the parameters used for the rank
computation. To obtain the result, for each query Qi we maintain a list with positions
corresponding to the values that passed the evaluated filters in variable P (Qi ). When a
105
new filter has to be evaluated, at each step the filter Fi with the lowest rank is selected to be
evaluated from the pool of unevaluated filters. After the filter Fi is evaluated, the list with
resulting column positions corresponding to the values that passed the filter is intersected
with the current list of positions for each query containing Fi . Now, the resulting set of
filtered values of all the queries resolved by filter Fi can be constructed by scanning again
the corresponding columns. At this point each resolved query execution terminates and
the result can be sent to the user. All the filters of the queries that are resolved after the
evaluation of filter Fi and are not part of any other query are considered evaluated and are
removed from the filters pool. The processing continues with updating the ranks for the
remainder of the filters and with sorting the remainder of the filters in non-decreasing order
of ranks. We call this method the adaptive method since at each step the rank of each filter
is adaptively recomputed taking the change of filter participation into account.
Suppose θ is the number of pairs (Fj ,Qi ), where Fj appears as a filter of Qi . Then,
when evaluating a filter, the elimination of the resolved queries after filter evaluation is
done in O(θ) time. The total number of times the rank of each filter that belongs to the
resolved queries is updated in O(θ). We maintain the filters in a priority queue sorted by
rank, therefore the update of each filter’s rank is performed in O(log m) time, where m is
the number of filters. Therefore, since we assume the filters are independent, the execution
overhead of Algorithm 4.2 is O(θ log m).
In [35] authors show that an equivalent algorithm with k simple queries and m filters
achieves an O(log2 k log m) approximation factor. However, in the case of monitoring batch
queries executed on network flow data, the number of columns and distinct filters is rather
small compared to the expected number of simple queries, therefore we expect that the
processing time approximation factor to be dominated by the polylogarithmic function in
the number of queries rather than the number of filters.
4.3
Query Language
In many situation the limited scope of pure SQL commands is not suitable for the
network data analysis. Network monitoring and forensic tasks might require the use of
106
sophisticated SQL nested queries, stored procedures and scripting languages that are not
trivial to represent [17, 32].
Additional to the processing engine and query optimization methods, in this chapter
we propose a simple SQL extension for querying network flow data called NetSQL. The
query language extension was designed to easily describe join-free network monitoring and
forensic queries based on their characteristics. The set of language operators is a subset
of SQL commands and a small set of added features implemented on top of the column
oriented storage system described in previous chapter. Besides standard SQL commands
and operators, we only added features necessary for efficiency of monitoring and forensics
workloads.
4.3.1
Data Definition Language
NetSQL is implemented at the logical level, the physical level being transparent to the
user. Taking into account the assumptions in previous section, in this section we define the
primitives for the data definition language. As such, the syntax of the command to create
a storage with n columns of various types is as follows:
CREATE STORAGE storage name (A1 T1 , . . . , An Tn );
where A1 , . . . , An represent the attributes of the network flow records that will be stored
and T1 , . . . , Tn their respective types. Since network data is permanently appended at line
speed it is often important to provide a mechanism to send data for storage as soon as it
is collected from the network. Unlike standard SQL, NetSQL implements a primitive to
import data from a network resource. The following construct is used for this purpose:
LOAD INTO storage name FROM url ;
Note that the data sent from the source can be in either format IPFIX [26] or CSV text
format. The inverted IPs index structure is created using the command:
107
CREATE IPS INDEX index name
ON ips column name
USING ip addr ;
where index name is the name of the inverted IPs index, ips column name can be either the
source IP attribute or destination IP attribute matching a specific IP address expression or
being a member in a set represented by ip addr. For example, the IPv4 address expression
can be given in the subnet CIDR format and the inverted IP index will be created only
for the matching IP addresses. NetSQL also supports another operation for selecting an
active storage, similar to the one defined to select a table in a relational databases, with
the following syntax:
SELECT STORAGE storage name;
Depending on the projected usage scenario other operations can also be designed and
implemented. As such, we also implemented commands similar with the SQL commands
for exporting a storage, deleting a storage, dropping a storage and dropping an index that
can have the following syntax:
EXPORT STORAGE storage name TO csv file;
DELETE STORAGE storage name;
DROP STORAGE storage name;
DROP INDEX index name;
However, at the time of writing this dissertation NetSQL supports only the operations
described in Section 4.3. These commands are used to enable the performance evaluation
of the system and to enable testing of various optimization strategies.
108
4.3.2
Data Manipulation Language
NetSQL does not consider the relational model for data organization, it assumes that
at any given time only one storage is active and all the queries refer to that storage. The
storage selection is done using the command presented in Section 4.3.1. Suppose that the
storage has n attributes, then the following syntax represents the general type of a filtering
and aggregation simple query written in NetSQL:
label:
SELECT Ac1 , . . . , Ack , f1 (Aa1 ), . . . , fm (Aam )
WHERE σ1 (A1 ) AND . . . AND σn (An )
GROUP BY Ac1 , . . . , Ack ;
where Ac1 , . . . , Ack with k ≤ n represent the attributes in the output, Aa1 , . . . , Aam with
m ≤ n the attributes that are aggregated in the output, f1 , . . . , fm the aggregation functions
applied to the m attributes and σ1 , . . . , σn the predicates over all the attributes considered.
The output of a query in NetSQL is represented by a set of columns. In order to facilitate the
use of previous query results in subsequent queries a label is added to each of the queries.
When an attribute Ai of the output of a query with label Qi is accessed by some other
subsequent query Qj the dot notation Qi .Ai is used in the WHERE clause of the query
labeled Qj to represent the attribute values in the output of Qi . NetSQL implements the
basic filtering operators <, ≤, 6=, >, ≥, = and IN to represent the membership of an element
in a set. The basic aggregation operators implemented in NetSQL are M IN , M AX, SU M
and COU N T .
4.3.3
User Input
Sometimes, in network data analysis tasks it is useful to load and use large predefined
user inputs, such as lists of malicious hosts or lists with restricted port numbers that are
not part of the initial storage schema. Unlike standard SQL, NetSQL allows the user to
input a set with large number of values that can be used in conjunction with the existing
109
storage data without the need to change the underlying storage and data structures. The
following primitive implements this feature:
LOAD INTO U1
T1 , . . . , U s
Ts FROM url ;
where U1 , . . . , Us represent the names of the user input attributes and T1 , . . . , Ts their types.
By specifying the types at loading time each column will be encoded using specific encoding
methods for each type in order to facilitate more efficient operations with the data in encoded
format. For example, if the IP addresses are represented in dotted notation as strings in
the user source file, they will be encoded into 4-byte integer values in the memory resident
data structure.
Note that the storage and user data loading routines are very similar in syntax but have
different semantics. The main difference is represented by the fact that the storage data is
written to disk and the user data will be memory resident, therefore it is assumed that the
size of the data structure created from the user input will fit in main memory at all times.
4.4
Experiments
We implemented an instance of the querying framework presented using Java programming language on the FreeBSD 7.2 platform. For all the experiments we used a single
machine with 6 GB DDR2 RAM, two Quad-Core 2.3 Ghz CPUs and 1TB SATA-300 32
MB Buffer 7200 rpm disk. We used the network flow data captured over a 24 hour period
of one weekday at our campus border router representing about 8 GB of raw network flow
data. We considered 12 attributes for each network flow record, each of byte, short or
integer type and stored the data using NetStore, the column oriented system described in
Chapter 3. We present the experiments setting and performance evaluation for both simple
and complex queries.
110
4.4.1
Simple Queries Performance
We first generated simple queries with increasing number of predicates with selectivities
chosen at random from interval (0, 1). In the first phase, for each simple query we generated all the permutations of all the filters in the query. We executed the simple queries
permutations sequentially without the filtering order optimization enabled and recorded the
running time for each permutation. Next, we run the same simple queries with optimization
enabled and recorded the running times.
Figure 4.4: The average simple queries running times for permutation, optimized and optimum strategies.
Figure 4.4 shows the average running times for the queries with increasing number of
predicates for the two approaches, permutation and optimized compared with the optimum
running time. As the figure shows, the optimized simple query processing achieves better
performance than the average running times for permutation queries while maintaining a
close to optimum performance. This proves that the filtering cost estimation model is quite
111
accurate within some small predictable error. The small difference from optimum is due,
on one hand, to the assumption that the segment values are pseudorandomly distributed,
thus affecting the selectivity factor computation accuracy, and on the other hand to the
preprocessing overhead to compute the estimation parameters. Since the number of filters
within a simple query is rather small, the preprocessing overhead added by filtering order
optimization is expected to be negligible and much smaller than the absolute query execution
time. We believe a more accurate optimal filtering order can be obtained by using more
expensive approaches to maintain segment statistics and values distributions (for example
full histograms) that will help to improve the selectivity factor estimation.
4.4.2
Complex Queries Performance
We consider the following parameters for each test: m the number of filters, k the number
of simple queries and p the average participation of each filter (the number of queries that
share the same filter). Then the expected number of filters per query is q = pm/k. Since we
have only 12 flow attributes and we allow at most one attribute in a filter we have q ≤ 12.
For both types of complex queries, sequential and batch, we generated the simple queries
in a similar way. We fist generated the m filters having predicates with random selectivity
factors in interval (0,1). Even though larger number of predicates can be used we set the
number of predicates m = 66 for all the experiments just to maintain reasonable running
times for all the complex queries. Next we generated k queries with the expected number
of filters per query q randomly selected in the interval [1,12], by increasing the average
participation of each filter p and calculating k for each value of q and p using the formula
k = mp/q.
For both complex query cases, we considered as baseline the naı̈ve method when all the
simple queries are optimized and are executed in sequential order.
Sequential Interactive Queries
We tested the efficiency of the complex interactive queries by generating increasing
number of simple queries with increasing participation p of each filter, each simple query
being executed sequentially one-at-a-time. We then run the same queries enabling the filter
112
evaluation result sharing between new queries and previous queries. We call this method of
sharing filter evaluation results, the sequential method.
Figure 4.5 shows the results for m = 66, q randomly selected in interval [1,12] and p
increasing integer from 1 to 9. When the average filter participation p was under 2, that
is each filter occurred on average in less than two queries (number of queries 22 on the
graph), the result of the naı̈ve method is slightly better than the sequential method. This is
the case because the sequential method basically performed the same strategy as the naı̈ve
method since no filter was evaluated before. The performance penalty of sequential methods
represents the overhead to manage the already computed results. After the point where
the filter participation is larger than 2 (number of queries larger than 22 on the graph),
the computation savings by reusing the previously evaluated filters are clearly visible. The
performance benefit of the sequential method varies up to an order of magnitude compared
to the naı̈ve method and is dependent on the filters selectivity and participation.
Figure 4.5: The complex interactive query performance compared to the baseline.
113
We observed that, even though after a while all the filters are pre-evaluated with results
stored in memory, the performance gap is not increasing. This is the case because the
sequential method still has to perform intersection of the filtering positions for each preevaluated filter. If the number of predicates is small and the processing engine has a
large memory available, another level of optimization can be implemented that will store
and reuse the most frequent intersections of the filtered positions, but this design was not
implemented, and is left as future work.
Batch Queries
To test the batch query processing algorithm we considered again the naı̈ve method as
the baseline and we varied the number of generated queries executed by each method using
m = 66 filters, q random in interval [1,12] and p increasing integer in interval [1, 21]. We
ran each number of queries in batches using the naı̈ve and adaptive methods by executing
all the queries in the same time and Figure 4.6 shows the results obtained.
Similar to the previous case, the performance difference compared to the naı̈ve method
starts to be noticed once the participation of each filter increases. This is the case because
once a shared filter is being evaluated more queries can benefit from its evaluation. However,
the performance gap is not substantially increased until filter participation is larger than
11 (on average 11 queries share at least one filter). This is the case because for fewer filters,
the smaller performance benefit is diminished by the large number of initial unresolved
queries and by the filter rank re-computation overhead. Once the filters are evaluated the
resolved queries are no longer taken into account by Algorithm 4.2. However, for workloads
of batch queries with more than 330 simple queries and sharing more than 11 filters, the
adaptive method can perform up to 6 times faster in our experiments. These workloads
are not uncommon for monitoring purposes if we consider the increasing number of devices
connected in the IP networks.
The main limitation of the adaptive batch processing algorithm is that, unlike the naı̈ve
method, it does not guarantee the optimum execution time for each individual simple query.
However, a hybrid execution strategy that assigns priorities to each simple query can be
implemented. As such, the queries that have highest priority can be executed first using
114
Figure 4.6: The complex batch query performance compared to the baseline.
the naı̈ve method, their filters evaluation positions can be cached and the rest of the queries
can be executed in a batch. Previously evaluated filters results can then be used in the
adaptive algorithm by assigning them the lowest rank. We believe this is another interesting
optimization scheme and we plan to pursue it in future work.
4.5
Related Work
There is a rich body of research concerned with methods of querying network flow records
and efficiency of concurrent query processing and we review the techniques related to both
areas.
4.5.1
Multi-Query Processing
In [11] authors propose a method that enables ad-hoc concurrent queries to execute
efficiently in a shared resource environment regardless of when they where submitted. This
115
system works well for a data warehouse model that assumes non-blocking queries. However,
this system is not compatible with blocking forensic queries, nor with the batch query
processing for monitoring.
When queries are considered for processing in batches, it is possible to identify common
sub-expressions and optimize the execution of the batch of queries by generating plans that
share computational cost among simple queries as described in [48]. However, if we consider
each simple query in the batch as a set of pipelined filters, the actual savings of performing
a single evaluation of all the common filters can be lost if the optimal filtering order of
individual queries is changed. The new filtering order might require the scanning of non
relevant segments thus wasting processing time. As such, for batch queries we use the idea
of efficient work sharing with care by assigning a rank to each filter as in [35]. In [35],
authors formalize the problem of optimally executing a collection of queries represented
by a conjunction of possibly expensive filters that may be shared across queries. We use a
similar algorithm and derive a cost function to contribute to the rank of each filter by taking
into account the underlying storage infrastructure described in [19]. Based on the estimated
cost, we prioritize the evaluation of expensive filters that are shared by many queries and
will lead to more queries being executed faster, thus reducing the overall processing time.
4.5.2
Network Flow Query Languages
The existing query languages for network flow data can be grouped into two main
categories: SQL-based and customized.
In general the SQL-based languages, use a transactional row-oriented database system
for storage and use standard SQL commands to define and manipulate the network flow
data [18, 36]. These systems are able to provide support for basic forensics and monitoring
tasks but show performance penalty when querying large amounts of historical network flow
data.
Each of the systems described in [15, 56] proposes an extended SQL-based language to
support streaming data operations such as selection, projection, aggregation and ways to
query data based on time windows and our proposed simple language is similar in nature.
However, these systems are designed to provide the best performance for live streaming
116
data, not historical disk resident data, and are not suitable for interactive queries when
user input is interleaved at each step.
The customized languages implement either filtering methods or a set of data manipulation commands. The filtering methods, such as the ones described in [33], allow users to
construct filtering expressions for network traces using IP address, port numbers, protocols,
etc and use the filtered data to generate traffic reports and to perform network analysis.
Similar to the case of streaming SQL-based systems, these filtering methods are concerned
with the real time analysis of the network traffic and don’t try to optimize monitoring and
forensic interactive queries on disk resident data.
Some systems implement a set of command line tools [17], or a collection of pearl
scripts [39], and use their own primitives to define filtering expressions and to perform
analysis. These systems have the capability to generate high level traffic reports using
live and short-term historical data but the reports should be expressed in terms of the
provided set of primitives and are not trivial to represent. In [32], authors propose a more
intuitive framework for designing a stream-based query language for network flow data but
implementation and the query runtime performance is not the primary goal, so it is omitted
from the analysis.
In our framework we propose a simple SQL-based query language extension for network
flow data because SQL is a well know standard and offers a rich set of primitives. Additionally, we implement a small set of features that enable simpler representation of network
monitoring and forensic queries without the need to use a complex scripting language.
4.6
Conclusion
In this chapter we presented the design and implementation of a querying framework
for monitoring and forensic analysis when using network flow data stored on disk in a
column oriented storage system. We showed that simple monitoring and forensics queries
performance can be improved up to an order of magnitude compared to the average case.
We provided an efficient execution model for sequential interactive queries and showed that
batch queries can potentially achieve many times better performance than the naı̈ve case,
117
using an adaptive query processing algorithm when the number of filters is small and the
queries share many filters.
Additionally, we presented the primitives of a simple SQL-based query language with
the small set of added features than enables simpler definition and representation of the
monitoring and forensic queries. To the best of our knowledge, our work is among the first
to consider filtering optimization methods for sequential and batch queries on network flow
data stored in a column-oriented system. Since the column data is stored in segments that
can be easily processed in parallel, in future work we seek to investigate the design and
development of efficient parallel processing algorithms in a distributed environment.
The implementation of the querying framework presented in this chapter together with
the column oriented storage system described in Chapter 3 form a complete system for
storying and querying network flow data.
In the next chapter we present a discussion of the methods described in this dissertation,
we propose new directions for future work and we conclude our presentation.
Chapter
5
Conclusions and Future Work
5.1
Concluding Remarks
In this dissertation, we presented new methods to store and query network data that
have the potential to enhance the efficiency of network monitoring and forensic analysis.
First, we presented novel methods for payload attribution. As part of a network forensics
system the proposed methods provide an efficient probabilistic query mechanism to answer
queries for excerpts of a payload that passed through the network. The methods allow data
reduction ratios greater than 100:1 while having a very low false positive rate when querying.
In the same time, they allow queries for very small excerpts of a payload and also for excerpts
that span multiple packets. The experimental results show that the methods achieve a
significant improvement in query accuracy and storage space requirements compared to
previous attribution techniques. More specifically, the evaluation in Chapter 2 shows that
winnowing method represents the best technique for block boundary selection in payload
attribution applications, shingling is clearly a more efficient method for consecutiveness
resolution than the use of offset numbers and finally, the use of multiple instances of payload
attribution methods can provide improved false positive rates and data-aging capability.
The above payload processing techniques combined together form a payload attribution
method called Winnowing Multi-Hashing which substantially outperforms previous methods. The experimental results also show that in general the accuracy of attribution increases
with the length and the specificity of a query. Moreover, privacy and simple access control
118
119
is achieved by the use of Bloom filters and one-way hashing with a secret key. Thus, even
if the system is compromised no raw traffic data is ever exposed and querying the system
is possible only with the knowledge of the secret key.
Second, we have presented the challenges and design guidelines of an efficient storage
and query infrastructure for network flow data. We presented the specific design, implementation and evaluation of a novel working architecture, called NetStore, that is useful in the
network monitoring tasks and assists in network forensics investigations. The simple column
oriented design of NetStore helps in reducing query processing time by reducing the time
spent with disk I/O and loading only data needed for processing. Moreover, the column
partitioning facilitates the use of efficient compression methods for network flow attributes
that allow data processing in compressed format, therefore boosting even more query runtime performance. NetStore clearly outperforms existing row-based database systems and
provides better results than the general purpose column oriented systems because of simple
design decisions tailored for network flow records without using auxiliary data structures
for tuple reconstruction. Experiments show that NetStore can provide more than ten times
faster query response compared to other storage systems while maintaining much smaller
storage size.
Finally, we presented the design and implementation of a querying framework for monitoring and forensic analysis when using network flow data stored on disk in a column
oriented storage system. We showed that simple monitoring and forensics queries performance can be improved up to an order of magnitude compared to the average case if the
order of the filtering predicates is chosen with care by computing a rank for each filter based
on data statistics, then executing the filters in increasing order of the rank. We provided an
efficient execution model for sequential interactive queries and showed that batch queries
can potentially achieve many times better performance than the naı̈ve case by using an
adaptive query processing algorithm when the number of filters is small and the queries
share many filters. We also presented the primitives of a simple SQL-based query language
called NetSQL. The proposed language was designed for monitoring and forensic query
workloads that use network flow data. Additional to the useful standard SQL commands,
the language implements a small set of added features than enables simpler definition and
120
representation of the monitoring and forensic queries. To the best of our knowledge, our
work is among the first to consider filtering optimization methods for sequential and batch
queries on network flow data stored in a column-oriented system. Moreover, the implementation of the querying framework presented in Chapter 4 together with the column oriented
storage system described in Chapter 3 form a complete system for storying and querying
network flow data.
5.2
Future Work
The methods for processing, storing and querying network data presented in this dissertation were designed to be mainly used by network forensic investigations and monitoring
query workloads. However, we believe that these methods also have much broader range of
applicability in various areas where large amounts of data are being processed. Therefore,
each of the possible application domains itself creates interesting open problems that can
be pursued in future work.
For example, the payload attribution methods presented in Chapter 2 can be used to
better decide the similarity score of document contents or other binary large data files in
general. Based on the underling data some of the methods proposed might yield different
performance metrics than in the case of payload processing and the performance rankings
for the new application requirements might be different. As such, it would be interesting
to see the performance of each method in partitioning the content when making decisions
about the similarity of large chunks of data transported over the network, such as in the
case of web caching.
Additionally, the storage infrastructure described in Chapter 3 can be used not only
for network flow data but also for any kind of time sequential data that share the same
characteristics and query workloads as network flow data. As such, a similar type of storage
infrastructure is expected to yield good performance for most categories of structured time
sequential data, for example log records used for log analysis, or application protocols
information (HTTP, DNS, FTP, etc) used for various network analysis tasks. The interesting
problem in this case is to analyze the use of the same storage infrastructure for storing
121
attributes of arbitrary data types, for example strings of variable length or other binary
objects that are added to the permanent storage using appended only operations.
Moreover, the querying framework described in Chapter 4 can be used for other types
of queries than monitoring and forensics. As such, any query workload that contains many
queries sharing a large number of predicates, or queries that are executed sequential, share
some filters and use data stored by a column store can be executed faster using the algorithms proposed.
Given the increasing emphasis on multi-core chip design and the increasing popularity of
cloud computing models, another interesting research direction is to find new techniques and
more efficient methods and algorithms to exploit the parallel processing in a multi-threaded
and distributed environment. In Chapter 2 we argued that payload data collected from
our campus border router is highly compressed in Bloom filters and querying operations
currently provide satisfactory performance for data stored in a single Bloom filter. However,
with the increase in traffic volumes it is expected that querying to become a computational
intensive operation when many filters are queried at once. Therefore, an interesting problem
is to explore the data processing models in a distributed environment when all the filters are
queried in parallel. Similarly, NetStore uses data stored as independent segments that can be
easily processed in parallel. In this case we believe it would be interesting to investigate the
design and development of efficient parallel processing algorithms that process the relevant
segments independently and merge the resulting values in the final step.
Finally, the methods and systems presented in this dissertation were mainly implemented as prototypes for concepts evaluation using network data collected at our university
edge routers. Storing network data is becoming more popular and many organizations use
it for various purposes. We believe it would be useful to have an efficient system for storing
network data in many places, in many organizations, that will incorporate some of the techniques presented in this dissertation. Then, all these storage systems could collaboratively
work to detect networks abuse or other network complex behavior by using historical network data to asses the reputation of each unknown network entity. For example, it would
be useful to have a global view of all the historical network data when an a priori unknown
host A is trying to establish a connection with an organizations’s host X. In this case,
122
the decision of whether the connection should be allowed or not between host A and host
X could be established after querying other external historical network storages on other
organizations that may have experienced connections with host A before. Based on this
interesting idea a whole reputation mechanism for external hosts of an organization could
be build.
Bibliography
[1] Daniel Abadi, Samuel Madden, and Miguel Ferreira. Integrating compression and
execution in column-oriented database systems. In SIGMOD ’06: Proceedings of the
2006 ACM SIGMOD International Conference on Management of Data, pages 671–
682, New York, NY, USA, 2006. ACM.
[2] Daniel J. Abadi, Samuel R. Madden, and Nabil Hachem. Column-stores vs. rowstores: how different are they really? In SIGMOD ’08: Proceedings of the 2008 ACM
SIGMOD International Conference on Management of Data, pages 967–980, New York,
NY, USA, 2008. ACM.
[3] Daniel J. Abadi, Daniel S. Myers, David J. DeWitt, and Samuel Madden. Materialization strategies in a column-oriented dbms. In ICDE, pages 466–475, 2007.
[4] E. Anderson and M. Arlitt. Full Packet Capture and Offline Analysis on 1 and 10 Gb
Networks. Technical Report HPL-2006-156, 2006.
[5] S. Bellovin and W. Cheswick. Privacy-enhanced searches using encrypted Bloom
filters.
Cryptology ePrint Archive, Report 2004/022, 2004.
Available at
http://eprint.iacr.org/.
[6] B. Bloom. Space/time tradeoffs in hash coding with allowable errors. In Communications of the ACM (CACM), pages 422–426, 1970.
[7] Lars Brenna, Alan Demers, Johannes Gehrke, Mingsheng Hong, Joel Ossher,
Biswanath Panda, Mirek Riedewald, Mohit Thatte, and Walker White. Cayuga: a
high-performance event processing engine. In SIGMOD ’07: Proceedings of the 2007
ACM SIGMOD International Conference on Management of Data, pages 1100–1102,
New York, NY, USA, 2007. ACM.
[8] A. Broder. Some applications of Rabin’s fingerprinting method. In Sequences II: Methods in Communications, Security, and Computer Science, pages 143–152. SpringerVerlag, 1993.
[9] A. Broder. On the resemblance and containment of documents. In Proceedings of the
Compression and Complexity of Sequences, 1997.
[10] A. Broder and M. Mitzenmatcher. Network Applications of Bloom Filters: A Survey.
In Annual Allerton Conference on Communication, Control, and Computing, UrbanaChampaign, Illinois, USA, October 2002.
123
124
[11] George Candea, Neoklis Polyzotis, and Radek Vingralek. A scalable, predictable join
operator for highly concurrent data warehouses. Proc. VLDB Endow., 2(1):277–288,
2009.
[12] Sirish Chandrasekaran, Sirish Ch, Owen Cooper, Amol Deshpande, Michael J. Franklin,
Joseph M. Hellerstein, Wei Hong, Sailesh Krishnamurthy, Sam Madden, Vijayshankar
Raman, Fred Reiss, and Mehul Shah. Telegraphcq: Continuous dataflow processing
for an uncertain world, 2003.
[13] F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chandra,
A. Fikes, and R.E. Gruber. Bigtable: A distributed storage system for structured
data. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and
Implementation (OSDI06), 2006.
[14] C. Y. Cho, S. Y. Lee, C. P. Tan, and Y. T. Tan. Network forensics on packet fingerprints. In 21st IFIP Information Security Conference (SEC 2006), Karlstad, Sweden,
2006.
[15] Chuck Cranor, Theodore Johnson, Oliver Spataschek, and Vladislav Shkapenyuk. Gigascope: a stream database for network applications. In SIGMOD ’03: Proceedings
of the 2003 ACM SIGMOD International Conference on Management of Data, pages
647–651, New York, NY, USA, 2003. ACM.
[16] S. Garfinkel. Network forensics: Tapping the internet. O’Reilly Network, 2002.
[17] Carrie Gates, Michael Collins, Michael Duggan, Andrew Kompanek, and Mark
Thomas. More netflow tools for performance and security. In LISA ’04: Proceedings
of the 18th USENIX Conference on System Administration, pages 121–132, Berkeley,
CA, USA, 2004. USENIX Association.
[18] Roxana Geambasu, Tanya Bragin, Jaeyeon Jung, and Magdalena Balazinska. Ondemand view materialization and indexing for network forensic analysis. In NETB’07:
Proceedings of the 3rd USENIX International Workshop on Networking Meets
Databases, pages 1–7, Berkeley, CA, USA, 2007. USENIX Association.
[19] Paul Giura and Nasir Memon. Netstore: An efficient storage infrastructure for network
forensics and monitoring. In Proceedings of the 13th International Symposium on Recent
Advances in Intrusion Detection, Ottawa, Canada, September 2010.
[20] Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. Compressing relations and
indexes. In In proceedings of IEEE International Conference on Data Engineering,
pages 370–379, 1998.
[21] Guofei Gu, Phillip Porras, Vinod Yegneswaran, Martin Fong, and Wenke Lee. BotHunter: Detecting Malware Infection Through IDS-Driven Dialog Correlation. In Proceedings of the 16th USENIX Security Symposium, pages 167–182, August 2007.
[22] Alan Halverson, Jennifer L. Beckmann, Jeffrey F. Naughton, , and David J. Dewitt.
A comparison of c-store and row-store in a common framework. Technical Report
TR1570, University of Wisconsin-Madison, 2006.
125
[23] M. Handley, C. Kreibich, and V. Paxson. Network Intrusion Detection: Evasion, Traffic
Normalization, and End-to-End Protocol Semantics. In Proceedings of the USENIX
Security Symposium, Washington, USA, 2001.
[24] Joseph M. Hellerstein and Michael Stonebraker. Predicate migration: Optimizing
queries with expensive predicates. In SIGMOD Conference, pages 267–276, 1993.
[25] Allison L. Holloway and David J. DeWitt. Read-optimized databases, in depth. Proc.
VLDB Endow., 1(1):502–513, 2008.
[26] IETF. Ip flow information export(ipfix). http://datatracker.ietf.org/wg/ipfix/charter.
[27] Infobright Inc. Infobright. http://www.infobright.com.
[28] Sybase Inc. Sybase iq. http://www.sybase.com.
[29] N. King and E. Weiss. Network Forensics Analysis Tools (NFATs) reveal insecurities,
turn sysadmins into system detectives. Information Security, Feb. 2002. Available at
www.infosecuritymag.com/2002/feb/cover.shtml.
[30] LucidEra. Luciddb. http://www.luciddb.org.
[31] U. Manber. Finding similar files in a large file system. In Proceedings of the USENIX
Winter 1994 Technical Conference, pages 1–10, San Fransisco, CA, USA, 1994.
[32] Vladislav Marinov and Jürgen Schönwälder. Design of a stream-based ip flow record
query language. In DSOM ’09: Proceedings of the 20th IFIP/IEEE International
Workshop on Distributed Systems: Operations and Management, pages 15–28, Berlin,
Heidelberg, 2009. Springer-Verlag.
[33] Steven McCanne and Van Jacobson. The bsd packet filter: a new architecture for
user-level packet capture. In USENIX’93: Proceedings of the USENIX Winter 1993
Conference, pages 2–2, Berkeley, CA, USA, 1993. USENIX Association.
[34] M. Mitzenmacher. Compressed Bloom Filters. IEEE/ACM Transactions on Networking (TON), 10(5):604 – 612, 2002.
[35] Kamesh Munagala, Utkarsh Srivastava, and Jennifer Widom. Optimization of continuous queries with shared expensive filters. In PODS ’07: Proceedings of the twentysixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pages 215–224, New York, NY, USA, 2007. ACM.
[36] Bill Nickless. Combining cisco netflow exports with relational database technology for
usage statistics, intrusion detection, and network forensics. In LISA ’00: Proceedings
of the 14th USENIX Conference on System Administration, pages 285–290, Berkeley,
CA, USA, 2000. USENIX Association.
[37] NTOP.
PF
RING
Linux
kernel
http://www.ntop.org/PF RING.html, 2008.
patch.
Available
at
[38] Vern Paxson. Bro: A system for detecting network intruders in real-time. In Computer
Networks, pages 2435–2463, 1998.
126
[39] Dave Plonka. Flowscan: A network traffic flow reporting and visualization tool. In
LISA ’00: Proceedings of the 14th USENIX Conference on System Administration,
pages 305–318, Berkeley, CA, USA, 2000. USENIX Association.
[40] M. Ponec, P. Giura, H. Brönnimann, and J. Wein. Highly Efficient Techniques for
Network Forensics. In Proceedings of the 14th ACM Conference on Computer and
Communications Security, pages 150–160, Alexandria, Virginia, USA, October 2007.
[41] Miroslav Ponec, Paul Giura, Joel Wein, and Hervé Brönnimann. New payload attribution methods for network forensic investigations. ACM Trans. Inf. Syst. Secur.,
13(2):1–32, 2010.
[42] PostgreSQL. Postgresql. http://www.postgresql.org.
[43] M. O. Rabin. Fingerprinting by random polynomials. Technical report 15-81, Harvard
University, 1981.
[44] S. Rhea, K. Liang, and E. Brewer. Value-based web caching. In Proceedings of the
Twelfth International World Wide Web Conference, May 2003.
[45] Robert Richardson and Sara Peters. 2007 CSI Computer Crime and Security Survey
Shows Average Cyber-Losses Jumping After Five-Year Decline. CSI Press Release,
September 2007. Available at http://www.gocsi.com/press/20070913.jhtml.
[46] Martin Roesch. Snort - lightweight intrusion detection for networks. In LISA ’99:
Proceedings of the 13th USENIX conference on System administration, pages 229–238,
Berkeley, CA, USA, 1999. USENIX Association.
[47] S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document
fingerprinting. In SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD International
Conference on Management of Data, pages 76–85, New York, NY, USA, 2003. ACM
Press.
[48] Timos K. Sellis. Multiple-query optimization. ACM Transactions on Database Systems,
13:23–52, 1988.
[49] K. Shanmugasundaram, H. Brönnimann, and N. Memon. Payload Attribution via
Hierarchical Bloom Filters. In Proc. of ACM CCS, 2004.
[50] K. Shanmugasundaram, N. Memon, A. Savant, and H. Brönnimann. ForNet: A Distributed Forensics Network. In Proc. of MMM-ACNS Workshop, pages 1–16, 2003.
[51] Abraham Silberschatz, Henry Korth, and S. Sudarshan. Database Systems Concepts.
McGraw-Hill, Inc., New York, NY, USA, 2010.
[52] Dominik Ślȩzak, Jakub Wróblewski, Victoria Eastwood, and Piotr Synak. Brighthouse:
an analytic data warehouse for ad-hoc queries. Proc. VLDB Endow., 1(2):1337–1345,
2008.
[53] A. C. Snoeren, C. Partridge, L. A. Sanchez, C. E. Jones, F. Tchakountio, S. T. Kent,
and W. T. Strayer. Hash-based IP traceback. In ACM SIGCOMM, San Diego, California, USA, August 2001.
127
[54] S. Staniford-Chen and L.T. Heberlein. Holding intruders accountable on the internet.
In Proceedings of the 1995 IEEE Symposium on Security and Privacy, Oakland, 1995.
[55] Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack,
Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil, Pat
O’Neil, Alex Rasin, Nga Tran, and Stan Zdonik. C-store: a column-oriented dbms.
In VLDB ’05: Proceedings of the 31st International Conference on Very Large Data
Bases, pages 553–564. VLDB Endowment, 2005.
[56] Mark Sullivan and Andrew Heybey. Tribeca: A system for managing large databases
of network traffic. In In USENIX, pages 13–24, 1998.
[57] Cisco Systems. Cisco ios netflow. http://www.cisco.com.
[58] Vertica Systems. Vertica. http://www.vertica.com.
[59] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression.
IEEE Transactions on Information Theory, 23:337–343, 1977.
[60] Marcin Zukowski, Peter A. Boncz, Niels Nes, and Sándor Héman. Monetdb/x100 - a
dbms in the cpu cache. IEEE Data Eng. Bull., 28(2):17–22, 2005.
Download