A New Methodology for Packet Trace Classification and Compression based on Semantic Traffic Characterization by Raimir Holanda Filho Submitted to the Computer Architecture Department In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Computer Science at the Technical University of Catalonia September 2005 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 01 1.1. The problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 01 1.2. Overview of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 02 1.3. Contribution of our work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 03 2. Traffic Modeling and Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 07 2.1. Classical traffic characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 07 2.2. Traffic modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3. Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3. Semantic Traffic Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1. Semantic Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4. Flow Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31 4.2. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3. Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.4. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44 5. Entropy of TCP/IP Header Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47 5.2. Packet Level Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.3. Flow Level Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.4. Trace Compression Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .75 6. Lossless Compression Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.1. Generic Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.2. TCP/IP Header Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.3. Proposed Header Trace Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.4. Decompression Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.5. Compression Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7. Lossy Compression Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.1. Packet Trace Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.2. Decompression Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.3. Compression Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7.4. Comparative Packet Trace Characteristics . . . . . . . . . . . . . . . . . . . . . . 100 7.5. Memory Performance Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 8. Trace Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 8.1. Packet Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 8.2. Flow Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 8.3. Packet Trace Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 9. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 List of Figures Figure 1.1 Relation between the main contributions . . . . . . . . . . . . . . . . . . . . . . . 5 Figure 3.1 RedIRIS topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Figure 3.2 TSH header data format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Figure 3.3 Flow mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Figure 4.1 The asymptotic equipartition property . . . . . . . . . . . . . . . . . . . . . . . . 46 Figure 5.1 Flow clustering methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Figure 5.2 Number of clusters for - RedIRIS . . . . . . . . . . . . . . . . . . . . . . . . 51 Figure 5.3 Number of clusters for ATM OC-3 traces . . . . . . . . . . . . . . . . . . . . . 52 Figure 5.4. Selected fields used to flow clustering . . . . . . . . . . . . . . . . . . . . . . . 53 Figure 5.5 Number of clusters for m=2 packets . . . . . . . . . . . . . . . . . . . . . . . . . 54 Figure 5.6 Number of clusters for m=3 packets . . . . . . . . . . . . . . . . . . . . . . . . . 55 Figure 5.7 Number of clusters for m=4 packets . . . . . . . . . . . . . . . . . . . . . . . . . 56 Figure 5.8 Number of clusters for m=5 packets . . . . . . . . . . . . . . . . . . . . . . . . . 57 Figure 5.9 Number of clusters for m=6 packets . . . . . . . . . . . . . . . . . . . . . . . . . 59 Figure 5.10 Number of clusters for m=7 packets . . . . . . . . . . . . . . . . . . . . . . . . 59 Figure 5.11 Relation between clusters and flows . . . . . . . . . . . . . . . . . . . . . . . . 61 Figure 6.1 Temporary data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Figure 6.2 Compression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Figure 6.3 Small flow compressed data format . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Figure 6.4 Web compression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Figure 6.5 Small Web flow compressed data format . . . . . . . . . . . . . . . . . . . . . 68 Figure 6.6 Packet compressed data format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Figure 6.7 Large flow compressed data format . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Figure 6.8 Decompression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Figure 6.9 Flow clustering vs. Huffman based behavior . . . . . . . . . . . . . . . . . . 73 Figure 6.10 Compression techniques comparison (Lossless) . . . . . . . . . . . . . . 74 Figure 7.1 Compression techniques comparison (Lossy) . . . . . . . . . . . . . . . . . 79 Figure 7.2 R/S plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Figure 7.3 Unique addresses set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Figure 7.4 Temporal locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Figure 7.5 as a function of prefix length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Figure 7.6 Multifractal spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Figure 7.7 Memory access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Figure 7.8 Cache miss rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Figure 8.1 Conventional routing procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Figure 8.2 IP switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Figure 8.3 Packet distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Figure 8.4 Number of clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Figure 8.5. RedIRIS trace-3D shaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Figure 8.6. Memphis trace-3D shaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Figure 8.7. Columbia trace-3D shaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Figure 8.8 Flow clustering spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 List of Tables Table 4.1 IP version elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Table 4.2 IP version entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Table 4.3 IHL elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Table 4.4 IHL entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Table 4.5 TOS elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Table 4.6 TOS entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Table 4.7 Length elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Table 4.8 Length entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Table 4.9 Flags elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Table 4.10 Flags entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Table 4.11 Fragment offset elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Table 4.12 Fragment offset entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Table 4.13 Protocol elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Table 4.14 Protocol entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Table 4.15 Data offset elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Table 4.16 Data offset entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Table 4.17 Control bits elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Table 4.18 Control bits entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Table 4.19 Source port entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Table 4.20 Destination port entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Table 4.21 Source address entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Table 4.22 Destination address entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Table 4.23 Resume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Table 4.24 Version,IHL joint probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Table 4.25 Version,IHL joint entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Table 4.26 Version,IHL,Flags joint probability . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Table 4.27 Version,IHL,Flags joint entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Table 4.28 Version,IHL,Flags,FragOff joint probability . . . . . . . . . . . . . . . . . . 38 Table 4.29 Version,IHL,Flags,FragOff joint entropy . . . . . . . . . . . . . . . . . . . . . 38 Table 4.30 Version,IHL,Flags,FragOff,Protocol joint probability . . . . . . . . . . 39 Table 4.31 Version,IHL,Flags,FragOff,Protocol joint entropy . . . . . . . . . . . . . 39 Table 4.32 Version,IHL,Flags,FragOff,Protocol,TOS joint probability . . . . . 40 Table 4.33 Version,IHL,Flags,FragOff,Protocol,TOS joint entropy . . . . . . . . 40 Table 4.34 Version,IHL,Flags,FragOff,Protocol,TOS,DataOff joint prob . . . 41 Table 4.35 Version,IHL,Flags,FragOff,Protocol,TOS,DataOff joint entropy 41 Table 4.36 Version,IHL,Flags,FragOff,Protocol,TOS,DataOff,Control bits joint probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Table 4.37 Version,IHL,Flags,FragOff,Protocol,TOS,DataOff,Control bits joint entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Table 4.38 Version,IHL,Flags,FragOff,Protocol,TOS,DataOff,Control bits,Length joint probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Table 4.39 Version,IHL,Flags,FragOff,Protocol,TOS,DataOff,Control bits,Length joint entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Table 4.40 Independent random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Table 4.41 Resume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Table 4.42 AEP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Table 5.1 Flow probability distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Table 5.2 Number of clusters for m=2 packets . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Table 5.3 Clusters description (m=2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Table 5.4 Flow Entropy (m=2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Table 5.5 Number of clusters for m=3 packets . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Table 5.6 Clusters description (m=3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Table 5.7 Flow Entropy (m=3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Table 5.8 Number of clusters for m=4 packets . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Table 5.9 Clusters description (m=4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Table 5.10 Flow Entropy (m=4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Table 5.11 Number of clusters for m=5 packets . . . . . . . . . . . . . . . . . . . . . . . . . 57 Table 5.12 Clusters description (m=5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Table 5.13 Flow Entropy (m=5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Table 5.14 Number of clusters for m=6 packets . . . . . . . . . . . . . . . . . . . . . . . . . 58 Table 5.15 Number of clusters for m=7 packets . . . . . . . . . . . . . . . . . . . . . . . . . 58 Table 5.16 Cluster and entropy behavior for m-packet flows . . . . . . . . . . . . . . 61 Table 6.1 Set of different values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Table 6.2 Huffman encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Table 7.1 Hurst parameter estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Table 8.1 Packet classification example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 List of Symbols Random variable Time series Mean Expectation Variance Autocorrelation function Autocovariance function Hurst parameter ! Lattice box counting dimension " -packet #%& $ #%& $ (') *,+ - & $ (') - &$ # $ - $ Flow with " packets Packet header of the i-th packet of a flow consisting of m packets Selected header field of #& $ Mapping function Mapped value of # & $ .') Vector of mapped values Flow of " packets Numerical representation of the " packets / -021 Header field variation 3 4 Entropy 36587 4 Conditional Entropy 3 :9;5< Joint Entropy !:>=?7(7 @A Relative Entropy BCDFEG5 Mutual Information Acknowledgments Abstract Internet traffic measurement has been a subject of interest as long as there has been Internet traffic. Nowadays, packet traces are collected and used for performance evaluation purposes in many systems. For instance, we can use packet traces to evaluate basic packet forwarding or complex functions such as quality of service (QoS) processing, packet classification, security, billing and accounting of modern network devices. Moreover, the need for network analysis has increased and realistic models and methodologies for understanding network behavior play an even more essential role in facilitating the evolution of future gigabit/sec speeds. In this context, we have developed a novel flow characterization approach that incorporates semantic characteristics of flows. We understand by semantic characterization the joint analysis of traffic characteristics including the inter packet time and some of the most important fields (source and destination address, port numbers, packet length, TCP flags, etc) of the TCP/IP headers content. Firstly, using clustering techniques we demonstrated that behind the great number of flows in a high speed link, there is not so much variety among then. In addition, traces captured from different links showed similar variety. Using the evidence that many flows can be grouped in few clusters, we have implemented a template of flows. This template consist of a dataset storing the most common classes of flows. These results confirmed that exploring the TCP flow properties we can obtain high rates of compression for TCP/IP header traces. The analysis carried out using some concepts provided by the Information theory provided a higher formalism to the proposed methodology. We calculated the entropy at header field and at flow level. Moreover, we demonstrated that the methodology could be used to develop header trace compression and trace classification. The proposed compression method is focused on the problem of compression of huge packet header traces. Here, we propose two compression methods: lossless and lossy. The lossless compression is a combined method based on TCP flow clustering for small flows and Huffman encoding for large flows. With our proposed method, storage size requirements for .tsh packet header traces are reduced to 16% of their original size. Others known methods have their compression ratio bounded to 50% and 32%. The lossy compression is a packet trace compression method based on the most representative flow clusters, on self-similar inter-packet time properties and fractal IP address generation. With this proposed method, storage size requirements are reduced to 3% of its original size. Although this specification does not define a completely lossless compressed data format, it preserves important statistical properties present in the original trace such as self-similarity, spatial and temporal locality, and IP address structure. Furthermore, a memory performance evaluation was carried out with four types of traces and the outcomes for mem- ory access and cache miss ratio measurements demonstrated a similar behavior between our decompressed trace and the original trace. Finally, the proposed trace classification can be used to identify how similar are traces collected from different links and the type of applications present into the trace. Using traces with different properties is strongly recommended for evaluation purposes and extended validation of many systems. Our approach to trace classification consists of identify semantically, for each trace, what are their typical flows. 14 Chapter 1 Introduction This chapter describes the addressed problem of this thesis and shows our main contributions to overcome it. 1.1. The problem The Internet is a global internetwork, sharing information among millions of computers throughout the world. Internet users send packets of information from one machine to another carrying traffic from a variety of data, video and voice applications. Since the appearance of the world-wide-web (WWW) and more recently of P2P and real-time applications, the Internet traffic has continued to grow exponentially. This rapid growth and the proliferation of new applications have combined to change the character of the Internet in recent years. The volume of traffic and high capacity of the links have rendered traffic characterization and analysis more difficult and yet more critical consisting both on a challenging endeavor. However, we have seen that, nowadays, the available tools used in the Internet are not very large and are concentrated in fews programs. For instance, almost all operating systems are Windows or Linux based, TCP is the dominant protocol, there are few TCP versions, and Web and P2P are the most common applications. Moreover, since that many people use the same type of searchers (google, scholar, etc) the users tend to show similar behavior when using the Internet. Hence, the need for realistic models and methodologies for understand network behavior plays an even more essential role in facilitating the evolution of future gigabit/sec speeds. Under my point of view, it seems inevitable that simulations and empirical techniques to describe traffic behavior will play a larger role than traditional mathematical techniques have played in the past. Internet traffic measurement has been a subject of interest as long as there has been Internet traffic. Important results has been obtained from experimental analysis on packet traces. For instance, studies on the 90’s years of traffic measurements from a variety of different networks have provided ample evidence that actual network traffic is self-similar or fractal in nature, i.e., burst over a wide range of time scales. This observation is in contrast to common modeling choices in engineering theory, where exponential assumptions still are used to reproduce 15 the bursty behavior. Moreover, understanding the nature of network traffic is critical in order to properly design and implement network devices and network interconnections. Our work is based on empirical investigations of high speed Internet links which aggregate substantial amounts of traffic. Nowadays, packet traces are collected and used for performance evaluation purposes in many systems. For instance, we can use packet traces to evaluate from basic packet forwarding to complex functions as quality of service (QoS) processing, packet classification, security, billing and accounting of modern network devices. A high performance of these network devices is necessary to follow up those new functionalities. From this reason, several techniques have been proposed for high speed network devices. For example, MPLS (Multi Protocol Label Switching) combines the flexibility of layer-3 routing with the high capacity of layer-2 switching. However, its performance must be strongly affected by the control parameters set. To choose the appropriate control parameters set, we need to know the characteristics of the Internet traffic. Moreover, these network devices must not only achieve high performance, but they also must have the flexibility to deal with the large and ever-changing increasing demands for new and more complex network functionalities. For executing these tasks, general purpose processors or specific processors known as Network Processors [41] are normally used. Otherwise, the basic forwarding plane functions are performed in software. In both cases, the performance of these systems depends not only on parameters such as packet length or inter packet time, but also on some semantic properties of flows like: spatial and temporal locality of IP address, IP address structure, TCP flags sequence, type of service, etc. A critical requirement for performance evaluation and design of those network elements is the availability of realistic traffic traces. Nowadays, there is a growing interest in capturing Internet traffic in pursuit of insights into its evolution. A popular scheme to obtain real traces for extended periods of time is to collect them from routers [20]. There are, however, several reasons that make difficult in many cases to have access to them. Firstly, Internet providers are usually reluctant to make public real traces captured in their networks. Moreover, when these traffic traces are made public [89], they are delivered after some transformations, such as sanitization [91], which modify some basic semantic properties (such as IP address structure). Secondly, there are others problems which arise due to the increasing speed of Internet routers. Hardware for collecting traces at high speed (e.g. to link rates of 2.5 Gbps, 10 Gbps or even 40 Gbps) is usually expensive. Moreover, with the increase of link rates, the required storage for packet traces of meaningful duration becomes too large. 1.2. Overview of thesis This chapter presents the thesis problem, and the main contributions to the field of research on packet trace characterization, compression and classification. 16 Chapter 2 presents the related work on traffic modeling and classical traffic characteristics, describing in more details the self-similar traffic characteristics and the fractal properties of the IP address structure. Also, describes the set of traces used into our analysis Chapter 3 presents our flow characterization approach which incorporates semantic characteristics of flows. Chapter 4 demonstrates that behind the great number of flows in a high speed link, there is not so much variety among then. The evidence that Internet traffic shows a small variety of flows has guided us to group the flows into a set of clusters. Using some concepts provided by the Information theory, we calculate in chapter 5 the entropy at header field and at flow level. Furthermore, we demonstrate that those outcomes could be used to develop header trace compression and trace classification. In Chapter 6 we address the problem of compression of huge packet traces. We propose a novel lossless packet header compression, focused not on the problem of reducing transmission bandwidth or latency, but on the problem of saving storage space. With our proposed method, storage size requirements are reduced to 16% of its original size. Chapter 7 studies the properties of a new lossy packet trace compression method. The compression ratio that we achieve is around 3%, reducing the file size, for instance, from 100 MB to 3 MB. Although this specification defines a not lossless compressed data format, it preserves important statistical properties present into original trace such as self-similarity, spatial and temporal locality, and IP address structure. Furthermore, memory performance studies are presented with the Radix Tree algorithm executing a trace generated by our method. To give support to these studies, measurements were taken of memory access and cache miss ratio. Our approach to trace classification is presented on Chapter 8. It consists of identify, for each trace, what are their typical flows. Finally, in Chapter 9 we summarize our results and mention open areas for continued research in this area. 1.3. Contribution of our work The core contributions of our work encompass four areas: H Semantic traffic characterization; H Lossless packet trace compression; H Lossy packet trace compression; H Packet trace classification. 17 The first component deals with packet traffic characterization. It differs from previous studies in its focus on semantic characterization. We understand by semantic characterization the analysis of traffic characteristics including the inter packet time and some of the most important fields (source and destination address, port numbers, packet length, TCP flags, etc) of the TCP/IP headers content. Many published papers show important characteristics of the traffic such as inter packet time, traffic intensity, packet length, etc; but in our opinion, more semantic aspects of flows are required for a useful traffic characterization. We propose, for Internet traffic, a novel semantic characterization. We demonstrate that behind the great number of flows in a high-speed link, there is not so much variety among them and clearly they can be grouped into a set of clusters [61]. In these analysis, we have assumed the most common case of storing TSH (Time Sequence Header) packet headers files [112]. Using the evidence that many flows can be grouped in few clusters, we have constructed a new lossless packet header trace compressor. We propose a novel packet header compressor, focused not on the problem of reducing transmission bandwidth or latency, but on the problem of saving storage space. In our case, we use the advantage that we know in advance all packets in a flow to be compressed, and the compression rate that we achieve is around 16%, reducing the file size, for instance, from 100MB to 16MB [59]. To reach this performance, the method uses two classes of algorithms. A first class fits well for small flows and the second for large flows. Analysis using both classes have demonstrated that the best combined performance is reached when we consider small flows ranging from 1 to 7 packets per flow. The final method presented is lossless, in the sense that for some fields the decompression algorithm regenerates exactly the original value, while for others, those for which the initial values are random as for instance initial TCP sequence number, the values are shifted, as if we were capturing the trace at another execution time. Evidently, these changes do not affect, in most cases, the analysis taken from the decompressed file. Others known methods have their compression ratio bounded to 50% and 32%. Then, in order to reach a higher compression ratio, the third component proposes a lossy compression method [62]. For some specific research reasons, a lossy method that preserves important statistical properties present into original traces can be more appropriate. The compression ratio that we achieve with this method is around 3%. The last component proposes, for Internet traffic, a new approach to classify the traffic based on flow clustering spectrum. Using a methodology based on three steps, the proposed packet trace classification identify how similar are traces collected from different links [60]. Figure 1.1 illustrates how these four components are interconnected. 18 Internet Trace Semantic Traffic Characterization Packet Trace Packet Trace Classification Compression Lossless Lossy Compression Compression Figure 1.1: Relation between the main contributions 19 20 Chapter 2 Traffic modeling and characterization This chapter is devoted to present the classical traffic characteristics and the related work on traffic modeling, describing in more details the self-similar traffic characteristics and the fractal properties of the IP address structure. Moreover, we describe the set of traces used into our analysis. 2.1. Classical traffic characteristics The complexity of the Internet traffic necessitates that we characterize it as a function of multiple dimensions to understand the network mechanism. Those traffic characteristics are strongly influenced by a set of factors such as: delay, jitter, loss, throughput, utilization, reachability, availability, burstiness, and length. Bellow, we describe each one of these factors in more details. 2.1.1. Delay and jitter Delay and jitter are typically end-to-end performance notions. Delay includes transfer delay, caused by the intermediate switching nodes and end hosts. Many real-time multimedia applications may require predictable delay, notably inconsistent with the datagram best-effort architecture of the Internet. In addition, many continuous media applications will rely on synchronization between audio and video streams. Thus the variance in delay will also be an important Internet properties [26]. Studies have provided evidence for the variance and asymmetry of delay on wide area Internet infrastructures [85] [25]. Floyd and Jacobson [44] [45] analyze traffic phase effects in packet-switched gateways and end systems, their potential damaging effects on Internet performance, and suggestions for possibles ways to mitigate the systematic tendency of routing protocols to synchronize in large systems. Zhang et al. [123] studies the phase effects of congestion control algorithms or other aspects of TCP implementations [84]. 2.1.2. Loss Multimedia applications may require low or predictable delay but not necessarily completely lossless services, while other applications require transmission 21 guarantees but not strict delay bounds. In addition to the effect of loss on protocol performance and network dynamics [115] [104] [4] [85] studies have also investigated the potential impact of loss on charging policies [103] [76]. Related to loss are those of reliability and availability of given links or nodes in the network. 2.1.3. Throughput According Jain [67], throughput is defined as the rate (requests per unit of time) at which the requests can be serviced by the system and includes: H H nominal capacity or bandwidth: achievable throughput under ideal workload conditions; H throughput under actual workload conditions; efficiency: ratio of maximum achievable throughput (usable capacity) to nominal capacity. 2.1.4. Utilization The metric that typically comes to mind in describing a network, at least for a network operator, is utilization. Utilization metrics can reflect any measured granularity, and statistics of their distribution, including mean, variance, and percentile statistics, can reveal trends over both short and long intervals. Related to utilization is the congestion of the network, or contention for either bandwidth or switching resources. Measurements of congestion include distributions of queue length or available buffers in nodes. Several studies of local area environments focus on short-term utilization characteristics [14] [79]. Longer term utilization metrics would include traffic volume growth over several years on a backbone infrastructure [23]. An Internet service provider will tend to pay attention to utilization metrics as indicators of how close their network is to saturation so they can plan for upgrades. 2.1.5. Reachability As networks increase their range of possible destinations, so does the size of routing tables, and thus the cost of maintaining them in switching nodes and the cost of searching them in forwarding datagrams. Metrics such as the size of these routing tables, or the number of IP network numbers to which an Internet component can route traffic, are indicators of network reachability. 2.1.6. Reliability The reliability of a system is usually measured by the probability of errors or by the mean time between errors. 2.1.7. Availability The availability of a system is defined as the fraction of the time the system is available to service user’s requests. The time during which the system is not 22 available is called downtime; the time during which the system is available is called uptime. Often the mean uptime, better known as the Mean Time To Failure (MTTF), is a better indicator. 2.1.8. Burstiness The statement that traffic is bursty means that the traffic is characterized by short periods of activity separated by long idle periods. There are two implications to the bursty nature of data traffic. First of all, the fact that the long-term average usage by a single source is small would suggest that the dedication of facilities to a single source is not economical and that some sort of sharing is appropriate. The second aspect of burstiness is that sources transmit at a relatively high instantaneous rate. This is, in effect, a requirement on the form that this sharing of facilities can take [55]. Burstiness metrics fall into two categories: those that measure inter-arrival processes, e.g., time between packets arrivals, and those that measure arrival processes, e.g., number of packets per time interval. 2.1.9. Length Payload is the amount of information carried in a packet. What constitutes informations will depend on the layer, e.g., the payload of an IP packet would be the contents of the packet following the IP header. A loose definition of payload sometimes includes the entire packet including headers. Packet payload is one indicator of protocol efficiency, although a more accurate analysis of efficiency will also reflect end-to-end behavior, including acknowledgments, retransmission, and update strategies of different protocols. The differences in payload per application are also visible at the aggregate level. In [23] and [15] are present evidences for significantly different distributions of traffic by packets and bytes by individual networks. The disparity between the number of packets and the number of bytes sent by networks indicates a definite difference in workload profiles, where specific networks, likely, with major data repository, source mostly large packets sizes into the backbone. 23 2.2. Traffic Modeling In the last years, many researchers have developed mathematical models which have provided a great number of insight into the design and performance of network systems. The fundamental aim of this section is to present these models and to give an overview on the state of the art in traffic modeling. The work of Erlang [17] [37] [38] in the context of telephone switching systems constitutes the pioneer work on traffic modeling. Erlang found that given a sufficiently large population, the random rate of calls can be described by a Poisson process and the duration of the call was found to have an exponential distribution. An important result obtained by Erlang is the probability of a busy signal as a function of the traffic level and the number of lines that have been provided. In the generic queueing theory model, the theory attempts to find probabilistic descriptions of some quantities as the sizes of waiting lines, the delay experienced by an arrival, and the availability of a service facility [55]. In the voice of telephone network, for instance, demands for service take the form of subscribers initiating a call. In this application the server is the telephone line. The analog of the service time of a customer is the duration of the call. In connection with queues, a convenient notation has been developed. In its simplest form it is written A/R/S, where A designates the arrival process, R the service required by an arriving customer, and S the number of servers. For example, M/M/1 indicates a Poisson arrival of customers, an exponentially distributed service discipline and a single server. The M may be taken to stand for memoryless (or markovian) in both the arrival process and the service time. Although queueing theory was developed for voice traffic, it was applicable to computer communications. The generation of data messages at a computer corresponds to the initiation of voice calls to the network. The time required to transmit the message over a communications facility corresponds to the required time of service. Measurements of traffic in data systems have shown that some message generation can be modeled as a Poisson process. In particular, was showed that the Poisson arrival process was a special case of the pure birth process. This lead directly to the consideration of birth-death processes, which model certain queueing systems in which customers having exponentially distributed service requirements arrive at a service facility at a Poisson rate. Consider a network of communication links and assume that there are several packet streams, each following a unique path that consists of a sequence of links through the network. Let ICJ , in packets/sec, be the arrival rate of the packet stream K . Then the total arrival rate at link (i,j) is: LNMPO I,J RQ Q=SRTUVW K XVY "ZK3K T[A\ K]K 1^C_`QD1^ab 1c9') (2.1) The preceding network model is well suited for virtual circuit networks, with each packet stream modelling a separate virtual circuit. For datagram networks, 24 it is necessary to use more general model that allows bifurcation of the traffic of a packet stream, where there may be several paths followed by the packets of a stream. Let I,J denote the arrival rate of packet stream K , and let d & + K denote the fraction of the packets of stream K that go through link (i,j). Then the total arrival rate at link (i,j) is: LNM O d & + K ,I J RQ Q=SRTUVW K XVY "ZK3K [T A\ ]K K 1^C_eQD1^fg219') (2.2) From the special case of two queues that even if the packet streams are Poisson with independent packet lengths at their point of entry into the network, this property is lost after the first transmission line. To resolve the dilemma, it was suggested by Kleinrock [72] [71] that merging several packet streams on a transmission line has an effect similar to restoring the independence of interarrival time and packet lengths. It was concluded that it is often appropriate to adopt an M/M/1 queueing model fo each communication link regardless of the interaction of traffic on this link with traffic on other links. This is known as Kleinrock independence approximation and seems to be a reasonably good approximation for systems involving Poisson stream arrivals at the entry points, packet lengths that are nearly exponentially distributed, a densely connected network and moderate-to-heavy traffic load. Queueing models was also extensively used for ATM traffic. An interesting modeling approach was decomposing the problem by time scales. This approach was first introduced by Hui [63]. This decomposition is based on the qualitatively different nature of the system at three different time scales: call scale, burst scale and cell scale. At the cell time scale, the traffic consists of discrete entities, the cells, produced by each source at a rate which is often orders of magnitude lower than the transmission rate of the output link. At the burst time scale the finner cell scale is ignored and the input process is characterized by its instantaneous rate. Consequently, fluid flow models appear as a natural modelling tool [97]. At cell level traffic, was studied the mechanism that handle with individual cells. Basically, was used two classes of cell level models: those based on renewal processes and those based on Markov modulated processes. At burst level, bellow are showed several models where traffic is considered as continuous fluid: H H Renewal rate process H On/Off process and their superpositions H Poisson burst process H Gaussian traffic modeling Markov modulated rate process Traffic measurements studies however have demonstrated that data traffic characteristics differ significantly from Poisson features. These studies may be 25 classified as belonging either to local area networks (LAN) or wide area networks (WAN). Jain el al. [68] studied the traffic on a token ring network, and showed that successive packet arrivals on a token ring network were neither Poisson nor compound Poisson. An alternative model using the concept of a packet train that represents a cluster of arrivals characterized by a fixed range of inter-arrivals times was proposed. This model considers a track between two nodes A and B. All packets on the track are flowing either from A to B or from B to A. A train consist of packets flowing on this track with the intercar time between them being smaller than a specified maximum, referred to as the maximum allowed intercar gap (MAIG). If no packets are seen on the track for MAIG time units, the previous train is declared to have ended and the next packet is declared to be the locomotive (first car) of the next train. The intertrain time is defined as the time between the last packet of a train and the locomotive of the next train (figure 2.1). AB T1 AB T2 AB T3 AB AB AB AB Inter−Train T4 T5 T6 Inter−Car Figure 2.1: Packet Trains More recent studies show that it is not accurate to use a Poisson process to model data traffic. A pioneering work by Leland et al. [79] showed that the LAN traffic generated by Ethernet connected workstations, file servers and personal computers exhibits same degree of correlation when aggregated using window sizes increasing from seconds to hours. This work proposed that Ethernet LAN traffic is statistically self-similar, and that major components of Ethernet LAN traffic such as external LAN traffic or external TCP traffic share the same self-similar characteristics as the overall LAN traffic. Since this work many researchers have shown features of data traffic that exhibit self-similar like longrange dependent (LRD) properties. A stochastic process is said to have LRD if it is wide-sense stationary and the sum of its auto-correlation values diverge. More recent investigations have proposed other notions for burstiness, such as focusing on the arrival process, as a counting process, rather than the interarrival process. Willinger et al. [118] [79] have studied the packet arrival process, and the correlation of packet arrivals in local environments. In their study of several Ethernet environments [79] they present evidence that Ethernet traffic is self-similar, implying no natural burst length. Such traffic exhibits the same correlation structure at various aggregation granularities. They conclude that empirical data demands reexamination of currently considered formal models for 26 packet traffic, e.g., pure Poisson or Poisson-related models such as Poisson-batch or Markov-Modulated Poisson processes [56], packet train models [68], and fluid flow models [8]. In particular, their evidence indicates that Poisson modeling assumptions are false for environments that aggregate much traffic. Contrary to the Poisson assumption, the traffic profile of their measured environments becomes burstier rather smoother as the number of active sources increases; the Poisson assumption appears to hold only during low traffic periods with mostly machine generated router-to-router traffic. While these studies focus on LAN, specifically Ethernet traffic, Paxson and Floyd [94] also comment on the potential self-similarity of wide-area traffic. They also note that packet inter-arrivals are not exponentially distributed [68] [30] [45]. Paxson et al. [95] evaluated 21 WAN traces in their traffic analysis research. They considered both the Poisson process model and new models to characterize FTP and TELNET traffic and found that in some cases commonly used Poisson models result in serious underestimation of TCP traffic burstiness that existed over a wide range of time scales. For interactive TELNET traffic, the exponentially distributed inter-arrivals commonly used to model packet arrivals generated by the user side of a TELNET connection grievously underestimated the variability of these connections. For applications such as SMTP and NNTP, connections arrivals are not well modeled as Poisson since both types of connections are machineinitiated and can be timer-driven. For large bulk transfer, exemplified by FTP, the traffic structure significantly deviates from the Poisson model. Paxson et al. also so offered results that suggest self-similar properties of WAN traffic. The degree of self-similarity present in the process is measured in terms of the Hurst parameter [79]. This Hurst parameter has been proposed as a measure of the Burstiness of the traffic. The persistence of traffic burstiness is one of the main causes of congestion and implying on packet loss and delays. Further study by Willinger et al. [120] provided a plausible physical explanation for the occurrence of self-similarity in high speed network traffic. The super-position on many ON/OFF sources with strictly alternating ON and OFF periods and whose ON periods or OFF periods exhibit the Noah effect (i.e. have infinite variance) can produce aggregate network traffic that exhibits the Joseph effect (i.e. self-similar or LRD). This was provided as a physical explanation for the presence of self-similar traffic patterns in modern high-speed network traffic that is consistent with traffic measurements at the source level. Crovella et al. [28] showed that Web traffic is self similar, and the selfsimilarity is in all likelihood attributable to the heavy-tailed distributions of transmission times of documents and silent times between document requests. The data traces used in the Crovella study were recorded by Cunha et al. [29]. Along the last years, the researchers have pointed out the evidence that the Internet traffic shows long-range dependence, scaling phenomena and heavy tail behavior. Furthermore, networking series such as the aggreagate number of packets and bytes in time, have been shown to exhibit correlations over large time scales and self-similar scaling properties. These models resulted in invalidating the traditionally used assumptions in 27 modeling and simulations, namely that packet arrivals are Poisson and packet sizes and interarrival times are mutually independent. However, the results presented in [70] have showed that unlike the older data sets, current network traffic can be well represented by the Poisson model for subsecond time scales and up to sub-second time scales traffic is characterized by a stationary Poisson model. 2.2.1. Traffic Flow Properties The efforts to model and characterize computer network traffic have focused on temporal statistics of packet arrival and packet size distribution. These models were used to link and buffer size dimensioning respectively. However, in the last years, we have seen a strong tendency in use a flowbased approach to model the Internet traffic. Claffy, Braun and Polyzos [22] have presented a parameterizable methodology for profiling Internet traffic flows at a variety of granularities. Barakat et al. [11] have presented a traffic model at flow level by a Poisson shot-noise process. In this model, a flow is a generic notion that must be able to capture the characteristics of any kind of data stream. Moreover, many published papers have studied some traffic characteristics such as flow size, flow lifetime, IP address locality, and IP address structure. For instance, in [51] flow size distribution is studied, introducing a flow classification based on number of bytes, i.e. mice or elephants. In [19] flows are classified by lifetime, demonstrating that most flows are very short. Kohler, Li,Paxson, and Shenker [75], have investigated the structure of addresses contained in IP traffic. All these studies show important characteristics of the traffic, but in our opinion, more semantic aspects of flows are required for a useful traffic characterization. In terms of synthetic generation, Barford and Crovella [12] created a Web workload generator which mimics a set of real users accessing a server. The tool, called SURGE generates references matching empirical measurements of: server file size distribution, request size distribution, relative file popularity, embedded file references, temporal locality of reference and idle periods of individual users. The work of Aida and Abe [6] investigates the stochastic property of the packet destinations and proposes an address generation algorithm which is applicable for describing various Internet access patterns. 2.2.2. Popularity Popularity has been extensively studied, mainly to Web reference streams. Web reference streams shows highly skewed popularity distributions, which are usually characterized by the term Zipf’s Law [49] [29] [7] [16]. Zipf’s Law was originally applied to the relationship between a word’s popularity in terms of rank and its frequency of use. It states that if one ranks the popularity of words used in a given text (denoted by h ) by their frequency of use # (denoted by ) then #jilk]m h 28 (2.3) The practical implication of Zipf-like distributions for reference streams is that most references are concentrated among a small fraction of all of the objects referenced. Based on the near-ubiquity of Zipf-like distributions in the Web, many authors have captured popularity skew [69] [16] [12]. Highly popular documents tend to be requested frequently, and thus will exhibit shorter inter-request time; less popular documents, on the other hand, tend to be requested infrequently, and thus will exhibit longer inter-request times. In [69] was showed the relashionship between popularity and temporal locality. That is, Zipf,s Law results in strong temporal locality [46]. Designers of computer systems have years ago incorporated the notion of memory reference locality in system design, largely through the use of virtual memory and memory caches [58]. In deriving metrics for network traffic locality, Jain [66] draws a comparison to memory reference locality, which is either spatial or temporal. Spatial locality refers to the likelihood of reference to memory locations near previously referenced locations. Temporal locality refers to the likelihood of future references to the same location. In network traffic, the concentration of references to a small fraction of address and the persistence of references to recently used addresses are analogous concepts [66]. Jain presents data using three measures of locality: the network traffic income distribution, which can reflect long or short-term locality; and two metrics that only apply to short-term locality assessment: the average working set size as a function of packet window size, and the stack depth probability distribution. The income distribution measures what percent of communicating network entities is responsible for what percent of traffic on the network. Changes in the working set of source or destination IP networks are indicators of source-based and destination-based favoritism. To measure the working set one plots the number of unique address references as a function of the number of total address references. The stack level probability distribution measures the likelihood of reference to a network address as a function of the previous reference to that address. Gulati et al. [50] offers four locality metrics, two of which overlap with those of Jain: persistence; address reuse, which is similar to persistence with the requirement for consecutive reference loosened; concentration; and reference density. Reference density reflects the number of communicating entities responsible for a given percentile of the network traffic. Many previous studies have established the existence of network traffic locality, in particular short-term traffic locality in specific network environments for selected granularities of network traffic flow. Jain originally established the packet train model to study locality behavior on a local area network [68]. Other studies, though not focused on packet trains in particular, also find evidence for locality even in networks of wider geographic scope [24] [57] [93] [39] [30] [23] [105]. Others [2] [1] [83] have extended the packet train model to the transport and application layers, defining a train as a quadruple of source/destination address pairs in conjuction with port numbers. Just as program and data caching policies can exploit memory reference locality in a virtual memory system, router designers can exploit traffic locality with 29 analogous schemes such as caching network address and specialized flow information in switching nodes. [43] and [66] simulate caching algorithms on traffic traces taken from LAN gateways. Feldmeier [43] estimated the potential benefit of caching on the performance of gateway routing tables. Using measured traffic from gateways at MIT, he simulated a variety of fully associative caching replacement algorithms (LRU, FIFO, and random) to determine cache performance metrics such as hit radio and inter-fault distance. His data indicated that the probability of reference to a destination address versus time of previous reference to that address monotonically decreases for up to 50 previous references, implying that an LRU cache management procedure is optimal for caches of 50 slots or less. His conservative conclusion was that caching could reduce gateway routing table lookup time by up 65%. In addition to caching destination addresses, his simulations indicate benefits from caching source address as well. Jain [66] also performed trace-driven cache simulations. Simulating MIN (optimal), LRU, FIFO, and random replacement algorithms, he found significantly difference locality behavior between interactive and non-interactive traffic. The interactive traffic did not follow the LRU stack model while the noninteractive traffic did. In particular, the periodic nature of certain protocols may make caches ineffective unless they are sufficiently large. Such environments may require larger or multiple caches, or new cache replacement/fetch algorithms. Estrin and Mitzel [39] also explore locality in their investigation of lookup overhead in routers. They use data collected at border routers and transit networks to estimate the number of active conversations at a router, which reflect the storage requirements for the associated conversation state table. They find that maintaining fine grained traffic state may be possible at the network periphery, but deeper within the network coarser granularity may be necessary. They also use the traces to perform simulations of an LRU cache for different conversation granularities, and find that improvements in state lookup time are possible with a small cache, even without special hardware. Gulati et al. [50] have explored LAN cache performance of source address, destination address, and both source and destination address. In their measurement study of LAN traffic they find that it is more important to cache destination rather than source address, especially for caches with more than 15 entries. One reason is that many source hosts send very few packets, and thus the cost of caching the source address is greater than the benefit. Another reason is that source addresses are poor predictors of destination address references in the future. About locality in the WWW traffic, Almeida et al. [7] proposed models for both temporal and spatial locality of reference in streams of requests arriving at Web servers. They showed that simple models based on document popularity are insufficient for capturing either temporal or spatial locality. Moreover, they showed that temporal locality can be characterized by the marginal distribution of the stack distance trace and that spatial locality can be characterized using the notion of self-similarity. For temporal and spatial locality of reference in streams of requests arriving 30 at Web servers, Almeida, Bestavros, Crovella, and Oliveira [7], have proposed models that capture both properties. 2.2.3. Long range dependence and self-similarity From Ethernet traffic [79] to wide-area traffic [95], passing through Web traffic [28], all have been characterized by statistical self-similarity. Some have modeled quite successfully such traffic with ON/OFF traffic sources [119] while others have tried to fit a Markov-modulated model [102]. [92] has discussed some implications of self- similarity on network performance as well as the impact of network handling mechanisms on traffic characteristics. An important idea has already been raised in [6], in that the heavy-tailed characteristics of the objects size to be transferred over the network could suffice for generating self-similarity. It has been shown that heavy-tailed file transfer duration and file sizes can lead to high variability. Taqqu et al. [110] proved that aggregate World Wide Web traffic as found on the Internet links can be modeled by super-positioning many ON/OFF traffic sources where the ON and OFF periods were drawn from heavy tailed distributions. This method afforded the generation of traffic traces for simulation and linear time. Traffic generated by one of the ON/OFF traffic sources in the above mentioned model was representative of a single Web user [79]. Deng [36] proposed an ON/OFF traffic model to be used for the simulation of traffic generated by an individual browsing the Web. He derived distributions for the parameters of the model by means of analyzing datasets measured on a corporate network. He used probability plots to gauge the goodness-of-fit of analytic distributions to the datasets. The model had the advantage of being simple and of generating self-similar traffic due to the heavy tailed nature of the ON and OFF distributions. In the study of wide area network (WAN), Klivansky et al. [73] examined the packet traffic from geographically dispersed locations on the NSFNET T3 backbone, and indicated that packet-level traffic over NSFNET core switches exhibits LRD. Their key conclusion is that LRD in TCP traffic is primarily caused by the joint distributions of two TCP conversation parameters: the one-way number of packets per conversation and the conversation duration. Along this section, we describe the main concepts related to self-similarity. D is We start defining the concept of stationary ergodic process. A process stationary if its behavior or structure is invariant with respect to shifts in time. D is strictly stationary if D Gnop9cDq[p9WrsrsrW9 tAc and D ;nvuP[9c qwu [9srsrsrU9c txuy possess the same joint distribution for all ^{z{|w} , ;n[9srWrsrW9ct , 4z~| . A process is said stationary ergodic when it is possible to estimate the process statistics (mean, variance, autocorrelation function, etc) from the observed ^C69;^ M k9srWrsrU9 . values of a single time series I , we define the m-aggregated time series For a stationary time series $? M $? 9[ M 9Wk9[9srWrsrp by summing the original time series over non overlapping, adjacent blocks of size " . This may be expressed as 31 $ O I $? M k " >& $ $ n I & (2.4) One way of viewing the aggregated time series is as a technique for compressing the time scale. If the statistics of the process (mean, variance, correlation, etc.) are preserved with compression, then we are dealing with a self-similar process. Some of the properties of self-similar processes are most clearly stated in terms of the autocorrelation r(k). The autocorrelation as defined as: Dfu~ 6 M (D N? . U q (2.5) However, other properties are best expressed in terms of the auto-covariance: M N TX M q (2.6) where . When a given pattern is reproduced exactly at different scales, it might be termed exact self-similarity. This exact self-similarity can be constructed for a deterministic time series. A process is said to be exactly self-similar with pa M ] k k9[9Wrsrsr we have: rameter if for all " <D $? M < 4 <X1R^TUV "N W M xo\TU\]XXVYQ 1\]^ (2.7) (2.8) M kS mA The parameter is related to the Hurst parameter, defined as . M k and the variance of the time average For a stationary, ergodic process, k]m decays to zero at the rate of " . For a self-similar process, the variance of the time average decays more slowly. A process I is said to be asymptotically self-similar if for all large enough I $? M < I 1^aTpV < " ¡£¢ 6¤ ¡ ¢ K " ¤ ¥ x o\XTp\]XXVYQ 1\^ (2.9) (2.10) Thus, with this definition of self-similarity, the autocorrelation of the aggregated process has the same form as the original process. One of the most significant properties of self-similar processes is referred to as long-range dependence. This property is defined in terms of the behavior of the TA6 as increases. auto-covariance In general, a short-range dependent process satisfies the condition that its auto-covariance decays at least as fast as exponentially: 32 TXviP¦ ¦ K 7§7X¤ ¥ 6 ¥ O also, we can observe that: ¨ k (2.11) (2.12) In contrast, a long-range dependent process has a hyperbolically decaying auto-covariance: 6i©7ª7 K 7ª7X¤ ¥ « k (2.13) where is related to the Hurst parameter as defined earlier. In this case, O 6 M ¥ (2.14) One of the attractive features of using self-similar models for time series is that the degree of self-similarity of a series is expressed using only a single parameter. The parameter express the speed of decay of the series autocorrelation M function. For historical reasons, the parameter used is the Hurst parameter k% m . Thus, for self-similar series with long-range dependence, k]m k . As ¤ k , the degree of both self-similarity and long range dependence increases. A number of approaches have been taken to determine whether a given time series of actual data is self-similar and, if so, to estimate the self-similarity parameter . Bellow, we summarize some of the more common approaches taken. The first method, the variance-time plot, relies on the slowly decaying variF $ is plotted against " on a log-log ance of a self-similar series. The variance of o plot; a straight line with slope greater than -1 is indicative of self-similarity. The second method, the R/S plot, uses the fact that for a self-similar dataset, the rescaled range or R/S statistic grows according to a power law with exponent 2^a . Thus the plot of R/S against as a function of the number of points included ^ on a log-log plot has slope which is an estimate of . The third approach, the periodogram method, uses the slope of the power spectrum of the series as frequency approaches zero. On a log-log, the periyk M k¬A . odogram slope is a straight line with slope The last method is called Whittle estimator. The two forms that are most commonly used to calculate it are Fractional Gaussian Noise (FGN) with paramk]m k , and Fractional ARIMA (p,d,q) with Z®­Z k]m [13] [18]. eters These two models differ in their assumptions about the short-range dependences in the datasets; FGN assumes no short-range dependence while fractional ARIMA can assume a fixed degree of short-range dependence. 2.2.4. IP Address Multifractality The presence of multifractality in IP address was originally described by [75]. According with them, an address structure can be viewed as a subset of the 33 q 9]6<ugk]mX± q M M B s 9 Y k % ¯ ° ) m ± unit interval , where the subinterval corre spond to address . Considered this way, address structure might resemble a Cantor dust-like fractal [81] [96]. The lattice box counting fractal dimension metric naturally fits with address structures and prefix aggregation. Lattice box counting = dimension measures, for every , the number of dyadic intervals of length to cover the relevant dust. These dyadic intervals correspond to the first =required ^ bits of the IP addresses (p-aggregate). Given a trace, let be the number of :² = ²l³ ). p-aggregates that contain at least one address present in the trace ( Furthermore, since each p-aggregate contains and is covered by exactly two dis^ ² ^ }Cn ² X^ . Using this notation, joint (p+1)-aggregates, we know that lattice box counting dimension is defined as M ´(¶¸· =SQ \]Q _)\]_^ ! µ (2.15) p¹»º Q \]_)^ would appear as a straight line with slope If address structures were fractal, ! when plotted as a function of = . Adaptations of the well-known Cantor dust construction can generate address structures with any fractal dimension. The original Cantor construction can be extended, for instance, to a multifractal Cantor measure [54] [101]. Begin B by assigning a unit of mass to the unit interval . Then, split the interval into three parts where the middle part takes up a fraction ¼ of the whole interval; call BW½Y9;Bn , and BWq . Then throw away the middle part BAn , giving it none of these parts ½ and the parent interval’s mass. The other subintervals are assigned masses " M Y B ½ U B q q k ½ " " . Recursing on the nonempty subintervals and generates Bs½½ ,BW½q , BWq½ , and BWqq with respective masses " q½ , " ½ four " q, nonempty subintervals q " q " ½ , and " q . Continuing the procedure defines a sequence of measures BU¾on¿¿¿ ¾ ) = " ¾n£ÀÁYÁYÁ,À " ¾ (each  & is 0, 1, or 2). To create an address where ( structure from this measure, we choose a number of address so that the proba %¯U . If " ½ and " q differ the measure is bility of selecting address equals multifractal. 2.3. Data collection Network traffic measurements provides a mean to understand what is and what is not working properly on a local-area or wide-area network. Using specialized network measurement hardware or software, a network researcher can collect detailed information about the transmission of packets on the network, including their time structure and contents. With detailed packet-level measurements, and some knowledge of the Internet Protocol stack, it is possible to obtain significant information about the structure of an Internet application or the behavior of an Internet user. According to [117] there are four main reasons why network traffic measurement is a useful methodology: H Network troubleshooting: Computer networks are not infallible. Often, a single malfunctioning piece of equipment can disrupt the operation of an 34 H entire network, or at least degrade performance significantly. H Protocol debugging: Developers often want to test out new versions of network applications and protocols. Network traffic measurement provides a mean to ensure the correct operation of the new protocol or application, its conformance to required standards, and its backward-compatibility with previous versions. H Workload characterization: Network traffic measurements can be used as input to the workload characterization process, which analyzes empirical data (often using statistical techniques) to extract salient and representative properties describing a network application or protocol. Knowledge of the workload characteristics can then lead to the design of better protocols and networks for supporting the application. Performance evaluation: Finally, network traffic measurements can be used to determine how well a given protocol or application is performing in the Internet. Detailed analysis of network measurements can help identify performance bottlenecks. In general, the tools for network traffic measurement can be classified in the following different ways: H H Hardware-based versus Software-based Measurement tools. The primary categorization among network measurement tools is hardware-based versus software-based measurements tools. Hardware-based tools are often referred to as network traffic analyzers: special-purpose equipment designed expressly for the collection and analysis of network data. Software-based measurements tools typically rely on kernel-level modifications. One widely used utility is tcpdump [111], a user level tool for TCP/IP packet capture. In general, the software-based approach is much less expensive than the hardware-based approach, but may not offer the same functionality and performance as a dedicated network traffic analyzer. Another software-based approach to workload analysis relies on the access logs that are recorded by Web servers and Web proxies on the Internet. These logs record each client request for Web site content, including the time of day, client IP address, URL requested, and document size. Post-processing of such access logs provides useful insight into Web server workloads [9], without the need to collect detailed network-level packet traces. Passive versus Active Measurements Approaches. A passive network monitor is used to observe and record the packet traffic on an operational network, without injecting any traffic of its own onto the network. That is, the measurement device is non-intrusive. An active network measurement approach uses packet generated by a measurement device to probe the Internet and measure its characteristics. Examples of this approach include the ping 35 H utility for estimating network latency to a particular destination on the Internet, the traceroute utility for determining Internet routing paths, and the pathchar tool for estimating link capacities and latencies along an Internet path. On-line versus Off-line Traffic Analysis. Some network traffic analyzers support real-time collection and analysis of network data, often with graphical displays for on-line visualization of live traffic data. Other network measurement devices are intended only for real-time collection and storage of traffic data; analysis is postponed to an off-line stage. The tcpdump utility falls into this category. Once the traffic data is collected and stored, a researcher can perform as many analysis as desired in the post-processing phase. However, reliable and and representative measurements of wide area Internet traffic are difficult to obtain. Basically, these difficulties are related on the following problems: H H Firstly, real traces are often difficult to obtain, mostly because of security or privacy concerns. For this reason, publicly available collections of traces are often sanitized by hiding the real source and destination addresses of all packets. Although such sanitized traces may still be useful for many studies (e.g. dealing with the interarrival time or length distribution of packets), they are completely useless in those cases when the actual IP addresses are needed, e.g. to investigate the behavior of caching algorithms. H Secondly, with the increasing speed of Internet routers, it becomes more and more expensive to collect and store full traces of a meaningful duration without affecting the router’s performance. The situation begins to resemble the problems of collecting memory reference strings of programs in execution. The complete string of sizable CPU bound program cannot be collected and stored in real time without drastically impairing its execution time. Thirdly, Internet traffic patterns are likely to undergo significant changes due to new applications and user activity patterns that cannot be anticipated at present. For example, Web transactions became one of the major components of the Internet traffic almost overnight. Such changes will render real trace collections obsolete. The Off-line analysis carried along this thesis are based on traces captured from many sites and mainly are devoted to workload characterization. One of them was an OC-3 link trace (155 Mbps) that connects the Scientific Ring of Catalonia to RedIRIS (Spanish National Research Network) [98] collected by a hardware-based measurement tool on passive mode. This not sanitized trace is a collection of packets flowing in one direction of the link containing a timestamp 36 and the first 40 bytes of the packet. For our analysis, we have used only the output link. The Scientific Ring (see Figure 2.2) is a high performance communication network created in 1993 and nowadays it joins more than forty research institutions. Figure 2.2: RedIRIS Topology Furthermore, we have surveyed publicly available archive of traces collected and maintained by the National Laboratory for Applied Network Research (NLANR) [89]. We downloaded traces obtained from the following sites [90]: H H Colorado State University (COS) H Front Range GigaPOP (FRG) H University of Buffalo (BUF) Columbia University (BWY) In all cases, the traces are stored using the TSH packet header format. For .tsh files the header size is 44 bytes: 8 bytes of timestamp and interface identifier, 20 bytes of IP, and 16 of TCP (see Figure 2.3). No IP or TCP options are included. The packet payload is also not stored. 37 0 8 16 24 timestamp (seconds) timestamp (microseconds) interface Version IHL Type of Service Total Length Identification Time to Live Flags Protocol IP Fragment Offset Header Checksum Source Address Destination Address Source Port Destination Port Sequence Number fin Reserved ack psh rst syn Data Offset urg Acknowledgment Number Window Figure 2.3: TSH header data format 38 TCP Chapter 3 Semantic traffic characterization In this chapter, we present a novel flow characterization approach that incorporates semantic characteristics of flows. We understand by semantic characterization the joint analysis of traffic characteristics including some of the most important fields (source and destination address, port numbers, packet length, TCP flags, etc) of the TCP/IP headers content [61]. 3.1. Semantic Characterization Let us define a packet flow as a sequence of packets in which each packet has the same value for a 5-tuple of source and destination IP address, protocol number, and source and destination port number and such that the time between two consecutive packets does not exceed a threshold. In our case, we have adopted a threshold of 5 sec. First of all, our analysis intend to classify the header fields depending of how - 1 be a header field for they change for packets belonging to the same flow. Let M 0 2 1 1 -0ok] . the i-th packet of a flow. We also define à -0ok] can be classified as -0ok] -random, -0ok] For the first packet of a flow, -0okY -not predictable: predictable, or H -0ok] -random fields: Are fields whose initial values could or should be chosen at random: Identification, Sequence Number, and Acknowledgment Number. The identification field is primarily used for uniquely identifying fragments of an original IP datagram and many operating systems assign a sequential number for each packet. Hence, assigning a random number for the first fragment does not constitute a problem. Equally, the sequence number and acknowledgment number fields are not affected if we assign random values for the first packet of each flow. H -0ok] -predictable fields: Are fields whose value is usually known or at least predictable: Interface, Version, IHL, Type of Service, Flags, Fragment Offset, Protocol, Data Offset, Reserved, and Control Bits. The fields placed in this group preserve a high level of similarity among different flows. 39 H -0ok] -not predictable fields: Are fields whose value cannot be predicted and has an specific meaning: Timestamp, TTL, Header Checksum, Total Length, Source Address, Source Port, Destination Address, Destination Port, and Window. This group embrace the fields that are very hard to guess their values for the first packet of each flow. For instance, we can not know previously, where will start each flow. Hence, is impossible to guess, for each flow, the value of the timestamp field of the first packet. The TTL field is modified in Internet header processing, hence depending of the amount of hops previously visited, its value must vary broadly for different flows. The total length carried by each packet as well the window field also show a large variation. Finally, the source and destination address, represents a set of directions that is impossible to know in advance. - 1 behavior, the fields of the i-th packet à - 1 -predictable, and à - 1 -not pre- Moreover, according with the à -021 MÄ , of a flow were classified as à dictable. H à -021 MÅ fields: are header fields whose values are likely to stay constant over the life of a connection: Version, Type of Service, Protocol, Source Address, Destination Address, Source Port, and Destination Port. Here we are grouping the set of TSH header fields that is likely to stay constant over the life of a connection. Hence, from each flow, we only need store the data from the first packet. H à -021 -predictable fields: Are fields whose à - 1 values are predictable, can be calculated based on the information stored in another field or follows sequential increments: Interface, IHL, Identification, Flags, Fragment Offset, Time to Live, Sequence Number, acknowledgment number, Data Offset, Reserved, and Control Bits. H à -021 -not predictable fields: Are fields that are likely to change over the life of the conversation, and furthermore, are impossible to be calculated: timestamp, total length, header checksum, and Window. In this group are inserted fields that are likely to change over the life of the conversation, and furthermore, are impossible to be calculated. -0ok] -021 Taking into account the joined behavior of and à , we have created -0ok] four categories of fields. In the first category, are placed the fields whose -021 values are constant or predictable through a values are predictable and à flow: (( -k] -k] Not Random) AND ( -predictable)) AND (( à 0 2 1 (à -predictable)) -021 MÆMÇ ) OR The fields that agree with those constraints are: Interface, Version, IHL, Type of Service, Flags, Fragment Offset, Protocol, Data Offset, Reserved, and Control 40 Bits. This set of fields shows a high similarity within consecutive packets belonging to the same flow and in particular between m-packets flows (flows with m packets). -k] values are not In the second category are included the fields whose - 1 values are constant or predictable: predictable and à (( -0ok] -Not Random) AND (-0okY -Not predictable)) AND ((à -021 MÆMÄ ) OR -021 -predictable)) (à According with those constraints, we have the following fields: TTL, Source Address, Source Port, Destination Address, and Destination Port. For these fields, storage needs are restricted to the first packet of each flow. The third category incorporates the fields that are hard to predict or calculate and we can not assign random values to them: (à -021 -Not predictable) In this case, storage needs are extended over all packets. These fields are: Timestamp, Total Length, Header Checksum, Window. -0ok] is ranFinally, the last category groups the fields whose initial value -021 can be calculated: Identification, and Sequence dom and the increments à Number, and Acknowledgment Number. 3.1.1 Flow Mapping For a best representation of the header fields as well as a way to understand their behavior, we have developed a header field mapping. In this mapping, for some header fields the values are simply copied from the packets; for others the mapped value represents the increment or decrement between consecutive packets into a flow, and finally for some of them which the distribution of values is highly skewed, we can replace the original value by a transformation or function of the values. Bellow, we describe this mapping (See Figure 3.1). # $ Let & be the packet header of the i-th packet of a flow consisting of " # $ (') is a selected header field of # & $ . For each field # & $ (') , a function packets. & *,+ performs a mapping into an integer value -& $ (') : For each packet, let - & $ (') M *,+ 6 # & $ .')c (3.1) - & $ M 6 - & $ ok][9G- & $ 6p9WrsrsrÈ (3.2) denote a vector of integers, where we include the selected fields. For the complete flow we can define: # $ M 2# n $ 9G# q $ 9srsrsrU9G# $ $ (3.3) and - $ M 6- n $ G9 - q $ W9 rsrsrW9;- $ $ p r 41 (3.4) P3 5 (3) P35(1) P3 5 (2) P45 (1) P4 5 (2) F35 (1) F35 (2) P45 (3) F35(3) 5 4 F45 (1) F (2) F 5(3) 4 Figure 3.1: Flow mapping - $ Note that the vector can be viewed as a numerical representation of the " packet headers, as we substitute some selected packet header fields by integers. As an example of header fields that are simply copied are: Version, IHL, Type of service, Flags, Fragment Offset, Data Offset, and Control bits. However, for some fields; such as Identification and Time to Live; we map their values in terms of the increment or decrement between consecutive packets. On the other hand, as we have said earlier, if the distribution of a parameter is highly skewed, one should consider the possibility of replacing the parameter by a transformation or function of the parameter. For instance, the replacing of the timestamp and the packet size into more appropriate values: H Packet size: Observing the Internet packets, we have seen that there are a high predominance of small packet size (acknowledgements packets) and packet size near 1,500 bytes (packets carrying data). As a consequence of this observation, we can use the following mapping: - & $ . ' M M H M ;E 1 : =SRTUVW K 1ÊV ²ÌËAÍ<Î o V K d É kAE;1 d:É Ë k =SRTUVW K 1ÊV ² Wk Ï )E;1 d:É =SRTUVW K 1ÊVÑÐÇksÏ Inter-packet time into a flow: Analysing a set of flows with the same number of packets we have seen that, for small flows, the inter-packet time between consecutive packet is very similar for many flows. Basically, this interpacket time are very small or are near the Round Trip Time (RTT). This behavior is related with the TCP properties. The sequence of figures from 3.2 to 3.6 show, for a set of 1,100 flows with 6 packets, how similar are the inter-packet time between consecutive packets into a flow. Hence, if a 42 1200 1200 1000 Flows with 6 packets Flows with 6 packets 1000 Large RTT 800 600 Medium RTT 400 200 0 Small RTT 0 0.05 0.1 0.15 0.2 0.25 Time (sec) Figure 3.2: à t( k J Ò and 0.3 tWÓ 0.35 800 600 400 200 0 0.4 0 packets) Figure 3.3: 1000 Flows with 6 packets 1000 Flows with 6 packets 1200 800 600 400 200 0.05 0.1 0.15 0.2 0.25 Time (sec) Figure 3.4: à t(³ Ò>Ô and 0.3 Ï Ò>Ô 0.15 0.2 tWÓ Ã t( 0.25 0.35 0.4 and 0.3 ³ Ò>Ô 0.35 0.4 packets) 800 600 400 200 0 0 0.1 Time (sec) 1200 0 0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Time (sec) packets) Figure 3.5: à t(Ï Ò>Ô and Õ Ò>Ô packets) packet to be transmitted waits for a packet sent by the opposite node, we call it as dependent packet, otherwise, if a packet is sent immediately after the last one we call it as no-dependent. For instance, in the TCP three-way handshake, when a node send a Syn control flag, it waits for a Syn+Ack control flag from the opposite node. This waiting time corresponds to the RTT. In this sense, we associate inter-packet time to acknowledgement dependence. Hence: - & $ (') M M ;E 1 d RTWÓ[Vc=SVs^ ­ Vs^,D=SRTWVW kE;1 d×ÖÆØ RTUÓGVc=SVs^ ­ Vs^,D=SRTWVW 43 1200 Flows with 6 packets 1000 800 600 400 200 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Time (sec) Figure 3.6: à t(Õ Ò>Ô 44 and Ë Ò>Ô packets) Chapter 4 Flow Clustering This chapter is dedicaded to explore the similarity between Internet flows using the semantic traffic characterization proposed in the previous chapter. To be able to do that, we have broken down the trace into flows with the same number of packets and calculated the number of clusters. Beyond calculating the number of clusters, we have studied the popularity of each cluster. For compression purposes, for instance, the relation between the number of flows and the number of clusters must be as minimum as possible. 4.1. Introduction Internet Traffic is composed of a very large number of very short flows, and a few very long flows. The terminology mice and elephants provides a useful metaphor for understanding this characteristic of Internet traffic: there are relatively few elephants, and a large number of mice [51]. The table 4.1 shows, for each one of the m-packet flows (column 1), the probability distribution . The data indicate that 90% of the flows show less than 21 packets (column 2). Similar probability distributions were showed in [22] [10]. Moreover, in the same table we see that 77% of the packets (column 3) are carried by only a small number of flows (elephants) while the remaining large amount of flows carry few packets (mice). These outcomes are similar to that described in [51]. The figure 4.1 shows the cumulative distributions of flow packet volume. The curve indicates that the 90th percentile of the flows reflects 20 packets or less and that the tail part of the distribution suit as a class of heavy-tailed distribution which decays very slowly in its tail. 4.2. Methodology Using the flow characterization described in chapter 3, in a high-speed link we can find potentially a large variety of flows. However, from our studies, we have seen that the flows are not very different from each other. To study the variety among flows, we have used an approach based on clustering, a classical technique used for workload characterization [67]. The basic 45 Table 4.1: Number of packets per flow 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Ù 2021 Flows distribution Flow Flow Volume Probability (Packets) 0.095546 0.004193 0.070476 0.006184 0.063517 0.008361 0.052245 0.009169 0.195584 0.042908 0.163134 0.042947 0.077876 0.023919 0.039831 0.013981 0.028339 0.011191 0.021941 0.009627 0.016285 0.007860 0.012876 0.006779 0.010409 0.005937 0.008343 0.005125 0.007762 0.005108 0.006719 0.004717 0.007220 0.005386 0.005656 0.004467 0.005214 0.004347 0.003911 0.003432 0.107118 0.774358 46 1 Probability distribution 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 10 100 1000 Packets/Flow 10000 100000 Figure 4.1: Distribution of the number of packets in flows idea of clustering is to partition the components into groups so the members of a group are as similar as possible and different groups are as dissimilar as possible. Statistically, this implies that intra-group variance should be as small as possible and intergroup variance should be as large as possible. In the last years, a number of clustering techniques have been described in the literature. These techniques fall into two classes: hierarchical and nonhierar chical. In nonhierarchical approaches, one starts with an arbitrary set of clusters, and the members of the clusters are moved until the intra-group variance is minimum. There are two kinds of hierarchical approaches: agglomerative and divisive. ^ In the agglomerative hierarchical approach, given components, one starts with ^ clusters (each cluster having one component). Then neighboring clusters are merged successively until the desired number of clusters is obtained. In the divisive hierarchical approach, on the other hand, one starts with one cluster (of ^ components) and then divides the cluster successively into two, three, and so on, until the desired number of clusters is obtained. A popular known clustering technique is the minimum spanning tree method. Generally, a measured trace consists of a large number of flows. For analysis purposes, it is useful to classify these flows into a small number of classes or clusters such that the components within a cluster are very similar to each other. Later, one member from each cluster may be selected to represent the class. Clustering analysis basically consists of mapping each component into an n^ dimensional space and identifying components that are close to each other. Here is the number of parameters. The closeness between two components is measured by defining a distance measure. The Euclidian distance is the most commonly used distance metric and is defined as: 47 t ­M® O I & I + q ½c¿ Ú n (4.1) The figure 4.2 depicts the proposed clustering methodology. Starting from a real trace, we break it down into flows with the same number of packets. Using $ the mapping described in chapter 3 (see figure 3.1), from a set of vectors ( ), $ we calculate the Euclidian distance between the vectors and the results are - $ represents a cluster. stored in a distance matrix of flows. Initially, each vector Evidently, distance 0 means that two vectors are exactly identical. Later, we search ­)Û the smallest element of the distance matrix. Let J , the distance between clusters and K , be the smallest. We merge clusters and K and also merge any other cluster pairs that have the same distance. We have used the Minimum Spanning ^ Tree hierarchical clustering technique [67], which starts with clusters of one component each and successively joins to the nearest clusters until be reached a specific distance between the clusters. For each " , we apply, separately, the clustering method. After all, the final clusters are joined and Templates of Flows are generated. Vector 1 Vector 2 Flows with 2pkts Distance Calutalte the Euclidian Distance between the vectors Distance Matrix of flows Vector n Vector 1 Flows with 3 pkts Calutalte the Euclidian Distance between the vectors Vector 2 Matrix of flows Vector n RedIRIS Trace Template of Flows Vector 1 Flows with m pkts Vector 2 Calutalte the Euclidian Distance between the vectors Distance Matrix of flows Vector n Figure 4.2: Flow Clustering Methodology 48 4.3. Clustering We have selected 9 fields to study their diversity among flows in Internet links. The shaded boxes in figure 4.3 depict those selected fields. We use de definition of flow as a sequence of packets in which each packet has the same value for a 5-tuple of source and destination address, source and destination port and protocol and such that the time between two consecutive packets does not exceed a threshold of 5 sec. 0 8 16 24 timestamp (seconds) timestamp (microseconds) interface Version IHL Type of Service Total Length Identification Time to Live Flags Protocol Fragment Offset Header Checksum Source Address Destination Address Source Port Destination Port Sequence Number fin Reserved ack psh rst syn Data Offset urg Acknowledgment Number Window Figure 4.3: Selected fields used to flow clustering - $ %- $ We started our analysis using the RedIRIS trace, converting each flow in a vector and calculating the distance between the previously read flows. The vector contains the mapping value of the selected fields described above. However, for the TTL field, we have used the increment or decrement between consecutive packets. For each " , we show how the number of clusters increases when new flows are read. However, we reach a step where the number of clusters reach a limit, and the new read flows always fit with some of the previous clusters. Bellow, we describe the set of clusters for each m-packet flows: H Flows with 2 packets per flow: The table 4.2 shows, from 5 to 243 flows, how increase the number of clusters. After read only 243 flows, the number of clusters tends to increase smoothly (see figure 4.4) and the number of clusters (15) represent only 6% of the number of flows. On table 4.3 we describe some clusters found into flows with 2 packets. As we see, 87% of flows are concentrated in only one cluster. The data 49 points with extreme parameter values are called outliers, particularly if they lie far away from the majority of the other points. Since those outlying components do not consume a significant portion of flows, their exclusion would not affect significantly the final results of clustering. Table 4.2: Number of clusters for m=2 packets Number of flows Number of clusters Percentage 5 2 0.40 47 6 0.13 87 7 0.08 201 14 0.07 243 15 0.06 16 Clusters m=2 packets Number of clusters 14 12 10 8 6 4 2 0 0 50 100 150 Number of flows 200 250 H Figure 4.4: Number of clusters m=2 H Flows with 3 packets per flow: The table 4.4 shows, from 6 to 648 flows, how increase the number of clusters. After read 648 flows, the number of clusters tends to increase smoothly (see figure 4.5) and the number of clusters (69) represent only 10% of the number of flows. On table 4.5 we describe some clusters found into flows with 3 packets. As we see, approximately 82% of flows are concentrated in only four clusters. Flows with 4 packets per flow: The table 4.6 shows, from 24 to 1,740 flows, how increase the number of clusters. After reading 1,740 flows, the number of clusters tends to increase smoothly (see figure 4.6) and the number of clusters (103) represent only 6% of the number of flows. On table 4.7 we describe some clusters found into flows with 4 packets. As we see, approximately 81% of flows are concentrated in only five clusters. 50 Table 4.3: Flows distribution Cluster Packet 1 Packet 2 01 1:4:5:2:0:0:6:2:0 1:4:5:0:0:0:5:4:0 02 1:4:5:2:0:0:7:2:0 1:4:5:0:0:0:5:4:0 03 1:4:5:2:0:0:7:2:0 1:4:5:2:0:0:5:17:0 04 1:4:5:2:0:0:11:2:0 1:4:5:2:0:0:5:4:0 05 1:4:5:2:0:0:7:2:0 1:4:5:2:0:0:5:4:0 06 1:4:5:2:0:0:7:18:0 1:4:5:2:0:0:5:17:0 07 1:4:5:2:0:0:7:18:0 1:4:5:2:0:0:5:4:0 08 1:4:5:2:0:0:6:2:0 1:4:5:2:0:0:5:4:0 09 1:4:5:2:0:0:11:18:0 1:4:5:2:0:0:8:17:0 Prob 0.029586 0.875740 0.017751 0.005917 0.035503 0.011834 0.005917 0.011834 0.005917 Table 4.4: Number of clusters for m=3 packets Number of flows Number of clusters Percentage 6 5 0.83 40 10 0.27 133 25 0.19 265 45 0.17 362 54 0.14 499 65 0.13 648 69 0.10 70 Clusters m=3 packets Number of clusters 60 50 40 30 20 10 0 0 100 200 300 400 Number of flows 500 Figure 4.5: Number of clusters m=3 51 600 700 Table 4.5: Flows with 3 pkts per flow Packet 1 Packet 2 Packet 3 Prob 1:4:5:2:0:0:7:18:0 1:4:5:2:0:0:5:24:0 1:4:5:2:0:0:5:17:0 0.076923 1:4:5:2:0:0:7:2:0 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:17:0 0.564103 1:4:5:2:0:0:6:18:0 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:17:0 0.102564 1:4:5:2:0:0:6:2:0 1:4:5:0:0:0:5:16:-67 1:4:5:0:0:0:5:4:0 0.025641 1:4:5:2:0:0:7:18:0 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:20:0 0.076923 1:4:5:2:0:0:7:18:0 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:17:0 0.051282 1:4:5:2:0:0:7:2:0 1:4:5:2:0:0:8:16:0 1:4:5:2:0:0:5:17:0 0.025641 1:4:5:2:0:0:7:2:0 1:4:5:2:0:0:7:2:0 1:4:5:0:0:0:5:4:0 0.025641 1:4:5:2:0:0:8:18:0 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:17:0 0.025641 1:4:5:2:0:0:6:2:0 1:4:5:2:0:0:5:24:0 1:4:5:2:0:0:5:25:0 0.025641 Table 4.6: Number of clusters for m=4 packets Number of flows Number of clusters Percentage 24 9 0.37 56 11 0.19 201 35 0.17 461 51 0.11 873 76 0.08 1740 103 0.06 120 Clusters m=4 packets 100 Number of clusters Cluster 1 2 3 4 5 6 7 8 9 10 80 60 40 20 0 0 200 400 600 800 1000 1200 1400 1600 1800 Number of flows Figure 4.6: Number of clusters m=4 52 Table 4.7: Flows with 4 pkts per flow Cluster 1 2 3 4 5 6 7 8 9 10 11 Packet 1 1:4:5:2:0:0:7:2:0 1:4:5:2:0:0:10:18:0 1:4:5:2:0:0:7:2:0 1:4:5:2:0:0:7:18:0 1:4:5:2:0:0:7:18:0 1:4:5:2:0:0:7:2:0 1:4:5:0:0:0:7:18:0 1:4:5:2:0:0:7:2:0 1:4:5:2:0:0:6:2:0 1:4:5:2:0:0:7:18:0 1:4:5:2:0:0:7:18:0 Packet 2 1:4:5:0:0:0:5:16:-66 1:4:5:2:0:0:8:16:0 1:4:5:0:0:0:5:16:-67 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:16:0 1:4:5:0:0:0:5:24:0 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:24:0 1:4:5:2:0:0:5:16:2 1:4:5:2:0:0:5:16:0 Packet 3 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:8:24:0 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:24:2 1:4:5:2:0:0:5:24:0 1:4:5:2:0:0:5:24:0 1:4:5:0:0:0:5:16:0 1:4:5:2:0:0:5:24:0 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:16:2 1:4:5:2:0:0:5:16:0 Packet 4 1:4:5:2:0:0:5:17:0 1:4:5:2:0:0:8:17:0 1:4:5:2:0:0:5:17:0 1:4:5:2:0:0:5:17:0 1:4:5:2:0:0:5:17:0 1:4:5:2:0:0:5:17:0 1:4:5:0:0:0:5:17:0 1:4:5:2:0:0:5:4:0 1:4:5:2:0:0:5:20:0 1:4:5:2:0:0:5:17:0 1:4:5:2:0:0:5:25:0 To improve the accuracy of our outcomes, we extended our analysis to other networks. We demonstrated that the occurrence of flow clustering verified in RedIRIS trace could be seen also in other traces. In Fig. 4.7, we show, M Ï packets and with inter cluster distance equal to zero, the outfor " comes from four ATM OC-3 traces downloaded from NLANR web site. The plotted curves are from Colorado State University (COS), Front Range GigaPOP (FRG), University of Buffalo (BUF), and Columbia University (BWY). Furthermore, in Figure 4.7, we see the behavior of the Joined Trace (upper curve). This trace was obtained joining the four downloaded NLANR traces. From Figure 4.7, we obtain two important conclusions: (i) the traces from NLANR have the same behavior of the RedIRIS trace, i.e. we can obtain a small number of clusters to represent a packet trace; (ii) the number of clusters in the joined trace is less than the summation of clusters of the other four traces. This implies that the type of flows is basically the same in all traces. Similar behavior was obtained for different values of " . 53 Prob 0.018182 0.163636 0.018182 0.218182 0.072727 0.218182 0.018182 0.127273 0.109091 0.018182 0.018182 25 Joined COS BUF BWY FRG Number of Clusters 20 15 10 5 0 0 50 100 150 200 250 300 Flows for m=4 packets Figure 4.7: Number of clusters for ATM OC-3 traces and joined trace - NLANR traces 54 H Flows with 5 packets per flow: The table 4.8 shows, from 30 to 3,162 flows, how increase the number of clusters. After read 3,162 flows, the number of clusters tends to increase smoothly (see figure 4.8) and the number of clusters (142) represent only 4% of the number of flows. On table 4.9 we describe some clusters found into flows with 5 packets. As we see, approximately 88% of flows are concentrated in only four clusters. Table 4.8: Number of clusters for m=5 packets Number of flows Number of clusters Percentage 30 5 0.16 62 7 0.11 207 19 0.09 709 63 0.08 1589 99 0.06 3162 142 0.04 160 Clusters m=5 packets Number of clusters 140 120 100 80 60 40 20 0 0 500 1000 1500 2000 2500 Number of flows 3000 3500 H H Figure 4.8: Number of clusters m=5 Flows with 6 packets per flow: The table 4.10 shows, from 52 to 1,174 flows, how increase the number of clusters. After read 1,174 flows, the number of clusters tends to increase smoothly (see figure 4.9) and the number of clusters (92) represent only 8% of the number of flows. Flows with 7 packets per flow: The table 4.11 shows, from 5 to 2,106 flows, how increase the number of clusters. After read 2,106 flows, the number of clusters tends to increase smoothly (see figure 4.10) and the number of clusters (252) represent 11% of the number of flows. 55 Table 4.9: Flows with 5 pkts per flow Cluster Packet 1 Packet 2 Packet 3 Packet 4 Packet 5 Prob 1 1:4:5:2:0:0:7:2:0 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:24:0 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:17:0 0.600 2 1:4:5:2:0:0:10:2:0 1:4:5:2:0:0:8:16:0 1:4:5:2:0:0:8:24:0 1:4:5:2:0:0:8:16:0 1:4:5:2:0:0:8:17:0 0.110 3 1:4:5:2:0:0:7:2:0 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:24:0 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:4:0 0.085 4 1:4:5:2:0:0:6:2:0 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:24:0 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:17:0 0.085 5 1:4:5:2:0:0:7:2:0 1:4:5:0:0:0:5:16:-67 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:24:0 1:4:5:2:0:0:5:4:0 0.010 6 1:4:5:2:0:0:10:18:0 1:4:5:2:0:0:8:16:0 1:4:5:2:0:0:8:24:0 1:4:5:2:0:0:8:24:0 1:4:5:2:0:0:8:25:0 0.015 7 1:4:5:2:0:0:7:18:0 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:24:0 1:4:5:2:0:0:5:24:0 1:4:5:2:0:0:5:17:0 0.005 8 1:4:5:2:0:16:6:2:0 1:4:5:0:0:0:5:16:-66 1:4:5:2:0:16:5:16:0 1:4:5:2:0:16:5:24:0 1:4:5:2:0:16:5:17:0 0.005 9 1:4:5:2:0:0:7:18:0 1:4:5:2:0:0:5:24:0 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:24:0 1:4:5:2:0:0:5:17:0 0.015 10 1:4:5:2:0:16:6:2:0 1:4:5:0:0:0:5:16:-67 1:4:5:2:0:16:5:16:0 1:4:5:2:0:16:5:24:0 1:4:5:2:0:16:5:17:0 0.010 11 1:4:5:2:0:0:7:2:0 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:24:2 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:17:0 0.020 12 1:4:5:2:0:0:6:2:0 1:4:5:0:0:0:5:16:-66 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:24:0 1:4:5:2:0:0:5:17:0 0.005 13 1:4:5:2:0:0:7:2:0 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:24:0 1:4:5:2:0:0:5:24:0 1:4:5:2:0:0:5:4:0 0.005 14 1:4:5:2:0:0:7:18:0 1:4:5:2:0:0:5:24:0 1:4:5:2:0:0:5:24:0 1:4:5:2:0:0:5:24:0 1:4:5:2:0:0:5:17:0 0.005 15 1:4:5:2:0:0:10:18:0 1:4:5:2:0:0:8:16:2 1:4:5:2:0:0:8:16:2 1:4:5:2:0:0:8:24:2 1:4:5:2:0:0:8:17:0 0.005 16 1:4:5:2:0:0:7:2:0 1:4:5:2:0:0:5:24:0 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:17:0 0.005 17 1:4:5:2:0:0:7:2:0 1:4:5:0:0:0:5:16:-67 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:24:0 1:4:5:2:0:0:5:17:0 0.005 18 1:4:5:2:0:0:7:2:0 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:24:0 1:4:5:2:0:0:5:24:0 1:4:5:2:0:0:5:17:0 0.005 19 1:4:5:2:0:0:7:2:0 1:4:5:0:0:0:5:16:-66 1:4:5:2:0:0:5:24:0 1:4:5:2:0:0:5:16:0 1:4:5:2:0:0:5:17:0 0.005 Table 4.10: Number of clusters for m=6 packets Number of flows Number of clusters Percentage 52 18 0.34 157 29 0.18 518 59 0.11 1174 92 0.08 Table 4.11: Number of clusters for m=7 packets Number of flows Number of clusters Percentage 5 5 1 146 50 0.34 396 95 0.23 949 163 0.17 2106 252 0.11 56 100 Clusters m=6 packets 90 Number of clusters 80 70 60 50 40 30 20 10 0 0 200 400 600 800 1000 1200 Number of flows Figure 4.9: Number of clusters m=6 300 Clusters m=7 packets Number of clusters 250 200 150 100 50 0 0 500 1000 1500 2000 Number of flows Figure 4.10: Number of clusters m=7 57 2500 4.4. Conclusions From the previous analysis, we have concluded that Internet flows are not very different from each other and we can group a high amount of them into few clusters. The table 4.12 extends these analysis, showing the relation between the number of clusters and the number of flows for each class of m-packets flows (m ranging from 2 to 20). As we see in figure 4.11, for " ranging from 2 to 7, this relation is highly favorable, what means that we can represent many flows with few clusters reaching a high compression ratio. However, for " greater than 7, this relation is impaired. This impaired relation does not means that the clusters does not exist for large flows, but that we would need an extemely large trace to found them what is not addressed for practical purposes. Table 4.12: Percentage of clusters for m-packets flows Number of Pkts Clusters/Flows per flow (%) 2 6% 3 10% 4 6% 5 4% 6 8% 7 11% 8 24% 9 30% 10 39% 11 47% 12 59% 13 57% 14 54% 15 58% 16 66% 17 52% 18 66% 19 64% 20 79% An extensive analysis carried out with many traces has produced the following conclusions: H H For small flows, behind the great number of flows in a high-speed link there is not so much variety among them and clearly they can be grouped into a set of clusters; For each subset of m-packet clusters, the TCP/IP flows are not equally distributed into the clusters, with a high predominance of few clusters; 58 1 Percentage of clusters Clusters/Flows 0.8 0.6 0.4 0.2 0 0 5 10 m-packet flows 15 20 H Figure 4.11: Relation between clusters and flows H H Same type of clusters present at different traces; The evidence that Internet flows can be grouped into a small set of clusters led us to create templates of flows; Those templates constitute an efficient mechanism to packet trace compression and classification. 59 60 Chapter 5 Entropy of TCP/IP Header Fields In this chapter we dicuss about the definition of entropy and its aplicability to TCP/IP header fields. The analysis were carried out at packet and flow level. The entropy of a random variable is a measure of the uncertainty of the random variable; it is a measure of the amount of information required on the average to describe the random variable. For data compression, the entropy of a random variable is a lower bound on the average length of the shortest description of the random variable and is used to establish the fundamental limit for the compression of information. Data compression can be achieved by assigning short descriptions to the most frequent outcomes of the data source and necessarily longer descriptions to the less frequent outcomes. Thus the entropy is the data compression limit as well as the number of bits needed in random number representation. Codes achieving turn out to be optimal. 5.1. Introduction The entropy of a random variable is defined by with a probability mass function p(x) 3 4 M O ª= I oQ2\Y_Xq=ª I (5.1) Note that entropy is a functional of the distribution of . It does not depend on the actual values taken by the random variable , but only on the probability. Using logarithms to base 2 the entropy will then be measured in bits. It is the number of bits on the average required to describe the random n variable. The Q \Y_ where X is entropy of can also be interpreted as the expected value of drawn according to probability mass function p(x). Thus: 3 4 M Q \Y_ =§ k I (5.2) 5.1.1 Conditional Entropy The conditional entropy is defined as the entropy of a random variable, given another random variable. The conditional entropy of a random variable given an61 other is defined as the expected value of the entropies of the conditional distributions, averaged over the conditioning random variable. F6587 4 M O ª= I oF65 7 M I ¢ F6587 { M O =ª I O)Ü =§ Î 7 I Q \Y_]=ª Î 7 I ¢ F65 7 4 M O ORÜ =ª I 9 Î oQ2\Y_]=ª Î 7 I ¢ Ü F25Z7 { M x ¢sÝ Q2\Y_]=ª6587 { (5.3) (5.4) (5.5) (5.6) 5.1.2 Joint Entropy The joint entropy =ª joint distribution I FD:9G5 of a pair of discrete random variables ×9G5Ñ 9 Î is defined as: 3 :9;5< M O ORÜ ª= I 9 Î Q \]_]=ª I 9 Î ¢ with a (5.7) which can also be expressed as F ×9G5Ñ M <Q \Y_]=ª ×9G5 (5.8) Also, the entropy of a pair of random variables can be defined as the entropy of one plus the conditional entropy of the other: 3 :9;5< M O ¢ O Ü ª= I 9 Î Q \]_]=ª I 9 Î F ×9G5 M O ORÜ ª= I 9 Î Q \Y_]=ª I D=ª Î 7 I ¢ F ×9G5Ñ M O OÜ =§ I 9 Î Q \Y_=§ I Þ O OÜ =ª I 9 Î oQ \]_]=ª Î 7 I ¢ ¢ FD:9G5 M O =ª I Q \]_]=ª I O O)Ü =ª I 9 Î Q \Y_]=ª Î 7 I ¢ ¢ 3 :9;5Ñ M FD4fu~36587 4 (5.9) (5.10) (5.11) (5.12) (5.13) 5.1.3 Relative Entropy The relative entropy is a measure of the distance between two distributions. In statistics, it arises as an expected logarithm of the likelihood ratio. The relative !ßà=?7 @A is a measure of the inefficiency of assuming that the distribution entropy is q when the true distribution is p. For example, if we knew the true distribution of the random variable, then we could construct a code with average description 62 F>=S @ instead, we used the code for a distribution , we would need Fà=,au`!ßà=?.7¸7 @If, bits on the average to describe the random variable. =ª @ The relative entropy between two probability mass functions I and I length is defined as: !ßà=?7¸7 @ M O ª= I oQ2\Y_ =ª@ I ¢ I 4 !ß>=?7(7 @A M Q \]_ =ª@DD4 (5.14) (5.15) 5.1.4 Mutual Information The mutual information is a measure of the amount of information that one random variable contains about another random variable. It is the reduction in the uncertainty of one random variable due to the knowledge of the other. Con 5 sider two random variables and with a joint probability mass function p(x,y) =ª =ª Î . The mutual information and marginal probability mass function I and BC FE;5Ñ is the relative entropy between the joint distribution and the product dis=ª =ª Î , i.e., tribution I ORÜ ª= I 9 Î oQ \]_ =ªª= I =ª9 Î Î ¢ I BC 3EG5< MPO OÜ =ª I 9 Î Q \]_ =ª=ª I 7 Î ¢ I BC 3EG5Ñ M O O)Ü =§ I 9 Î Q \Y_=§ I au O O)Ü =§ I 9 Î Q \Y_=§ I 7 Î ¢ ¢ BC FE;5< M O =§ I Q \Y_]=ª I Å O O)Ü =ª I 9 Î Q \Y_]=ª I 7 Î ¢ ¢ BC 3EG5< M F 43F e7 5< BC FE;5Ñ MPO (5.16) (5.17) (5.18) (5.19) (5.20) 5.1.5 Asymptotic Equipartition Property In information theory, the analog of the law of large numbers is the Asymptotic Equipartition Property (AEP). It is a direct consequence of the weak law of n t states large numbers. The law of large numbers for independent, identically n & isthat n distributed (i.i.d.) random variables, tá &>n close to its expected value ^ Q \Y_ ÞâÝ aãoÝ ¿¿¿ Ý ä is close to the entropy for large values of . The AEP states that t 8np9c«qs9srsrWrW9c¨t are i.i.d. random variables =ªDånG9c«qs9srsrsrU9c¨tA is , where and ånG9c«qs9srsrsrU9c¨t . Thus the probability the sequence =ªtheD probability nG9c«qs9srsrsrU9c¨oft observing tsæ . assigned to an observed sequence will be close to The asymptotic equipartition property can be formalized in the following way: 63 ^ k Q \Y_]=ª 8nG90qW9srsrsrU9c¨t M ^ k O Q \]_]=ªD & & ^ k Q \Y_=§ 8n[9c«qs9srWrsrW9c¨tv¤ <Q \Y_=§ 4 ^ k Q \Y_=§ 8nG9c«qs9WrsrsrW9«t M FD4 (5.21) (5.22) (5.23) This enables us to divide the set of all sequences into two sets, the typical set, where the sample entropy is close to the true entropy, and the non-typical set, which contains the other sequences 5.1. Any property that is proved for the typical sequences will then be true with high probability and will determine the average behavior of a large sample. nG9c«qs9srsrsrU9c¨t be independent and identically distributed random variLet ables with probability mass p(x). We wish to find short descriptions for such sequences of random variables. First, we divide all sequences into two sets: the typical set and its complement. np9 qs9srsrsrU9 I t The typical set with respect to p(x) is the set of sequences I I with the following property: t æ }ç ² ª= I pn 9 I qW9srsrsrU9 I t ² £^a3 4èG (5.24) We can represent each sequence of the t ætypical }ç set by giving the index of the ² sequence in the set. Since there are sequences in the typical set, the v ^ 6 éu G è index requires no more than bits. t ^ è For a sufficiently large, at small and using the notation I to denote np9 qY9srsrsrp9 I t and Q I to denote t the length of the codeword correa sequence I t I using ^a3 I bits on the average. sponding to I , we can represent sequences Non−typical set Typical set Figure 5.1: The Asymptotic Equipartition Property 64 5.2. Packet Approach Entropy For a sequence of packets into a Internet trace, we can assume that the sequence of header field values constitute a stationary ergodic process. Moreover, given the high aggregation level of high speed Internet links, we can assume also that the header field values of consecutive packets are independent. In this case, the joint entropy is: F6ê & [9 ê & }Cn M F6ê & fu~3ê & }CnY7 Fê & c (5.25) F6ê & 9[ê & }Cn M F6ê & fu~3ê & }Cn (5.26) Fê & 9[ê & }Cn M ÑÀß3ê (5.27) Fê & is the header field entropy of packet 1 . where ^ For a sequence of packets: Fê & 9[ê & }Cn[9WrsrsrW9GêCt M Fê & fu~F6ê & }CnY73ê & cfuÅÁYÁYÁYu~F6êCt,7êCt [n 9WrsrsrW9Gêªn (5.28) F6ê & 9[ê & }Cn[9srsrsrU9[êCtA M ^{ÀßFê (5.29) 5.2.1 Header Field Entropy Using a header field approach and for the RedIRIS trace, we have calculated for each one of the TSH header fields, what is the entropy. For our analysis, we used 1,000,000 packets. The summation of their correspondent code sizes give us the average length to establish the limit for the compression of packet headers. We have used the following header fields: timestamp, Interface, Version, Internet Header Length (IHL), Type of service (TOS), Total Lenght, Identification, Flags, Fragment Offset, Time to live, Protocol, Header checksum, Source address, Destination Address, Source port, Destination port, Sequence number, Acknowledgment number, Data Offset, Control Bits, and Window. Bellow, we show a brief description of these header fields (obtained from [99] and [100]) and we show the calculated entropy: H Timestamp: The timestamp is compound by two fields: a first that records the time in seconds (32 bits) and a second that records the microseconds (24 bits). The timestamp shows a highly skewed distribution. However, we did not calculate the entropy for the timestamp values but for the interpacket time between consecutive packets. In this case, the amount of bits to represent this header field is significantly lower. This does not represent any problem because the start time is a ramdom value and the most important is the inter-arival time. The table 5.1 shows some inter-packet times and their associated frequency. As we can see, the inter-packet times of 8 sec and 7 sec are very common, having frequency of 17% and 12% respectively. In the table 5.2 we show the entropy for the timestamp header field. 65 Table 5.1: Timestamp Header Field Elements Probability p(x) 0.000055 0.002054 0.000010 0.003099 0.000034 0.002352 0.000023 0.005603 0.000011 0.001727 0.000048 0.001218 0.000049 0.002505 0.000009 0.012568 0.000047 0.002521 0.000046 0.003484 0.000032 0.002680 0.000019 0.005069 0.000016 0.000705 0.000048 0.000647 0.000024 0.001683 0.000008 0.171074 0.000007 0.124183 0.000016 0.002643 0.000032 0.003537 0.000012 0.002018 0.000007 0.048100 0.000033 0.005350 0.000009 0.012205 0.000033 0.001073 0.000014 0.006262 0.000034 0.000593 0.000013 0.007582 0.000031 0.003082 0.000011 0.000644 0.000012 0.001001 0.000030 0.001514 0.000018 0.007563 0.000028 0.004963 0.000035 0.001344 0.000026 0.002027 .. .. . . Table 5.2: Timestamp Header Field Original Size (bits) Entropy H(x) 56 6.432200 66 H Interface: On RedIRIS trace, packets are recorded only in one direction. In fact, the only interface used is the output link. The table 5.4 shows the original size in bits and the entropy. Table 5.3: Interface Header Field Elements Probability p(x) 1 1.00 H Table 5.4: Interface Header Field Original Size (bits) Entropy H(x) 8 0.00 Version: The Version field indicates the format of the Internet header. The table 5.5 show the only IP version present into the trace. The table 5.52 show the size in bits of the original field and the calculated entropy. H H Table 5.5: IP Version Header Field Elements Probability p(x) 4 1.00 Internet Header Length: Internet Header Length is the length of the Internet header in 32 bit words, and thus points to the beginning of the data. Note that the minimum value for a correct header is 5. The table 5.7 show the only IHL value present into the trace. The table 5.8 show the size in bits of the original size and the calculated entropy. Type of Service: The Type of Service provides an indication of the abstract parameters of the quality of service desired. These parameters are to be used to guide the selection of the actual service parameters when transmitting a datagram through a particular network. Several networks offer service precedence, which somehow treats high precedence traffic as more important than other traffic (generally by accepting only traffic above a certain precedence at time of high load). The major choice is a three way tradeoff between low-delay, high-reliability, and high-throughput. The table 5.9 show the different elements present into the trace. The table 5.10 show the size in bits of the original size and the calculated entropy. 67 Table 5.6: IP Version Header Field Original Size (bits) Entropy H(x) 4 0.00 Table 5.7: IHL Field Elements Probability p(x) 5 1.00 Table 5.8: IHL Header Field Original Size (bits) Entropy H(x) 4 0.00 Table 5.9: TOS Header Field Elements Probability p(x) 0 0.980928 28 0.000176 16 0.014960 160 0.000484 21 0.000880 27 0.000088 128 0.000880 8 0.000176 7 0.000264 18 0.000176 192 0.000176 12 0.000132 136 0.000044 64 0.000088 24 0.000132 3 0.000044 .. .. . . Table 5.10: TOS Header Field Original Size (bits) Entropy H(x) 8 0.179016 68 H Total Length: Total Length is the length of the datagram, measured in octets, including Internet header and data. This field allows the length of a datagram to be up to 65,535 octets. Such long datagrams are impractical for most hosts and networks. All hosts must be prepared to accept datagrams of up to 576 octets (whether they arrive whole or in fragments). It is recommended that hosts only send datagrams larger than 576 octets if they have assurance that the destination is prepared to accept the larger datagrams. The number 576 is selected to allow a reasonable sized data block to be transmitted in addition to the required header information. For example, this size allows a data block of 512 octets plus 64 header octets to fit in a datagram. The maximal Internet header is 60 octets, and a typical Internet header is 20 octets, allowing a margin for headers of higher level protocols. The table 5.11 show the different elements present into the trace. The table 5.12 show the size in bits of the original size and the calculated entropy. Table 5.11: Total Length Header Field Elements Probability p(x) 1440 0.006618 40 0.398533 985 0.000281 552 0.004167 64 0.005891 1500 0.185472 78 0.022422 52 0.068102 1374 0.000024 1480 0.010455 886 0.000041 48 0.030769 .. .. . . H H Table 5.12: Total Length Header Field Original Size (bits) Entropy H(x) 16 4.492404 Identification: An identifying value assigned by the sender to aid in assembling the fragments of a datagram. Each flow assign a random value for the first packet. Hence, this field shows a distribution highly skewed and we assumed that compressing it is not possible. Hence, we also did not apply any compression (see table 5.13). Flags: This fiels has only three bits, the bit 0 (reserved and must be zero, bit 1 ((DF) 0 = May Fragment, 1 = Don’t Fragment) and bit 2 ((MF) 0 = Last 69 Table 5.13: Identification Header Field Original Size (bits) Entropy H(x) 16 16 Fragment, 1 = More Fragments). The table 5.14 show the different elements present into the trace. The table 5.15 show the size in bits of the original size and the calculated entropy. Table 5.14: Flags Header Field Elements Probability p(x) 2 0.936070 0 0.063710 1 0.000220 H Table 5.15: Flags Header Field Original Size (bits) Entropy H(x) 3 0.345932 H H Fragment Offset: 13 bits This field indicates where in the datagram this fragment belongs. The fragment offset is measured in units of 8 octets (64 bits). The first fragment has offset zero. The table 5.16 show the different elements present into the trace. The table 5.17 show the size in bits of the original size and the calculated entropy. Time to live: This field indicates the maximum time the datagram is allowed to remain in the Internet system. If this field contains the value zero, then the datagram must be destroyed. This field is modified in Internet header processing. The time is measured in units of seconds, but since every module that processes a datagram must decrease the TTL by at least one even if it process the datagram in less than a second, the TTL must be thought of only as an upper bound on the time a datagram may exist. The intention is to cause undeliverable datagrams to be discarded, and to bound the maximum datagram lifetime. The table 5.18 show the most important elements present into the trace and the table 5.19 show the size in bits of the original size and the calculated entropy. Protocol: This field indicates the next level protocol used in the data portion of the Internet datagram. The table 5.20 show the different elements present into the trace. The table 5.21 show the size in bits of the original size and the calculated entropy. 70 Table 5.16: Fragment Offset Header Field Elements Probability p(x) 0 0.999780 185 0.000088 1295 0.000044 1639 0.000088 Table 5.17: Fragment Offset Header Field Original Size (bits) Entropy H(x) 13 0.003325 Table 5.18: Time to Live Header Field Elements Probability p(x) 61 0.075678 124 0.115364 123 0.259768 125 0.227341 126 0.113956 .. .. . . Table 5.19: Time to Live Header Field Original Size (bits) Entropy H(x) 8 3.206642 Table 5.20: Protocol Header Field Elements Probability p(x) 6 0.944122 17 0.048530 1 0.007260 50 0.000044 0 0.000044 Table 5.21: Protocol Header Field Original Size (bits) Entropy H(x) 8 0.343014 71 H Header Checksum: A checksum on the header only. The checksum field is the 16 bit one’s complement of the one’s complement sum of all 16 bit words in the header. Due to simplicity to compute the checksum, we have reserved only one bit to identify if that is a correct checksum or not (see table 5.22). H Table 5.22: Header Checksum Header Field Original Size (bits) Entropy H(x) 16 1 Source Address: The source address. The table 5.23 show the size in bits of the original field and the calculated entropy. H Table 5.23: Source Address Header Field Original Size (bits) Entropy H(x) 32 8.667664 Destination Address: The destination address. The table 5.24 show the size in bits of the original field and the calculated entropy. H Table 5.24: Destination Address Header Field Original Size (bits) Entropy H(x) 32 10.258050 H Source Port: The source port number. The table 5.25 show the size in bits of the original field and the calculated entropy. H H Destination Port: The destination port number. The table 5.26 show the size in bits of the original field and the calculated entropy. Sequence Number: The sequence number of the first data octet in this segment (except when SYN is present). If SYN is present the sequence number is the initial sequence number (ISN) and the first data octet is ISN+1. As the identification header field, each flow assign a random value to the first packeth and thus the distribution of values is highly skewed and compression is not possible (see 5.27). Acknowledgment Number: If the ACK control bit is set this field contains the value of the next sequence number the sender of the segment is expecting to receive. Once a connection is established this is always sent. As the sequence number field, the distribution is highly skewed and compression also is not possible (see 5.28). 72 Table 5.25: Source Port Header Field Original Size (bits) Entropy H(x) 16 9.002667 Table 5.26: Destination Port Header Field Original Size (bits) Entropy H(x) 16 6.713252 Table 5.27: Sequence Number Header Field Original Size (bits) Entropy H(x) 32 32 Table 5.28: Acknowlegment Number Header Field Original Size (bits) Entropy H(x) 32 32 73 H Data Offset: The number of 32 bit words in the TCP Header. This indicates where the data begins. The TCP header (even one including options) is an integral number of 32 bits long. The table 5.29 show the different elements present into the trace. The table 5.30 show the size in bits of the original size and the calculated entropy. Table 5.29: Data Offset Header Field Elements Probability p(x) 8 0.102957 5 0.797254 11 0.005412 0 0.045715 7 0.028467 10 0.006380 6 0.006864 2 0.000396 13 0.000660 3 0.000528 15 0.000924 14 0.002772 9 0.000088 4 0.000572 1 0.000264 12 0.000748 H Table 5.30: Data Offset Header Field Original Size (bits) Entropy H(x) 4 1.152861 H H Reserved: Reserved for future use. Must be zero. Consequently, we do not need store it. Control Bits: This field records the TCP operations URG (Urgent Pointer field significant), ACK (Acknowledgment field significant), PSH (Push Function), RST (Reset the connection), SYN (Synchronize sequence numbers) and FIN (No more data from sender). The table 5.31 show the different elements present into the trace. The table 5.32 show the size in bits of the original size and the calculated entropy. Window: The number of data octets beginning with the one indicated in the acknowledgment field which the sender of this segment is willing to accept. This field shows a large amount of possible values. The table 5.33 shows the different elements present into the trace. The table 5.34 show the size in bits of the original size and the calculated entropy. 74 Table 5.31: Control Bits Header Field Elements Probability p(x) 16 0.621172 24 0.252288 2 0.032119 1 0.036475 17 0.021867 .. .. . . Table 5.32: Control Bits Header Field Original Size (bits) Entropy H(x) 6 1.694607 Table 5.33: Window Header Field Elements Probability p(x) 33312 0.001364 16093 0.000044 16459 0.000220 16040 0.000484 8192 0.013332 8760 0.119720 37376 0.010164 0 0.057814 65535 0.034187 17254 0.000044 17232 0.001364 16448 0.003476 17520 0.112020 63898 0.000220 63002 0.000968 16532 0.000132 31680 0.003520 24820 0.015224 .. .. . . Table 5.34: Window Header Field Original Size (bits) Entropy H(x) 16 7.324844 75 In the table bellow, we show a summary of the entropy calculated earlier. Using a header field compression approach, the compression ratio is limited to 40% (141 / 352), what means that a packet header file of 100Mb can be reduced at maximum to 48MB. Clearly, this compression bound is not satisfactory and other approachs must be evaluated to reach higher compression ratios. Table 5.35: Summary Header Field Size (bits) Entropy H(x) Timestamp 56 6.432200 Interface 8 0.000000 Version 4 0.000000 IHL 4 0.000000 Type of Service 8 0.179016 Total Length 16 4.492404 Identification 16 16.000000 Flags 3 0.345932 Fragment Offset 13 0.003325 Time to Live 8 3.206642 Protocol 8 0.343014 Header Checksum 16 1.000000 Source Address 32 8.667664 Destination Address 32 10.258050 Source Port 16 9.002667 Destination Port 16 6.713252 Sequence Number 32 32.000000 Acknowledgment Number 32 32.000000 Data Offset 4 1.152861 Reserved 6 0.000000 Control Bits 6 1.694607 Window 16 7.324844 Total 352 140.816478 76 5.2.2 Joint Fields Entropy In this section we are interested on evaluate the behavior of the entropy when we aggregate the header fields. For each aggregation level, we depict two tables. The first table shows the probability associated for each joint and the second table shows the entropy of the joint header fields. H Aggregation level 1: Version+IHL (Tables 5.36 and 5.37). Table 5.36: Joint Version,IHL Joint Elements Joint Probability p(x) p(4,5) 1.00 H Table 5.37: Joint Version,IHL Joint Size (bits) Entropy H(x,y) 8 0.00 Aggregation level 2: Version+IHL+Flags (Tables 5.38 and 5.39) H Table 5.38: Joint Version,IHL,Flags Joint Elements Joint Probability p(x) p(4,5,2) 0.936070 p(4,5,0) 0.063710 p(4,5,1) 0.000220 Aggregation level 3: Version+IHL+Flags+Fragment Offset (Tables 5.40 and 5.41) 77 Table 5.39: Joint Version,IHL,Flags Joint Size (bits) Entropy H(x,y) 11 0.344969 Table 5.40: Joint Version,IHL,Flags,Fragment Offset Joint Elements Joint Probability p(x) p(4,5,2,0) 0.936070 p(4,5,0,0) 0.063622 p(4,5,1,0) 0.000088 p(4,5,1,185) 0.000088 p(4,5,1,1295) 0.000044 p(4,5,0,1639) 0.000088 Table 5.41: Joint Version,IHL,Flags,Fragment Offset Joint Size (bits) Entropy H(x,y) 24 0.346266 78 H Aggregation level 4: Version+IHL+Flags+FragOff+Protocol (Tables 5.42 and 5.43) Table 5.42: Joint Version,IHL,Flags,FragOff,Protocol Joint Elements Joint Probability p(x) p(4,5,2,0,6) 0.928326 p(4,5,0,0,6) 0.015795 p(4,5,2,0,17) 0.006688 p(4,5,0,0,17) 0.041843 p(4,5,2,0,1) 0.001056 p(4,5,0,0,1) 0.005896 p(4,5,1,0,1) 0.000088 p(4,5,1,185,1) 0.000088 p(4,5,1,1295,1) 0.000044 p(4,5,0,1639,1) 0.000088 p(4,5,0,0,50) 0.000044 p(4,5,0,0,0) 0.000044 H H Table 5.43: Joint Version,IHL,Flags,FragOff,Protocol Joint Size (bits) Entropy H(x,y) 32 0.493611 Aggregation level 5: Version+IHL+Flags+FragOff+Protocol+Type of Service (Tables 5.44 and 5.45) Aggregation level 6: Version+IHL+Flags+FragOff+Protocol+TOS+DataOffset (Tables 5.46 and 5.47) 79 Table 5.44: Joint Version,IHL,Flags,FragOff,Protocol,TOS Joint Elements Joint Probability p(x) p(4,5,2,0,6,0) 0.910815 p(4,5,2,0,6,28) 0.000176 p(4,5,0,0,17,0) 0.041843 p(4,5,0,0,6,0) 0.015136 p(4,5,2,0,6,16) 0.014960 p(4,5,2,0,17,0) 0.006688 p(4,5,0,0,1,0) 0.005368 p(4,5,2,0,6,160) 0.000484 p(4,5,2,0,6,21) 0.000880 p(4,5,2,0,6,27) 0.000088 p(4,5,0,0,6,128) 0.000660 p(4,5,0,0,1,128) 0.000088 p(4,5,2,0,6,8) 0.000176 p(4,5,2,0,1,0) 0.001056 p(4,5,0,0,1,7) 0.000264 p(4,5,2,0,6,18) 0.000176 p(4,5,1,185,1,0) 0.000088 p(4,5,0,0,1,192) 0.000176 p(4,5,1,1295,1,0) 0.000044 p(4,5,0,1639,1,0) 0.000088 p(4,5,2,0,6,128) 0.000132 p(4,5,0,0,50,0) 0.000044 p(4,5,2,0,6,12) 0.000132 p(4,5,1,0,1,0) 0.000088 p(4,5,2,0,6,136) 0.000044 p(4,5,2,0,6,64) 0.000088 p(4,5,2,0,6,24) 0.000132 p(4,5,2,0,6,3) 0.000044 p(4,5,0,0,0,0) 0.000044 Table 5.45: Joint Version,IHL,Flags,FragOff,Protocol,TOS Joint Size (bits) Entropy H(x,y) 40 0.644337 80 Table 5.46: Joint Version,IHL,Flags,FragOff,Protocol,TOS,DataOffset Joint Elements Joint Probability p(x) p(4,5,2,0,6,0,8) 0.100493 p(4,5,2,0,6,0,5) 0.766543 p(4,5,2,0,6,28,11) 0.000044 p(4,5,0,0,17,0,0) 0.033791 p(4,5,2,0,6,0,7) 0.026927 p(4,5,2,0,6,0,10) 0.005808 p(4,5,0,0,6,0,5) 0.014696 p(4,5,2,0,6,0,11) 0.004532 p(4,5,2,0,6,0,6) 0.005500 p(4,5,0,0,17,0,8) 0.000528 p(4,5,2,0,6,16,5) 0.014168 p(4,5,0,0,17,0,2) 0.000352 p(4,5,0,0,17,0,13) 0.000132 p(4,5,2,0,17,0,0) 0.006512 p(4,5,0,0,1,0,0) 0.004752 p(4,5,0,0,17,0,3) 0.000396 p(4,5,0,0,17,0,15) 0.000352 p(4,5,0,0,17,0,14) 0.002684 p(4,5,2,0,6,160,5) 0.000484 p(4,5,2,0,6,28,5) 0.000132 p(4,5,2,0,6,21,8) 0.000616 p(4,5,0,0,17,0,5) 0.000220 p(4,5,0,0,17,0,9) 0.000088 p(4,5,2,0,6,27,5) 0.000088 p(4,5,0,0,6,128,5) 0.000660 p(4,5,0,0,1,128,4) 0.000088 .. .. . . Table 5.47: Joint Version,IHL,Flags,FragOff,Protocol,TOS,DataOffset Joint Size (bits) Entropy H(x,y) 44 1.495013 81 H Aggregation level 7: Version+IHL+Flags+FragOff+Protocol+TOS+DataOffset+Control Bits (Tables 5.48 and 5.49) Table 5.48: Joint Version,IHL,Flags,FragOff,Protocol,TOS,DataOffset,Control Bits Joint Elements Joint Probability p(x) p(4,5,2,0,6,0,8,16) 0.082453 p(4,5,2,0,6,0,5,16) 0.509988 p(4,5,2,0,6,0,5,24) 0.231389 p(4,5,2,0,6,28,11,2) 0.000044 p(4,5,0,0,17,0,0,1) 0.030007 p(4,5,2,0,6,0,7,2) 0.023803 p(4,5,2,0,6,0,10,2) 0.002772 p(4,5,0,0,6,0,5,16) 0.009460 p(4,5,0,0,17,0,0,7) 0.000044 p(4,5,2,0,6,0,5,17) 0.018479 p(4,5,2,0,6,0,11,16) 0.004268 p(4,5,2,0,6,0,6,18) 0.001012 p(4,5,2,0,6,0,7,18) 0.003124 p(4,5,2,0,6,0,8,24) 0.014696 p(4,5,0,0,17,0,8,20) 0.000088 p(4,5,2,0,6,16,5,16) 0.009460 p(4,5,0,0,17,0,2,13) 0.000044 p(4,5,0,0,17,0,13,31) 0.000044 p(4,5,2,0,6,16,5,24) 0.004356 p(4,5,0,0,6,0,5,4) 0.002816 p(4,5,2,0,17,0,0,1) 0.006424 p(4,5,2,0,6,0,5,4) 0.005632 p(4,5,0,0,1,0,0,0) 0.004708 p(4,5,2,0,6,0,8,17) 0.002728 p(4,5,0,0,17,0,0,14) 0.000044 .. .. . . H Aggregation level 8: Version+IHL+Flags+FragOff+Protocol+TOS+DataOffset+Control Bits+Length (Tables 5.50 and 5.51) 82 Table 5.49: Joint Version,IHL,Flags,FragOff,Protocol,TOS,DataOffset,Control Bits Joint Size (bits) Entropy H(x,y) 50 2.541000 Table 5.50: Joint Version,IHL,Flags,FragOff,Protocol,TOS,DataOffset,Control Bits,Length Joint Elements Joint Probability p(x) p(4,5,2,0,6,0,8,16,1440) 0.001364 p(4,5,2,0,6,0,5,16,40) 0.364088 p(4,5,2,0,6,0,5,24,985) 0.000088 p(4,5,2,0,6,0,5,16,552) 0.001496 p(4,5,2,0,6,28,11,2,64) 0.000044 p(4,5,2,0,6,0,5,16,1500) 0.105729 p(4,5,0,0,17,0,0,1,78) 0.024243 p(4,5,2,0,6,0,5,24,1500) 0.057506 p(4,5,2,0,6,0,5,24,52) 0.000836 p(4,5,2,0,6,0,5,24,1374) 0.000044 p(4,5,2,0,6,0,5,24,1480) 0.007172 p(4,5,2,0,6,0,5,24,886) 0.000044 p(4,5,2,0,6,0,7,2,48) 0.023803 p(4,5,2,0,6,0,5,24,378) 0.000308 p(4,5,2,0,6,0,5,24,256) 0.000088 p(4,5,2,0,6,0,5,24,582) 0.001452 p(4,5,2,0,6,0,5,24,498) 0.000132 p(4,5,2,0,6,0,10,2,60) 0.002772 p(4,5,2,0,6,0,5,24,1440) 0.002112 p(4,5,0,0,6,0,5,16,40) 0.007700 p(4,5,2,0,6,0,5,24,281) 0.000088 p(4,5,0,0,17,0,0,1,153) 0.000088 p(4,5,2,0,6,0,5,24,396) 0.000132 .. .. . . Table 5.51: Joint Version,IHL,Flags,FragOff,Protocol,TOS,DataOffset,Control Bits,Length Joint Size (bits) Entropy H(x,y) 66 5.228122 83 Using a joint header approach, the compression ratio is improved in only 1% and reach a limit of 39%, what means that a packet header file of 100Mb can be reduced at maximum to 39MB. This new approach did not imply into a big gain over the previous method and we must evaluate another approach. Table 5.52: Summary Header Field Size (bits) Entropy H Joing fields 66 5.228122 Interface 8 0.000000 Timestamp 56 6.432200 Identification 16 16.000000 Time to Live 8 3.206642 Header Checksum 16 1.000000 Source Address 32 8.667664 Destination Address 32 10.258050 Source Port 16 9.002667 Destination Port 16 6.713252 Sequence Number 32 32.000000 Acknowledgment Number 32 32.000000 Reserved 6 0.000000 Window 16 7.324844 Total 352 137.833441 5.3. Flow Level Entropy We have seen that for a sequence of packets into a Internet trace, we assumed that the they are independent. However, this is not true for a sequence of packets into the same flow, where the sequence of some header fields show a strong dependence such as the IP address and TCP port numbers. Then, for a flow approach we have three scenarios. 5.3.1 First scenario From the chain rule for entropy we have that if =ª n[9 qY9WrsrsrW9 I t , then: according I I t F 8n[9c«qs9WrsrsrW9«tA M O Zn[9c«qY9srsrWrU9c¨t is drawn FD & 7 & n[9srsrsrU9c8n (5.30) In the first scenario, we have that the entropy of the field for a set of flows with n packets is: consequently; &à n F6-vnë 9GF2-Þq[vë 9WrsrsrW9;F6-§të (5.31) 36-nG9G-ÞqY9srWrsrU9;-ªtë (5.32) This scenario embraces the following fields: 84 H H Interface H Version H IHL H Type of Service H Flags H Fragment Offset Protocol 5.3.2 Second scenario In the second scenario, we have that: F2-nìÐ 9GF6-ªqGíÐ 9Wrsrsrp9GF6-§tA¬Ð (5.33) and according with the chain rule: F6- & G9 - & }Cn M 36- & fu~F2- & C} ns7F2- & (5.34) î Wï ð ½ ñ ò 36- & 9G- & }CncvëPF6- (5.35) F2- & 1 where ^ packets: is the entropy of the header field into the packet . For a flow with F6- & 9G- & }Cn[9srsrWrW9G-§t M 36- & fu~F2- & }Cns7F2- & c uÅÁYÁYÁsu~F2-ªt,7-§t n[9srWrsrW9G-vn î ïWð ½ ñ î ïWð ½ ñ ò ò (5.36) 36- & 9G- & }Cn[9srsrsrU9G-ªtvë®F6- (5.37) This scenario embraces the following fields: H H Source Address H Destination Address H H Source Port Destination Port Time to Live To calculate the conditional entropy of a header field, we normalized all packets in relation to the first packet. Hence, the table 5.53 shows the normalized value for flows with 5 packets, obtaining the following conditional entropy: 85 Table 5.53: Flows with 5 pkts per flow Flow Pkt 1 Pkt 2 Pkt 3 Pkt 4 Pkt 5 1 0 0 0 0 0 2 0 0 0 0 0 3 0 0 0 0 0 .. .. .. .. .. .. . . . . . . 28 0 0 -67 0 0 29 0 0 0 0 0 .. .. .. .. .. .. . . . . . . H F2-<S7-¨k] MÇ r A F2- ³ 7-9G-¨k] MÅ r(kYAóó F2-ÏC7- ³ 9G-9G-¨k] MÅ r A F2- Õ 7-ÆÏ9G- ³ 9G-9G-¨k] MÄ r A Data Offset The table 5.54 shows the normalized value for flows with 5 packets, obtaining the following conditional entropy: F2-<S7-¨k] MÇ rôAó ˳õË F2- ³ 7-9G-¨k] MÅ r A F2-ÏC7- ³ 9G-9G-¨k] MÅ r A F2- Õ 7-ÆÏ9G- ³ 9G-9G-¨k] MÄ r A H Table 5.54: Flows with 5 pkts per flow Flow Pkt 1 Pkt 2 Pkt 3 Pkt 4 Pkt 5 1 0 -2 -2 -2 -2 2 0 -2 -2 -2 -2 3 0 -2 -2 -2 -2 .. .. .. .. .. .. . . . . . . 21 0 -1 -1 -1 -1 22 0 -2 -2 -2 -2 .. .. .. .. .. .. . . . . . . Control Bits The table 5.55 shows the normalized value for flows with 5 packets, obtaining the following conditional entropy: F2-<S7-¨k] MÇ r A F2- ³ 7-9G-¨k] MÅ r(kYAóó F2-ÏC7- ³ 9G-9G-¨k] MÅ röAó ˳õAË 86 F2- Õ 7 -ÆÏ9G- ³ 9G-9G-¨k] MÄ r Õ k õ÷A k Table 5.55: Flows with 5 pkts per flow Flow Pkt 1 Pkt 2 Pkt 3 Pkt 4 Pkt 5 1 0 14 22 14 15 2 0 14 22 14 15 3 0 14 22 14 15 .. .. .. .. .. .. . . . . . . 20 0 14 22 14 2 .. .. .. .. .. .. . . . . . . 28 .. . 0 .. . 14 .. . 14 .. . 22 2 .. . 44 .. . 0 .. . -2 .. . 6 .. . 6 .. . .. . 7 .. . 5.3.3 Third scenario In the last scenario, the sequence of the header fields into a flow have a similar behavior into a sequence of consecutive packets. Hence, in this case: F2-nìÐ 9GF6-ªqGíÐ 9Wrsrsrp9GF6-§tA¬Ð (5.38) and the joint entropy is: F6- & G9 - & }Cn M 36- & fu~F2- & }Cns7F2- & 36- & 9G- & }Cn M F2- & fuF6- & }Cn 36- & 9G- & }Cn M ÀßF6- (5.39) (5.40) (5.41) ^ For a flow with packets: F6- & 9G- & }Cn[9srsrWrW9G-§t M 36- & f u~F2- & }Cns7F2- & cfuÅÁYÁYÁsu~F2-ªt,7-§t n[9srWrsrW9G-vn (5.42) F2- & 9G- & }CnG9srsrsrU9G-ªt M ^{ÀßF2-< (5.43) This third scenario embraces the following fields: H H Timestamp H Total Length H Identification H H Sequence Number Acknowledgement Number Window 87 5.4. Trace Compression Bound $ as: ÿ û]ü2ý2ts¯ Û &àþ n2øúùúû]ü ý6tW¯ Û &ôþ q äÿ ûü ý6tW¯ Û &àþ ± ð FñW6î -ï u ð FñW6î -ï u À ð FñW2î -<ï á á á " $ M " À ³Õ From the three senarios showed in the last section, we define (5.44) where, the summation for the three scenarios are showed in the tables 5.56 5.57 5.58. In table 5.58 we are considering that the entropy for the Sequence and Acknowledgment number header fields are equal to zero because they can be deduced from the Total Length header field. Table 5.56: Entropy 1st Scenario Header Field Entropy H(x) Interface 0.000000 Version 0.000000 IHL 0.000000 Type of Service 0.179016 Flags 0.345932 Fragment Offset 0.003325 Protocol 0.343014 Total 0.871287 Table 5.57: Entropy 2nd Scenario Header Field Entropy H(x) Time to Live 3.206642 Source Address 8.667664 Destination Address 10.258050 Source Port 9.002667 Destination Port 6.713252 Data Offset 1.152861 Control Bits 1.694607 Total 40.695743 t is: $ M Èr ó ÷ k]Xó ÷ uÏ r Ëõ Õ ÷À Ï ³³ u " À ³ ÏrôÏ õ ÏÏó " Õ Hence, the final expression for (5.45) and the maximum compression ratio is: where d $ Ù $ À d$ is the flow probability distribution for m-packet flows. 88 (5.46) Table 5.58: Entropy 3rd Scenario Header Field Entropy H(x) Timestamp 6.432200 Total Length 4.492404 Identification 16.000000 Sequence Number 0.000000 Acknowledgment Number 0.000000 Window 7.324844 Total 34.249448 5.5. Conclusions Applying the equations showed previouly, we deduced that the compression bound for a TCP/IP header trace is around 13%. The figure 5.59 shows for " Ù k ) we ranging from 1 to 20 what is the associated entropy. For large flows (" M kÕ. have assumed a mean value for the number of packets. This value was " Bellow, we resume the main conclusions derived from this chapter: H H For some TCP/IP header fields, we found a relatively low entropy, what means that there are some values assigned to them that show high probability; H H Selecting only the header fields with low entropy and grouping them, we have calculated the entropy at packet level. We have seen that this new arrangement do not impaired significantly the entropy; Also, we have seen that breaking down a trace into flows with the same number of packets and considering each different flow as a random variable, we obtain higher compression ratio; The compression bound for TCP/IP header traces is around 13%. 89 m-packet flow 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 150 Compression Table 5.59: Flows distribution d$ $ 0.095545617 0.070475923 0.063516576 0.052245242 0.195583723 0.163133511 0.077876497 0.039830729 0.028338782 0.021940996 0.016285273 0.012875795 0.010408937 0.008343194 0.007761577 0.006718678 0.007220072 0.005655723 0.005214496 0.003910872 0.107117787 Bound 90 0.215387651 0.156343609 0.136662262 0.126821589 0.120917185 0.116980915 0.114169294 0.112060578 0.110420466 0.109108376 0.108034848 0.107140242 0.106383267 0.105734431 0.105172107 0.104680073 0.104245926 0.103860017 0.103514730 0.103203972 0.098086822 $ À d$ 0.020579346 0.011018460 0.008680319 0.006625825 0.023649433 0.019083507 0.008891105 0.004463455 0.003129182 0.002393946 0.001759377 0.001379516 0.001107337 0.000882163 0.000816301 0.000703312 0.000752663 0.000587403 0.000539777 0.000403618 0.010506843 0.127952888 Chapter 6 Lossless Compression Method The main reason why header compression can be done at all is the fact that there is significant redundancy between header fields, both within consecutive packets belonging to the same flow and in particular between flows. The big gain of our proposed method comes from the observation that, for a set of selected header fields, the flows traveling into an Internet link are very similar. By utilizing a set of pre-computed templates of flows, and Huffman encoding, the header size can be significantly reduced. Hence, we have embarked upon the development of a new header compression scheme for packet header files that reduces drastically storage requirements. This Chapter provides the details of how the method works, focusing on the fact that the decompressed header is functionally identical to the original header. 6.1 Generic Compression Content compress can be as simple as removing all extra space characters, inserting a single repeat character to indicate a string of repeated characters, and substituting smaller bit strings for frequently occurring characters. The compression is performed by algorithms which determine how to compress and decompress. Some of the most popular compress algorithms are the Huffman coding [74], LZ77 [124], and deflate [32]. Those specifications define lossless compressed data formats. Huffman encoding belongs into a family of algorithms with variable codeword length. That means that individual values are replaced by bit sequences (messages) that have a distinct length. So values that appear frequently in a packet header are given a short sequence while others that are used seldom get a longer bit sequence. To achieve those aims, the following basic restrictions are imposed on the encoding algorithm: H H No two messages will consist of identical arrangements of bits. The message codes will be constructed in such a way that no additional indication is necessary to specify where a message code begins and ends once the starting point of a sequence of messages is known. 91 According to [27], an optimal (shortest expected length) prefix code for a given distribution can be constructed by the Huffman algorithm. Was proved that any other code for the same alphabet cannot have a lower expected length than the code constructed by the algorithm. Huffman coding is a form of prefix coding prepared by a special algorithm. The Huffman compression algorithm assumes data files consist of some values that occur more frequently than other values in the same file. This is very true, for instance, for text files and TCP/IP header traces. The algorithm builds a Frequency Table for each value within a file. With the frequency table the algorithm can then build the Huffman Tree from the frequency table. The purpose of the tree is to associate each value with a bit string of variable length. The more frequently used characters get shorter bit strings, while the less frequent characters get longer bit strings. Thus, the data file may be compressed. The tree sctructure containes nodes, each of which contains a value, its frequency, a pointer to a parent node, and pointers to the left and right child nodes. At first there are no parent nodes. The tree grows by making successive passes through the existing nodes. Each pass searches for two nodes, that have not grown a parent node and that have the two lowest frequency counts. When the algorithm finds those two nodes, it allocates a new node, assigns it as the parent of the two nodes, and gives the new node a frequency count that is the sum of the two child nodes. The next iterations ignores those two child nodes but includes the new parent node. The passes continue until only one node with no parent remains. That node will be the root node of the tree. To compress the file, the Huffman algorithm reads the file a second time, converting each value into the bit string assigned to it by the Huffman Tree and then writing the bit string to a new file. Compression then involves traversing the tree begining at the lead node for the value to be compressed and navigating to the root. This navigation iteratively selects the parent of the current node and sees whether the current node is the right or left child of the parent, thus determining if the next bit is a one (1) or a zero (0). The assignment of the 1 bit to the left branch and the 0 bit to the right is arbitraly. The decompression routine reverses the process by reading in the stored frequency table (presumably stored in the compressed file as a header) that was used in compressing the file. With the frequency table the decompressor can then re-build the Huffman Tree, and from that, extrapolate all the bit strings stored in the compressed file to their original value form. To do that, the algorithms reads the file a bit at a time. Beginning at the root node in the Huffman Tree and depending on the value of the bit, you take the right or left branch of the tree and then return to read another bit. When the selected node is a leaf (it has no right and left child nodes) the algorithm writes the value to the decompressed file and go back to the root node for the next bit. LZ77 compression works by finding sequences of data that are repeated. The term sliding window is used; all it really means is that at any given point in the data, there is a record of what characters went before. A 32K sliding window means that the compressor (and decompressor) have a record of what the last 92 32768 characters were. When the next sequence of characters to be compressed is identical to one that can be found within the sliding window, the sequence of characters is replaced by two numbers: a distance, representing how far back into the window the sequence starts, and a length, representing the number of characters for which the sequence is identical. With the Deflate compressor, the LZ77 and Huffman algorithms work together. 6.2 TCP/IP header Compression The previous methods do not take into account the specific properties of the data to be compressed. The following methods have been developed for saving transmission bandwidth on channels such as wireless and slow point-to-point links and are based on the fact that in TCP connections, the content of many TCP/IP header fields of consecutive packets of a flow can be usually predicted. The original scheme proposed for TCP/IP header compression in the context of transmission of Internet traffic through low speed serial links is Van Jacobson’s header compression algorithm [65]. The main goal behind [65] header compression scheme is to improve the line efficiency for a serial link. Increasing the line efficiency also allows for better allocation in asymmetric bandwidth allocation schemes. Since allocation is often dependent on the amount of data to be sent or received, the smaller headers will cause less fluctuation in the amount of actual data traversing the link. Also for this particular header compression scheme, the TCP and IP headers are compressed together and not individually. The header compression scheme in [35] aims to satisfy several goals, including better response time, line efficiency, loss rate, and bit overhead. [35] is similar to [65] in regards to TCP, but includes support for other features and protocols, such as TCP options, ECN, IPV6, and UDP. The specification also allows extension to multicast, multi-access links, and other compression schemes which ride over UDP. The goal of [21] is to provide a means of reducing the cost of headers when using the Real-Time Transport Protocol (RTP), which is often used for applications such as audio and video transport. However, instead of compressing the RTP header alone, greater efficiency is obtained by compressing the combined RTP/UDP/IP headers together. Another important goal is that the implementations for the compression and decompression code need to be simple, as a single processor may need to handle many interfaces. Finally, this compression scheme is not intended to work in conjuction with RTCP (Real-Time Transport Control Protocol) as the required additional complexity is undesirable. The incentive of the robust header compression scheme [40] arose from links with significant error rates, long round-trip times, and bandwidth limited capacity, and thus, the goal is to be able to design highly robust and efficient header compression schemes based upon a highle extensible framework. Consequently, the information presented in this document for this particular header compression scheme represents the underlying framework on which compression for other pro93 tocols is built. Finally, both UDP and RTP are covered in the [40] scheme. Since then, specifications for the compression of a number of other protocols have been written. Degermark proposed additional compression algorithms for UDP/IP and TCP/IPv6 [34]. Equally for wireless environments, another scheme that makes use of the similarity in consecutive flows from or to a given mobile terminal is described in [116]. 6.3 Proposed Header Trace Compression Our proposed compression method is in the context of saving storage space of potentially huge packet traces. The advantage of our proposed consists of know in advance the trace file and consequently the complete flows. Thus, we explore the properties of TCP/IP flows to predict some header fields and the Huffman encoding to obtain optimal compression ratio. We make use of the similarity in consecutive header fields in a packet flow to compress these headers. Here, we use the definition of flow presented earlier as a sequence of packets with the same IP 5-tuple and such that the time between two consecutive packets does not exceed 5 sec. Within a flow, most of the header fields of consecutive packets are identical or change in a very predictable manner. For this set of fields, storage is required only once per each flow. However, for another set, the header fields assume different values along the same flow. A previous phase to the compression itself was the step to determine the clusters of TCP/IP flows and build a Huffman tree to them. For practical reasons, the clusters are limited to flows with maximum 7 packets. The clusters were obtained from traces collected in RedIRIS and from traces downlaoded from NLANR (Figure 6.1). This set of clusters is expected to does not change significantly in size for any different traces. RedIRIS Trace Cluster Analyzer Clusters of Flows NLANR Traces Figure 6.1: Cluster Generation The table 6.1 shows the calculated entropy for the set of clusters. The clusters were grouped by the number of packets per flow. Our proposed compression occurs in two steps. At a first step, the algotithm traverse all file trace building the header field frequency tables and examining the presence of new clusters. The second step is the compression itself. Moreover, 94 Table 6.1: Interface Header Field m-packet Entropy H Code Size (bits) 2 0.940182 1 3 2.269575 3 4 2.913547 3 5 2.206767 3 6 3.317451 4 7 4.785971 5 we have applied different approaches for small flows and large flows. In both cases, the compression is carried out at flow level but for small flows we apply the clustering techniques described in chapter 4. 6.3.1 First Step: Header Frequency The first step starts reading the trace and looking into the packet the values of each header. Here, the compression algorithm creates one dataset and updates a second dataset (Figure 6.2). A first dataset (Header Frequency Table) stores the frequency of some header fields such as: source address, destination address, source port, destination port and protocol. For each packet, the algorithm reads the value of the headers and after read the last packet, the algorithm calculates the frequency of each one. The second dataset (Clustering Frequency Table) stores the flow clusterings. This dataset shows small variability because the most common flows were stored previously. For each packet, the algorithm looks into the 5-tuple of fields (Source and Destination IP address, Source and Destination port number, and Protocol number) to identify each new connection. Whenever a packet carrying a new flow is found, a new node is inserted at the end of a temporary data structure (Figure 6.3). This temporary data structure is implemented by a linked list and it stores ^ the packet headers of connections. When a Fin/Rst TCP flag arises in a packet or the time between two consecutive packets exceeds 5 sec, the flow is flagged, indicating that this flow has been completed. After that, the algorithm examines the number of inserted nodes associated to this flow. If the number of packets is greater than seven, the flow is imediately removed from the temporary data structure. Otherwise, if the number of packets is smaller than 8 it searches into the Clusterring Frequency Table dataset for identical flow. If a hit is not possible, this flow is added into the dataset and a larger message length is assigned to it. It is important to say that this dataset is not inserted in the compressed file and moreover it is shared between many compressed files. 6.3.2 Second Step Compression This second step is related with the compression itself. The method starts again looking into the 5-tuple of fields (Source and Destination IP address, Source 95 Clustering Frequency Table .tsh packet header Compressor Step 1 Header Frequency Table Temporary data structure Figure 6.2: Compression First Step and Destination port number, and Protocol number) to identify each new connection. Whenever a packet carrying a new flow is found, a new node is inserted at the end of another temporary data structure (linked list). When the flow placed at the head of the temporary linked list reaches a completed flow status, the compressor examines the number of inserted nodes associated to this flow to see if it is a small or large flow. In the case of small flows and for a first set of fields, the algorithm works searching for identical sequence of packets characteristics into the Cluster of Flows dataset. For the remaining header fields, the algorithm searches for the correspondent code size (Figure 6.4). After the template searching, the compressor algorithm starts writing into the compressed header file. For many fields (see Figure 4.3), the storage is reduced to a template identifier, which is the most important realization of our proposed method. However, for other fields, which predictability is not possible, the carried information requires to be stored. Here, is important to consider that, for some of these fields, which the value is likely to stay constant over the life of a flow, the storage is required only once per each flow. However, for the remainder fields, the storage is required for each packet. The Figure 6.5 shows the compressed data format for small flows. The first field (1 bit), is a flag to identify the type of flow: small or large. The following five fields store the respective codes for source and destination address, source and destination port and protocol. The next field stores the flow clustering code, which represents the following fields: Interface, Version, IHL, Type of Service, Flags, Fragment Offset, Data Offset, and Control Bits. The next five fields and for each packet, stores the code to Inter-packet time, Identification, Length, Window and chksum. 96 Flow 1 Pkt1 Flow 2 Flow 3 Flow n Pkt1 Pkt1 Pkt1 Pkt2 Pkt2 Pkt3 Pkt3 Pkt4 Figure 6.3: Temporary data structure Clustering Frequency Table .tsh packet header Header Frequency Table Compressor Step 2 Temporary data structure Figure 6.4: Temporary data structure 97 compressed header file FC Source Address Code Protocol Code Initial TTL Destination Address Code Flow Clustering Code Length Code 1 Window Code 1 ChkS1 Length Code 2 Window Code 2 ChkS2 Time Code N Identification Code N Source Port Code Time Code 1 Identification Code 1 Time Code 2 Identification Code 2 Length Code N Figure 6.5: Temporary data structure 98 Destination Port Code Windlow Code N ChkSN As already mentioned, for large flows we have used a different approach. In this class of flows, when a large flow reaches the completed status, each packet is inspected by the compressor in order to determine the correspondent header field codes (Figure 6.6). Header Frequency Table .tsh packet header Compressor Step 2 compressed header file Temporary data structure Figure 6.6: Temporary data structure The Figure 6.7 shows the compressed data format for large flows. The first field (1 bit), is a flag to identify the type of flow: small or large. The following five fields store the respective codes for source and destination address, source and destination port and protocol. For each packet the next fields store the following informations: the packet control (1 bit) which indicates whether it is the last packet; the joint header codes (Interface, Version, IHL, Type of Service, Flags, Fragment Offset, Data Offset, Control Bits); Inter-packet time code, Identification code, Length code, Window code and chksum. 99 FC Destination Address Code Source Address Code Protocol Code Initial TTL Length Code 1 PC Joint Header Code 1 Window Code 1 Identification Code 2 PC Windlow Code N Time Code 1 ChkS1 PC Time Code N Destination Port Code Identification Code 1 Joint Header Code 2 Length Code 2 Joint Header Code N Source Port Code Window Code 2 Identification Code N ChkSN Figure 6.7: Temporary data structure 100 Time Code 2 ChkS2 Length Code N 6.4 Decompression algorithm Processing at the decompressor is much simpler than at the compressor because all decisions have been made and the decompressor simply does what the compressor has told it to do. To perform its functionalities, the decompression algorithm sets up a temporary linked list to store the decompressed packets head^ ers of connections. It works reading the Clustering Frequency Table and the Header Frequency table datasets (Figure 6.8). These two datasets store the necessary information to reproduce the header fields of all original packets. Clustering Frequency Table compressed header file Header Frequency Table Decompressor .tsh header trace Temporary data structure Figure 6.8: Decompression model The decompression algorithm starts assigning a random timestamp to the first flow. To the following flows, the Time Code field (see Figures 6.5 and 6.7 indicates where each packet starts. The field FC (Flow Control) indicates how to decompress the flow (small or large). The first method refers to small flows. For each flow, the algorithm reads the following informations: Source Address, Destination Address, Source Port, Destination Port, Protocol, initial TTL value and the template identifier. We have assigned initial random values to the following header fields: Sequence Number, and Acknowledgment Number. After the template be identified, the algorithm goes decoding the following fields: interface, Version, IHL, Type of Service, Flags, Fragment Offset, à TTL, Data Offset, Reserved, and Control Bits. The timestamp, identification, Total Length and Window fields, are restored packet by packet (see Figure 6.5). The sequence number and Acknowledgment Number values are reconstructed based on the stored total length field. The second method refers to large flows. The decompression method is similar to the previously described mehtod for small flows, with the small diference that some header field informations are gathered directly from the Huffman treee, and not from templates of flows. 101 At the point that all header informations has been consumed from the packet, so its checksum is recalculated and stored in the IP checksum field. Depending on the header checksum flag the value is calculated correctly or not. For each decompressed packet, the algorithm inserts a new node at the temporary linked list sorting by time-stamp. After decoding the last packet of a flow, the algorithm continues the process reading the next record in the compressed dataset. Meantime, all nodes in the linked list are checked. For the nodes whose timestamp fields are smaller than the new flow start point, the packet headers are written in a decompressed file. 6.5 Compression Ratio The expected message length ( ) can be calculated in order to measure the #021 the probability efficiency of the algorithm. Let be the number of messages, £21 of the i-th value and the length of a message (number of bits assigned to it). Then: M O &> n #021£21[r (6.1) Moreover, the expected length of any instantaneous code for a random variable X is greater than or equal to the entropy H(X), i.e., Ù 3 4 (6.2) This section address the problem of performance analysis of the proposed compression method. The analysis were carried out by comparing the compression ratio of different compression methods. Using the compression algorithm described in the previous section, we depict in figure 6.9 for each m-packets flow (X axis), the correspondent compression ratio (Y axis). As we can see on figure 6.9, the compression ratio starts from 23% for 1-packet flow decaying fastly to 13% and then decaying smoothly until reach 11% in the case of very large flows. Around 7-packet flow, we notice the change of compression method from small to large flows. The curve of the figure 6.10 depicts the cumulative compression ratio when " (packets in flow) range from 1 to large values. The summation of all terms give us the value of the trace compression. As we can see from figure 6.10 this value is around 16%, what means that from a header trace file of 100MB our method compress it to 16MB. Is important to say that there are some data structures with informations related to this method that are also needed as for instance the Flow Clustering dataset. However, we do not take into account because they stay almost constant. We studied the efficiency of the proposed compression method, comparing it against GZIP [47] and Van Jacobson methods. The GZIP and also ZIP and ZLIB [48] applications use the deflation algorithm. The measures were taken from a TSH (Time Sequence Header) header trace file [89], [98]. The compressed file 102 0.24 Flow compression Compression ratio 0.22 0.2 0.18 0.16 0.14 0.12 0.1 1 10 100 1000 10000 100000 Packets/Flow Figure 6.9: Flow compression bound Cumulative compression ratio 0.18 Proposed compression algorithm 0.16 0.14 0.12 0.1 0.08 0.06 0.04 1 10 100 1000 10000 Packets/Flow Figure 6.10: Trace compression bound 103 100000 size obtained using the GZIP application is around 50% of the original TSH file size. For the Van Jacobson method, the header size of a compressed datagram ranges from 3 to 16 bytes. However, we must modify slightly the original method because the number of active flows is much more larger in a high-speed Internet link than in a low speed serial link (the scenario where Van Jacobson was originally proposed). Hence, we must increase thus the number of bytes needed to store the flow identifier (we have increased it from 1 byte to 3 bytes). Moreover, we assume that a time stamp (3 bytes) is added to each header. As a result we assume that minimal encoded headers becomes 8 bytes in the best case and 21 bytes in the worst case. Taking into account only the best case and considering the changes that we have explained before, the compression ratio for n-packet flows using the Van Jacobson method is bounded by: d k] 9 " M ÏÏ£u~ÏAóÏ " ` " obtaining thus a compression ratio given by: ¯ à& þ P M O # $ d Ò $ " (6.3) (6.4) Using this approach, we conclude that the compression rate of the Van Jacobson method reaches 32% in the best case. The performance of these three compression methods under analysis (our proposed method, GZIP method and Van Jacobson method) is depicted in Figure 6.11. For different trace collection time (X axis), we show the correspondent storage needs (Y axis). 104 Compressed file size (MBytes) 60 GZIP method VJ method Proposed method 50 40 30 20 10 0 0 20 40 60 80 Uncompressed file size (MBytes) Figure 6.11: Compression techniques comparison 105 100 106 Chapter 7 Lossy Compression Method In this chapter we present a new lossy header trace compression method based on clustering of TCP flows. With a flow characterization approach that incorporates a set of packet characteristics such as inter-packet time, payload size, and TCP structures, we demonstrated that behind the great number of flows in a high-speed link, there is not so much variety among them and clearly they can be grouped into a set of clusters. Using templates of flows, we developed an efficient method to compress a header trace file. With this proposed method, storage size requirements are reduced to 3% of its original size. Although this specification defines a not lossless compressed data format, it preserves important statistical properties present into original trace such as self-similarity, spatial and temporal locality, and IP address structure. Furthermore, in order to validate the decompressed trace, measurements were taken of memory access and cache miss ratio. The results showed that under specific purposes, our proposed method provides a good solution for header trace compression. 7.1 Packet Trace Compression After the analysis presented in [61], we have seen into several traces that there are some type of flows extremely popular and most of them are short lived and have small number of packets [51], [19]. Similarly as the lossless compression method, we make use of the similarity in consecutive header fields in a packet flow to compress these headers. Here, we also use the definition of flow presented earlier as a sequence of packets with the same IP 5-tuple and such that the time between two consecutive packets does not exceed 5 sec. A previous phase to the compression itself was the step to build the template of flows. This template is based on the most common type of clusters found in many traces. Here, we have not limited the clusters for only small flows, but all flows lenght are stored. Besides the header fields used earlier to calculate the clusters, we have added the following header fields: H Inter-packet time into a flow: for small flows, the interpacket time is calcu107 H lated in terms of acknowledgment dependence. If a packet to be transmited waits for a packet sent by the opposite node, it is called a dependent packet, otherwise, if a packet is sent immediately after the last one, we classify it as a not dependent. For instance, in the TCP three-way handshake, when a node send a Syn TCP flag, it waits for a Syn+Ack TCP flag from the opposite node. This waiting time correspond to the Round Trip Time (RTT). In this sense, we associate inter packet time to acknowledgement dependence. In the case of short flows, we have seen that time-varying does not represents a serious problem. Hence, for short flows, we have assumed that each flow has a specific RTT. Evidently, this assumption is not true for long flows, where RTT is dynamic and time-varying. Hence, for long flows, we store the inter-packet time; Total length: we have seen that aproximately 40% and 20% of all packet carry 40 and 1,500 bytes respectively. In this case we have used the followind binary codes: 00 if packet size = 40, 01 if packet size = 1500, and 11 if 40 packet size 1500. A second step in the compression consists of explore some header field distributions. We apply the step to obtain the following data: H H Server Port frequency: for know TCP ports, such as Web servers, we calculate their distribution; H Client/server frequency: we calculate how distributed are the flows between the following relashionship: client/client, client/server, server/client. H IP address: amount of unique IP addresses found in the trace; H @src,@dst pairs: amount of unique source and destination IP address in the trace; H RTT: Round trip time distribution; Window: header field distribution for the Window header field. Finally, the method for compression starts looking into the 5-tuple of fields (source and destination IP address, source and destination port number, and protocol number). When a packet carrying a new flow is found, a new node is inserted at the end of a temporary linked list and a new entry is created in the compressed dataset. When the flow reaches the status of completed, first of all, the algorithm looks for the number of inserted nodes associated to this flow and searches into the template dataset for identical sequence of header fields. In the case that, a match is not possible, the algorithm searches for the most similar, calculating the euclidian distance between them. Moreover, the algorithm looks into the trace for a previous identical pair of source and destination address. After that, we update the compressed dataset and remove all nodes of this flow from the linked list. For each flow, we store in the compressed dataset the following data: 108 Inter-Flow time (7 bits): Distance between consecutive flows H Number of packets (17 bits): Number of packets into the flow H Cluster Identifier (14 bits): a pointer to the Template dataset H H Distance to same pair (10 bits): Distance in terms of number of flows to a previously flows with identical pair of source and destination IP address. As we can see, independently of the flow length, we need of 48 bits to represent each flow. 7.2 Decompression algorithm The real question is, given a compressed trace, how to decompress fast. The decompression algorithm is implemented in two steps: a multifractal IP address generator which mimics the IP address structure of real traces, and the algorithm that restores the compressed files. Kohler, Paxson, and Shenker [75] have demonstrated that real address structure look broadly self-similar: meaningful structure appears at all magnifications levels. The multifractal address generator is based on a multiplicative process. Following Evertsz and Mandelbrot [42], a process that fragments a set into smaller and smaller components according to some rule, and at the same time fragments the measure or mass associated with these components according to some other rule is a multiplicative process or cascade. The more formal mathematical construction of a cascade starts with assigning aq unit of mass the unit interval B M 9sk] , where the subinterval ¯ M °)m ± 9Y6 uÌk]cmA ± q tocorresponds to ad dress . The construction starts splitting the interval into three parts where the B]½ middle part takes up a fraction ¼ of the whole interval.These parts are called , Bn , and BWq . Then throw away the middle part BAn , giving it none of the parent in½ and " q M kx " ½ . terval’s mass. The other subintervals are assigned masses " Y B ½ W B q Recursing on the nonempty subintervals and generates nonempty subinq½ " four qq W B ½ ½ W B ½ q W B q ½ W B q q ½ q q ½ " " " " " tervals , , , and with respective masses , , , and . Continuing the procedure defines a set of addresses that exhibit a wide spectrum of local scaling behaviors. Using the previously calculated Number of IP address, Number of source/destination address pair, client/server distribution, and TCP port frequency, we generated a sequence of an anonymized 4-tuple. The decompression algorithm sets up a linked list to store temporarily the sequence of decompressed packets. It works reading the compressed datasets. As we have seen, the compressed dataset stores, for each flow, the time-stamp of the first packet, the number of packets, a pointer to the template dataset, and the distance to a previous flow with identical source and destination pair. The template dataset stores the necessary information to reproduce important packet flow characteristics such as: inter packet time, TCP flag sequence and packet size. 109 The algorithm starts reading the compressed dataset. Note that this dataset is sorted by the time-stamp data field. For each flow, a TTL field is generated using the distribution calculated previously. Reading the cluster identifier, the algorithm identify the correspondent template. The algorithm goes reading the template values and decoding the interpacket time, interface, version, IHL, TOS, Length, Flags, FragOFF, Protocol, DataOFF, Control Flags. For each decompressed packet, the algorithm inserts a new node at the linked list sorting by the time-stamp. Furthermore, are assigned the source and destination IP address, source and destination port number. If the record pointer to a previous address, we simply copy the same 4-tuple, if not, we write a new 4-tuple from the list generated previously. The Sequence and Acknowledgement numbers are generated based on the Total Length. To the Identification header field we generate an initial random value and sequential values for the next packets. After read the last value of the template, the algorithm continues the procedure reading the next record in the compressed dataset. At this moment, all nodes in the linked list with time stamp less than the current value are written in a decompressed file. 7.3 Compression Ratio To study the efficiency of the proposed compression method, we compare the compress ratio for different compression methods for large packet traces. The measures were taken from a TSH (Time Sequence Header) header trace file and the compression methods evaluated were the GZIP [47], the Van Jacobson method and the method proposed in [61]. In the proposed compression method 8 bytes for the first packet of a flow are sufficient to represent each flow of " packets. There are some data structures with information related to the clusters of flows that are also needed. However these additional data structures are almost constant with the packet trace length. Then the compression ratio for m-packet flows is given by: d " M Ï ó " (7.1) obtaining thus a compression ratio of: ¯ Ò à& þ M O $ # $ d " (7.2) The table 7.1 shows for each one of the m-packet flows the correspondent compression ratio (third column) and the figure 7.1 depicts them graphically. Multiplying each m-packet compression ratio (third column) by the its frequency (second column) we obtain the relative trace compresson ratio (fourth colums). The summation of all components produces the total trace compression ratio. As we can see, this value is around 3%. The curve on figure 7.2 shows how it increases. The GZIP application and also ZIP and ZLIB [48] use the deflation algorithm. For different TSH file sizes, the compressed file size obtained using the 110 m-packet flow 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 150 Compression Table 7.1: Flows distribution d$ $ 0.095545617 0.070475923 0.063516576 0.052245242 0.195583723 0.163133511 0.077876497 0.039830729 0.028338782 0.021940996 0.016285273 0.012875795 0.010408937 0.008343194 0.007761577 0.006718678 0.007220072 0.005655723 0.005214496 0.003910872 0.107117787 0.136363636 0.068181818 0.045454545 0.034090909 0.027272727 0.022727273 0.019480519 0.017045455 0.015151515 0.013636364 0.012396694 0.011363636 0.010489510 0.009740260 0.009090909 0.008522727 0.008021390 0.007575758 0.007177033 0.006818182 0.000909091 0.14 $ À d$ 0.013028948 0.004805177 0.002887117 0.001781088 0.005334102 0.003707580 0.001517075 0.000678933 0.000429375 0.000299195 0.000201884 0.000146316 0.000109185 0.000081264 0.000070559 0.000057261 0.000057915 0.000042846 0.000037424 0.000026665 0.000097379 0.03539729 Flow Compression Compression ratio 0.12 0.1 0.08 0.06 0.04 0.02 0 1 10 100 1000 Packets/Flows Figure 7.1: File size comparison 111 10000 100000 Cumulative compression ratio 0.04 Proposed compression algorithm 0.035 0.03 0.025 0.02 0.015 0.01 1 10 100 1000 10000 100000 Packets/Flow Figure 7.2: File size comparison GZIP application is 50% of the original TSH file size (see Figure 7.3). For the Van Jacobson method we have seen in the last chapter that the compression ratio is around 30% Figure 7.3 shows the file size of the original trace and the correspondent file sizes for the three compression methods under analysis. 112 Compressed file size (MBytes) 60 GZIP method VJ method Proposed method 50 40 30 20 10 0 0 20 40 60 80 Uncompressed file size (MBytes) Figure 7.3: File size comparison 113 100 7.4 Comparative Packet Trace Characteristics In this section we validate how effective is our lossy compression method. We have compared packet trace properties of the original trace against the decompressed trace. The results have demonstrated that the decompressed trace reproduces good approximations of original trace. The properties under analysis were: self-similarity, spatial and temporal locality, and IP address structure. 7.4.1 Self-Similarity To capture the long-range dependence we employed the statistics of selfsimilarity. Self-similarity means that the statistical properties of a stochastic process do not change for all aggregation levels of the stochastic process. That is, the stochastic process looks the same if one zooms in time in and out in the process [121]. The Hurst parameter express the degree of self-similarity; large values zÄ r Õ 9Wk] the process is called long-range indicate stronger self-similarity. If dependent(LRD). There are three methods to determine self-similarity and estimate the parameter H of a given set of data. These methods are: Variance-time plots R/S (Rescale adjusted range) statistic Frequency domain: periodogram and Whittle’s To estimate the Hurst parameter, we have used the R/S plot method. This method is useful for providing a single estimate of and results in a X-Y plot. Linearity in the plot is evidence of long-range dependence in the underlying series, and the slope of the line in each case can be used to estimate . Plots of the graphical estimators for the decompressed and the original RedIRIS trace are shown in Figure 7.4. M r ÷ k and M r ÷A³ for On the upper curves, the slopes estimate decompressed and original trace respectively. In addition to these two traces, we also estimated the Hurst parameter for a trace with exponential inter-packet time. M r Õ , indicative of no long-range The graphical result for this trace yields dependence in this series. These results are shown in TABLE 7.4.1. TABLE 4.4.1 Hurst parameter estimators Decompressed RedIRIS Exponential 0.71 0.73 0.52 7.4.2 Spatial locality The existence of spatial locality of reference can be established by comparing the total number of unique sequences observed in the decompressed trace and the total number of unique sequences that would be found in the original trace and in a trace with random permutation of addresses [7]. Notice that a random permutation destroys the spatial locality of reference that may exist in a trace by uncorrelating the sequence of references. If references are indeed correlated, then 114 3 Original RedIRIS trace Decompressed RJV Trace Exponential Trace 2.5 log 10(r/s) 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 log 10(d) Figure 7.4: R/S plotting one would expect the total number of unique sequences observed in a trace to be less than the total number of unique sequences observed in a random trace. Figure 7.5 shows the total number of unique destination addresses (Y axis) Ê for trace length (X axis). Three cases are shown. The top curve shows the total number of addresses observed in a random trace. The middle and bottom curves show the total number of unique addresses observed in our decompressed trace and in the original trace respectively. Ê The three curves show an increasing on the total number of unique addresses as increases, with the randomized trace showing the steepest increase and the others two curves showing similar slow increase. These results confirm a good approximation between the decompressed and original trace. 7.4.3 Temporal Locality Temporal locality implies that recently accessed addresses are more likely to be referenced in the near future [7]. Cache miss ratio is one of the most important parameters in characterizing the temporal locality. In our experiments, we have used the decompressed trace, the original trace and a random trace to drive a routing table lookup algorithm observing the cache miss ratio for different cache sizes. Note that in this simulation study we are concerned with the data cache only. This makes sense from the viewpoint of IP address caching at a router, where the program itself is short and can be entirely stored in any reasonable instruction cache, while the implementation of the data structures for the address cache is a primary concern. Figure 7.6 shows how temporal locality (measured by miss rate) varies 115 8000 Random trace Decompressed RJV trace Original RedIRIS trace Number of Addresses 7000 6000 5000 4000 3000 2000 1000 0 0 100 200 300 400 500 600 700 800 Number of Packets (k) Figure 7.5: Unique addresses observed for the destination IP address across the traces. We see that the three curves show a decrease in the miss rate ratio as cache size increases, with the randomized trace showing the slowest decrease and the decompressed and original traces showing similar decreases. The miss ratios of the random trace (top curve) are consistently higher than those of both the decompressed and the original traces, confirming again, a good approximation between both traces . 7.4.4 IP Address Structure This section compares the structural characteristics of destination IP addresses seen on RedIRIS and decompressed trace. These characteristics may have implications for algorithms that deal with IP address aggregates, such as routing lookup and congestion control. In [75] was investigated how a conglomerate‘s packets are distributed among its component address, and how those addresses aggregate. ´ ^ would As we have seen in chapter 2, if address structures were fractal, appear as a straight line with slope D when plotted as a function of p. Figure 7.7 ^ shows, for the RedIRIS trace, a log plot of as a function of p; we found that, Ï ² = ² ksÏ , ^ curves do appear linear on a logfor a reasonable middle region scale plot. In this case, the fractal dimension D is equal to 0.625. Regarding the ! Mb r Ë óAó . decompressed trace (Figure 7.7), the fractal dimension obtained was This result is sufficiently close to the fractal dimension of the original trace. To test if a data set is consistent with the properties of multifractals, we use the Histogram Method to examine its multifractal spectrum [96]. The figure 7.8 plots the spectrum of the original and decompressed traces. Again, we see that 116 1 Random trace Decompressed RJV trace Original RedIRIS trace 0.9 0.8 Miss Rate 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 2.5 3 Cache Size (k) Figure 7.6: Temporal locality parameters 4 Original RedIRIS trace Decompressed RJV trace 3.5 3 Np 2.5 2 1.5 1 0.5 0 0 2 4 6 8 10 Prefix length p 12 Figure 7.7: Np as a function of prefix length 117 14 16 both are very similar. Multifractal Spectrum 0,25 0,2 0,15 0,1 0,05 0 6 6 8 4 4 6 8 2 8 2 2 4 4 8 7 9 5 3 0, 0,3 0,3 0,4 0,4 0, 0,5 0,5 0,6 0,6 0, 0,7 0,7 0,8 0,8 0, 0,9 0,9 Scaling Exponent Figure 7.8: Np as a function of prefix length 7.5 Memory Performance Validation The compression method studied in this paper achieves a large compression rate. However, it is not able to recover the exact compressed packet trace. In this section we study whether the recovered trace is suitable for studies focusing on memory access characteristics. The results presented in this section do not cover the entire set of possible network benchmarks, but they clearly show that the recovered trace exhibits close behavior in relation with the uncompressed packet trace. We have applied three benchmark programs taken from Netbench [82] and Commbench [122] benchmarks. The selected programs were: Route (Netbench), NAT (Netbench), and RTR (Commbench). All the selected programs involve the Radix Tree Routing inside their algorithms. The Radix Tree is a binary tree, which starting at the root, stores the prefix address and mask so far. As you move down the tree, more bits are matched going one way down the tree. If they don’t match, the other branch holds the entry required. This sort of data structure can result in efficient average performance for forwarding table lookup times, on the order of 118 ln (number of entries), which for large routing tables is quite a gain. The returned value from looking up an entry will typically be the next hop IP router. The Radix Tree code was instrumented using the ATOM tool [108]. In order to delimit the processing of packets, checkpoints were placed at the beginning and at the end of the packet processing. The instrumented code records the number of memory accesses performed by each packet. At the end of the traffic trace processing, a list including the total number of memory accesses per packet is generated. 7.5.1 Memory Access Measurements In our experiments, we have used four different traces. A first trace is a subset of the original RedIRIS trace, containing only Web flows. Henceforth, we will call this trace of Original trace. The second one is the decompressed trace, obtained after be applied our proposed compress/decompress method over the Original trace. A third trace was generated assigning random IP destinations addresses, but maintaining the same temporal distribution of the Original trace. Finally, for the last trace, the IP directions were generated by a multiplicative process and were launched using LRU stack model with an exponential interpacket time distribution. Figure 7.9 plots the cumulative traffic (Y axis) against the number of memory access (X axis) when executing the Radix Tree Routing algorithm for the four traces. We observe that the Original and the Decompressed trace show similar behavior while the others traces depict different shapes. We can see, for instance, that approximately 55% of the traffic from the Original and Decompressed trace execute access to memory ranging from 53 to 67. Otherwise, the random trace shows that only 30% of its traffic ranges from 53 to 62 access to memory, and the fractal trace for this same number of memory access presents approximately 27% of the traffic. Furthermore, we observe that for the Original and Decompressed trace, the number of access ranging from 53 to 92 corresponds to 60%, but for the Random trace, we have 70% of the traffic executing from 53 to 88 memory accesses, and the random trace executing from 53 to 96 accesses to memory for 37% of the traffic.These divergences are due to the fact that the number of visited nodes is different. 7.5.2 Cache Miss Rate In Figure 7.10 and for the same Radix Tree algorithm, we show the cumulative traffic (Y axis) against the cache miss rate (X axis). Here, again, we observe huge similarity among the Original and the Decompressed trace, but in this case, the fractal trace has a similar behavior and the random trace presenting not concordance with the Original trace. In the graph we can see that about 60% of the packets from the Original and Decompressed trace show a cache miss rate lower than 5%, which correspond to the sequence of packets with a very similar behavior. Otherwise, for this same ratio, we obtain around 10% of the packets from 119 120 RedIRIS RedIRIS random fracexp Decomp 100 Traffic (%) 80 60 40 20 0 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 #Mem Accs Figure 7.9: Memory Access for the traces: original RedIRIS, the random address, the fractal address, and the decompressed trace generated by our method. the Random trace. For a cache miss ratio ranging from 5% to 10%, we observe an inverse behavior, with 50% of the packets from the random trace conforming this ratio and only 10% of the packets from the Original and Decompressed trace. In our opinion, the differences between the Original and random trace are due to the fact that in one trace memory needs to be released, whereas in the other trace memory is still available. 120 120 RedIRIS RedIRIS random fracexp Decomp 100 Traffic (%) 80 60 40 20 0 0%-5% 5%-10% 10%-20% Cache Miss Rate >20% Figure 7.10: Cache miss rate for the traces: original RedIRIS, the random address, the fractal address, and the decompressed trace generated by our method. 121 122 Chapter 8 Trace Classification Until recently, routing of packets has involved the determination of outgoing link based on the determination address and then transferring packet data to the appropriate link interface. Destination-based packet forwarding treats all packets going to the same destination address identically, providing only best-effort service (servicing packets in a first-come-first-served manner). However, routers are now called upon to provide additional functionalities such as security, billing, accounting and different qualities of service to different applications. To provide these additional functionalities, complex packet analysis are performed as part of the packet processing involving multiple fields in the packet. Thus, a better understanding and classifying of the TCP/IP packet header fields constitutes an important activity. 8.1. Packet Classification Traditional routers do not provide service differentiation because they treat all traffic going to a particular Internet address in the same way. However, users are demanding a more discriminating form of router forwarding, called service differentiation. The process of mapping packets to different service classes is referred as packet classification. Packet classification is important for applications such as firewalls, intrusion detection, differentiated services, VPN implementations, QoS applications, and server load balancing (as shown in Table 8.1). Table 8.1: Packet classification example Layer Application Two Switching, MPLS Three Forwarding Four Flow Identification, IntServ Four Filtering, DiffServ Seven Load Balancing Seven Intrusion Detection Packet classification routers have a database of rules. The rules are explicitly ordered by a network manager (or protocol) that creates the rule database. Thus 123 when a packet arrives at a router, the router must find a rule that matches the packet headers; if more than one match is found, the first matching rule is applied. Packet classification involves selection of header fields form packets, such as source and destination addresses, source and destination port numbers, protocol or even parts of URL; and then finding out the best packet classification rule (also called filtering rule or filter) to determine action to be taken on the packet. Each packet classification rule consists of a prefix (or range of values) for each possible header field, which matches a subset of packets. The requirements for packet classification can vary widely depending on the application and where packet classification is performed in the network: H H Resource limitations: Packet classification solutions can tradeoff time to perform the classification per packet vs. memory used. At large corporate campuses, access speeds may range from medium speed of T3 and OC3, to top speeds of OC-12 and above. At inter ISP boundaries, the access speeds will be OC-12, OC-48, and above. Residential customers have access speeds of T1 (DSL) or less. Solutions should achieve the required target access speed, while minimizing the amount of memory used. H Number of rules to be supported: Packet classification applications differ in the number of rules that are specified. Today typical firewalls may specify a few hundred rules, while an access/backbone router may have a few hundred of thousands of rules; these rules are expected to scale up with enhanced services and router throughput and may reach millions of rules. H Number of fields used: Packet classification applications differ on the number of fields (dimensions) of the IP header that is used for classification. Current routers use one field (destination IP address). Firewalls and other access list applications may use a few more fields [53]. H Nature of rules: Current routers use rules with a prefix mask on destination IP addresses. However, more general masks such as arbitrary ranges are expected to become permissible. Packet classification solutions need to accommodate such general specification. H Updating the set of rules: The number of changes to the rules either due a route or policy change is moderate to small compared to the number of packets that an application, e.g., a router, needs to classify in the same time period. Packet classification solutions must adapt gracefully and quickly to such updates without sacrificing the access performance. Rebuilding major parts of the data structure for every update is prohibitive. Worst case vs. Average case: There is a widely held view that for access time performance of packet classification, one must focus on worst case, rather than average case [77]. 124 Many of these requirements have been articulated in the extensive collection of papers that have addressed the packet classification problem [3], [52], [53], [77], [78], [106], [107]. 8.2. Flow Classification Although the proposed solutions can vary greatly, two different approaches exist to provide QoS (Quality of Service), one speeds up either the forwarding or the route look-up procedure and the other is that routers be able to classify packets based on the information carried by the packet headers or to optimize the amount of work that needs to be done when routing decisions are made [86]. The first set of solutions either speed up the forwarding procedure by implementing the forwarding engine on hardware instead of software or optimize the route look-up procedure with algorithm development. These solutions are referred to as Gigabit-routing solutions and they tend to move the forwarding procedure to specialized components. This speeds up the process of sending the packets to Gigabits/s-level [87] [114] [109]. Also various algorithms for speeding up the route look-up process have emerged and gained wide interest. These algorithms usually utilize different and improved binary search methods [33] [64]. However, with the developments of new broadband technologies, and thus more bandwidth available, the limits of hardware-based routing or fast route look-ups will eventually be met. The second set of solutions, aims to decrease the workload of routers by assigning long-lived packet flows do dedicated connections providing the flows with a possibility to a better service level than can be realized in the default (routed) connection. This approach consists of optimize the amount of work that needs to be done when routing decisions are made [86] and is known as Integrated Internet routing and switching, or the IP switching solution [87] [5] [31] [88] [113]. The routing decisions could be made to only fraction of the packets total in a flow, thus reducing the total workload of the router. Doing just one route look-up for a series of packets and then forwarding the rest of the packets in OSI layer 2 is an interesting approach when we compare it to the burden of doing the route look-up and forwarding on OSI layer 3 as many times as there are packets in the traffic flow. Flow classification is one of the key issues in IP switching. An IP switching must perform flow classification to decide whether a flow should be switched directly or forwarded hop-by-hop by the routing software. This is implemented by inspecting the values of the packet header fields and making a decision based upon a local policy. One flow classification policy, the protocol-based policy [80] is to simply classify flows by protocols. With this policy, all TCP flows are selected for switching while all UDP flows are forwarded hop-by-hop by the routing software. The argument is that connection-oriented services are longer and have more packets to be sent over a short time than connectionless services. Similarly, flows can also be classified by applications, such as ftp, smtp, and http [80]. Only those applications 125 that tend to generate long-lived flows and contain a large number of packets are selected for switching. 8.3. Packet Trace Classifier Our work is related with the second set of solutions. In this chapter we describe a trace classifier. The idea of this trace classifier is analyze how similar are traces collected from different links and provide informations about the traces which can be helpful to implement packet classification rules or to choose the most appropriate flow classification scheme to be used in traffic-controlled IP switching. To be able to do that, we have developed a visual tool based on spectrum of colors to easily compare different traces. After describe the trace classification method, we discuss how this classification method can be used both in packet and flow classification. In this section we propose a methodology based on three steps to identify how similar are traces collected from different places and how different applications (e.g. WEB, P2P, FTP, etc) are distributed into the trace. The idea behind this proposed classification is to offer an easy and efficient method to select different traces characteristics to be used for performance evaluation purposes. We have applied our methodology to traces captured from an OC-3 link (155 Mbps) that connects the Scientific Ring of Catalonia to RedIRIS (Spanish National Research Network) [98], which consists of about 250 institutions. This not sanitized trace is a collection of packets flowing in one direction of the link containing a timestamp and the first 40 bytes of the packet. For our analysis, we have used only the output link. Furthermore, we surveyed traces downloaded from NLANR Web site [89]. The first step is devoted to evaluate how distributed are packets among the m-packets flows. The figure 8.1 shows, for four different traces, the percentage of packets (axis Y) placed in different m-packets flows (axis X). A first trace was collected on 1993 and has a high predominance of FTP traffic. The RedIRIS and Memphis University traces have a predominance of Web traffic but with the presence of P2P traffic. The last trace, the trace captured from Columbia University, shows a high predominance of Web traffic. Based only in this first step, we can see that two of them (RedIRIS and Memphis) show similar behavior while the others have different distributions. From this first step, we have concluded that the trace collected on 1993 is very different from the others, what led us to exclude it from next analysis. Using the flow characterization described in section 2, in a high-speed link we can find potentially a large variety of flows. However, looking into the flows, we can see that they are not very different from each other. The second step is devoted to study the variety among flows. To be able to do that, we have used an approach based on clustering, a classical technique used for workload characterization [67]. The basic idea of clustering is to partition the components into groups so the members of a group are as similar as possible and different groups are as - $ dissimilar as possible. From the set of vectors ( ), we calculate the Euclidian distance between them and the results are stored in a distance matrix of flows. Ini126 - $ tially, each vector represents a cluster. Evidently, distance 0 means that two vectors are exactly identical. For each " ranging from 2 to 13, we apply, separately, the clustering method. The figure 8.2 shows, for three traces, the number of different clusters (axis Y) for each one of the m-packets flows (axis X). Percentage of packets 0.2 1993 trace RedIRIS trace Memphis University trace Columbia University trace 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 m-packets flows 40 45 50 Figure 8.1: Packet distribution 50 RedIRIS trace Memphis University trace Columbia University trace 45 Number of Clusters 40 35 30 25 20 15 10 5 0 0 2 4 6 8 m-packets flows 10 12 14 Figure 8.2: Number of Clusters Taking the percentage of packets and the number of clusters per m-packets flows and applying a triangle-based cubic interpolation to create uniformly spaced grid data we display each trace (RedIRIS, Memphis University, Columbia University) as a surface plot (Figures 8.3, 8.4 and 8.5). Looking at the shape of each figure, we can see clearly that the RedIRIS and the Memphis traces show some similarities and that the Columbia trace shape is totally different. 127 The third step analyze for each one of the m-packets flows, how distributed are the packets among their clusters. Using the technique of spectrum of colors, we have represented on figure 8.6, the spectrum for " =5. From each trace, we have selected the most representative clusters. On figure 8.6, each bar graph represents a trace and each color on the bar graphic represents the percentage of flows that fit with this cluster. Plotting the spectrum of the three traces under analysis, we can see that the spectrum of RedIRIS and Memphis traces are similar while the Columbia trace shows a different spectrum. After concluding these tree steps, we can be capable to identify with a high level of precision how semantically similar are different traces. On going work have been devoted to identify the type of application present into each analyzed trace (e.g. Web, P2P, etc). 30 Number of clusters 25 20 15 10 5 0 0.2 14 0.15 12 0.1 10 8 0.05 Percentage of packets 6 m−packets flows 4 0 2 Figure 8.3: RedIRIS trace - 3D shaping 128 30 Number of clusters 25 20 15 10 5 0 0.2 14 0.15 12 0.1 10 8 0.05 Percentage of packets 6 m−packets flows 4 0 2 Figure 8.4: Memphis trace - 3D shaping 30 Number of clusters 25 20 15 10 5 0 0.2 14 0.15 12 0.1 10 8 0.05 Percentage of packets 6 m−packets flows 4 0 2 Figure 8.5: Columbia trace - 3D shaping 129 RedIRIS Trace Memphis Univerity Columbia University 91−100% 41−50% 81−90% 31−40% 71−80% 21−30% 61−70% 11−20% 51−60% 0−10% Figure 8.6: Flow clustering spectrum 130 Chapter 9 Conclusions In this work, there are five main contributions that consist of: 1) A novel traffic characterization that incorporates semantic characteristics of flows. We understand by semantic characterization the analysis of traffic characteristics of the TCP/IP header contents. We have demonstrated that behind the great number of flows in a high-speed link, there is not so much variety among them and clearly they can be grouped into a set of clusters. For flows with 5 packets, for instance, which represent 20% of the total, we can group then in only 142 types of flows, with 88% of the flows grouped in only four clusters. The evidence that Internet flows can be grouped into a small set of clusters leaded us to create templates of flows and an efficient method to compress and classify header packet traces. 2) Using the concepts of Entropy, we have studied the compression bound for packet header traces. We have seen that for some TCP/IP header fields, the entropy is very low. Using a flow level approach and supported by the chain rule for entropy, we have seen that the compression bound for TCP/IP header traces is around 13%. 3) A lossless packet header compression method based on TCP flow clustering and Huffman encoding. The combined algorithm applies the Flow Clustering technique for small flows and the Huffman encoding for large flows. This approach has significantly increased the compression ratio. We have seen that the Flow Clustering technique fits well for small flows, where many flows are grouped in few templates. In these circumstances, the number of templates remains constant or shows small variations. The technique is based on semantic similarities among flows and TCP/IP functionalities. For large flows, we adopt another approach. This approach is based on Huffman encoding and explores the similarity between packets during the life of a connection. With our proposed method, the compression ratio that we achieve for .tsh headers packet traces is around 16%, reducing the file size, for instance, from 100MB to 16 MB. The compression proposed here is more efficient than any other and simple to implement. Others known methods have their compression ratio bounded to 50% (GZIP) and 32% (Van Jacobson method), pointing out the effectiveness of our method. 4) A lossy compression method. In order to reach higher compression ratios, 131 we have also proposed a methodology for lossy compression. With our proposed method, storage size requirements are reduced to 3% of its original size. Although this specification defines a lossy compression method, analysis over the decompressed trace, has demonstrated that for a set of statistical properties, it represents a good approximation of original traces. Furthermore, a memory performance evaluation was carried out with four types of traces and the outcomes for memory access and cache miss ratio measurements demonstrated that our proposed compression method shows a huge efficiency. 5) A packet trace classifier. We proposed a methodology based on three steps to identify how similar are traces collected from different places and how different applications (e.g. Web, P2P, FTP, etc) are distributed into the trace. As future works, we believe that using the flow clustering methodology presented here and other properties as packtet loss, retransmission, etc, we can build a synthetic traffic generator. 132 Bibliography [1] M. Acharya and B. Bhalla. A flow model for computer network traffic using real-time measurements. in Second International Conference on Telecommunications Systems, Modeling and Analysis, March 1994. [2] M. Acharya, R. Newman-Wolfe, H. Latchman, R. Chow, and B. Bhalla. Real-time hierarchical traffic characterization of a campus area network. in Proceedings of the Sixth International Conference on Modelling Techniques and Tools for Computer Performance Evaluation, 1992. [3] H. Adiseshu, S. Suri, and G. Parulkar. Packet filter management for layer 4 switching. 1998. [4] A. Agrawala and D. Sanghi. Network dynamics: an experimental study of the Internet. in Proceedings of Globecom’92, December 1992. [5] H. Ahmed, R. Callon, A. Malis, and J. Moy. IP switching for scalable IP services. Proceedings of the IEEE, 85:1984–1997, December 1997. [6] M. Aida and T. Abe. Pseudo-Address Generation Algorithm of Packet Destinations for Internet Performance Simulation. IEEE INFOCOM 2001, 2001. [7] V. Almeida, A. Bestavros, M. Crovella, and A. Oliveira. Characterizing reference locality in the WWW. Proceedings of the Fourth International Conference on Parallel and Distributed Information Systems (PDIS96), December 1996. [8] D. Anick, D. Mitra, and M. Sondhi. Stochastic theory of a data-handling system with multiple sources. Bell System Technical Journal, 1984. [9] M. Arlitt and C. Williamson. Internet Web Servers: Workload Characterization and Performance Implications. IEEE/ACM Transactions on Networking, 5:815–826, October 1997. [10] S. Ata, M. Murata, and H. Miyahara. Analysis of Network Traffic and Its Application to Design of High-Speed Routers. IEICE Trans. Inf and Syst., 5, May 2000. 133 [11] C. Barakat, P. Thiran, G. Iannaccone, C. Diot, and P. Owezarski. Modeling Internet Backbone Traffic at the Flow Level. IEEE Transactions on Signal Processing - Special Issue on Networking, 51, August 2003. [12] P. Barford and M. Crovella. Generating Representative Web Workloads for Network and Server Performance Evaluation. Proceedings of the 1998 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 151–160, July 1998. [13] J. Beran. Statistics for Long-Memory Processes. Chapman and Hall, New York, NY, 1994. [14] D. Boggs, J. Mogul, and C. Kent. Measured capacity of an Ethernet: Miths and reality. in Proceedings of ACM SIGCOMM’88, pages 222–234, August 1988. [15] H. Braun and K. Claffy. Network analysis in support of Internet policy requirements. in Proceedings of INET’93, June 1993. [16] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web Caching and Zipf-like Distributions: Evidence and Implications. In Proceedings of the IEEE Infocom, April 1999. [17] E. Brockmeyer, H.L. Halstrom, and A. Jensen. The Life and Works of A. K. Erlang. Transactions of the Danish Academy of Technical Science ATS, 2, 1948. [18] P. Brockwell and R. Davis. Time Series: Theory and Methods. Springer Series in Statistics. Springer-Verlag, second edition, 1991. [19] N. Brownlee and K. Claffy. Understanding Internet traffic streams: Dragonflies and tortoises. IEEE Communications Magazine, 40:110–117, October 2002. [20] CAIDA. The Cooperative Association for Internet Data Analysis. http://www.caida.org. [21] S. Casner and V. Jacobson. Compressing IP/UDP/RTP Headers for LowSpeed Serial Links. Internet Engineering Task Force, RFC-2508, February 1999. [22] K. Claffy, H. Braun, and G. Polyzos. A Parameterizable Methodology for Internet Traffic Flow Profiling. IEEE Journal on Selected Areas in Communications, 13, October 1995. [23] K. Claffy, G. Polyzos, and H. Braun. Traffic characteristics of the T1 NSFNET backbone. in Proceedings of IEEE Infocom 93, pages 885–892, 1993. 134 [24] K. Claffy, G. Polyzos, and H.-W. Braun. Internet traffic flow profiling. UCSD TR-CS93-328, SDCS GA-A21526, November 1993. [25] K. Claffy, G. Potyzos, and H. Braun. Measurement considerations for assessing unidirectional latencies. Internetworking: Research and Experience, vol. 4:pp. 121–132, September 1993. [26] Kimberly C. Claffy. Internet Traffic Characterization. Ph.D. Thesis, University of California, San Diego, 1994. [27] T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley Sons, INC. [28] M. Crovella and A. Bestavros. Self-Similarity in World Wide Web Traffic: Evidence and Possible Causes. In SIGMETRICS96, pages 160–169, May 1996. [29] C. Cunha, A. Bestavros, and M. Crovella. Characteristics of WWW Clientbased Traces. Technical Report BU-CS-95-010, Boston University, Computer Science Department, July 1995. [30] P. Danzig, S. Jamin, R. Caceres, D.J. Mitzel, and D. Estrin. An empirical workload model for driving wide-area TCP/IP network simulation. Internetworking: Research and Experience, vol. 3:no. 1, 1991. [31] B. Davie, P. Doolan, and Y. Rekhter. Switching in IP Networks-IP Switching, Tag Switching and Related Technologies. Morgan kaufmann Publishers, 1998. [32] DEFLATE. Compressed data format specification. ftp://ds.internic.net/rfc/rfc1951.txt. Available in [33] M. Degermark, A. Brodnik, and S. Pink. Small Forwarding Tables for Fast Routing Lockups. Luea University of Technology, 1997. [34] M. Degermark, M. Engan, B. Nordgren, and S. Pink. Low-loss TCP/IP Header Compression for Wireless Networks. Proc. MOBICOM, November 1996. [35] M. Degermark, B. Nordgren, and S. Pink. IP Header Compression. Internet Engineering Task Force, RFC-2507, February 1999. [36] S. Deng. Empirical Model of WWW Document Arrivals at Access Link. IEEE International Conference on Communication, june 1996. [37] A.K. Erlang. The Theory of Probabilities and Telephone Conversations. Nyt Tidsskrift Matematik, 20:33–39, 1909. 135 [38] A.K. Erlang. Solution of Some Problems in the Theory of Probabilities of significance in Automatic Telephone Exchanges. Electrical Engineering Journal, 10:189–197, 1917. [39] D. Estrin and D. Mitzel. An assesment of state and lookup overhead in routers. in Proceedings of IEEE Infocom 92, pages 2332–42, 1992. [40] C. Bormann et al. RObust Header Compression ROHC: Framework and four profiles: RTP, UDP, ESP, and uncompressed. Request for Comments 3095, July 2001. [41] F. Arts et al. Network processor requirements and benchmarking. Computer Networks Journal. Special Issue: Network Processors, 41, April 2003. [42] C.J.G. Evertsz and B.B. Mandelbrot. Multifractal measures. H.-O. Peitgen, H. Jurgens and D. Saupe, editors, Chaos and Fractals: New Frontiers in Science, Springer-Verlag, New York, 1992. [43] D. Feldmeier. Improving gateway performance with a routing table cache. in Proceedings of IEEE Infocom 88, pages 298–307, March 1988. [44] S. Floyd and V. Jacobson. On traffic phase effects in packet-switched gateways. Internetworking: Research and Experience, vol. 3:pp. 115–156, September 1992. [45] S. Floyd and V. Jacobson. The synchronization of periodic routing messages. in Proceedings of ACM SIGCOMM’93, pages pp. 33–44, September 1993. [46] R. Fonseca, V. Almeida, M. Crovella, and B. Abrahao. On the Intrinsic Locality Properties of Web Reference Streams. BUCS-TR-2002-022, August 2002. [47] J.-L Gailly and M. Adler. ftp://prep.ai.mit.edu/pub/gnu/. GZIP documentation and sources. [48] J.-L Gailly and M. Adler. ZLIB documentation and sources. ftp://ftp.uu.net/pub/archiving/zip/doc/. [49] S. Glassman. A caching relay for the World Wide Web. In Proceedings of the First International World Wide Web Conference, pages 69–76, 1994. [50] N. Gulati, C. Williamson, and R. Bunt. Local area network locality: Characterization and application. in Proceedings of the first International Conference on LAN Interconnection, pages 233–250, October 1993. [51] L. Guo and I. Matta. The war between mice and elephants. Technical Report BU-CS-2001-05 Boston University - Computer Science Department, May 2001. 136 [52] P. Gupta, S. Lin, and N. McKeown. Routing lookups in hardware at memory access speeds. Proc. IEEE INFOCOM, San Francisco, California, page 1241, 1998. [53] P. Gupta and N. McKeown. Packet classification on multiple fields. ACM Computer Communication Review, 1999. [54] D. Harte. Multifractals: Theory and Applications. Chapman Hall, 2001. [55] J. F. Hayes. Modeling and Analysis of Computer Communications Networks. Plenum Publishing Corporation, New York, N.Y., 1984. [56] H. Heffes and D. Lucantoni. A Markov modulated characterization of packetized voice and data traffic and related statistical multiplexer performance. IEEE Journal on Selected Areas in Communications, vol. 4:pp. 856–868, April 1986. [57] S. Heimlich. Traffic characterization on the NSFNET National Backbone. in Proceedings of the 1990 Winter USENIX Conference, December 1988. [58] J. Hennessy and D. Patterson. Computer Architecture: A Quantitative Approach. Morgan kaufmann Publishers, 1990. [59] R. Holanda and J. Garcia. A Lossless Compression Method for Internet Packet Headers. EuroNGI Conference on Next Generation Internet Networks - Traffic Engineering. Rome, Italy, April 2005. [60] R. Holanda and J. Garcia. A new methodology for packet trace classification and compression based on semantic traffic characterization. ITC - 19th International Teletraffic Congress. Beijing, China., August 2005. [61] R. Holanda, J. Garcia, and V. Almeida. Flow Clustering: a New Approach to Semantic Traffic Characterization. 12th Conference on Measuring, Modelling, and Evaluation of Computer and Communication Systems, Dresden - Germany, September 2004. [62] R. Holanda, J. Verdu, J. Garcia, and M. Valero. Performance Analysis of a New Packet Trace Compressor based on TCP Flow Clustering. ISPASS 2005 - IEEE International Symposium on Performance Analysis of Systems and Software. Austin, Texas, USA, March 2005. [63] J.Y. Hui. Resource allocation for broadband networks. IEEE J. Selected Areas in Comm., 6:1598–1608, 1988. [64] M. Waldwogel J. Turner, G. Varghese, and B. Plattner. Scalable High Speed IP Routing Lookups. In Proceedings of ACM SIGCOMM 97, pages 25–36, 1997. 137 [65] Van Jacobson. Compressing TCP/IP Headers for Low-Speed Serial Links. RFC-1144, February 1990. [66] R. Jain. Characteristics of destination address locality in computer networks: a comparison of caching schemes. Computer networks and ISDN systems, vol. 18:pp. 243–254, May 1990. [67] R. Jain. The Art of Computer Systems Performance Analysis. John Wiley & Sons, 1991. [68] R. Jain and S. Routhier. Packet trains measurements and a new model for computer network traffic. IEEE Journal on Selected Areas in Communications, pages 986–995, September 1986. [69] S. Jin and A. Bestavros. Sources and Characteristics of Web Temporal Locality. Proceedings of the 8th MASCOTS. IEEE Computer Society Press, August 2000. [70] T. Karagiannis, M. Molle, M. Faloutsos, and A. Broido. A Nonstationary Poisson View of Internet Traffic. IEEE INFOCOM, 2004. [71] L. Kleinrock. Queueing Systems, Volume 1: Theory. Wiley, 1975. [72] L. Kleinrock and R. Gail. Queueing Systems: Problems and Solutions. Wiley Interscience, 1996. [73] S. Klivansky, A. Mukherjee, and C. Song. On Long Dependence in NSFNET Traffic. http://www.cc.gatech.edu/, December 1994. [74] D.E. Knuth. Dynamic Huffman coding. Journal of Algorithms, pages 163– 180, June 1985. [75] E. Kohler, J. Li, V. Paxson, and S. Shenker. Observed structure of address in IP traffic. Proceedings of the SIGCOMM Internet Measurements Workshop, November 2002. [76] B. Kumar. Effect of packet losses on end-user cost in internetworks with usage based charging. Computer Communications Review, August 1993. [77] T.V. Lakshman and D. Stiliadis. High-speed policy-based packet forwarding using efficient multi-dimensional range matching. ACM Computer Communication Review, 28:203–214, September 1998. [78] B. Lampson, V. Srinivasan, and G. Varghese. IP lookups using multiway and multicolumn search. IEEE INFOCOM, page 1248, 1998. [79] W. Leland, M. Taqqu, W. Willinger, and D. Wilson. On the self-similar nature of Ethernet traffic. in Proceedings of ACM SIGCOMM 93, September 1993. 138 [80] S. Lin and N. McKeon. A Simulation Study of IP Switching. ACM SIGCOMM’97, pages 15–24, 1997. [81] B. B. Mandelbrot. Fractals, Form, Chance and Dimension. San Francisco, CA, 1977. [82] G. Memik, W.H. Mangione-Smith, and W. Hu. Netbench: A benchmarking suite for network processors. IEEE International Conference ComputerAided Design-ICCA, 2001. [83] J. Mogul. Network locality at the scale of processes. in Proceedings of ACM SIGCOMM’91, pages 273–285, September 1991. [84] J. Mogul. Observing TCP dynamics in real networks. in Proceedings of ACM SIGCOMM’92, pages 305–317, August 1992. [85] A. Mukherjee. On the dynamics and significance of low frequency components of Internet load. Technical Report, December 1992. [86] P. Newman, G. Minshall, T. Lyon, and L. Huston. IP switching and gigabit routers. IEEE Communications Magazine, pages 64–49, January 1997. [87] P. Newman, G. Minshall, T. Lyon, and L. Huston. IP switching and gigabit routers. IEEE Communications Magazine, pages 64–49, January 1997. [88] P. Newman, G. Minshall, and T.L. Lyon. IP Switching - ATM under IP. IEEE/ACM Transactions on Networking, 6:117–129, April 1998. [89] NLANR. National Laboratory for Applied Network Research. http://www.nlanr.net. [90] NLANR. NLANR. Passive Measurement and Analysis: Site configuration and status. http://pma.nlanr.net/PMA/Sites/. [91] R. Pang and V. Paxson. A High-Level Programming Environment for Packet Trace Anonymization and Transformation. Proceedings of ACM SIGCOMM Conference, August 2003. [92] K. Park, G. Kim, and M. Crovella. On the effect of traffic self-similarity on network performance. In Proceedings of SPIE International Conference on Performance and control of Network Systems, November 1997. [93] V. Paxson. Growth trends of wide area TCP conversations. IEEE Network, 1994. [94] V. Paxson and S. Floyd. Wide-area traffic: the failure of Poisson modeling. Technical Report Lawrence Berkeley Laboratory, February 1994. [95] V. Paxson and S. Floyd. Wide Area Traffic: The Failure of Poisson Modeling. IEEE/ACM Transactions on Networking, vol. 3:226–244, June 1995. 139 [96] H. O. Peitgen, H. Jurgens, and D. Saupe. Chaos and Fractals. SpringerVerlag, 1992. [97] COST 242 Project. Broadband Network Teletraffic-Final Report of Action. Springer, 1996. [98] RedIRIS. Spanish National Research Network. http://www.rediris.es. [99] rfc791. Internet Protocol. DARPA Internet Program Protocol Specification, September 1981. [100] rfc793. Transmission Control Protocol. DARPA Internet Program Protocol Specification, September 1981. [101] R. Riedi. Introduction to multifractals. Technical Report, Rice University, October 1999. [102] S. Robert and J. Le Boudec. New models for self-similar traffic. Performance Evaluation, pages 57–68, 1997. [103] V. Rutenburg and R. G. Ogier. Fair charging policies and minimumexpected-cost routing in internets with packet loss. in Proceedings of IEEE Infocom 91, pages 279–288, April 1991. [104] D. Sanghi, A. Agrawala, O. Gudmundson, and B. Jain. Experimental assesment of end-to-end behavior on the Internet. in Proceedings of IEEE Infocom 93, pages 867–874, March 1993. [105] A. Schmidt and R. Campbell. Internet protocol traffic analysis with applications for ATM switch design. Computer Communications Review, vol. 23:pp. 39–52, April 1993. [106] V. Srinivasan, S. Suri, and G. Varghese. Packet classification using tuple space search. ACM SIGCOMM’99, September 1999. [107] V. Srinivasan, G. Varghese, S. Suri, and M. Waldvogel. Fast and scalable layer four switching. ACM SIGCOMM’98, September 1998. [108] A. Srivastava and A. Eustace. ATOM - A system for building customized program analysis tool. Programming Language Design and Implementation - PLDI, pages 196–205, June 1994. [109] A. Tantaway, O. Koufopavlou, M. Zitterbart, and J. Abler. On the Design of a Multigigabit IP Router. Journal of High Speed Networks, 1994. [110] M. Taqqu, W. Willinger, and R. Sherman. Proof of a Fundamental Result in Self-Similar Traffic Modeling. ACMCCR: Computer Communication Review, 1997. [111] tcpdump. The tcpdump program. ftp://ftp.ee.lbl.gov/tcpdump.tar.Z. 140 [112] TSH. TSH format. http://pma.nlanr.net/Traces/tsh.format.html. [113] A. Viswanathan, N. Feldman, Z. Wang, and R. Callon. Evolution of multiprotocol label switching. IEEE Communications Magazine, 36:165–173, May 1998. [114] R.J. Walsh and C.M. Ozveren. The GIGA switch control processor. IEEE Network, pages 36–43, February 1995. [115] Z. Wang and J. Crowcroft. Eliminating periodic packet losses in the 4.3 Tahoe BSD TCP congestion contro algorithm. Computer Communications Review, April 1992. [116] C. Westphal. Improvements on IP Header Compression. GLOBECOM 2003 - IEEE Global Telecommunications Conference, 22:676–681, December 2003. [117] C. Williamson. Internet Traffic Measurements. IEEE Internet Computing, 5:70–74, November 2001. [118] W. Willinger. Variable-bit-rate video traffic and long-range dependence. IEEE Transactions on Communication, 1994. [119] W. Willinger, V. Paxson, and M. Taqqu. Self-similarity and heavy tails: Structural modeling of network traffic. In Practical Guide to heavy Tails: Statistical Techniques and Applications, Birkhauser Verlag, 1998. [120] W. Willinger, M. Taqqu, R. Sherman, and D. Wilson. Self-similarity through high-variability: Statistical analysis of Ethernet LAN traffic at the source level. IEEE/ACM Transactions on Networking, 5(1):71–86, February 1997. [121] W. Willinger, M.S. Taqqu, and A. Erramilli. A bibliographical guide to selfsimilar traffic and performance modeling for modern high-speed networks, Stochastic Networks: Theory and applications. Royal Statistical Society Lecture Notes Series, Oxford University Press, 4:339–366, 1996. [122] T. Wolf and M.A. Franklin. CommBench - a telecommunications benchmark for network processors. Proc. of IEEE International Symposium on Performance Analysis of Systems and Software - ISPASS, 2000. [123] L. Zhang, S. Shenker, and D. Clark. Observations on the dynamics of a congestion control algorithm: The effects of two-way traffic. in Proceedings of ACM SIGCOMM’91, pages 133–148, September 1991. [124] J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory., 23:337–343. 141