Mosharaf Chowdhury, PhD Student in CS My current high-level research interests include data center networks and large-scale data-parallel systems. Recently, I have worked on a project on improving the performance and scalability of packet classification in wide-area and data center networks. I have also worked on network virtualization that allows multiple heterogeneous virtual networks to coexist on a shared physical substrate. Please visit http://www.mosharaf.com/ for more information on my research. I want to take away lessons on macro- and micro-level parallelism from this course and possibly apply them in my research. I summarize a recent work on botnet detection using large-scale data-parallel systems (MapReduce [2] and Dryad/DryadLINQ [3][4]) in the following. While previous incarnations of this assignment resulted in people discussing how MapReduce (or its open-source implementation Hadoop) works, this one is about how MapReduce like large data-parallel systems can be used in practice for useful purposes. BotGraph: Large Scale Spamming Botnet Detection [1] Motivation Analyzing large volume of logged data to identify abnormal patterns is one of the biggest and most frequently faced challenges in the network security community. This paper presents BotGraph, a mechanism based on random graph theory, to detect botnet spamming attacks on major web email providers based on millions of log entries and describes its implementation as a data-parallel system. The authors posit that even though individually detecting bot-users (false email accounts/users created by botnets) is difficult, they do have some aggregate behavior (e.g., they share IP addresses when they log in and send emails). BotGraph detects abnormal sharing of IP addresses among bot-users to separate them from human-users. Core Concept BotGraph has two major components: aggressive sign-up detection that identifies sudden increase of signup activities from the same IP address and stealthy bot detection that detects sharing of one IP address by many bot-users as well as sharing of many IP addresses by a single bot-user by creating a user-user graph. In this paper, the authors consider multiple IP addresses from the same autonomous system (AS) as one shared IP address. It is also assumed that legitimate groups of normal users do not use the same set of IP addresses from different ASes. In order to identify bot-user groups, the authors create a user-user graph where each user is a vertex and each link between two vertices carry some weight based on a similarity metric of the two vertices. The authors' assumption is that the botusers will create a giant connected component in the BotGraph (since they will share IP addresses) that will collectively distinguish them from the normal users (who create much smaller connected components). They show that there exists some threshold on edge weights, which, if decreased, will suddenly result in large components. It is proven using random graph theories that if there is IP address sharing than the giant component will be seen with a high probability. The detection algorithm works in an iterated manner starting with a smaller threshold value to create a large component, and it then recursively increases the threshold to extract connected sub-components (until the size of the connected components become smaller than some constant). The resultant output is a hierarchical tree. The algorithm then uses some statistical measurements (using histograms) to separate bot-user group components from their real counterparts. After the pruning stage, BotGraph goes through another phase of traversing the hierarchical tree to consolidate bot-user group information. Data-parallelism in Action The biggest challenge in applying BotGraph is the complexity arising from the construction of the graph that can contain millions of vertices. The authors propose two approaches using currently popular large data parallel processing mechanism MapReduce and its extension using selective filtering with two more interfaces than just basic Map and Reduce (Dryad/DryadLINQ). Method 1. Simple Data-parallelism In this approach the authors partition data according to IP address, and then leverage the well known Map and Reduce operations to straightforwardly convert graph construction into a data-parallel application. As illustrated in Figure 1, the input dataset is partitioned by the user-login IP address (Step 1). During the Map phase (Step 2 and 3), for any two users Ui and Uj sharing the same IP-day pair, where the IP address is from Autonomous System ASk , they output an edge with weight one e =(Ui, Uj , ASk). Only edges pertaining to different ASes need to be returned (Step 3). After the Map phase, all the generated edges (from all partitions) will serve as inputs to the Reduce phase. In particular, all edges will be hash partitioned to a set of processing nodes for weight aggregation using (U i, Uj) tuples as hash keys (Step 4). After aggregation, the outputs of the Reduce phase are graph edges with aggregated weights. Figure 1: Process Flow of Method 1 [1] Method 2. Selective Filtering In this approach the authors partition the inputs based on user ID. For any two users that were located in the same partition, they can directly compare their lists of IP-day pairs to compute their edge weight. At the same time, for two users whose records locate at different partitions, they need to ship one user’s records to another user’s partition before computing their edge weight, resulting in huge communication costs. To reduce this cost, the authors selectively filter the records that do not share any IP-day keys. Figure 2 shows the processing flow of generating user-user graph edges with such an optimization. For each partition pi, the system computes a local summary si to represent the union of all the IP-day keys involved in this partition (Step 2). Each local summary si is then distributed across all nodes for selecting the relevant input records (Step 3). At each partition pj (j != i), upon receiving si, pj will return all the login records of users who shared the same IP-day keys in si. This step can be further optimized based on the edge threshold w: if a user in p j shares fewer than w IP-day keys with the summary si, this user will not generate edges with weight at least w. Thus only the login records of users who share at least w IP-day keys with si should be selected and sent to partition p i (Step 4). To ensure the selected user records will be shipped to the right original partition, they add an additional label to each original record to denote their partition ID (Step 7). Finally, after partition p i receives the records from partition pj, it joins these remote records with its local records to generate graph edges (Step 8 and 9). Other than Map and Reduce, this method requires two additional programming interface supports: the operation to join two heterogeneous data streams and the operation to broadcast a data stream (implemented using Dryad/DryadLINQ). Figure 2: Process Flow of Method 2 [1] The main difference between the two data processing flows is that Method 1 generates edges of weight one and sends them across the network in the Reduce phase, while Method 2 directly computes edges with weight w or more, with the overhead of building a local summary and transferring the selected records across partitions. However, the cross-node communication in Method 1 is constant, whereas for Method 2, with the increasing number of computers both the aggregated local summary size as well as the number of user-records to be shipped increase, resulting in a larger communication overhead. Results The authors found 0 to be the more resource efficient, scalable, and faster of the two. Using a 221.5 GB workload, Method 1 created the user-user graph in just over 6 hours with 12.0 TB communications overhead, whereas Method 2 finished in 95 minutes with only 1.7 TB overhead. Evaluations using the data-parallel systems show that BotGraph could identify a large number of previously unknown bot-users and bot-user accounts with low false positive rates. References [1] [2] [3] [4] Y. Zhao et al, “BotGraph: Large Scale Spamming Botnet Detection,” NSDI’09. J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” OSDI’04. M. Isard et al, “Dryad: Distributed Data-parallel Programs from Sequential Building Blocks,” EuroSys’07. Y. Yu et al, “DryadLINQ: A System for General-purpose Distributed Data-parallel Computing Using a High-level Language,” OSDI’08.