Automated Signature Extraction for High Volume Attacks Yehuda Afek Anat Bremler-Barr Shir Landau Feibish This work is part of the Kabarnit–Cyber Consortium (2012-2014) under Magnet program, funded by the chief scientist in the Israeli ministry of Industry, Trade and Labor. This research was also partly supported by European Research Council (ERC) Starting Grant no. 259085. Current DDoS Attack Zombies on innocent computers Infrastructurelevel DDoS attacks Bandwidth-level DDoS attacks 2 Server-level DDoS attacks High volume attacks - Current Defense Many different types of attackers: Defense Line1 Remaining attacks: Botnets (millions of computers) Hard to identify behaviorally, under the radar screen Zero-day – no known signatures Defense Line 2 access SYN cookies, control list Challengefiltering response 3 Defense Line 3 behavioral analysis … Defense Line n Call for HELP!! Signature based DDoS Attack Detection Unknown (zero-day) attacks: Some hope: Attack tools usually leave some unique footprint (repeating pattern) Example in packet: Connection: KEEP-ALIVE Today: Find signatures manually (human eye) Our goal: Find it automatically Signatures used by anti-DDoS devices and firewalls to stop attack 4 Mitigation in minutes, good enough for these types of attacks Signatures also used in NIDS/IPS (Snort, Bro, etc.) Worm detection (automated extraction) Previous work: Worm behavior (address dispersion, suspicious code, etc.) Fixed-length signatures Non-scalable Notable works: 5 Kephart et al ‘94 Honeycomb [Kreibich et al ’04] Earlybird [Singh et al ‘04] Autograph[Kim et al ’04] Hancock[Griffin et al ’09] System Overview Peace time traffic sample Attack time traffic sample Signature Extraction Attack signatures e.g. Connection: KEEP-ALIVE Our Challenge: Automatically find signatures that appear frequently only during attack Where: Input collection: 6 In mitigation box (DDoS Guard/firewall/anti-DDoS etc.) In the cloud – collect data from several collectors. Signature Extraction - High Level Signature Extraction Peace time traffic sample Attack time traffic sample 7 Find frequent strings in peace time traffic Find frequent strings in attack time traffic Take only strings found in attack and not in peace Attack signatures e.g. Connection: KEEP-ALIVE Our Goal Automatically find signatures that appear frequently only during attack Requirements: Find minimal set of signatures 1. Allow signatures of varying lengths Don’t include signatures found in legitimate traffic 2. 3. Minimum false positives Minimize space and time usage 4. 8 Some filtering devices have limited capacity Large amounts of data Quick response Finding Frequent Strings in Traffic Input: Sequence of packets Output: Strings that appear frequently in packets Peace time traffic sample Attack time traffic sample 9 Find frequent strings in attack time traffic Take only strings found in attack and not in peace Attack signatures e.g. Connection: KEEP-ALIVE Common Stringology solution: use suffix trees/arrays Find frequent strings in peace time traffic too much space Our solution uses heavy hitters Heavy Hitters (Frequent Items) Input: N values, integer v Output: v values each appearing at least N/v times Approximate solution: Uses O(v) space! One pass over input! Known counter based HH Algorithms: 10 Misra & Gries 1982 Lossy Counting – Monku and Motwani 2002 Space saving - Metwally et al 2005 – currently using Space saving Heavy Hitters [Metwally et al 2005] Algorithm: Maintain v values, and their counters. Input 10 22 value counter 30 10 1 10 22 1 35 30 1 50 11 Space saving Heavy Hitters [Metwally et al 2005] Algorithm: Maintain v values, and their counters. If next value x is one of the v, increment its counter. Input 10 22 value counter 30 10 2 10 22 1 35 30 1 50 12 Space saving Heavy Hitters [Metwally et al 2005] Algorithm: Maintain v values, and their counters. If next value x is one of the v, increment its counter. Else take item with minimal counter c: Replace value with x New counter is c+1 Input 10 Error rate: N/v 22 value counter 30 10 2 10 35 2 35 30 1 50 13 Our Solution Heavy hitters usually done on numbers… how do we use it for text? k-grams: strings of length exactly k Trivial idea: For each packet: Take all k-grams (sliding window) Do Heavy hitters on them b1=abca b2 = bcab k-grams b3 = cabc Fixed length not good enough Either too short: cuts up longer signatures 14 abcabcadefgfsdghjghnfdghfgsdhfjs Substring pollution - Too many heavy hitters for one signature Or too long : noisy signatures Our Solution: Double Heavy Hitters Double Heavy Hitters algorithm: two separate instances of heavy hitters Heavy Hitters 1: Find heavy hitters of k-grams Heavy Hitters 2: Find heavy hitters of varying-length strings created during run of Heavy Hitters 1 Input to Heavy Hitters 1: k-grams …. k k k Heavy Hitters 1 k k 15 k k Input to Heavy Hitters 2: strings string string Heavy Hitters 2 string string k string Output is output of Heavy Hitters 2 Double Heavy Hitters Algorithm While processing k-grams in Heavy Hitters1 Find max run of k-grams: Already in Heavy Hitters 1 Counters of consecutive k-grams maintain predefined ratio Create string Insert into Heavy Hitters 2 k-grams: abca Is already in Heavy Hitters 1? N bcab cabc abca bcab cabc abcd N N Y Y Y N abcabc Check ratio 16 bcda N cdab dabc abca N N Y abca Double Heavy Hitters Algorithm Example: bi K-gram b1 abca b2 bcab Heavy Hitters 1 Heavy Hitters 2 b3 cabc K-gram counter string counter b4 abca abca 1 NULL 0 b5 bcab bcab 1 NULL 0 b6 cabc cabc 1 NULL 0 b7 abcd 17 Input: abcabcabcd Double Heavy Hitters Algorithm Example: bi K-gram b1 abca b2 bcab Heavy Hitters 1 Heavy Hitters 2 b3 cabc K-gram counter string counter b4 abca abca 2 NULL 0 b5 bcab bcab 1 NULL 0 b6 cabc cabc 1 NULL 0 b7 abcd 18 Input: abcabcabcd String = abca Double Heavy Hitters Algorithm Example: bi K-gram b1 abca b2 bcab Heavy Hitters 1 Heavy Hitters 2 b3 cabc K-gram counter string counter b4 abca abca 2 NULL 0 b5 bcab bcab 2 NULL 0 b6 cabc cabc 1 NULL 0 b7 abcd 19 Input: abcabcabcd String = abcab Double Heavy Hitters Algorithm Example: bi K-gram b1 abca b2 bcab Heavy Hitters 1 Heavy Hitters 2 b3 cabc K-gram counter string counter b4 abca abca 2 NULL 0 b5 bcab bcab 2 NULL 0 b6 cabc cabc 2 NULL 0 b7 abcd 20 Input: abcabcabcd String = abcabc Double Heavy Hitters Algorithm Example: bi K-gram b1 abca b2 bcab Heavy Hitters 1 Heavy Hitters 2 b3 cabc K-gram counter string b4 abca abcd 3 abcabc 1 b5 bcab bcab 2 NULL 0 b6 cabc cabc 2 NULL 0 b7 abcd 21 Input: abcabcabcd String = abcabc counter Heavy Hitters on text – improving the estimation Problem: substrings in heavy hitters Only longest run is in input to HH2 Correct the count: After run of algorithm For all strings s in Heavy Hitters 2: Heavy Hitters 2 string counter wonder 200 woman 300 wonderwoman 100 Find other strings which contain s and add their counters to s’s counter Heavy Hitters 2 22 string counter Real counter wonder 200 300 woman 300 400 wonderwoman 100 100 Double Heavy Hitters Algorithm Analysis Input: Error bounds: For HH1 with v items: N/v For HH2 with v items: C/v We Prove: 23 Input to HH1: N k-grams Input to HH2: C consecutive grams C ≤ N/(k + 1) Overall: Error bound of the Double Heavy Hitters algorithm Signature Extraction - High Level Signature Extraction Peace time traffic sample Attack time traffic sample Find frequent strings in peace time traffic Find frequent strings in attack time traffic Take only strings found in attack and not in peace Attack signatures e.g. Connection: keep-ALIVE Formalize with thresholds 24 Chose Signatures Create signatures that never appear in legitimate traffic Thresholds: Attack-high Peace-low Peace-high Delta Strings in attack with frequency > Attack-High 25 Chose Signatures Create signatures that never appear in legitimate traffic Thresholds: Attack-high Peace-low Peace-high Delta Strings in attack with frequency > Attack-High Signatures 26 Strings in peace time False positives Chose Signatures Create signatures that rarely appear in legitimate traffic Thresholds: Attack-high Peace-low Peace-high Delta Strings in attack with frequency > Attack-High Signatures 27 Strings in peace with frequency > Peace-Low False positives Chose Signatures Create signatures that may appear in legitimate traffic, but appear in attack traffic much more Thresholds: Attack-high Peace-low Peace-high Delta Strings in attack with frequency > Attack-High frequency > Peace-high frequency > Peace-Low Signatures Signatures only if attack frequency at least delta more than peace frequency 28 False positives Use peace traffic to create filters Use our Double Heavy Hitters algorithm on peace time traffic: 100% Peace time traffic packets payload: White list abcabcadefgfsdghjghnfdghfg...... b1=abca b2 = bcab b3 = cabc Double Heavy Hitters Algorithm frequency > Peace-high …… Peace-high Output values Maybe white list 50% frequency > Peace-high frequency > Peace-Low Peace-low Not white list 0% 29 Extracting Attack Signatures Now use Double Heavy Hitters algorithm on attack time traffic with filters Attack traffic packets payload: hagdhdadjashdklahdjkasfjasbfjabfhfgahfvhsbdfjkasnkiaywtqyeffcgfacsdxasdbas b1=hagd b2 = agdh Heavy Hitters 1 string Modified DHH 30 b3 = gdhd …… White list: discard if contained in whitelist string Heavy Hitters 2 Output values Signatures Maybe white list: frequency > AttackHigh Evaluations Overall eleven tests: Ten real attack captures 5 captures of peacetime traffic 5 synthetic peacetime captures One Synthetic attack in real peace time traffic Compare to human expert 31 Sample Signatures Could not be identified manually Extra newline between header fields Use of upper-case characters, where usually lower Use of a rarely used HTTP field Use of rare user agent. 32 Results – Accuracy of Double Heavy Hitters estimation Algorithm (DHH) Actual Count (frequency) 100 90 Percent 80 70 60 50 40 30 20 10 0 1 9 13 17 21 Signatures 25 29 33 37 Graph of frequency of signatures 33 5 RED – Actual count (frequency) in attack traffic BLUE – Algorithm (DHH) estimation of frequency of signatures Results - Attack Rate Estimation Tests with real peace time traffic Tests with synthetic peace time traffic 100 90 Attack rate 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 Test Number 34 7 8 9 10 11 Human Expert Our Algorithm Results – Recall and Precision Estimation Precision: relevant packets from all identified Tests with real peace time traffic Tests with synthetic peace time traffic 100 Recall: identified packets from all relevant 90 80 Average: 99.96 Worst case: 99.8 Percent 70 60 50 40 30 20 10 0 1 2 3 4 5 6 Test Number 35 7 8 9 10 11 Peacetime based Attack based Future Work Identify signatures always found in same packets Good synthetic peace-time traffic, global white-list Support regular expression signatures 36 37