Learning Communication Rules Srikanth Kandula Ranveer Chandra and Dina Katabi Network Admins. are Groping in the Dark Focus on Traffic Volume But, What’s Going On? • TCP=80%, HTTP=30% • Traffic follows plan? • Misconfigurations • Suspicious Traffic • Adapt report categories (e.g., AutoFocus) – Much traffic from ports 500-600 Besides focusing on volume, learn rules underlying the traffic (Active) user browsing web, reading/sending mail (Automatic) SMS scan on a network, outlook refresh X Rule Y X X Y flowY flowX (http DNS) X Y t Whenever flowy happens, flowx is likely to occur If you could learn such rules directly from a trace, • Infer the actual behavior of applications – AFS root servers direct traffic to volume servers evenly – mail to the incoming MX, is forwarded onto group MXes • Notice misconfigurations and badness – these clients shld not be talking on known command-control ports this server shld not be responding to DHCP requests – this mail server shld not attempt connections to non-existent MXes Report all significant rules with no specific knowledge about a trace Mining for Rules is Hard • How to define significance? – When is a group of flows interesting enough to report? • Avoid observer bias but cannot evaluate everything – Focus on one server, miss what you are not looking for • Practical, deal with noise, search quickly eXpose 1. A scoring function for significance 2. Heuristics that bias search toward high hit-rate 3. Empirical validation on enterprise traces Overview Activity Matrix flow1 … flowK Packet Trace time1 … Rules timeR • Packet trace to Activity Matrix o Rows are 1s windows; Columns are flows o Is flow active in [timei-1, timei )? (at least one packet) • Association rule mining (X,Y are r.v. for columns) • Need not worry about interleaving • Dependencies are at these time-scales (an rtt, a server response) All windows in [.25s, 2s] range yield similar rules Which Rules are Significant? X Y • High Joint Probability? o X, Y may occur very often individually (e.g., breeze, sun shining) • High Conditional Probability? o Say Y occurs only when X does, but both are rare (lottery, buy a jet) Which Rules are Significant? X Y • High Joint Probability? • High Conditional Probability? • We use mutual information (combines the two) PY | X PY | X Score X Y P XY log P XY log PY PY * Measures fraction of change in Y due to X Score=0, if Y is independent of X Score=Max, if Y is fully dependent on X * Trades off dependency & frequency * Encodes Directionality Kerberos Reservation Modifying Scores for Networking • Negative Correlation … X Y – Flows with little overlap P(Y|X) 1 leads to high score PY | X PY | X Score X Y P XY log P XY log PY PY … Modifying Scores for Networking • Negative Correlation … X … Y – Flows with little overlap PY | X PY | X Score X Y P XY log P XY log PY PY • Long Running Flows – – – – … X Y Large downloads, ssh/remote desktop Trivial overlaps with long flow P(Y|X) 1 Distinguish new vs. present Present rules reported only if small mismatch in freq. • Too Many Possibilities – Bias, focus on pairs with at least one common IP – Miss rules, but hit-rate up 1000x and costs down 10x … Generics - Miss, if no client accesses server often + Rules that abstract away parts of a flow Database Client : Server Server : Database Server Client : Server Server : Database * (any client) Client : Rsrv. Client : Kerberos Kerberos Client : Rsrv. Client : Kerberos * * (any client, but same on both sides) To do this automatically, • what to abstract? (IP addresses at non-server port) • which pairs to consider for rule? – flows match IP, generics match abstracted IP Reservation Mining for Rules Techniques extend to arbitrary sized rules X Y O(f2) X 1 X 2 X n Y O(fn+1) Instead, 1. Focus on pair-wise rules (simpler is likelier) 2. Group similar rules – Eliminate weak rules between strongly connected groups – Transitive closure to read off clusters Rule Score Recursive Spectral Partitioning (VKV’00) Rule Mining Digests 105—106 flows into 102—103 rule clusters Recap: eXpose Mines for Rules Activity Matrix flow1 … flowK Packet Trace time1 … present |new Rules Rule Clusters … flowi.new flowj.present ... timeR Contributions Learn all significant rules without prior knowledge o Scoring function for rule significance o Avoids observer bias, yet stays feasible by focusing on high hit-rate o Algorithms to mine and prune Related Work Semi-Automated Discovery of App. Session Structure (KJPK’06) Sherlock (Diagnosing Performance Problems, BCGKMZ’07) Autofocus (ESV’03) BLINC (KPF’05) Stepping Stones (ZP’00) Learn all significant rules without prior knowledge o Avoids observer bias, yet stays feasible by focusing on high hit-rate o Scoring function for rule significance o Algorithms to mine and prune Results Evaluation Setup CSAIL’s Access Access Link of Conf. LANs Before CSAIL’s Servers Inside Microsoft • Traces at access and internal server-facing links – Packet Headers, Connection Records (Bro), some anon. • Operational n/w with 103 clients, diverse traffic mix • Corroborated on test-bed traffic & vetted by admins. • Ran eXpose on a 2.4GHz x86 with 8GB RAM Rules Discovered by eXpose • Dependencies for Major Applications email @ microsoft Client.* – PFS1.X Client.* – DC.88 Client.* – PFS2.X Client.* – Mail.X Client.* – Mail.135 Client.* – Proxy.80 Rules Discovered by eXpose • Dependencies for Major Applications afs @ csail C.7001 – *.* C.7001 – AFS2.7000 C.7001 – Root.7003 C.7001 – AFS1.7000 AFS1.7000 – Root.7002 Rules Discovered by eXpose • Dependencies for Major Applications – web, e-mail, file-servers, IM, print, video broadcast web @ microsoft Proxy2.80 – *.* Proxy3.80 – *.* Proxy1.80 – *.* Proxy4.80 – *.* Rules Discovered by eXpose • Dependencies for Major Applications – web, e-mail, file-servers, IM, print, video broadcast • Configuration Errors & Other Badness smtp + IDENT @ csail Client.113 – MailServer.* Client.* – MailServer.25 Rules Discovered by eXpose • Dependencies for Major Applications – web, e-mail, file-servers, IM, print, video broadcast • Configuration Errors & Other Badness – IDENT, Legacy emails, ssh scans, wingate Legacy email ids @ csail UnivMail.* – Old1.25 UnivMail.* – Old3.25 UnivMail.* – Old2.25 Rules Discovered by eXpose • Dependencies for Major Applications – web, e-mail, file-servers, IM, print, video broadcast • Configuration Errors & Other Badness – IDENT, Legacy emails, ssh scans, wingate • Rules for stuff we didn’t know before Nagios monitors @ csail Nagios.7001 – AFS2.7000 Nagios.* – Mail1.25 Nagios.7001 – AFS1.7000 Nagios.* – Mail2.25 Rules Discovered by eXpose • Dependencies for Major Applications – web, e-mail, file-servers, IM, print, video broadcast • Configuration Errors & Other Badness – IDENT, Legacy emails, ssh scans, wingate • Rules for stuff we didn’t know before – Nagios, LLMNR, iTunes Link level multicast name resolution @ hotspots H.137 – Wins.137 Black box: Little prior knowledge about servers, H.* – evolve Multicast.5355 applications, or users Can H.* – DNS.53 Correctness & Completeness • False Positives – 13% of rule-clusters in CSAIL trace, we couldn’t explain • False Negatives – Main CSAIL Web Server (too many different activities) – Dependencies on Personal Web Pages (too few traffic) – PlanetLab Traffic (punted) • Other Limitations – IPSec, Anonymized, Cover Traffic • Extensions – Rules repeat over time, and across traces – Application whitelisting, Customize Generics # Flows (x 106) Time to Mine for Rules .6 .2 .6 .9 2.8 At CSAIL’s access link, high fan-out with many distinct flows Stream Mining Appears Feasible! Packet Trace eXpose Rules for frequently reoccurring flow sets Learn all significant rules with no specific knowledge o Avoids observer bias, but feasible by focusing on high hit-rate o Scoring function for rule significance o Algorithms to mine and prune Empirical validation on enterprise traces • found configurations & protocols that we didn’t know existed • learnt rules for actual behavior of applications • found config. errors, bot scans, infected machines http://research.microsoft.com/~srikanth Backup # of Discovered Rules Expanding Search Space (# of flows)… Rule Score (Modified JMeasure) … exposes few significant rules! Memory Footprint (million rules) Time to Mine Rules (s) Expanding Search Space (# of flows)… # Top Active Flows # Top Active Flows … exposes few rules & costs a lot in time, memory # of Discovered Rules Varying Size of Time Windows Rule Score (Modified JMeasure) All window sizes in [.25s, 2s] produce similar rules! Joint Probability For all rules X Y Prob. (Y) Prob. (X)