Real-time, High Throughput Traffic Analytics with Elasticsearch Chris Bradley christopher.bradley@verizondigitalmedia.com February 04, 2016 Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. Disclaimers Portions of the data we use to make analytical decisions have been removed. The removal of this data does not impact the functionality of the system. • This redaction is done as a security measure as our network is under constant threat. • The overall system functionality is present here intact and only a small set of data properties used to make analytical decisions have been removed. • File paths, network ports, data sampling rates, and specific packet properties are types of data that has been generalized or redacted. Software Versions We are running recent versions of software but not bleeding edge. Software discussed herein references the following versions. • Elasticsearch – 1.3.1 • Logstash – 1.4.6 • Rsyslog – 8.12.0 • Ubuntu – 14.04 LTS • Python – 2.7.5 Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 2 Disclaimers One last thing… We are still using facets instead of aggregations in Elasticsearch. This is a result of the system being built before aggregations existed. We will be switching but facets do still function today. (They are deprecated though.) The end result is facets and aggregations have very similar functions, aggregations are just more efficient. Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 3 Terminology Terminology used throughout this presentation • Anycast – Global load balancing based on shared IP address assignment. • CDN – Content delivery network. • GSLB – Global server load balancing. • OSI Layer 3 – The network layer of the OSI model. In this case IPv4 or IPv6. • OSI Layer 4 – The transport layer of the OSI model. In this case TCP, UDP, or ICMP. • Signature – A tuple of packet properties that are common during an attack. • Spoof – The action of masking or faking IP addresses as part of an attack. • VDMS – Verizon Digital Media Services. Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 4 Overview The VDMS network is under constant attack. Over the past years we have developed an application (Stonefish) to identify threats in real-time and mitigate them in fast time scales. • Time to identification and time to resolution are independent metrics that are critical to the performance of our CDN. The purpose of the Stonefish application is to identify threats at OSI layer 3 and 4 and take action to filter this malicious traffic from our network. Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 5 CDNs and the DDOS problem For CDNs performance and uptime are the most critical metrics. The VDMS mission is to be the fastest, most performant CDN on the planet. • We are incredibly sensitive to anything that impacts performance. • DDOS attacks are a serious threat to customer retention. Predicting and Preventing every single attack is impossible. • The design goal of the system was originally to obtain the largest network coverage and protection in a realistic manner. • The nature of DDOS attacks makes them relatively easy to initiate but difficult to track to origination point(s). Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 6 Stonefish Origin Story The origination of Stonefish starts in 2011. During this time EdgeCast was suffering from large scale multi-day DDOS attacks. • Tracking attack signatures was a manual process. • Manual rules generated to filter signatures. • Globally distribute these rules to servers and routers. This was not an effective means for dealing with attacks. • Time to identification and mitigation was not acceptable. • The level of effort and engineering resources required to do this were significant. • CDN performance was impacted for unacceptable time periods while this manual process occurred. Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 7 Stonefish Design Goals The concept that was created was a system to automatically identify and stop attacks. • Identify an attack on our network anywhere in the world in 60 seconds. • Mitigate 99% of attacks on our network within 60 seconds of identification. • Monitor connection state information for millions of connections in near real-time. • Block the most common attacks without human intervention. Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 8 VDMS Simplified POP View Routing Internet peering and uplinks, ACLs only Firewall Stateful inspection Load Balance Edge Intelligent control plane App layer Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 9 Stonefish Ecosystem Sample Data (IPTables) Push Rules (Stonefish) Transmit Data (Rsyslog) Process Data (Stonefish) Ingest Data (Logstash) Store Data (Elasticsearch) Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 10 Cluster Hardware The ‘heart’ of Stonefish is an Elasticsearch cluster This is a fully collapsible cluster that utilizes the VDMS Anycast architecture to provide 99.999% uptime. Cluster Composition • 240 Intel Xeon cores • 640GB RAM • 120TB SSD storage • Redundant 10GB Ethernet links to each server in the cluster Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 11 Collection and Storage Data collection occurs at routing and load balancing layers • Load balancing layer runs Linux – Kernel packet logs sampled data via IPTables. • Rsyslog sends JSON formatted packet data to Anycast address. • Logstash receives JSON packet data and ingests to Elasticsearch. All cluster servers are horizontally scalable • All servers in the cluster run an identical stack. • One Anycast address is used to access all servers in the cluster. • Any server in the cluster can ingest packet data. Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 12 Inbound traffic Load balancers Kernel packet logging Rsyslog JSON formatting Data flow from edge to cluster Rsyslog TCP TLS transmission Stonefish Cluster – Logstash Ingestion Stonefish Cluster – Elasticsearch storage SYN DNS ICMP UDP Stonefish Threat Detection Stonefish Dashboard Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 13 Linux Kernel Packet Logging Kernel provides packet logging via IPTables user space tools Example Log Rules iptables –t filter -A log_chain -p tcp -m tcp --tcp-flags SYN SYN -m statistic --mode random --probability 0.0001 -j LOG --log-prefix \"ipt-type1: \" --log-level 7 iptables –t filter -A log_chain -p icmp -m statistic --mode random --probability 0.0001 -j LOG --log-prefix \"ipt-type2: \" --log-level 7 iptables –t filter -A log_chain -p udp --dport 53 -m statistic --mode random --probability 0.0001 -j LOG --log-prefix \"ipt-type3: \" --log-level 7 iptables –t filter -A log_chain -p udp -m statistic --mode random --probability 0.0001 -j LOG --log-prefix \"ipt-type4: \" --log-level 7 Output Chain log_chain (1 references) pkts bytes target prot opt in out source destination 0 0 LOG tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp flags:0x02/0x02 statistic mode random probability 0.0001 LOG flags 0 level 7 prefix "ipt-type1 " 0 0 LOG icmp -- * * 0.0.0.0/0 0.0.0.0/0 statistic mode random probability 0.0001 LOG flags 0 level 7 prefix "ipt-type2: " 0 0 LOG udp -- * * 0.0.0.0/0 0.0.0.0/0 udp dpt:53 statistic mode random probability 0.0001 LOG flags 0 level 7 prefix "ipt-type3: " 0 0 LOG udp -- * * 0.0.0.0/0 0.0.0.0/0 statistic mode random probability 0.0001 LOG flags 0 level 7 prefix "ipt-type4: “ Result [[11507705.594434] ipt-type2: IN=bond0 OUT= MAC=ec:f4:bb:e7:13:a8:78:19:f7:38:0f:f0:08:00 SRC=10.0.0.1 DST=10.1.0.1 LEN=68 TOS=0x00 PREC=0x00 TTL=61 ID=32341 PROTO=ICMP TYPE=3 CODE=1 [SRC=93.184.221.200 DST=87.209.64.100 LEN=40 TOS=0x00 PREC=0x00 TTL=58 ID=5286 DF PROTO=TCP SPT=443 DPT=56484 WINDOW=290 RES=0x00 ACK FIN URGP=0 ] MARK=0x8 [11507804.647848] ipt-type1: IN=bond0 OUT= MAC=ec:f4:bb:e7:13:a8:78:19:f7:38:0f:f0:08:00 SRC=10.0.0.2 DST=10.1.0.1 LEN=60 TOS=0x00 PREC=0x00 TTL=58 ID=28666 DF PROTO=TCP SPT=59999 DPT=80 WINDOW=29200 RES=0x00 SYN URGP=0 MARK=0x7 [11507804.942317] ipt-type1 IN=bond0 OUT= MAC=ec:f4:bb:e7:13:a8:78:19:f7:38:0f:f0:08:00 SRC=10.0.0.3 DST=10.1.0.1 LEN=60 TOS=0x00 PREC=0x00 TTL=56 ID=53421 DF PROTO=TCP SPT=59761 DPT=80 WINDOW=29200 RES=0x00 SYN URGP=0 MARK=0x7 Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 14 Rsyslog data transmission Rsyslog templating is not user friendly, but powerful and fast Example /etc/rsyslog.d/10-ipt-log-transmit.conf $template msg_parse, "{\"@timestamp\": \"%TIMESTAMP:::date-rfc3339%\", \"@hostname\": \"%hostname%\", \"@fields\": {\"field1\": \"%msg:R,ERE,1,BLANK,0:FLD1=([0-9?.]+)--end%\", \"field2\": \"%msg:R,ERE,1,BLANK,0:FLD2=([0-9?.]+)--end%\", \"field3\": \"%msg:R,ERE,1,BLANK, 0:FLD3=([0-9]+)--end%\", \"field4\": \"%msg:R,ERE,1,BLANK,0:FLD4=([0-9]+)--end%\", \"field5\": \"%msg:R,ERE,1,BLANK,0:FLD5=([0-9]+)--end%\", \"field6\": \"%msg:R,ERE,1,BLANK, 0:FLD6=([0-9]+)--end%\", \"field7\": \"%msg:R,ERE,1,BLANK,0:FLD8=(0x[0-9a-f]+)--end%\"}, \"@tags \": [\"ipt-type1\"]}\n" $MainMsgQueueSize 1000 $MainMsgQueueDiscardMark 800 $MaxMessageSize 1000k $WorkDirectory /path/to/disk_assisted_queue_dir/daq $ActionQueueFileName packetlog $ActionQueueMaxDiskSpace 10g $ActionQueueSaveOnShutdown on $ActionQueueType LinkedList $ActionResumeRetryCount -1 $SystemLogRateLimitInterval 0 $DefaultNetstreamDriverCAFile /path/to/tls/cert_file/cert.crt $ActionSendStreamDriver gtls $ActionSendStreamDriverMode 1 $ActionSendStreamDriverAuthMode x509/name $ActionSendStreamDriverPermittedPeer *.domainname.tld :msg, contains, "ipt-type1" @@anycast.ip.address:9999;msg_parse & stop Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 15 Rsyslog result Rsyslog mutates kernel to JSON and puts onto the wire. Kernel packet log [11507804.647848] ipt-type1: IN=bond0 OUT= MAC=ec:f4:bb:e7:13:a8:78:19:f7:38:0f:f0:08:00 SRC=10.0.0.1 DST=10.1.0.1 LEN=60 TOS=0x00 PREC=0x00 TTL=58 ID=28666 DF PROTO=TCP SPT=59999 DPT=80 WINDOW=29200 RES=0x00 SYN URGP=0 MARK=0x7 Rsyslog JSON output { } "_source": { "@timestamp": "2015-09-01T01:54:51.813+00:00”, "@srvtype": ”server_type”, "@source_host": ”hostname”, "@fields": { "src": ”10.0.0.1”, "dst": ”10.1.0.1”, ”field1": ”99”, ”field2": ”99”, ”field3": ”99”, ”field4": ”99”, "mark": "0x7” }, "@tags": [ ”ipt_type1” ], "@version": "1”, "type": ”ipt-type1”, "host": ”127.0.0.1” } Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 16 Rsyslog data transmission alternatives There are many alternatives that are more user friendly in production today. • Logstash has egress functionality. • Lumberjack is a lightweight Logstash. • Custom code could be written to transport messages. • Publish-subscribe messaging systems could be used (NSQ, Kafka, RabbitMQ, etc.). • Many Github projects out there based on Python, Go, or your desired language. If we were building from the ground up today Rsyslog may not have been selected. Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 17 Logstash Ingestion Configuration is broken into two parts: input and output. Logstash is highly configurable and powerful, we use it in it’s most basic form. Configuration file can be in JSON or YAML. input { tcp { port => 9999 type => ’custom_type_here’ ssl_enable => 'true’ ssl_cacert => ’/path/to/certificate_authority/ca.crt’ ssl_cert => '/path/to/server/cert/server.crt’ ssl_key => '/path/to/server/key/server.key’ codec => json { charset => "UTF-8” } } } output { if ”ipt-type1" in [@tags] { elasticsearch_http { index => ’index-type1-%{+YYYY.MM.dd}’ host => ’your.es.host.ip.or.localhost’ template => '/path/to/your/elasticsearch/index/template/file.tmpl’ template_name => ’custom_template_name’ template_overwrite => true } } } Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 18 Elasticsearch and Data Storage Stonefish randomly samples n% of all inbound traffic • Each protocol type has data stored in a unique index in Elasticsearch sharded by day. • 100s of millions of records per protocol type per day are archived. • Data set granularity ranges from seconds to months. Elasticsearch records are immutable. • Packet records are never modified • Some updates and upserts are done to threat data for detected attacks. Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 19 Elasticsearch configuration and performance One of the most important pieces to performance Elasticsearch is a Java application, configuring the Java environment makes an enormous impact on performance. Original Oracle JVM outperforms distribution specific bundles in all of our testing. • A few key environment variables – JAVA_OPTS –Xms30g –Xmx30g – ES_HEAP_SIZE=30g • Async replication allows for faster ingestion • Bulk ingestion can increase storage performance if dealing with large volumes of data Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 20 Elasticsearch and security Later versions of Elasticsearch support authentication and encryption natively Nginx SSL reverse proxy allows for many authentication schemes and encrypted communications. • Nginx can provide discrete URL based authentication schemes – Auth tokens for writing data from servers – LDAP/AD/Kerberos for secure user access Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 21 Stonefish brings it all together Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 22 Visualizing the data Custom dashboard based on Flask, Angular, and D3 Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 23 The result – Seeing attacks in real-time Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 24 Normal traffic is consistent Histograms show anomalies clearly Threats are abrupt changes to traffic pattern Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 25 Elasticsearch does the heavy lifting Elasticsearch facets allow us to query packet data for anomalies. • Allows us to write query logic based on what type of data we want to see. We do not have to write ‘fuzzy’ logic or custom code algorithms to determine where anomalies lie. • Elasticsearch finds common packet properties based on threshold information supplied in queries to it. • Elasticsearch’s built in trend data is also used to determine the pattern a particular signature is taking on. (Trending up/down, etc.) Building the above functionality in house would have greatly complicated our task and extended our timeframe to deployment. Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 26 Data Analysis Elasticsearch data is continually analyzed for changes to packet metrics. • This is done via custom Python code. • The VDMS software retrieves scores for time intervals and compares them to previous intervals for anomalous or out-of-bounds changes. • Each application type has custom query and detection logic for the most accurate identification possible. Many factors determine if traffic is a threat.. • Does this traffic exceed a minimum threshold for this POP? • Are there signatures that exceed a minimum threshold for this POP? • If the answer to either of these questions is yes, generate signatures and score them. Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 27 Drilling into the data Elasticsearch facets are used as part of query and filter logic. • Aggregations build scores on combinations of packet header properties. • Signatures are weighted against one another to determine the percentage of new connections globally and local to a POP. • Comparisons to previous data is done to confirm normal/abnormal behavior. Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 28 Acting on the data Anomalous traffic patterns detected • What is the percentage of total POP and global traffic? • Does this traffic exceed a minimum threshold for us to take action? • Is this anomalous traffic valid traffic from a customer event in progress? • This decision is key as large customers can spike heavily. • Is the signature specific enough to only filter attack traffic? • Custom signature scoring algorithm. • Mitigation rules are globally distributed within 60 seconds when thresholds are exceeded. • The system automatically removes mitigations rules after a pre-determined amount of time once an attack is no longer observed. Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 29 Signature scoring Custom algorithm generates a score for a signature. • The score gives us an indication of how specific or generic a signature is. • Generic signatures can filter valid traffic, the more specific the better and the higher the score. • Scores range from 1 – 10. • Scores greater than 5 are very specific and safe to deploy. • Lower scored signatures require engineering review to determine if they should be deployed. Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 30 Threat storage A separate Elasticsearch index is used to store threats and their signatures. • Track attack start and stop times • Anycast VIP being attacked • The protocol(s) used in the attack • The signatures associated with the attack • Rate of attack traffic This provides us a historical view of all attacks targeted at out network. Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 31 Threat Visibility Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 32 Elasticsearch allows for historical reporting Storing threats separately allows for fast and easy reporting. A single month view in 2015 Total 107 Average Per Day 3 Average Attack Size 336,212/sec Median Attack Size 211,700/sec Max Attack Size 3,752,166/sec Average POP Traffic % 22% Median POP Traffic % 17% Max POP Traffic % 81% Average Attack Duration 0:21:08 Median Attack Duration 0:23:00 Max Attack Duration 13:45:00 Total Time Under Attack 52:42:00 Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 33 Success Stories The system has been largely successful • 1000s of attacks have been identified and stopped by Stonefish • Minimal performance impact has been experienced as a result of DDOS attacks. • No news is good news • VDMS has not been in the news regarding DDOS or outages. • Over the past years many networks have been impacted by sizable sustained attacks. • Stonefish identified it’s first attack 30 seconds after going into production. • The system has 100% uptime for over 24 months. • Stonefish is protecting over 13Tbps of network capacity across 100+ POPs • 5% of the total Internet is powered by VDMS Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 34 Additional information Follow up • Chris Bradley • christopher.bradley@verizondigitalmedia.com Is this something you would like to work on? http://jobs.verizondigitalmedia.com Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 35 Thank you. Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 36