Real-time, High Throughput Traffic Analytics with Elasticsearch

Real-time, High
Throughput Traffic
Analytics with
Chris Bradley
February 04, 2016
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Portions of the data we use to make analytical decisions have been removed.
The removal of this data does not impact the functionality of the system.
•  This redaction is done as a security measure as our network is under constant threat.
•  The overall system functionality is present here intact and only a small set of data
properties used to make analytical decisions have been removed.
•  File paths, network ports, data sampling rates, and specific packet properties are
types of data that has been generalized or redacted.
Software Versions
We are running recent versions of software but not bleeding edge.
Software discussed herein references the following versions.
Elasticsearch – 1.3.1
Logstash – 1.4.6
Rsyslog – 8.12.0
Ubuntu – 14.04 LTS
Python – 2.7.5
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
One last thing…
We are still using facets instead of aggregations in Elasticsearch. This is a
result of the system being built before aggregations existed.
We will be switching but facets do still function today. (They are deprecated
The end result is facets and aggregations have very similar functions,
aggregations are just more efficient.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Terminology used throughout this presentation
•  Anycast – Global load balancing based on shared IP address assignment.
•  CDN – Content delivery network.
•  GSLB – Global server load balancing.
•  OSI Layer 3 – The network layer of the OSI model. In this case IPv4 or
•  OSI Layer 4 – The transport layer of the OSI model. In this case TCP, UDP,
or ICMP.
•  Signature – A tuple of packet properties that are common during an attack.
•  Spoof – The action of masking or faking IP addresses as part of an attack.
•  VDMS – Verizon Digital Media Services.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
The VDMS network is under constant attack.
Over the past years we have developed an application (Stonefish) to identify
threats in real-time and mitigate them in fast time scales.
•  Time to identification and time to resolution are independent metrics that are
critical to the performance of our CDN.
The purpose of the Stonefish application is to identify threats at OSI layer 3
and 4 and take action to filter this malicious traffic from our network.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
CDNs and the DDOS problem
For CDNs performance and uptime are the most critical metrics.
The VDMS mission is to be the fastest, most performant CDN on the planet.
•  We are incredibly sensitive to anything that impacts performance.
•  DDOS attacks are a serious threat to customer retention.
Predicting and Preventing every single attack is impossible.
•  The design goal of the system was originally to obtain the largest network
coverage and protection in a realistic manner.
•  The nature of DDOS attacks makes them relatively easy to initiate but
difficult to track to origination point(s).
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Stonefish Origin Story
The origination of Stonefish starts in 2011.
During this time EdgeCast was suffering from large scale multi-day DDOS
•  Tracking attack signatures was a manual process.
•  Manual rules generated to filter signatures.
•  Globally distribute these rules to servers and routers.
This was not an effective means for dealing with attacks.
•  Time to identification and mitigation was not acceptable.
•  The level of effort and engineering resources required to do this were
•  CDN performance was impacted for unacceptable time periods while this
manual process occurred.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Stonefish Design Goals
The concept that was created was a system to automatically identify and
stop attacks.
•  Identify an attack on our network anywhere in the world in 60 seconds.
•  Mitigate 99% of attacks on our network within 60 seconds of identification.
•  Monitor connection state information for millions of connections in near
•  Block the most common attacks without human intervention.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
VDMS Simplified POP View
ACLs only
App layer
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Stonefish Ecosystem
Sample Data
Push Rules
Transmit Data
Process Data
Ingest Data
Store Data
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Cluster Hardware
The ‘heart’ of Stonefish is an Elasticsearch cluster
This is a fully collapsible cluster that utilizes the VDMS Anycast architecture to
provide 99.999% uptime.
Cluster Composition
•  240 Intel Xeon cores
•  640GB RAM
•  120TB SSD storage
•  Redundant 10GB Ethernet links to each server in the cluster
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Collection and Storage
Data collection occurs at routing and load balancing layers
•  Load balancing layer runs Linux
–  Kernel packet logs sampled data via IPTables.
•  Rsyslog sends JSON formatted packet data to Anycast address.
•  Logstash receives JSON packet data and ingests to Elasticsearch.
All cluster servers are horizontally scalable
•  All servers in the cluster run an identical stack.
•  One Anycast address is used to access all servers in the cluster.
•  Any server in the cluster can ingest packet data.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Inbound traffic
Load balancers
Kernel packet logging
Rsyslog JSON formatting
Data flow
from edge to
Rsyslog TCP TLS transmission
Stonefish Cluster – Logstash Ingestion
Stonefish Cluster – Elasticsearch storage
Stonefish Threat Detection
Stonefish Dashboard
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Linux Kernel Packet Logging
Kernel provides packet logging via IPTables user space tools
Example Log Rules
iptables –t filter -A log_chain -p tcp -m tcp --tcp-flags SYN SYN -m statistic --mode random --probability 0.0001 -j LOG --log-prefix \"ipt-type1: \" --log-level 7
iptables –t filter -A log_chain -p icmp -m statistic --mode random --probability 0.0001 -j LOG --log-prefix \"ipt-type2: \" --log-level 7
iptables –t filter -A log_chain -p udp --dport 53 -m statistic --mode random --probability 0.0001 -j LOG --log-prefix \"ipt-type3: \" --log-level 7
iptables –t filter -A log_chain -p udp -m statistic --mode random --probability 0.0001 -j LOG --log-prefix \"ipt-type4: \" --log-level 7
Chain log_chain (1 references)
pkts bytes target
prot opt in
tcp --
tcp flags:0x02/0x02 statistic mode random probability 0.0001 LOG flags 0 level 7 prefix "ipt-type1 "
icmp -- *
statistic mode random probability 0.0001 LOG flags 0 level 7 prefix "ipt-type2: "
udp -- *
udp dpt:53 statistic mode random probability 0.0001 LOG flags 0 level 7 prefix "ipt-type3: "
udp -- *
statistic mode random probability 0.0001 LOG flags 0 level 7 prefix "ipt-type4: “
[[11507705.594434] ipt-type2: IN=bond0 OUT= MAC=ec:f4:bb:e7:13:a8:78:19:f7:38:0f:f0:08:00 SRC= DST= LEN=68 TOS=0x00 PREC=0x00 TTL=61 ID=32341
[11507804.647848] ipt-type1: IN=bond0 OUT= MAC=ec:f4:bb:e7:13:a8:78:19:f7:38:0f:f0:08:00 SRC= DST= LEN=60 TOS=0x00 PREC=0x00 TTL=58 ID=28666 DF
PROTO=TCP SPT=59999 DPT=80 WINDOW=29200 RES=0x00 SYN URGP=0 MARK=0x7
[11507804.942317] ipt-type1 IN=bond0 OUT= MAC=ec:f4:bb:e7:13:a8:78:19:f7:38:0f:f0:08:00 SRC= DST= LEN=60 TOS=0x00 PREC=0x00 TTL=56 ID=53421 DF
PROTO=TCP SPT=59761 DPT=80 WINDOW=29200 RES=0x00 SYN URGP=0 MARK=0x7
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Rsyslog data transmission
Rsyslog templating is not user friendly, but powerful and fast
Example /etc/rsyslog.d/10-ipt-log-transmit.conf
$template msg_parse, "{\"@timestamp\": \"%TIMESTAMP:::date-rfc3339%\", \"@hostname\": \"%hostname%\", \"@fields\":
{\"field1\": \"%msg:R,ERE,1,BLANK,0:FLD1=([0-9?.]+)--end%\", \"field2\": \"%msg:R,ERE,1,BLANK,0:FLD2=([0-9?.]+)--end%\",
\"field3\": \"%msg:R,ERE,1,BLANK, 0:FLD3=([0-9]+)--end%\", \"field4\": \"%msg:R,ERE,1,BLANK,0:FLD4=([0-9]+)--end%\",
\"field5\": \"%msg:R,ERE,1,BLANK,0:FLD5=([0-9]+)--end%\", \"field6\": \"%msg:R,ERE,1,BLANK, 0:FLD6=([0-9]+)--end%\",
\"field7\": \"%msg:R,ERE,1,BLANK,0:FLD8=(0x[0-9a-f]+)--end%\"}, \"@tags \": [\"ipt-type1\"]}\n"
$MainMsgQueueSize 1000
$MainMsgQueueDiscardMark 800
$MaxMessageSize 1000k
$WorkDirectory /path/to/disk_assisted_queue_dir/daq
$ActionQueueFileName packetlog
$ActionQueueMaxDiskSpace 10g
$ActionQueueSaveOnShutdown on
$ActionQueueType LinkedList
$ActionResumeRetryCount -1
$SystemLogRateLimitInterval 0
$DefaultNetstreamDriverCAFile /path/to/tls/cert_file/cert.crt
$ActionSendStreamDriver gtls
$ActionSendStreamDriverMode 1
$ActionSendStreamDriverAuthMode x509/name
$ActionSendStreamDriverPermittedPeer *.domainname.tld
:msg, contains, "ipt-type1" @@anycast.ip.address:9999;msg_parse
& stop
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Rsyslog result
Rsyslog mutates kernel to JSON and puts onto the wire.
Kernel packet log
[11507804.647848] ipt-type1: IN=bond0 OUT= MAC=ec:f4:bb:e7:13:a8:78:19:f7:38:0f:f0:08:00
SRC= DST= LEN=60 TOS=0x00 PREC=0x00 TTL=58 ID=28666 DF
PROTO=TCP SPT=59999 DPT=80 WINDOW=29200 RES=0x00 SYN URGP=0 MARK=0x7
Rsyslog JSON output
"_source": {
"@timestamp": "2015-09-01T01:54:51.813+00:00”,
"@srvtype": ”server_type”,
"@source_host": ”hostname”,
"@fields": {
"src": ””,
"dst": ””,
”field1": ”99”,
”field2": ”99”,
”field3": ”99”,
”field4": ”99”,
"mark": "0x7”
"@tags": [
"@version": "1”,
"type": ”ipt-type1”,
"host": ””
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Rsyslog data transmission alternatives
There are many alternatives that are more user friendly in production today.
•  Logstash has egress functionality.
•  Lumberjack is a lightweight Logstash.
•  Custom code could be written to transport messages.
•  Publish-subscribe messaging systems could be used (NSQ, Kafka, RabbitMQ, etc.).
•  Many Github projects out there based on Python, Go, or your desired language.
If we were building from the ground up today Rsyslog may not have been selected.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Logstash Ingestion
Configuration is broken into two parts: input and output.
Logstash is highly configurable and powerful, we use it in it’s most basic form.
Configuration file can be in JSON or YAML.
input {
tcp {
port => 9999
type => ’custom_type_here’
ssl_enable => 'true’
ssl_cacert => ’/path/to/certificate_authority/ca.crt’
ssl_cert => '/path/to/server/cert/server.crt’
ssl_key => '/path/to/server/key/server.key’
codec => json {
charset => "UTF-8”
output {
if ”ipt-type1" in [@tags] {
elasticsearch_http {
index => ’index-type1-%{+YYYY.MM.dd}’
host => ’’
template => '/path/to/your/elasticsearch/index/template/file.tmpl’
template_name => ’custom_template_name’
template_overwrite => true
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Elasticsearch and Data Storage
Stonefish randomly samples n% of all inbound traffic
•  Each protocol type has data stored in a unique index in Elasticsearch sharded by day.
•  100s of millions of records per protocol type per day are archived.
•  Data set granularity ranges from seconds to months.
Elasticsearch records are immutable.
•  Packet records are never modified
•  Some updates and upserts are done to threat data for detected attacks.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Elasticsearch configuration and performance
One of the most important pieces to performance
Elasticsearch is a Java application, configuring the Java environment makes
an enormous impact on performance.
Original Oracle JVM outperforms distribution specific bundles in all of our testing.
•  A few key environment variables
–  JAVA_OPTS –Xms30g –Xmx30g
•  Async replication allows for faster ingestion
•  Bulk ingestion can increase storage performance if dealing with large volumes of
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Elasticsearch and security
Later versions of Elasticsearch support authentication and encryption
Nginx SSL reverse proxy allows for many authentication schemes and
encrypted communications.
•  Nginx can provide discrete URL based authentication schemes
–  Auth tokens for writing data from servers
–  LDAP/AD/Kerberos for secure user access
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
brings it all
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
the data
Custom dashboard based on Flask, Angular, and D3
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
The result –
attacks in
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Normal traffic is consistent
Threats are abrupt changes to traffic pattern
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Elasticsearch does the heavy lifting
Elasticsearch facets allow us to query packet data for anomalies.
•  Allows us to write query logic based on what type of data we want to see.
We do not have to write ‘fuzzy’ logic or custom code algorithms to
determine where anomalies lie.
•  Elasticsearch finds common packet properties based on threshold
information supplied in queries to it.
•  Elasticsearch’s built in trend data is also used to determine the pattern a
particular signature is taking on. (Trending up/down, etc.)
Building the above functionality in house would have greatly
complicated our task and extended our timeframe to deployment.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Data Analysis
Elasticsearch data is continually analyzed for changes to packet metrics.
•  This is done via custom Python code.
•  The VDMS software retrieves scores for time intervals and compares them
to previous intervals for anomalous or out-of-bounds changes.
•  Each application type has custom query and detection logic for the most
accurate identification possible.
Many factors determine if traffic is a threat..
•  Does this traffic exceed a minimum threshold for this POP?
•  Are there signatures that exceed a minimum threshold for this POP?
•  If the answer to either of these questions is yes, generate signatures and
score them.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Drilling into the data
Elasticsearch facets are used as part of query and filter logic.
•  Aggregations build scores on combinations of packet header properties.
•  Signatures are weighted against one another to determine the percentage
of new connections globally and local to a POP.
•  Comparisons to previous data is done to confirm normal/abnormal
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Acting on the data
Anomalous traffic patterns detected
•  What is the percentage of total POP and global traffic?
•  Does this traffic exceed a minimum threshold for us to take action?
•  Is this anomalous traffic valid traffic from a customer event in progress?
•  This decision is key as large customers can spike heavily.
•  Is the signature specific enough to only filter attack traffic?
•  Custom signature scoring algorithm.
•  Mitigation rules are globally distributed within 60 seconds when thresholds
are exceeded.
•  The system automatically removes mitigations rules after a pre-determined
amount of time once an attack is no longer observed.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Signature scoring
Custom algorithm generates a score for a signature.
•  The score gives us an indication of how specific or generic a signature is.
•  Generic signatures can filter valid traffic, the more specific the better and
the higher the score.
•  Scores range from 1 – 10.
•  Scores greater than 5 are very specific and safe to deploy.
•  Lower scored signatures require engineering review to determine if they
should be deployed.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Threat storage
A separate Elasticsearch index is used to store threats and their
•  Track attack start and stop times
•  Anycast VIP being attacked
•  The protocol(s) used in the attack
•  The signatures associated with the attack
•  Rate of attack traffic
This provides us a historical view of all attacks targeted at out network.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Elasticsearch allows for historical reporting
Storing threats separately allows for fast and easy reporting.
A single month view in 2015
Average Per Day
Average Attack Size
Median Attack Size
Max Attack Size
Average POP Traffic %
Median POP Traffic %
Max POP Traffic %
Average Attack Duration
Median Attack Duration
Max Attack Duration
Total Time Under Attack
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Success Stories
The system has been largely successful
•  1000s of attacks have been identified and stopped by Stonefish
•  Minimal performance impact has been experienced as a result of DDOS
•  No news is good news
•  VDMS has not been in the news regarding DDOS or outages.
•  Over the past years many networks have been impacted by sizable
sustained attacks.
•  Stonefish identified it’s first attack 30 seconds after going into production.
•  The system has 100% uptime for over 24 months.
•  Stonefish is protecting over 13Tbps of network capacity across 100+ POPs
•  5% of the total Internet is powered by VDMS
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Additional information
Follow up
•  Chris Bradley
Is this something you would like to work on?
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Thank you.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.