Real-time, High Throughput Traffic Analytics with Elasticsearch

advertisement
Real-time, High
Throughput Traffic
Analytics with
Elasticsearch
Chris Bradley
christopher.bradley@verizondigitalmedia.com
February 04, 2016
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Disclaimers
Portions of the data we use to make analytical decisions have been removed.
The removal of this data does not impact the functionality of the system.
•  This redaction is done as a security measure as our network is under constant threat.
•  The overall system functionality is present here intact and only a small set of data
properties used to make analytical decisions have been removed.
•  File paths, network ports, data sampling rates, and specific packet properties are
types of data that has been generalized or redacted.
Software Versions
We are running recent versions of software but not bleeding edge.
Software discussed herein references the following versions.
• 
Elasticsearch – 1.3.1
• 
Logstash – 1.4.6
• 
Rsyslog – 8.12.0
• 
Ubuntu – 14.04 LTS
• 
Python – 2.7.5
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
2
Disclaimers
One last thing…
We are still using facets instead of aggregations in Elasticsearch. This is a
result of the system being built before aggregations existed.
We will be switching but facets do still function today. (They are deprecated
though.)
The end result is facets and aggregations have very similar functions,
aggregations are just more efficient.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
3
Terminology
Terminology used throughout this presentation
•  Anycast – Global load balancing based on shared IP address assignment.
•  CDN – Content delivery network.
•  GSLB – Global server load balancing.
•  OSI Layer 3 – The network layer of the OSI model. In this case IPv4 or
IPv6.
•  OSI Layer 4 – The transport layer of the OSI model. In this case TCP, UDP,
or ICMP.
•  Signature – A tuple of packet properties that are common during an attack.
•  Spoof – The action of masking or faking IP addresses as part of an attack.
•  VDMS – Verizon Digital Media Services.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
4
Overview
The VDMS network is under constant attack.
Over the past years we have developed an application (Stonefish) to identify
threats in real-time and mitigate them in fast time scales.
•  Time to identification and time to resolution are independent metrics that are
critical to the performance of our CDN.
The purpose of the Stonefish application is to identify threats at OSI layer 3
and 4 and take action to filter this malicious traffic from our network.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
5
CDNs and the DDOS problem
For CDNs performance and uptime are the most critical metrics.
The VDMS mission is to be the fastest, most performant CDN on the planet.
•  We are incredibly sensitive to anything that impacts performance.
•  DDOS attacks are a serious threat to customer retention.
Predicting and Preventing every single attack is impossible.
•  The design goal of the system was originally to obtain the largest network
coverage and protection in a realistic manner.
•  The nature of DDOS attacks makes them relatively easy to initiate but
difficult to track to origination point(s).
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
6
Stonefish Origin Story
The origination of Stonefish starts in 2011.
During this time EdgeCast was suffering from large scale multi-day DDOS
attacks.
•  Tracking attack signatures was a manual process.
•  Manual rules generated to filter signatures.
•  Globally distribute these rules to servers and routers.
This was not an effective means for dealing with attacks.
•  Time to identification and mitigation was not acceptable.
•  The level of effort and engineering resources required to do this were
significant.
•  CDN performance was impacted for unacceptable time periods while this
manual process occurred.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
7
Stonefish Design Goals
The concept that was created was a system to automatically identify and
stop attacks.
•  Identify an attack on our network anywhere in the world in 60 seconds.
•  Mitigate 99% of attacks on our network within 60 seconds of identification.
•  Monitor connection state information for millions of connections in near
real-time.
•  Block the most common attacks without human intervention.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
8
VDMS Simplified POP View
Routing
Internet
peering
and
uplinks,
ACLs only
Firewall
Stateful
inspection
Load
Balance
Edge
Intelligent
control
plane
App layer
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
9
Stonefish Ecosystem
Sample Data
(IPTables)
Push Rules
(Stonefish)
Transmit Data
(Rsyslog)
Process Data
(Stonefish)
Ingest Data
(Logstash)
Store Data
(Elasticsearch)
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
10
Cluster Hardware
The ‘heart’ of Stonefish is an Elasticsearch cluster
This is a fully collapsible cluster that utilizes the VDMS Anycast architecture to
provide 99.999% uptime.
Cluster Composition
•  240 Intel Xeon cores
•  640GB RAM
•  120TB SSD storage
•  Redundant 10GB Ethernet links to each server in the cluster
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
11
Collection and Storage
Data collection occurs at routing and load balancing layers
•  Load balancing layer runs Linux
–  Kernel packet logs sampled data via IPTables.
•  Rsyslog sends JSON formatted packet data to Anycast address.
•  Logstash receives JSON packet data and ingests to Elasticsearch.
All cluster servers are horizontally scalable
•  All servers in the cluster run an identical stack.
•  One Anycast address is used to access all servers in the cluster.
•  Any server in the cluster can ingest packet data.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
12
Inbound traffic
Load balancers
Kernel packet logging
Rsyslog JSON formatting
Data flow
from edge to
cluster
Rsyslog TCP TLS transmission
Stonefish Cluster – Logstash Ingestion
Stonefish Cluster – Elasticsearch storage
SYN
DNS
ICMP
UDP
Stonefish Threat Detection
Stonefish Dashboard
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
13
Linux Kernel Packet Logging
Kernel provides packet logging via IPTables user space tools
Example Log Rules
iptables –t filter -A log_chain -p tcp -m tcp --tcp-flags SYN SYN -m statistic --mode random --probability 0.0001 -j LOG --log-prefix \"ipt-type1: \" --log-level 7
iptables –t filter -A log_chain -p icmp -m statistic --mode random --probability 0.0001 -j LOG --log-prefix \"ipt-type2: \" --log-level 7
iptables –t filter -A log_chain -p udp --dport 53 -m statistic --mode random --probability 0.0001 -j LOG --log-prefix \"ipt-type3: \" --log-level 7
iptables –t filter -A log_chain -p udp -m statistic --mode random --probability 0.0001 -j LOG --log-prefix \"ipt-type4: \" --log-level 7
Output
Chain log_chain (1 references)
pkts bytes target
prot opt in
out
source
destination
0
0
LOG
tcp --
*
*
0.0.0.0/0
0.0.0.0/0
tcp flags:0x02/0x02 statistic mode random probability 0.0001 LOG flags 0 level 7 prefix "ipt-type1 "
0
0
LOG
icmp -- *
*
0.0.0.0/0
0.0.0.0/0
statistic mode random probability 0.0001 LOG flags 0 level 7 prefix "ipt-type2: "
0
0
LOG
udp -- *
*
0.0.0.0/0
0.0.0.0/0
udp dpt:53 statistic mode random probability 0.0001 LOG flags 0 level 7 prefix "ipt-type3: "
0
0
LOG
udp -- *
*
0.0.0.0/0
0.0.0.0/0
statistic mode random probability 0.0001 LOG flags 0 level 7 prefix "ipt-type4: “
Result
[[11507705.594434] ipt-type2: IN=bond0 OUT= MAC=ec:f4:bb:e7:13:a8:78:19:f7:38:0f:f0:08:00 SRC=10.0.0.1 DST=10.1.0.1 LEN=68 TOS=0x00 PREC=0x00 TTL=61 ID=32341
PROTO=ICMP TYPE=3 CODE=1 [SRC=93.184.221.200 DST=87.209.64.100 LEN=40 TOS=0x00 PREC=0x00 TTL=58 ID=5286 DF PROTO=TCP SPT=443 DPT=56484
WINDOW=290 RES=0x00 ACK FIN URGP=0 ] MARK=0x8
[11507804.647848] ipt-type1: IN=bond0 OUT= MAC=ec:f4:bb:e7:13:a8:78:19:f7:38:0f:f0:08:00 SRC=10.0.0.2 DST=10.1.0.1 LEN=60 TOS=0x00 PREC=0x00 TTL=58 ID=28666 DF
PROTO=TCP SPT=59999 DPT=80 WINDOW=29200 RES=0x00 SYN URGP=0 MARK=0x7
[11507804.942317] ipt-type1 IN=bond0 OUT= MAC=ec:f4:bb:e7:13:a8:78:19:f7:38:0f:f0:08:00 SRC=10.0.0.3 DST=10.1.0.1 LEN=60 TOS=0x00 PREC=0x00 TTL=56 ID=53421 DF
PROTO=TCP SPT=59761 DPT=80 WINDOW=29200 RES=0x00 SYN URGP=0 MARK=0x7
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
14
Rsyslog data transmission
Rsyslog templating is not user friendly, but powerful and fast
Example /etc/rsyslog.d/10-ipt-log-transmit.conf
$template msg_parse, "{\"@timestamp\": \"%TIMESTAMP:::date-rfc3339%\", \"@hostname\": \"%hostname%\", \"@fields\":
{\"field1\": \"%msg:R,ERE,1,BLANK,0:FLD1=([0-9?.]+)--end%\", \"field2\": \"%msg:R,ERE,1,BLANK,0:FLD2=([0-9?.]+)--end%\",
\"field3\": \"%msg:R,ERE,1,BLANK, 0:FLD3=([0-9]+)--end%\", \"field4\": \"%msg:R,ERE,1,BLANK,0:FLD4=([0-9]+)--end%\",
\"field5\": \"%msg:R,ERE,1,BLANK,0:FLD5=([0-9]+)--end%\", \"field6\": \"%msg:R,ERE,1,BLANK, 0:FLD6=([0-9]+)--end%\",
\"field7\": \"%msg:R,ERE,1,BLANK,0:FLD8=(0x[0-9a-f]+)--end%\"}, \"@tags \": [\"ipt-type1\"]}\n"
$MainMsgQueueSize 1000
$MainMsgQueueDiscardMark 800
$MaxMessageSize 1000k
$WorkDirectory /path/to/disk_assisted_queue_dir/daq
$ActionQueueFileName packetlog
$ActionQueueMaxDiskSpace 10g
$ActionQueueSaveOnShutdown on
$ActionQueueType LinkedList
$ActionResumeRetryCount -1
$SystemLogRateLimitInterval 0
$DefaultNetstreamDriverCAFile /path/to/tls/cert_file/cert.crt
$ActionSendStreamDriver gtls
$ActionSendStreamDriverMode 1
$ActionSendStreamDriverAuthMode x509/name
$ActionSendStreamDriverPermittedPeer *.domainname.tld
:msg, contains, "ipt-type1" @@anycast.ip.address:9999;msg_parse
& stop
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
15
Rsyslog result
Rsyslog mutates kernel to JSON and puts onto the wire.
Kernel packet log
[11507804.647848] ipt-type1: IN=bond0 OUT= MAC=ec:f4:bb:e7:13:a8:78:19:f7:38:0f:f0:08:00
SRC=10.0.0.1 DST=10.1.0.1 LEN=60 TOS=0x00 PREC=0x00 TTL=58 ID=28666 DF
PROTO=TCP SPT=59999 DPT=80 WINDOW=29200 RES=0x00 SYN URGP=0 MARK=0x7
Rsyslog JSON output
{
}
"_source": {
"@timestamp": "2015-09-01T01:54:51.813+00:00”,
"@srvtype": ”server_type”,
"@source_host": ”hostname”,
"@fields": {
"src": ”10.0.0.1”,
"dst": ”10.1.0.1”,
”field1": ”99”,
”field2": ”99”,
”field3": ”99”,
”field4": ”99”,
"mark": "0x7”
},
"@tags": [
”ipt_type1”
],
"@version": "1”,
"type": ”ipt-type1”,
"host": ”127.0.0.1”
}
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
16
Rsyslog data transmission alternatives
There are many alternatives that are more user friendly in production today.
•  Logstash has egress functionality.
•  Lumberjack is a lightweight Logstash.
•  Custom code could be written to transport messages.
•  Publish-subscribe messaging systems could be used (NSQ, Kafka, RabbitMQ, etc.).
•  Many Github projects out there based on Python, Go, or your desired language.
If we were building from the ground up today Rsyslog may not have been selected.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
17
Logstash Ingestion
Configuration is broken into two parts: input and output.
Logstash is highly configurable and powerful, we use it in it’s most basic form.
Configuration file can be in JSON or YAML.
input {
tcp {
port => 9999
type => ’custom_type_here’
ssl_enable => 'true’
ssl_cacert => ’/path/to/certificate_authority/ca.crt’
ssl_cert => '/path/to/server/cert/server.crt’
ssl_key => '/path/to/server/key/server.key’
codec => json {
charset => "UTF-8”
}
}
}
output {
if ”ipt-type1" in [@tags] {
elasticsearch_http {
index => ’index-type1-%{+YYYY.MM.dd}’
host => ’your.es.host.ip.or.localhost’
template => '/path/to/your/elasticsearch/index/template/file.tmpl’
template_name => ’custom_template_name’
template_overwrite => true
}
}
}
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
18
Elasticsearch and Data Storage
Stonefish randomly samples n% of all inbound traffic
•  Each protocol type has data stored in a unique index in Elasticsearch sharded by day.
•  100s of millions of records per protocol type per day are archived.
•  Data set granularity ranges from seconds to months.
Elasticsearch records are immutable.
•  Packet records are never modified
•  Some updates and upserts are done to threat data for detected attacks.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
19
Elasticsearch configuration and performance
One of the most important pieces to performance
Elasticsearch is a Java application, configuring the Java environment makes
an enormous impact on performance.
Original Oracle JVM outperforms distribution specific bundles in all of our testing.
•  A few key environment variables
–  JAVA_OPTS –Xms30g –Xmx30g
–  ES_HEAP_SIZE=30g
•  Async replication allows for faster ingestion
•  Bulk ingestion can increase storage performance if dealing with large volumes of
data
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
20
Elasticsearch and security
Later versions of Elasticsearch support authentication and encryption
natively
Nginx SSL reverse proxy allows for many authentication schemes and
encrypted communications.
•  Nginx can provide discrete URL based authentication schemes
–  Auth tokens for writing data from servers
–  LDAP/AD/Kerberos for secure user access
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
21
Stonefish
brings it all
together
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
22
Visualizing
the data
Custom dashboard based on Flask, Angular, and D3
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
23
The result –
Seeing
attacks in
real-time
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
24
Normal traffic is consistent
Histograms
show
anomalies
clearly
Threats are abrupt changes to traffic pattern
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
25
Elasticsearch does the heavy lifting
Elasticsearch facets allow us to query packet data for anomalies.
•  Allows us to write query logic based on what type of data we want to see.
We do not have to write ‘fuzzy’ logic or custom code algorithms to
determine where anomalies lie.
•  Elasticsearch finds common packet properties based on threshold
information supplied in queries to it.
•  Elasticsearch’s built in trend data is also used to determine the pattern a
particular signature is taking on. (Trending up/down, etc.)
Building the above functionality in house would have greatly
complicated our task and extended our timeframe to deployment.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
26
Data Analysis
Elasticsearch data is continually analyzed for changes to packet metrics.
•  This is done via custom Python code.
•  The VDMS software retrieves scores for time intervals and compares them
to previous intervals for anomalous or out-of-bounds changes.
•  Each application type has custom query and detection logic for the most
accurate identification possible.
Many factors determine if traffic is a threat..
•  Does this traffic exceed a minimum threshold for this POP?
•  Are there signatures that exceed a minimum threshold for this POP?
•  If the answer to either of these questions is yes, generate signatures and
score them.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
27
Drilling into the data
Elasticsearch facets are used as part of query and filter logic.
•  Aggregations build scores on combinations of packet header properties.
•  Signatures are weighted against one another to determine the percentage
of new connections globally and local to a POP.
•  Comparisons to previous data is done to confirm normal/abnormal
behavior.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
28
Acting on the data
Anomalous traffic patterns detected
•  What is the percentage of total POP and global traffic?
•  Does this traffic exceed a minimum threshold for us to take action?
•  Is this anomalous traffic valid traffic from a customer event in progress?
•  This decision is key as large customers can spike heavily.
•  Is the signature specific enough to only filter attack traffic?
•  Custom signature scoring algorithm.
•  Mitigation rules are globally distributed within 60 seconds when thresholds
are exceeded.
•  The system automatically removes mitigations rules after a pre-determined
amount of time once an attack is no longer observed.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
29
Signature scoring
Custom algorithm generates a score for a signature.
•  The score gives us an indication of how specific or generic a signature is.
•  Generic signatures can filter valid traffic, the more specific the better and
the higher the score.
•  Scores range from 1 – 10.
•  Scores greater than 5 are very specific and safe to deploy.
•  Lower scored signatures require engineering review to determine if they
should be deployed.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
30
Threat storage
A separate Elasticsearch index is used to store threats and their
signatures.
•  Track attack start and stop times
•  Anycast VIP being attacked
•  The protocol(s) used in the attack
•  The signatures associated with the attack
•  Rate of attack traffic
This provides us a historical view of all attacks targeted at out network.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
31
Threat
Visibility
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
32
Elasticsearch allows for historical reporting
Storing threats separately allows for fast and easy reporting.
A single month view in 2015
Total
107
Average Per Day
3
Average Attack Size
336,212/sec
Median Attack Size
211,700/sec
Max Attack Size
3,752,166/sec
Average POP Traffic %
22%
Median POP Traffic %
17%
Max POP Traffic %
81%
Average Attack Duration
0:21:08
Median Attack Duration
0:23:00
Max Attack Duration
13:45:00
Total Time Under Attack
52:42:00
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
33
Success Stories
The system has been largely successful
•  1000s of attacks have been identified and stopped by Stonefish
•  Minimal performance impact has been experienced as a result of DDOS
attacks.
•  No news is good news
•  VDMS has not been in the news regarding DDOS or outages.
•  Over the past years many networks have been impacted by sizable
sustained attacks.
•  Stonefish identified it’s first attack 30 seconds after going into production.
•  The system has 100% uptime for over 24 months.
•  Stonefish is protecting over 13Tbps of network capacity across 100+ POPs
•  5% of the total Internet is powered by VDMS
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
34
Additional information
Follow up
•  Chris Bradley
•  christopher.bradley@verizondigitalmedia.com
Is this something you would like to work on?
http://jobs.verizondigitalmedia.com
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
35
Thank you.
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
36
Download