dpi_presentation_pisa

advertisement
Usefulness of the results
- a forgotten evaluation metric
of traffic identification tools
Tomasz Bujlow (tbu@es.aau.dk)
Aalborg University
Agenda
• Few words about myself
• Motivations for traffic monitoring
• Existing methods and tools for traffic monitoring & classification
- and why they are far from being excellent
• A deep look into Deep Packet Inspection
• How to verify the accuracy of classification tools?
• Implementation and various applications of VBS
Tomasz Bujlow
• University of Southern Denmark (2007 – 2009)
Bachelor of Computer Engineering, Computer Engineering
• The Silesian University of Technology (2003 – 2008)
Master of Science in Engineering, Computer Engineering
• Aalborg University (2010 – 2014)
Doctor of Philosophy (PhD): Classification and analysis of computer network traffic
• Universitat Politecnica de Catalunya (January 2013 – April 2013)
Visiting PhD Student, CBA - Broadband Communications Research Group
• Cisco Certified Network Professional (2010)
Part I
Motivations for traffic monitoring
Why to perform traffic monitoring?
• To obtain basic statistical information about different kinds of
flows in the network and improve Quality of Service
Our interests
●
●
●
content (audio, video)
application (P2P, FTP)
service (YouTube, Facebook)
Why to perform traffic monitoring?
• To obtain the knowledge which applications are most frequently
used in the network and enhance user experience by tuning
some network parameters or setting up dedicated proxies or
servers
Our interests
●
application (Skype, HTTP)
Why to perform traffic monitoring?
• To compare users located in the same network and group them
into profiled sections
Our interests
●
●
application (Skype, BitTorrent)
IP protocol (TCP / UDP)
Why to perform traffic monitoring?
• To create graphs of traffic flow between different networks and
optimize amounts of bandwidth bought from different content
providers
Our interests
●
service (YouTube, Facebook)
Why to perform traffic monitoring?
• To introduce smart logging of traffic. Logging is now required by
law. The ability to recognize types of transmitted content can
result in registering of only, for example, text content of
websites, but not images or downloaded binary files. That will
save resources, especially storage space.
Our interests
●
content (text, audio, video)
Why to perform traffic monitoring?
• To create a traffic generator, to imitate traffic generated by
particular applications, or to imitate the real traffic in the
network. That allows to test different solutions before
implementing them in the real network, and therefore, to
minimize the cost.
Our interests
●
●
●
●
IP protocol (TCP / UDP)
application (HTTP, BitTorrent,
Skype)
content (audio, video)
service (YouTube, Facebook)
Why to perform traffic monitoring?
• To obtain precise data needed to create fast and accurate
traffic classifiers working in the network core, which are based
on statistical informations (Machine Learning Algorithms).
Our interests
●
●
●
●
IP protocol (TCP / UDP)
application (HTTP, BitTorrent,
Skype)
content (audio, video)
service (YouTube, Facebook)
Why to perform traffic monitoring?
• To implement smart assessment of QoS in the network at the
users' level and in the core of the network
Our interests
●
●
●
●
IP protocol (TCP / UDP)
application (HTTP, BitTorrent,
Skype)
content (audio, video)
service (YouTube, Facebook)
Why to perform traffic monitoring?
• To understand the behavior of different applications, services, ...
Web browsing
YouTube
World of Warcraft
Our interests
●
●
●
Skype
●
IP protocol (TCP / UDP)
application (HTTP, BitTorrent,
Skype)
content (audio, video)
service (YouTube, Facebook)
Why to perform traffic monitoring?
• To detect malicious traffic, such as Botnet traffic
Our interests
●
application (bot?)
Why to perform traffic monitoring?
• To detect malicious traffic, such as DDoS attacks
Part II
Existing methods and tools for traffic
monitoring & classification
- and why they are far
from being excellent
Traffic classification – overview
• Classification by ports
• Deep Packet Inspection (DPI)
• QoS based (IP precedence, DSCP)
• Statistical classification
Port-based classification
• Very simple idea, widely used by network administrators to limit
unwanted traffic (generated by worms, spam, etc.)
• Implemented on almost all layer-3 switches existing on the
market
• Can classify only applications operating on fixed ports numbers
• Very easy to cheat, so unreliable
What can we get?
●
low application-layer protocol
(HTTP, POP3) for some old, wellknown cases
Deep Packet Inspection (DPI)
• Rely on inspecting the payload on the application layer
• Much more convenient to use than previously described
methods
• Requires significant amounts of resources
• Numerous privacy and confidentiality issues
• Encryption makes DPI more difficult
• False positives and false negatives due to
implemented statistical methods in DPI tools
What can we get?
●
●
Everything we want: IP protocol, application, content, service, etc
But what kinds of results are really produced by the existing DPI tools?
DPI – is it really a consistent mean?
• Ipoque PACE – application level, content container level [FLASH,
WINDOWSMEDIA, QUICKTIME]
• OpenDPI – an open-source fork of PACE, the same level of consistency
• nDPI – successor of OpenDPI, additionally: service provider level
[FACEBOOK, GOOGLE, TWITTER]
• Libprotoident – L4 level [TCP / UDP] + the application [BitTorrent],
content [Flash_Player], or service provider [YahooError]
• NBAR – consistent output on the application level
• L7-filter - consistent output on the application level
Today accuracy != consistency
●
●
accurate tools (PACE, OpenDPI, nDPI) – inconsistent
consistent tools (NBAR, L7-filter) - inaccurate
DPI – results by PACE, OpenDPI, nDPI
• applications and application protocols: BITTORRENT, RDP, SMB, NTP,
SSH, DNS, PANDO, NETBIOS, EDONKEY, SOPCAST
,DIRECT_DOWNLOAD_LINK, FTP, ICMP, QUICKTIME, MAIL_SMTP,
MAIL_IMAP, WINDOWSMEDIA, MAIL_POP, PPSTREAM, STUN,
STEAM
• low-level application protocol: HTTP, SSL
• content: FLASH, MPEG
• Undetected traffic: UNKNOWN
• nDPI adds few services, as Facebook, YouTube, and Google
DPI - effects of the consistency aspect
• Even if the classification results are consistent on the application level, other
levels are unknown (IP protocol, lower application protocol, content,
service). So, the usefulness of such results is very limited. However, they
can be used for the accounting purposes on the application level.
• Mixing the levels of the results makes the things even worse:
a) it is not possible to account the traffic on any level, as always one chosen
level is given and the rest is unknown
b) as only one level is given, we do not know what is on any other level, so the
usefulness of such results in almost NONE!
Today accuracy != consistency
●
●
accurate tools (PACE, OpenDPI, nDPI) – inconsistent
consistent tools (NBAR, L7-filter) - inaccurate
DPI - reasons for the lack of consistency
• Most developers claim that “their tool provides the most detailed result, on
whatever level it is”
• However, how to assess, which level is more precise? Content (MP4 video),
content container (Flash), service (YouTube), or application protocol (HTTP)?
• Given that the obtain result is Flash, what is the real flow association?
a) TCP → HTTP → Flash → MP4 video → YouTube (regular file download)?
b) TCP → RTMP → Flash → Justin.tv (live TV streaming)?
c) TCP → FTP → Flash → EXE (executable file inside Flash container
transferred by FTP)?
Today accuracy != consistency
●
●
accurate tools (PACE, OpenDPI, nDPI) – inconsistent
consistent tools (NBAR, L7-filter) - inaccurate
DPI – how to generate useful results?
• Structure the results, so all the relevant classification levels are evaluated:
a) IP protocol level (TCP / UDP)
b) lower application-level protocol, as HTTP, SSL, POP3, etc
c) higher application-level protocol or application, as SMTPS, Skype, BitTorrent,
Dropbox
d) content, as MP4 video, FLV video, MP3 audio, JPG image
e) service, as Facebook, YouTube, or Google
• Implemented by:
a) new version of PACE (partly and in a very limited manner)
b) new, development version of nDPI (full implementation)
DPI – results generated by new PACE
BitTorrent:plain:not_detected
SSL:generic:not_detected
RDP:no_subprotocols:not_detected
HTTP:generic:not_yet_detected
unknown:no_subprotocols:not_yet_detected
HTTP:generic:youtube
BitTorrent:uTP:not_detected
HTTP:generic:youtube
SMB/CIFS:no_subprotocols:not_detected
Socks:socksv5:not_yet_detected
SSH:no_subprotocols:not_detected
PPLIVE:no_subprotocols:not_detected
HTTP:generic:not_detected
Skype:unknown:not_detected
BitTorrent:encrypted:not_detected
PPSTREAM:no_subprotocols:not_detected
DNS:no_subprotocols:not_detected
Google:encrypted:not_detected
Pando:no_subprotocols:not_detected
unknown:no_subprotocols:not_detected
NETBIOS:no_subprotocols:not_detected
HTTP:media:not_detected
Yahoo:webmail:not_detected
FLASH:no_subprotocols:not_detected
eDonkey:plain:not_detected
HTTP:generic:facebook
DPI – results generated by nDPI-ng
• proto: TCP->SSL_with_certificate->POP3S, service: Google → encrypted POP3
session with a Google mail server
• proto: TCP->SSL_with_certificate, service: Twitter" → encrypted connection to a
Twitter server
• proto: TCP->FTP_Data, content: JPG → file-transfer FTP session, which carries a
JPG image
• proto: TCP->SSL_with_certificate->Dropbox, service: Dropbox → encrypted Dropbox
session (the application is Dropbox) with the Dropbox server
• proto: TCP->SSL_with_certificate, service: Dropbox → encrypted session with a
Dropbox server, while the application is unknown (it can be a web browser
connection)
• proto: TCP->HTTP, content: WebM, service: YouTube → a flow from YouTube, which
transports WebM movie
• proto: UDP->DNS, service: Facebook → DNS query about
a hostname belonging to Facebook
Using QoS markers
• Class of service (CoS): 3-bit field that is present in an Ethernet
frame header when 802.1Q VLAN tagging is present
• Very easy to cheat – everyone can set it to any value
• Most Internet Service Providers do not trust incoming QoS
markings from their customers
What can we get?
●
Nothing more than previously set
by a user or an application
Using QoS markers
• IP packets contain the Type of Service field, which can be used
for layer-3 QoS marking
What can we get?
●
Nothing more than previously set
by a user or an application –
limited to trusted devices in the
network
Using QoS markers
• Valid values for IP Precedence: 0 - 7
• Valid values for DSCP: 0 – 63
Statistical classification
• Based on rules, which can be written manually (slow and
inefficient) or derived automatically by the use of Machine
Learning Algorithms (MLAs)
• Very broad choice of MLAs: K-Nearest Neighbors, K-Means,
Naive Bayes Filter, C4.5, J48, Random Forest, etc
• Achievable detection rate is over 95%
• MLAs require significant amount of good quality training data
• But... the speed is the power!
What can we get?
●
●
●
application
content (indirectly)
service (indirectly)
So how can we use the statistical methods?
What can we use to classify the traffic by the statistical methods?
• IP protocol level → Type field from the IP packets
• Application level → statistical classification by packet sizes,
ports, TCP flags, flow durations, etc
• Content level → statistical classification by IP addresses
• Service provider level → statistical classification by IP
addresses
What is the real result?
●
●
Pretty good accuracy for the cases, which were trained by MLA
Poor accuracy for all the other cases
Identification of service providers
• Monitoring of DNS replies delivers the required information
• Problems: many service providers using the same IP address
“tcpdump -v -K -n -N -t -i eth0 udp src port 53”
IP (tos 0x0, ttl 46, id 30600, offset 0, flags [none], proto UDP (17), length 102)
8.8.8.8.53 > 172.26.10.88.58238: 33261 2/0/0 www.facebook.com. CNAME star.c10r.facebook.com.,
star.c10r.facebook.com. A 31.13.72.17 (74)
IP (tos 0x0, ttl 46, id 26945, offset 0, flags [none], proto UDP (17), length 181)
8.8.8.8.53 > 172.26.10.88.46207: 10707 4/0/0 fbstatic-a.akamaihd.net. CNAME fbstatica.akamaihd.net.edgesuite.net., fbstatic-a.akamaihd.net.edgesuite.net. CNAME a1168.dsw4.akamai.net.,
a1168.dsw4.akamai.net. A 95.101.2.73, a1168.dsw4.akamai.net. A 95.101.2.91 (153)
Part III
A deep look into
Deep Packet Inspection
How much information is needed?
• It depends on the specific DPI tool
• Libprotoident requires only 4 bytes of packet payload in each
direction to recognize the traffic. The price: only IP protocol and
application levels can be determined.
• Other tools also process following bytes, looking for specific
signatures of a content or a service.
• Some signatures can identify the traffic after receiving 1 first
packet with payload (as DNS, NTP, or BitTorrent). Finding the
web service or content in an HTTP flow usually requires 4 first
packets.
• The most 10 packets in each direction should be
sufficient to determine all the flow characteristics.
Which information is used by DPI?
Libprotoident: comparison of the first 4 Bytes of payload + of the
packet lengths + port numbers
if (!match_str_either(data, "\x01\x00\x00\x00"))
return false;
if (!match_chars_either(data, 0x00, 0x00, 0x00, ANY))
return false;
if (data->payload_len[0] == 4 && data->payload_len[1] == 1)
return true;
if (data->server_port != 53 && data->client_port != 53)
return false;
Which information is used by DPI?
In PACE / OpenDPI / nDPI, there are the same checks:
if ((payload_len > 0) && match_first_bytes(packet->payload, "\xe9\x03\x41\x01"))
NDPI_LOG(0, ndpi_struct, NDPI_LOG_DEBUG, "Found PPLIVE.\n");
if ((payload_len == 0) || ((payload_len == 2) && (packet->payload[0] == 0x05) && (packet>payload[1] == 0x00)))
NDPI_LOG(0, ndpi_struct, NDPI_LOG_DEBUG, "Found SOCKS5.\n");
if ((payload_len == 0) || (payload_len == 49) ||(payload_len == 94))
NDPI_LOG(0, ndpi_struct, NDPI_LOG_DEBUG, "Found PPLIVE.\n");
if ((packet->udp->dest == htons(5041) || packet->udp->source == htons(5041))
NDPI_LOG(0, ndpi_struct, 0 "Possible PPLIVE ...\n");
Which information is used by DPI?
• But they are done for each packet separately (in the order how
the packets arrive), so we do not have access to the payload of
the previous packet
• The detection status is kept in state variables associated with
the particular flow
Which information is used by DPI?
However, they use a bunch of other methods, as IP check:
/*
Apple (FaceTime, iMessage,...)
17.0.0.0/8
*/
if(((saddr & 0xFF000000 /* 255.0.0.0 */) == 0x11000000 /* 17.0.0.0 */)
|| ((daddr & 0xFF000000 /* 255.0.0.0 */) == 0x11000000 /* 17.0.0.0 */)) {
flow->ndpi_result_service = NDPI_RESULT_SERVICE_APPLE;
}
Which information is used by DPI?
Or TCP flags:
if (packet->tcp->psh != 0 && flow->rtmp_bytes == 1537)
NDPI_LOG(0, ndpi_struct, NDPI_LOG_DEBUG,
Or even the number of processed packets:
if (flow->packet_counter > 20)
NDPI_LOG(0, ndpi_struct, NDPI_LOG_DEBUG.....
Which information is used by DPI?
In order to discover web services and types of HTTP content,
nDPI parses HTTP headers to discover the “host” and “contenttype” lines.
The “host” field is compared against domain names associated
with the particular service, as:
"amazon.com" -> NDPI_RESULT_SERVICE_AMAZON
"amazonaws.com" -> NDPI_RESULT_SERVICE_AMAZON
"amazon-adsystem.com" -> NDPI_RESULT_SERVICE_AMAZON
".apple.com" -> NDPI_RESULT_SERVICE_APPLE
".mzstatic.com" -> NDPI_RESULT_SERVICE_APPLE
Which information is used by DPI?
The “content-type” field is compared against predefined values
associated with the particular types of the content:
"video/mp4" -> NDPI_RESULT_CONTENT_MPEG
"video/mpeg" -> NDPI_RESULT_CONTENT_MPEG
"video/nsv" -> NDPI_RESULT_CONTENT_MPEG
"misc/ultravox" -> NDPI_RESULT_CONTENT_MPEG
"audio/ogg" -> NDPI_RESULT_CONTENT_OGG
"video/ogg" -> NDPI_RESULT_CONTENT_OGG
How to deal with the encrypted traffic?
• Encrypted web traffic increased from 20% (in 2011) to 45%
(2014) from the whole web traffic
• The content is always unknown
• The application protocol (HTTPS, POPS, SMTPS, etc)
discovered based on ports (e.g., port 465 = HTTPS)
• The service discovered based on:
a) inspection of the server field in certificates (nDPI)
b) matching with services based on cached DNS replies (TSTAT)
Part IV
How to verify the accuracy
of classification tools?
The origin of the reference data
• The reference data (ground-truth) are usually obtained in one
of previously described ways, what causes incompleteness and
high misclassification rate
• Publicly available databases contain very often incomplete and
inaccurate data
• So how to provide good quality data?
Monitoring on the user's level
• System sockets provide name of the application associated
with each particular stream in the network
• Ability to split HTTP streams according to their content
• Fast, precise, avoid privacy issues
• Avoid unreliability of port-based or statistical tools
Volunteer-Based System (VBS)
• Collects data from clients
• Enhanced privacy
• Application names are taken
from system sockets
• Recognizes different types of
HTTP contents
• Open-source, GPL licensed
• Windows (32/64-bit) and Linux
• Can be downloaded free of
charge from SourceForge:
http://vbsi.sourceforge.net
Design of the system
• Volunteer-Based System consists of clients installed on users'
computers, the server located at Aalborg University, and
statistics generators. Each part of the software can be
developed independently, and it collaborates with other by the
use of database SQL interfaces
The concept of a flow
Remote end-point: IP address, port
The way
of transport:
transport
protocol
Local end-point: IP address, port
Information logged for each flow
• Identifier of the client
• Start timestamp
• Hashed local, global, and remote IP addresses
• Local and remote ports
• Transport layer protocol
• Name of the application
• Name of the network
Information logged for each packet
• Identifier of the flow
• Direction (inbound / outbound)
• Size
• State of all TCP flags (for TCP flows)
• Time elapsed from the previous packet in the flow
• Type of the content (for HTTP flows)
Privacy is guaranteed!
• Masked users' identities
• Masked IP addresses (local, global, remote)
• Collecting only general information about transferred HTTP
content (as image/gif, video/flv, audio/mpeg)
• We do not perform deep inspection of application payloads, all
the collected information is obtained only from headers
Performance tests
• The tests were made on 16 client machines located in Poland
• The clients analyzed
121.21 GB of data
• The server stored 7.4 GB
of statistical data
• Communication between
the clients and the server
was responsible for
around 5% of the traffic
Performance of VBS
• CPU usage by our system does not exceed 5% in average
• The system has no impact on the performance of users'
computers
• VBS is running in background as Windows service or Linux
daemon, so it is completely transparent to the users
The official project website
The official project website http://vbsi.sourceforge.net contains:
• Broad description of the project
• Screenshots
• Roadmap
• Binary packages for Windows and Linux
• Source code in Git repository
• Comprehensive documentation of the source code
• Bug tracking and feature request system
Part V
Implementation and various
applications of VBS
- a host-based monitoring tool
Implementation of our VBS
• Developed in Java, using Eclipse environment
• Open source, GPL licensed, freely available to everyone
• VBS is split into client, server, and statistics generator
• Modular design, all parts can be developed independently
• Running in the background as a Windows service or Linux
daemon thanks to Yet Another Java Service Wrapper (YAJSW)
– an open source project that provides support for both 32-bit
and 64-bit versions of Windows and Linux
• The auto-update mechanism of the client
The client
The client is installed on users' computers and are responsible
for collecting the data about the users' traffic and sending them
to the server. The client consists of the following modules:
• Packet capturer
• Socket monitor
• Flows generator
• Data transmitter
Packet capturer
• Uses jNetPcap Java library to collect packets from the network
interface. The library makes use of the installed WinPcap /
libpcap
• Captures all traffic passing the network interface except the
traffic from the local subnet (the traffic is filtered by Pcap on the
interface level)
• JnetPcap offers detecting and stripping various headers (datalink, IP, TCP, UDP, HTTP, etc)
• Packets are collected using native Pcap function loopPacket,
which saves the resources consumed by VBS
Socket monitor
• Calls the external socket monitoring tools every second to
ensure that even quick openings of sockets are registered
• Uses Netstat (Linux) or TCPViewCon (Windows) to get the list
of open sockets
• Monitors both TCP and UDP sockets and provides the
information about the time of opening, time of closing, and
name of the application associated with the socket
Flows generator
• Organizes the captured packets into flows
• Attaches the application name to the flow based on the
information from the socket monitor
• TCP flows are closed based on the time when the
corresponding socket is closed
• UDP flows are closed based on timeout (one UDP socket can
be associated with many flows, since only the local point is
defined)
• Closed flows are stored in the SQLite database
Data transmitter
• When the local SQLite database file exceeds 700 kB, it is sent
to the server
• Raw sockets are used by the communication, but a simple
password authentication mechanism exists
• The transmitted file also includes identifier of the client and
information about the computer on which the client is installed:
version of the operating system, information about RAM and
CPU(s)
The server
• Responsible for registering and authenticating clients
• Receives SQLite database files from clients
• Obtained files are extracted into MySQL database installed on
the server machine
• There are the following tables in the MySQL database: Clients,
Flows, Packets, Applications, ContentTypes, Performance
• So far we collected information about 13,242,858 flows and
999,731,839 packets associated with them. The information
takes 75.4 GiB disk space.
Risk assessment
• The threat assessment made by us proved that
the system can handle majority of potential
security problems
Examples of the statistics
• Data obtained from 4 users:
- User 1 – private user in Denmark, joined December 28, 2011
- User 2 – private user in Poland, joined December 28, 2011
- User 3 – private user in Poland, joined December 31, 2011
- User 4 – private user in Denmark, joined April 24, 2012
• Statistics calculated for all users altogether and for each user
separately
• Statistics obtained on the per-application basis and percontent-type basis
Number of flows vs amount of traffic
Top applications for all users
Torrent download vs upload
Top HTTP content-types for all users
Amounts of traffic vs content types
Characterizing the applications
• We tried to obtain characteristics of 5 different applications
based on traffic originated by all our users
• It is interesting to observe that 60% of packets (and 71% of
packets carrying data) for chrome are inbound
• For dropbox the number of inbound and outbound packets is
almost the same, but there is a large difference in the size of
the inbound and outbound packets
The End
Thank you for your attention!
Download