The Online Phase

advertisement
Decompression-Free Inspection:
DPI for Shared Dictionary
Compression over HTTP
Anat Bremler-Barr
Interdisciplinary Center Herzliya
Shimrit Tzur David
Interdisciplinary Center Herzliya &
The Hebrew University, Jerusalem
David Hay
The Hebrew University, Jerusalem
Yaron Koral
Tel Aviv University
1
Outline
Motivation
 Background

◦ AC algorithm

Our solution
◦ The offline Phase
◦ The online phase

Experimental Results
2
Deep Packet Inspection (DPI)

Search for patterns in the packets` payload

Signatures-based NIDS

Web-Application Firewalls
◦ Leakage prevention
◦ Content Filtering

Challenges:
◦ Thousands of known malicious patterns
◦ Real time, link rate

Security tools performance is dominated by
the pattern matching engine (Fisk & Varghese
2002)
◦ Intrusion Preventions
3
Compressed HTTP
84.1% of the top 1,000 sites compress
their traffic.
 Data compression is done by adding
19% increase
in 8 month!
references to repeated data.
 There are two types of compression:

◦ Intra-response compression – the references
point to bytes within the response (Gzip/Deflate)
◦ Inter-responses/connections compression – the
references point to bytes in a separate file,
called dictionary (Google’s SDCH).
4
Example –
Intra-Response Compression
File1.html:
abcdefgabcd
 File2.html
abcdxyzbcdtr


TCP Connection Setup
Encode repeated
strings by pointer:
{distance, length}
5
Example –
Inter-Response Compression




Dictionary:
abcd
File1.html:
abcdefgabcd
File2.html
abcdxyzbcdtr
TCP Connection Setup
Copy repeated
strings from the
dictionary:
(address, length)
6
Current NIDS Operation (1)
GET \index.html
Accept-Encoding: SDCH
Server
GET \index.html
Accept-Encoding: SDCH
NIDS
Client
Scan for
Intrusions
7
Current NIDS Operation (2)
GET \index.html
Accept-Encoding: SDCH
Server
GET \index.html
Accept-Encoding: SDCH
NIDS
Client
Do Not
Scan/
Decompress,
Scan, Compress
8
GET \index.html
Accept-Encoding: SDCH
Server
GET \index.html
Accept-Encoding: SDCH
NIDS
Client
Scan directly
with no decompression
9
Our Solution:
Decompression-Free Scanning

Focused on inter-response compression

Our algorithm works in two phases
◦ Offline phase - Scanning the dictionary
◦ Online phase - Scanning the delta files

Works at the rate of the compressed traffic
◦ Gain 56% improvement compared with scanning
the plain-text directly
10
Outline
Motivation
 Background

◦ Aho-Corasick (AC) algorithm

Our solution
◦ The offline Phase
◦ The online phase

Experimental Results
11
Aho-Corasick (AC) Algorithm




Finite State Machine (FSM)
◦ Regular states, accepting states
C
B
s1
s2
E
Goto function (black arrows)
◦ g(state,symbol)state
Patterns:
s0
E
s3
E
Each state corresponds to a label- the
BEof characters on its goto path
sequence
g(S
,B) = S12
11
BD
from the
root.
=label
? is the depth of the state
◦ Theg(S
length
of the
BCAA
11,A)
BCD
Failure function (red arrows)
CDBCAB
◦ f(state)state
s7
D C
s4
D
s5
s8
A
D
s13
s6
A
s14
Thewhen
label
ofisSno14goto
is BCAA
◦ Taken
there
function
◦ Goes to a state that its label is the longest
suffix of the current state’s label
f(S11) = S13  g(S11,A)  g(S13,A)=S14
B
s9
C
s10
A
s11
B
s12
Aho-Corasick Insights

The automaton remembers
only its current state
s0
E
B
s2
s1
◦ The input text ends with
the label of current state
◦ This label is the longest
suffix in the text that can
be a prefix of a match
 No
future pattern can
begin before this label
C
E
s3
s7
D C
s4
D
s5
s8
A
D
s13
s6
A
s14
B
s9
C
s10
A
s11
B
s12
Outlines
Motivation
 Background

◦ Aho-Corasick (AC) algorithm

Our solution
◦ The offline Phase
◦ The online phase

Experimental Results
14
Accelerator Algorithm Idea
The algorithm operates in two phases:
 The Offline Phase:
◦ Scan the dictionary and store information about
the pattern matching results

The Online Phase:
◦ Scan the delta file and skip almost all referenced
bytes that were already scanned for patterns.
15
The Offline Phase

The dictionary is scanned using
AC (from its first byte and from s
s0). We save the state after
each byte.
E
1
E
s3
C
s7
DC
s4 s5
A
D
s1
A3
s1
State:

s0
B
s2
0
1
2
3
4
5
6
7
8
9
10 11
D
B
E
A
A
C
D
B
C
A
B
S0
S2
S3
S0
S0
S7
S8
S9
S10 S11 S12 S5
s6
4
C
D
s8
B
s9
C
s10
A
s11
B
s12
We also save information of matched patterns
that are found in the dictionary
16
Challenges
0


1
2
3
4
5
6
7
The uncompressed data is:
A B DB C DB C A A B B E A A

9 10 11
Dictionary: D B E A A C D B C A B C
Delta file:
ABDB(5,4)AAB(1,4)

8
Patterns/
Signatures:
E
BE
BD
BCAA
BCD
CDBCAB
We copy from arbitrary position in the
dictionary when the automaton in an
Types of matches:
arbitrary state
◦ We show that no matter in whatRight
stateboundary
and which
Internal
symbol we start to copy, the resulting
state is
reachable via failure transitions Left
fromboundary
the saved
17
state.
The Online Phase
Scan the delta file:

Uncompressed bytes - scan using AC.

Copy instruction (p,x)
◦ The compressed data that we already scanned in the offline
phase.
◦ We will save the scan for almost all these bytes.

The internal match is trivial, see paper for details.
18
The Online Phase - Right Boundary

When encountering copy instruction (p,x),
We want to stop scanning and jump to
state[p+x-1]
◦ If the label of the state is longer than the copyvalue
 The label begins before the copy value
 The context of this state is not as in the online scan
 We take failure transitions to find state with
sufficiently short label.
◦ Otherwise
 The label of the state is contained in the copy value
 This is the longest suffix that can lead to a match
19
Example – Right Boundary
s0
E
COPY(7,4): BC AB
B
s2
s1
E
Uncompressed data:
…B

C
s3
Go to State[10]=s12. depth(s12) > 4.
Go to f(s12)=s2
depth(s2) ≤ 4
Current state is S2
s7
D C
s4
D
s5
s8
A
D
s13
s6
A
s14
B
s9
C
s10
A
s11
B
State:
s12
0
1
2
3
4
5
6
7
8
9
10 11
D
B
E
A
A
C
D
B
C
A
B
S0
S2
S3
S0
S0
S7
S8
S9
S10 S11 S12 S5
C
20
The Online Phase – Left Boundary

When encountering copy instruction (p,x),
We want to stop scanning and jump to
state[p+x-1]
◦ If the number of bytes we read from the copy
value is less than the depth of the current state
 The label of the state begins before the copied bytes
 We scan the copy value till we reach a state that its
label is shorter than the number of read bytes.
◦ Otherwise
 The label of the state is contained in the copy value
 Both offline and online scans have the same context
21
Example – Left Boundary
s0
E
COPY(5,4): CDBC
B
s2
s1

Uncompressed data:
…B
C
E
s3
j=2
j=0
j=1
depth=1
Depth=2
Depth=3
Continue
j=3
Stop scanning (depth(s9)≤3)
s7
D C
s4
D
s5
s8
A
D
s1
s6
3
A
s1
4
B
s9
C
s10
A
s11
B
State:
s12
0
1
2
3
4
5
6
7
8
9
10 11
D
B
E
A
A
C
D
B
C
A
B
S0
S2
S3
S0
S0
S7
S8
S9
S10 S11 S12 S5
C
22
Outline
Motivation
 Background

◦ Aho-Corasick (AC) algorithm

Our solution
◦ The offline Phase
◦ The online phase

Experimental Results
23
Experimental Results

Input:
◦ google.com dictionary
◦ Pages for 1000 most popular Google queries.

Patterns
◦ Snort

The synthetic case
◦ A patterns file for each input file so the input
file has a different percentage of matches, from
25% to 100%.
24
The Algorithm Overheads
Traversing the failure transitions
1.
◦
In the right boundary
Scanning the copy value
2.
◦
In the left boundary
Memory consumption:
3.
◦
◦
The additional information of the offline phase.
Total: 420 KB (per dictionary)

Can be further reduced by a variable-length pointer
encoding.
25
Failure Transitions –
Right Boundaries

If length ≥ depth,
no failure
transition is taken

In our
experiments:
◦ The average is
2.35 failure
transitions per file
 (average of 557
copy instructions
per file)
26
Scanning the Copy Value Left Boundary
Compression ratio –
compressed/uncompressed
 Scan ratio –
scanned/uncompressed.


Snort

The synthetic case
◦ low percentage of matches
scan-ratio ~ compression
ratio
◦ high percentage of matches
◦ Unrealistic case
◦ scan-ratio is between 1.05
to 1.2 times compressionratio.
27
Regular Expression Results
Strings were extracted
from the regular
expression and were
added to the pattern set.
 When needed, we use
off-the-shelf perl
compatible regular
expression engine to
scan additional parts of
the text.


The overhead of the
regular expression is
around 1% which is
almost negligible
28
Questions??
29
Regular Expression

Very common in security purpose patterns.
◦ In Snort, 55% of the rules contain regular expression.



Composed of anchors and pcre tokens.
For example, in the pattern: abc[1-9]*xyza{3,7}
The anchors are:
◦ abc
◦ xyz

The pcre tokens are:
◦ [1-9]*
◦ a{3,7}
30
Dealing with Regular Expression
The anchors are extracted from the
regular expression offline.
2. The anchors are added to the patterns
set.
3. If there is a regular expression which all
its anchors were matched:
1.
◦ run an off the-shelf regular expression engine
until, either a mismatch, a full pattern match,
or the whole (limited) text is searched.
31
Regular Expression –
Limited Search

In most cases, we can limit the search in
at least one direction.
◦ If before the first anchor all tokens have a
limited size, there is a bounded number of
characters we should examine before the
matched anchor.
◦ If after the last anchor all tokens have a limited
size there is a bounded number of characters
we should examine after the matched anchor.
32
Memory Consumption
1.
2.
Doubling the size of the dictionary (for saving
the offline scan results, one pointer per
symbol)
Saving the matched list (for internal matches)
Our experiments:
◦ Match list size 40,000
◦ Dictionary size 116K symbols
◦ Pointer size 17 bits
Total memory consumption is 420 KB (per
dictionary)
◦ Can be further reduced by a variable-length pointer
encoding.
33
Download