Binary Analysis for Grammar and Model Extraction: Techniques and

advertisement

Binary Analysis for Botnet

Reverse Engineering & Defense

Dawn Song

UC Berkeley

Binary Analysis Is Important for Botnet Defense

• Botnet programs: no source code, only binary

• Botnet defense needs internal understanding of botnet programs

– C&C reverse engineering

• Different possible commands, encryption/decryption

– Botnet traffic rewriting

– Botnet infiltration

– Botnet vulnerability discovery

BitBlaze Binary Analysis Infrastructure: Architecture

• The first infrastructure:

Novel fusion of static, dynamic, formal analysis methods

• Loop extended symbolic execution

• Grammar-aware symbolic execution

– Whole system analysis (including OS kernel)

– Analyzing packed/encrypted/obfuscated code

Vine:

Static Analysis

Component

TEMU:

Dynamic Analysis

Component

Rudder:

Mixed Execution

Component

BitBlaze Binary Analysis Infrastructure

BitBlaze: Security Solutions via Program Binary Analysis

 Unified platform to accurately analyze security properties of binaries

 Security evaluation & audit of third-party code

 Defense against morphing threats

 Faster & deeper analysis of malware

Detecting

Vulnerabilities

Generating

Filters

Dissecting

Malware

BitBlaze Binary Analysis Infrastructure

The BitBlaze Approach & Research Foci

 Semantics based, focus on root cause:

Automatically extracting security-related properties from binary code for effective vulnerability detection & defense

1. Build a unified binary analysis platform for security

– Identify & cater common needs of different security applications

– Leverage recent advances in program analysis, formal methods, binary instrumentation/analysis techniques for new capabilities

2. Solve real-world security problems via binary analysis

• Extracting security related models for vulnerability detection

• Generating vulnerability signatures to filter out exploits

• Dissecting malware for real-time diagnosis & offense: e.g., botnet infiltration

More than a dozen security applications & publications

Plans

• Building on BitBlaze to develop new techniques

• Automatic Reverse Engineering of C&C protocols of botnets

• Automatic rewriting of botnet traffic to facilitate botnet infiltration

• Vulnerability discovery of botnet

Preliminary Work

• Dispatcher: Enabling Active Botnet Infiltration using Automatic Protocol Reverse-Engineering

• Binary code extraction and interface identification for botnet traffic rewriting

• Botnet analysis for vulnerability discovery

Dispatcher: Enabling Active Botnet

Infiltration using Automatic Protocol

Reverse-Engineering

Juan Caballero

Pongsin Poosankam

Christian Kreibich

Dawn Song

Automatic Protocol Reverse-Engineering

• Process of extracting the application-level protocol used by a program, without the specification

– Automatic process

– Many undocumented protocols (C&C, Skype, Yahoo)

• Encompasses extracting:

1. the Protocol Grammar

2. the Protocol State Machine

• Message format extraction is prerequisite

Challenges for Active Botnet Infiltration

• Goal: Rewrite C&C messages on either dialog side

1. Understand both sides of C&C protocol

– Message structure

– Field semantics

2. Access to one side of dialog only

3. Handle encryption/obfuscation

Technical Contributions

1. Buffer deconstruction , a technique to extract the format of sent messages

 Earlier work only handles received messages

2. Field semantics inference techniques, for messages sent and received

3. Designing and developing Dispatcher

4. Extending a technique to handle encryption

5. Rewriting a botnet dialog using information extracted by Dispatcher

Message Format Extraction

• Extract format of a single message

• Required by Grammar and State Machine extraction

GET / HTTP/1.1

HTTP/1.1 200 OK

[Polyglot]

[Dispatcher]

Message Field Tree

HTTP/1.1 200 OK\r\n\r\n

MSG

[0:18]

Field Range: [3:3]

Field Boundary: Fixed

Field Semantics: Delimiter

Field Keywords: <none>

Target: Version

Status Line

[0:16]

Delimiter

[17:18]

Version

[0:7]

Delimiter

[8:8]

Status-Code

[9:11]

Delimiter

[12:12]

Reason

[13:14]

Message format extraction has 2 steps:

1. Extract tree structure

2. Extract field attributes

Delimiter

[15:16]

Sent vs. Received

• Both protocol directions from single binary

• Different problems

– Taint information harder to leverage

– Focus on how message is constructed, not processed

• Different techniques needed:

– Tree structure  Buffer Deconstruction

– Field attributes  New heuristics

Outline

Introduction

Problem

Techniques

Buffer Deconstruction

Field Semantics Inference

Handling encryption

Evaluation

Buffer Deconstruction

• Intuition

– Programs keep fields in separate memory buffers

– Combine those buffers to construct sent message

• Output buffer

– Holds message when “send” function invoked

– Or holds unencrypted message before encryption

• Recursive process

– Decompose a buffer into buffers used to fill it

– Starts with output buffer

– Stops when there’s nothing to recurse

Buffer Deconstruction

HTTP/1.1 200 OK\r\n\r\n

C(8) D(1) E(3) F(1)

A(17)

Output Buffer (19)

G(2)

B(2)

H(2)

Status Line

[0:16]

MSG

[0:18]

Delimiter

[17:18]

[0:7] [8:8] [9:11] [12:12] [13:14] [15:16]

Version Delimiter Status

Code

Delimiter Reason Delimiter

• Message field tree = inverse of output buffer structure

• Output is structure of message field tree

– No field attributes, except range

Field Attributes Inference

• Attributes capture extra information

– E.g., inter-field relationships

• Techniques identify

– Keywords

– Length fields

– Delimiters

– Variable-length field

– Arrays

Attribute

Field Range

Value

[StartOffset : EndOffset]

Field Boundary Fixed, Length, Delimiter

Field Semantics IP address, Timestamp, …

Field Keywords <list of keyworkds in field>

Field Semantics

• A field attribute in the message field tree

• Captures the type of data in the field

Field Semantics

Cookies

Error codes

Keyboard input

Keywords

File data

File information

Length

Padding

Filenames Ports

Hash / Checksum Registry data

Hostnames Sleep timers

Host information Stored data

IP addresses Timestamps

• Programs contain much semantic info  leverage it!

• Semantics in well-defined functions and instructions

– Prototype

• Similar to type inference

• Differs for received and sent messages

Field Semantic Inference

File path

GET /index.html HTTP/1.1

HTTP/1.1 200 OK

Content-Length: 25 File length

<html>Hello world!</html> stat(“index.html”, &file_info);

OUT IN OUT int stat(const char*path, struct stat *buf);

} struct stat {

… off_t st_size; /* total size in bytes */

Detecting Encoding Functions

• Encoding functions = (de)compression,

(de)(en)cryption, (de)obfuscation…

• High ratio of arithmetic & bitwise instructions

• Use read/write set to identify buffers

• Work-in-progress on extracting and reusing encoding functions

MegaD C&C protocol

• C&C on tcp/443 using proprietary encryption

• Use Dispatcher’s output to generate grammar

– 15 different messages seen (7 recv, 8 sent)

– 11 field semantics type MegaD_Message = record { msg_len : uint16; encrypted_payload: bytestring &length = 8*msg_len;

} &byteorder = bigendian; type encrypted_payload = record { version : uint16;

}; mtype : uint16; data : MegaD_data (mtype); type MegaD_data (msg_type: uint16) = case msg_type of {

0x00 -> m00 : msg_0;

[…] default -> unknown : bytestring &restofdata;

};

MegaD Dialog

C&C Server

SMTP Test Server

MegaD Rewriting

C&C Server

SMTP Test Server Template Server

Summary

• Buffer deconstruction, a technique to extract the format of sent messages

• Field semantics inference techniques, for messages sent and received

• Designed and developed Dispatcher

• Extended technique to handle encryption

• Rewrote MegaD dialog using information extracted by Dispatcher

Download