The 15th Annual Network and Distributed System Security Symposium Automatic Protocol Format Reverse Engineering through Context-Aware Monitored Execution Zhiqiang Lin 1 Xuxian Jiang 2, Dongyan Xu 1, Xiangyu Zhang 1Purdue University 2George Mason University February 12th, 2007 1 Motivation Protocol reverse engineering A process to recover protocol specifications E.g., fields and their relationships Applications: Network-based Intrusion detection – DoS attacks, Port Scans, Computer Systems Network management – correctly recognize and monitor traffic Fuzz Testing – s/w testing technique … Challenges 0x0040: cd46 4745 5420 2f6e 6577 732e 6874 6d6c 0x0050: 2048 5454 502f 312e 300d 0a55 7365 722d 0x0060: 4167 656e 743a 2057 6765 742f 312e 3130 Hierarchical 0x0070: 2e32 2028 5265 6420 4861 7420 6d6f 6469 0x0080: 6669 6564 290d 0a41 6363 6570 743a 202a Parallel 0x0090: 2f2a 0d0a 486f 7374 3a20 3132 392e 3137 0x00a0: 342e 3838 2e37 310d 0a43 6f6e 6e65 6374 0x00b0: 696f 6e3a 204b 6565 702d 416c 6976 650d. 0x00c0: 0a0d 0a Sequential Multiple fields in a single message Non-static size of fields Complex relationships among protocol fields Challenges HTTP-Request = Parallel Request-Line Sequential (( general-header | request-header | entity-header ) CRLF)* CRLF [ message-body ] Request-Line = Method SP Request-URI SP HTTP-Version CRLF Hierarchical A BNF Specification of HTTP Request (RFC2616) **Hierarchical relation: A field can be further divided into multiple sub-fields **Sequential relation : Captures the ordering between adjacent fields in a protocol. **Parallel relation: The positions of two or more fields are exchangeable in the protocol specification. Note: SP and CRLF are separators Related Work Network Trace Protocol Informatics Discoverer [W. Cui et. al. Security’07] Binary Analysis Polyglot [J. Caballero et. al. CCS’07] Automatic Network Protocol Analysis [G. Wondracek et. al. NDSS’08] Observation 119 int read_header(int sid) { ... 129 sgets(line, sizeof(line)-1, conn[sid].socket); … 137 if (sscanf(line, "%[^ ] %[^ ] %[^ ]", conn[sid].dat->in_RequestMethod, conn[sid].dat->in_RequestURI, conn[sid].dat->in_Protocol)!=3) ... 147 while (strlen(line)>0) { REQUEST LINE field ... divided into 154 if (strncasecmp(line, "Cookie: ", 8)==0) METHOD, REQUEST 155 strncpy(conn[sid].dat->in_Cookie, (char *)&line+8, URI and HTTP sizeof(conn[sid].dat->in_Cookie)-1); VERSION 156 if (strncasecmp(line, "Host: ", 6)==0) 157 strncpy(conn[sid].dat->in_Host, (char *)&line+6, sizeof(conn[sid].dat->in_Host)-1); … 160 if (strncasecmp(line, "User-Agent: ", 12)==0) •Cookie , host, user161 strncpy(conn[sid].dat->in_UserAgent, (char *)&line+12, agent are Parallel sizeof(conn[sid].dat->in_UserAgent)-1); fields 162 } ... Code snippet in http.c (null-httpd-0.5.0) 187 } AutoFormat -- Basic Idea Execution Context G E T Protocol Fields / n e w s… Context One Field Another Field System Overview GET /news.html input Context-aware Execution Monitor Log call stack 0 'G' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection ->0xF5A8->ap_read_request->ap_rgetline_core->ap_get_brigade->0x2D2CE->ap_get_brigade->0x2D667 ->apr_brigade_split_line->memchr 1 'E' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection ->0xF5A8->ap_read_request->ap_rgetline_core->ap_get_brigade->0x2D2CE->ap_get_brigade->0x2D667 ->apr_brigade_split_line->memchr 2 'T' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection ->0xF5A8->ap_read_request->ap_rgetline_core->ap_get_brigade->0x2D2CE->ap_get_brigade->0x2D667 ->apr_brigade_split_line->memchr … 24 ‘\n’ main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection ->0xF5A8->ap_read_request->ap_rgetline_core->ap_get_brigade->0x2D2CE->ap_get_brigade->0x2D667 ->apr_brigade_split_line->memchr … 0 'G' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection ->0xF5A8->ap_read_request->ap_getword_white EIP 0x4BA56A2 0x4BA56A2 0x4BA56A2 0x4BA56A2 0x1F7F3 Protocol Field Identifier Analyze log file Step 1: build protocol field tree from the logged data. Step 2: refine the tree using three heuristics Step 3: output the result Example: Apache log data 0 'G' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection ->0xF5A8->ap_read_request->ap_rgetline_core->ap_get_brigade->0x2D2CE->ap_get_brigade->0x2D667 ->apr_brigade_split_line->memchr 1 'E' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection ->0xF5A8->ap_read_request->ap_rgetline_core->ap_get_brigade->0x2D2CE->ap_get_brigade->0x2D667 ->apr_brigade_split_line->memchr 2 'T' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection ->0xF5A8->ap_read_request->ap_rgetline_core->ap_get_brigade->0x2D2CE->ap_get_brigade->0x2D667 ->apr_brigade_split_line->memchr … 24 ‘\n’ main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection ->0xF5A8->ap_read_request->ap_rgetline_core->ap_get_brigade->0x2D2CE->ap_get_brigade->0x2D667 ->apr_brigade_split_line->memchr … 24 '\n' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection ->0xF5A8->ap_read_request->ap_rgetline_core 23 '\r‘ main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection ->0xF5A8->ap_read_request->ap_rgetline_core 0 'G' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection ->0xF5A8->ap_read_request->ap_getword_white 1 'E' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection ->0xF5A8->ap_read_request->ap_getword_white 2 'T' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection ->0xF5A8->ap_read_request->ap_getword_white … GET /news.html HTTP/1.0\r\n \n \r GET 0x4BA56A2 0x4BA56A2 0x4BA56A2 0x4BA56A2 0x26187 0x26322 0x1F7F3 0x1F7F3 0x1F7F3 Step 1 -- Building Protocol Field Tree Parent node contains offsets of its children root User−Agent: Wget/1.10.2 (Red Hat HTTP/1.0 GET /news.html HTTP/1.0\r\n modified)\r\nAccept: */*\r\n…. Contains offsets of all input data Step 1: Building Protocol Field Tree Missing SPACE before “ /n” Redundancy in fields Overly fine grained fields GET /news.html HTTP/1.0\r\n GET /news.html HTTP/1.0\r\n GET /news.html GET HTTP/1.0 / news.html H TTP/1.0 / news.html H TTP/1.0 / news.html H TTP/1.0 \r \n Step 2: Refinement (Tokenization) GET /news.html HTTP/1.0\r\n GET /news.html HTTP/1.0\r\n GET /news.html GET / news.html HTTP/1.0 H \r \n TTP/1.0 GET /news.html HTTP/1.0\r\n / news.html H TTP/1.0 GET /news.html HTTP/1.0\r\n / news.html H TTP/1.0 Merge 2 child nodes if their content can form one token –based on TEXTBASED PROTOCOLS GET GET /news.html HTTP/1.0 /news.html HTTP/1.0 /news.html HTTP/1.0 /news.html HTTP/1.0 \r\n Step 2: Refinement (Redundant Node Deletion) GET /news.html HTTP/1.0\r\n GET /news.html HTTP/1.0\r\n GET GET /news.html HTTP/1.0 /news.html HTTP/1.0 /news.html HTTP/1.0 /news.html HTTP/1.0 An internal node is redundant if it has only 1 child GET \r\n GET /news.html HTTP/1.0\r\n GET /news.html /news.html HTTP/1.0 \r\n Step 2: Refinement (Node Insertion) GET /news.html HTTP/1.0\r\n GET /news.html GET HTTP/1.0 \r\n Insert a new child node to parent IF the offsets of children do not match the parent. /news.html GET /news.html HTTP/1.0\r\n GET /news.html GET /news.html HTTP/1.0 \r\n Step 3: Output the Result 4 GET /news.html HTTP/1.0\r\n Parallel & Sequential 3 GET /news.html 2 1 GET /news.html GET /news.html Parallel: *Collect execution history of each node * For a parent- if child nodes share similar history –MARK it Hierarchical HTTP/1.0 \r\n HTTP/1.0 \r\n Sequential: *Pre-order traversal of tree -lists the leaf nodes -parent of multiple parallel nodes Evaluation Implemented on top of Valgrind-3.2.3 Also applies to QEMU, PIN For context aware execution monitor Benchmark 30 messages with six known protocols and one unknown protocol. Evaluation Metric Re: Ratio of exact match |(A ∩ W)|/|W| A: set of fields identified by AutoFormat W: set of fields identified by Wireshark Overall Result Re(F): Re for finest-grained fields Re(H): Re for hierarchical fields Re(P): Re for parallel fields 100% match with Wireshark * (-) => |P| for Wireshark=0 Averages: Re(F) = 88.5% Re(H) = 98.0% Re(P) = 100.0% Re=93.4% Discussion Dynamic Trace Dependency -AutoFormat does not detect message formats not present in the execution trace Byte granularity – AutoFormat does not detect protocol fields at bit level Protocol State Machine – AutoFormat does not correlate multiple messages of same protocol session. Obfuscated binaries- AutoFormat does not handle these type of inputs. Conclusion Paper also includes the Slapper Worm Messages as a part of second experimental results set. AutoFormat A tool for automatic protocol format extraction. Key insight A protocol implementation is programmed to recognize the protocol format and usually contains protocol field-specific execution context, and we can actually leverage such context to infer the hierarchical structure of protocol fields, and even get their BNF structures. Q&A Thank you For more information: {zlin, dxu, xyzhang}@cs.purdue.edu xjiang@gmu.edu