Zhenyu Wu, Steven Gianvecchio, Mengjun Xie Advisor: Dr. Haining Wang The College of WILLIAM k MARY Internet & Ubiquitous Computing ◦ Billions of networked computers ◦ Playground for malware Suppression Techniques ◦ Static analysis Low latency, high throughput Widely used, IDS deployable ◦ Dynamic analysis The College of WILLIAM k MARY 2 Un-obfuscated ◦ Binary in plain Oligomorphism ◦ Segments of the binary ◦ Simple transformation (XOR) Polymorphism Metamorphism State of the Art ◦ Control-flow encryption ◦ Byte frequency manipulation Statistical analysis ◦ Anomalies in code body ◦ Meta transformation (P-code) Algorithmic detection ◦ Build in transformations ◦ Compression and encryption Unique substring Advanced pattern matching ◦ N-gram signatures Semantic analysis ◦ Persist high-level fingerprints The College of WILLIAM k MARY 3 WANTED $5,000,000 The College of WILLIAM k MARY 4 The College of WILLIAM k MARY 5 Polymorphism ◦ Compression & Encryption Nobody looks like a small dark box! The College of WILLIAM k MARY 6 Metamorphism ◦ Reordering Components Wanted $5,000,000 Cannot evade feature detections The College of WILLIAM k MARY 7 Control Flow Encryption ◦ Prevent feature analysis Increases suspicion The College of WILLIAM k MARY 8 The Real Player ◦ Assume other people’s identity (Mimicry) The College of WILLIAM k MARY 9 Lessons Learned: ◦ Evasion without obfuscating features ◦ Evasion by refusing inspection ◦ Evasion by mimicking Obfuscating original features Open to inspection, but disguises detection The College of WILLIAM k MARY 10 Mimimorphism: ◦ Reversible transformation of an executable that produces output statically resembles other benign programs ◦ Characteristics: Completely erases features from the original binary High order statistics matches benign executables Transformed payload consists of “meaningful” control flows, highly resemble those from benign executables The College of WILLIAM k MARY 11 Text Stenography Technique ◦ Transforms the input data and produces mimicry output copies that assume statistical and grammatical (structural) properties of another type of data ◦ Originally proposed by Peter Wayner as means to transport sensitive data under harsh surveillance Novel use of Huffman coding The College of WILLIAM k MARY 12 Huffman Coding 0 ◦ Digesting Builds a Huffman tree according to the symbol frequency ◦ Encoding Removes redundancies of the input data using a given Huffman tree ◦ Decoding Recovers the original data from the “condensed” data by emitting symbols according to the original Huffman tree 0 m 1 s 1 a Huffman Tree 01 s 00 m 01 a mass 000111 (32 bits) (6 bits) The College of WILLIAM k MARY 13 What if we decode a piece of random data? ◦ Produces “meaningless” data, but The output exhibits similar symbol frequency to the digest - and Input data can be recovered by Huffman encode Regular Mimic Function ◦ Learn: Build a Huffman tree from sample text ◦ Mimicry: Huffman decode on input (randomized) ◦ Recover: Huffman encode The College of WILLIAM k MARY 14 Insufficiencies chi ◦ Produces illegible, garbled text ◦ Frequency distributions follow 2n distribution High-order Mimic Function 0 0 l 1 c 1 0 n ◦ Captures interdependencies 0 p 1 t ins Build multiple Huffman trees One for each unique symbol prefix ◦ Produces “sensible” text with much more “natural” symbol frequency distributions rou 0 n 1 g 1 t Huffman “Forest” The College of WILLIAM k MARY 15 Mimicry of Peter Wayner’s paper ◦ Produced by 6th order mimic function Each of these historical reason, I don’t recommend using gA(t) to choose the safe. These one-to-one encoded with n leaves and punctuation. The starting every intended to find the same order mimic files. A Method is to break the trees by constructing the mimics the path down the most even though, offer no way that is, in this paper. Figure will not overflow memory. These produced by truncating letter. This need to handle n-th ordered compartment of nonsense words cannot bear any resemblance to B because this task is a Huffman showed in [1], [2], [3] among others. The College of WILLIAM k MARY 16 The Challenge: Machine Language Mimicking ◦ Consists of instructions and control flows Each instruction has a strict format to follow Machines never make “typo”, or use wrong “tense”! ◦ Mimic function has no knowledge of instructions Often makes mistakes generating instructions Have a low success rate of creating mimicry control flows Our Solution ◦ Integrate a custom assembler / disassembler ◦ Help the mimic function understand the language The College of WILLIAM k MARY 17 Digesting Mimicry Target Exec. Binaries XOR Disassemble PUSH High Order Instruction Mimic Function DEC MOV Control Flows Instruction Huffman Forest Mimicry Digest The College of WILLIAM k MARY 18 Digesting Instruction Prefix MOV XOR PUSH DEC XOR Exec. PUSH Binary DEC MOV Inst. Prefixes (Atomic op., repeat, operand size, etc.) 0 1 ModR/M (Mod / Reg. / R/M) SIB (Scale / Idx. / Base) 0 MOV 1 INC PUSH Displacement Instruction Huffman Tree COMMON_INST Structure The College of WILLIAM k MARY 19 Digesting MOV Instruction Encoding Instruction TemplatePrefix MOV XOR PUSH DEC XOR PUSH Inst. Prefixes (Atomic op., repeat, operand size, etc.) 0 DEC MOV 1 ModR/M (Mod / Reg. / R/M) SIB (Scale / Idx. / Base) 0 MOV 1 INC PUSH Displacement Instruction Huffman Tree COMMON_INST Structure The College of WILLIAM k MARY 20 Digesting Instruction Encoding Template MOV MOV 0 1 Inst. Prefixes (Atomic op., repeat, operand size, etc.) 16bit ModR/M (Mod / Reg. / R/M) Inst. Prefix SIB (Scale / Idx. / Base) Displacement REP 0 1 EAX 0 ECX 1 EDX ModR/M 0 1 2x8+16 SIB 3x4+0 …… Displacement The College of WILLIAM k MARY 21 Digesting Instruction Encoding Instruction TemplatePrefix MOV MOV Inst. Prefixes (Atomic op., repeat, operand size, etc.) 0 16bit XOR PUSH 0 1 DEC 1 REP EAX 0 ModR/M (Mod / Reg. / R/M) SIB (Scale / Idx. / Base) Inst. Prefix ECX 0 0 1 1 EDX INC ModR/M 1 1 MOV Displacement 0 2x8+16 SIB PUSH 3x4+0 …… Instruction Huffman Tree Displacement The College of WILLIAM k MARY 22 Digesting MOV Inst. Prefixes (Atomic op., repeat, operand size, etc.) Instruction Prefix Instruction Prefix XOR PUSH DEC XOR PUSH DEC ModR/M (Mod / Reg. / R/M) SIB (Scale / Idx. / Base) 0 MOV 0 MOV 1 1 INC PUSH Displacement Instruction Huffman Tree The College of WILLIAM k MARY 23 Digesting Instruction Prefix DEC PUSH DEC MOV 0 CMP XCHG 0MOV 1 0 DEC 1 0 MOV PUSH POP XOR PUSH DEC JMP 1 CALL MOV 0 MOV 1 1 INC PUSH Mimimorphic Digest The College of WILLIAM k MARY 24 Encoding PRNG Binary Data High Order Instruction Mimic Function Mimicry Digest Assemble Mimicry Binaries The College of WILLIAM k MARY 25 Encoding XOR PUSH DEC 01001001 Binary 10010101 Data 00010100 10001001 0 0 XOR MOV PUSH 1 Mimicry Digest 1 INC PUSH DEC Instruction Huffman Tree Instruction Prefix The College of WILLIAM k MARY 26 Encoding 01001001 Binary 10010101 Data 00010100 10001001 Instruction Encoding Template MOV XOR PUSH 0 1 DEC 16bit 0 REP 1 Inst. Prefix 0 INC 1 0 MOV 2x8+16 0 1 EAX 0 ECX 1 EDX ModR/M 1 PUSH 3x4+0 …… Instruction Huffman Tree SIB Displacement The College of WILLIAM k MARY 27 Encoding Instruction Encoding Template MOV 0 01001001 10010101 00010100 10001001 16bit 1 REP Inst. Prefix 0 1 EAX 0 ECX 1 EDX ModR/M 0 1 2x8+16 SIB 3x4+0 …… Displacement The College of WILLIAM k MARY 28 Encoding MOV Instruction Encoding Template MOV 01001001 10010101 00010100 10001001 0 1 Inst. Prefixes (Atomic op., repeat, operand size, REP etc.) 16bit ModR/M (Mod /Inst. Reg.Prefix / R/M) SIB 0 / 1Base) (Scale / Idx. Displacement 3x4+0 2x8+16 SIBStructure COMMON_INST 0 1 EAX 0 ECX 1 EDX ModR/M …… Displacement The College of WILLIAM k MARY 29 Encoding MOV 01001001 10010101 00010100 10001001 Inst. Prefixes (Atomic op., repeat, operand size, etc.) XOR PUSH ModR/M (Mod / Reg. / R/M) SIB (Scale / Idx. / Base) DEC MOV ? Displacement COMMON_INST Structure The College of WILLIAM k MARY 30 Encoding Instruction Prefix 01001001 10010101 00010100 10001001 XOR XOR PUSH DEC PUSH MOV DEC MOV The College of WILLIAM k MARY 31 Decoding Mimicry Binaries Disassemble High Order Instruction Mimic Function Mimicry Digest Binary Data PRNG The College of WILLIAM k MARY 32 Training ◦ Select 100 Windows XP system files as mimicry target They represent typical legitimate binaries ◦ Trained using 7th and 8th order mimimorphic engines Most control flow basic blocks have 7-8 instructions Evaluations ◦ Statistical Anomaly Tests Kolmogorov-Smirnov Test & Entropy Test ◦ Semantic Detection Test Control Flow Fingerprinting The College of WILLIAM k MARY 33 Statistical Tests 0.09 ◦ Kolmogorov-Smirnov Test Maximum byte frequency distribution differences Legitimate: 0.074±0.045; Mimimorphic: 0.093±0.006 ◦ Entropy Test 0.074 0.516 Measurement of predictability (or randomness) of data Legitimate: 6.353±0.258; Mimimorphic: 6.528±0.021 6.353 The College of WILLIAM k MARY 34 Semantic Tests ◦ Control Flow Fingerprinting Statically analyze executables (with a special disassembler) and extract control flow patterns Detecting malwares by matching their characteristic control flow patterns (i.e., shared fingerprints) ◦ Between original binary and Mimimorphic instances Shared fingerprints: the lower the better Only 1 out of 100 instances share a single fingerprint (out of hundreds of thousands fingerprints) The College of WILLIAM k MARY 35 Semantic Tests ◦ Between mimimorphic and legitimate binaries Shared fingerprints: the higher the better 7th order mimimorphic instances: Average 1856.46±372.5 (72.93 benign files) Minimum 1057 (44 files); Maximum 3321 (92 files) 8th order mimimorphic instances: Average 11407.99±912.42 (81.37 benign files) Minimum 9606 (70 files); Maximum 14216 (91 files) The College of WILLIAM k MARY 36 Semantic Tests ◦ A sample mimicry control flow pattern Reproduced by a 7th order mimimorphic instance The College of WILLIAM k MARY 37 Application Constraint ◦ Memory consumption: 600MB for 7th order and 1.2GB for 8th order mimimorphic transformation Disk-based on-demand digest storage ◦ Size increase: 20x inflation for 7th order and 30x for 8th order mimimorphic transformation Typical malware are less than 100KB Mimimorphism results in 2~3MB files The College of WILLIAM k MARY 38 We propose mimimorphism as a novel binary obfuscation technique ◦ Enhanced high order mimic functions with custom assembler / disassembler ◦ Achieves evasion by disguising, not refusing detection ◦ Effective against both statistical anomaly detection as well as semantic fingerprinting tests The College of WILLIAM k MARY 39 Robustness against other approaches ◦ Automatic n-gram detections Typical x86 instruction length: 2.1~2.8 8th order mimimorphism can approach 16-gram mimicry Existing n-gram detection algorithms can hardly scale up to ◦ Static semantic analysis Mimimorphism does not target specific detection techniques Focuses on reproducing features from benign programs Immune to lower order signature detections The College of WILLIAM k MARY 40 Robustness against other approaches ◦ Deep syntactic analysis Fails to exactly reproduce high level syntactic features: 45% “functions” do not have matching prologue and epilogue Many jump instructions go across function boundaries Detectable program-level anomalies Not all programs follow conventions Could lead to false positives The College of WILLIAM k MARY 41 The College of WILLIAM k MARY The Problem of the Unpacker ◦ Mimimorphic transformation does not provide solution for hiding the unpacker ◦ However, we believe unpackers do benefit from using mimimorphism Unpacker is the weakness of polymorphism because it is easy to be “spotted” – all other payload is not executable! All mimimorphic payload is “executable”, separating unpacker code from the payload becomes non-trivial The College of WILLIAM k MARY 43 Decoding Mimicry Binaries Disassemble High Order Instruction Mimic Function Mimicry Digest Binary Data PRNG The College of WILLIAM k MARY 44 Decoding Instruction Prefix MOV XOR PUSH DEC XOR 00Mimicry PUSH Binary DEC Decoded MOV Bits Inst. Prefixes (Atomic op., repeat, operand size, etc.) 0 1 ModR/M (Mod / Reg. / R/M) SIB (Scale / Idx. / Base) 0 MOV 1 INC PUSH Displacement Instruction Huffman Tree COMMON_INST Structure The College of WILLIAM k MARY 45 Decoding Decoded Bits Instruction Prefix MOV MOV 00 Inst. Prefixes 0 (Atomic op., repeat, operand size, etc.) 16bit XOR PUSH 0DEC 1 1 REP EAX 0 1 0 1 ModR/M (Mod / Reg. / R/M) Inst. Prefix Decoded Bits SIB (Scale / Idx. / Base) ECX 0 0 EDX INC ModR/M 1 1 Displacement 2x8+16 SIB COMMON_INST Structure MOV PUSH 3x4+0 …… Instruction Huffman Tree Displacement The College of WILLIAM k MARY 46 Decoding 0101 Decoded Bits MOV MOV Inst. Prefixes (Atomic op., repeat, operand size, etc.) ModR/M (Mod / Reg. / R/M) SIB (Scale / Idx. / Base) Displacement 0 16bit 1 REP Inst. Prefix 0 1 EAX 0 ECX 1 EDX ModR/M 0 1 2x8+16 SIB 3x4+0 …… Displacement The College of WILLIAM k MARY 47 Decoding 0101 Decoded Bits MOV MOV Inst. Prefixes (Atomic op., repeat, 00 operand size, etc.) ModR/M (Mod / Reg. / R/M) 0 16bit XOR PUSH 0DEC 1 1 REP EAX 0 1 0 1 Inst. Prefix ECX 0 Decoded BitsSIB (Scale / Idx. / Base) Displacement 0 EDX INC ModR/M 1 1 MOV 2x8+16 SIB PUSH 3x4+0 …… Instruction Huffman Tree Displacement The College of WILLIAM k MARY 48 Decoding Instruction Prefix Instruction Prefix XOR PUSH DEC XOR 01001001 00 0101 10010101 PUSH DEC 0 MOV Decoded Bits 0 MOV 1 1 INC PUSH Instruction Huffman Tree The College of WILLIAM k MARY 49