A Fast Approximate Detector for the Win32.Simile Malware Edna Milgo and Yasmine Kandissounon TSYS Department of Computer Science Columbus State University Mid-East Chapter of ACM 2008 Fall conference Nov.20-21 ,2008 Gatlinburg, Tennessee, USA. Terminology Malware Program designed to disrupt or damage system’s normal function Malware signature Malware’s unique characteristic or pattern Malware detection Identification of malware, usually by looking for a signature Metamorphic malware Malware that alters its own code (signature) to thwart detection Malware Malware 1 Malware 2 Malware 3 Malware detector Signature 1 Signature 2 Signature 3 Signatures are Specific Many forms of malware exist Malware detector extracts patterns : bytes or behavior that uniquely identify a malware instance W32.Simile’s engine W32.Simile M Simile Eve W32.Simile Signature Metamorphic Malware / Metamorphic Engine Win32. Simile M M … Too Many Generation 1 Generation N Signatures Generation 2 Challenge the AV Scanners ! Malware detector Storage and Signature N Signature 1 Signature 2 Time Win32. Simile … Metamorphic Engine Transforms the Malware Targeted Metamorphic Malware We Target W32.Simile Very Sophisticated Various Transformations Code Substitution expansion compression Permutation Code encryption Code Substitution lea eax,[ecx+3] mov eax,ecx add eax,3 Current Detection Method Look for metAPHOR 1b BY tHe MeNTAl drilLER/29A Current Engine and DAT files But..Time consuming to store one signature per variant Also expensive to update signature databases online Proposed Approach: Faster detection No need to store one signature for each variant Faster since only disassembly is needed Instruction Frequency Vector Definition Maps opcode mnemonics (instructions) with the frequency of their occurrence in an assembly language program. IFV’s are easy to compute in linear time! mov add sub push 3 3 2 1 Example Program P: mov add mov add sub push add mov sub IFV(P): (mov, 3),(add, 3),(sub, 2),(push, 1) Experimental Set Up Implemented a W32.Simile Simulator Code expansion, Code compression Generated variants 10 for each of the first 5 generations of W32. Simile Generated benign 50 random benign opcode sequences Generated the IFVs for each variant / benign Modeled IFV evolution as a Markov chain Transition matrix of the chain captures the probability that the IFV of variant A mutates to the IFV of variant B. (We hence selected 10 IFVs for the matrix) Classification Calculate the Euclidian distances between IFVs. n 2 d= sqrt(∑i=0 (IFVx-IFVy) ) Given suspect program, set a threshold ε pv = percentage of variants within ε of suspect pnv = percentage of non-variants within ε of suspect If pv > pnv then suspect is a variant else suspect is not a variant Evaluation/Results - Accuracy: Markov Chain Theory - Perfect classifier. - Smallest/largest recorded distances lrd(v,v)= 109.27 srd(v,v)=10.90 lrd(v,b)= 5370.79 srd(v,b) =4343.57 -A good ε is any threshold s.t 109.27 < ε < 4343.57 Conclusion and Future Work Impact IFV overcomes code permutation Faster since it simulates only the opcodes No need for variant signature Limitations / Future Work Is just a filter/quick check. More may be needed. Program analysis to the rescue. Detection is now limited to only 10 samples , need more. Mine Markov chain theory for more.