Detecting Undetectable Metamorphic Viruses Sujandharan Venkatachalam and Mark Stamp Department of Computer Science San Jose State University San Jose, California Abstract Signature-based detection provides a relatively simple and efficient method for detecting known viruses. At present, most antivirus systems rely primarily on signature detection. Metamorphic viruses are potentially one of the most difficult types of viruses to detect. Such viruses change their internal structure, which provides an effective means of avoiding signature detection. Previous work has provided a rigorous proof that a fairly simple metamorphic engine can generate viruses that are undetectable using “static analysis”. In this paper, we implement a metamorphic engine that produces such “undetectable” viruses and we show that, as expected, popular antivirus scanners fail to detect the resulting viruses. Finally, we analyze these same viruses using a previously developed approach based on hidden Markov models (HMM). This HMM-based approach easily detects the viruses. 1 Introduction Since the advent of malware, virus creation techniques and detection methodologies have evolved in an ongoing “arms race” [16] [21]. Virus creators want to evade popular detection mechanisms and since signature detection is the most popular, the most effort has gone towards evading signature detection. Virus programmers have created several types of viruses that attempt to bypass signature-based detection. As the name suggests, encrypted viruses encrypt their code, which effectively hides the signature. The Achilles heel of such viruses is that they include unencrypted decryption code, 1 which is susceptible to signature detection. Polymorphic viruses up the ante by morphing the decryption code, making detection more difficult. However, polymorphic viruses can be detected using emulation techniques. Metamorphic viruses do away with encryption entirely and instead rely on code morphing of the virus body to hide their signature [15], and thus the term “body polymorphic” is sometimes used. While metamorphic viruses can effectively hide their signature, the paper [23] shows that metamorphic viruses produced by the hacker community are generally not very metamorphic and even the most highly metamorphic are detectable using machine learning techniques—specifically, hidden Markov models (HMMs). The authors of [4] give the following intriguing result: “… we prove that reliable static detection of a particular category of metamorphic viruses is an NP-complete problem. Then we empirically illustrate our result by constructing a practical obfuscator which could be used by metamorphic viruses in the future to evade detection…” Furthermore, the authors of [4] state that “… static analysis is conceived as a process whose goal is to extract the semantics of a given program without any help of code execution.” In this paper, we have implemented a stand-alone metamorphic generator that satisfies the conditions in [4] and we have applied it to several well-known viruses. We show that, as expected, morphed versions of the base viruses are not detected using popular signature-based antivirus software. Then we show that these “undectectable” metamorphic viruses are, in fact, easily detected using the machine learning-based technique in [23]. Our detection technique does not employ on any code execution and, by most definitions, would be considered to use only “static analysis”. While our work does not contradict the paper [4], it does indicate that metamorphic viruses generated following the approach in [4] would not be particularly difficult to detect in practice. The organization of this paper is as follows. In Section 2 we provide a brief history of the evolution of virus development while Section 3 briefly outlines the corresponding history of virus detection. Then in Section 4 we consider code obfuscation techniques used in metamorphic 2 generators, and in Section 5 we briefly discuss the use of HMMs for metamorphic detection. In Section 6 we provide details on the metamorphic generator that we have developed—a generator that satisfies the conditions given in the paper [4]. Section 7 summarizes our experimental results. Finally, Section 8, Conclusions and Future Work, not surprisingly, contains our conclusions and provides suggestions for future work. 2 A Brief History of Viruses Early on in the Titanic struggle between the forces of good and evil (code, that is), signature based detection became the preferred method for detecting malware. Predictably, virus writers reacted by developing new techniques designed to evade signature detection. Here, we briefly outline the evolution of virus development and the parallel history of virus detection. In the next section, we consider some additional aspects of virus detection. A stealth virus takes control of the file management system and thereby attempts to conceal any change it has made to infected files. When virus scanners attempt to scan infected files, the virus redirects the scanner to some uninfected file. In order to detect such memory resident viruses, virus detection systems can scan active memory before scanning the file system. This type of detection proved effective against early stealth viruses, including Brain, Frodo and Whale [21]. As the name indicates, encrypted viruses try to bypass virus detection by self-encryption [5]. The code encryption implemented in such viruses effectively hides the signature of the underlying virus. However, the virus body must be decrypted before it can execute, and the decryption code is susceptible to signature scanning. Like encrypted viruses, polymorphic viruses try to bypass detection by self-encryption [5]. However, unlike encrypted viruses, these viruses mutate the decryptor code, making signature scanning much more challenging [17]. Figure 1 illustrates different variants of a polymorphic virus. Note that the decrypted body is the same in each case. 3 Figure 1: Polymorphic viruses [17] Polymorphic viruses can be detected using emulationif the emulated code is actually a virus, it will eventually decrypt itself, at which point standard signature scanning can be applied [2]. The virus “1260” (aka V2PX) is often considered the first polymorphic virus, and it paved the way for the creation of more sophisticated polymorphic code [27]. Metamorphic viruses do away with encryption and instead modify their entire code in such a way that each copy is functionally the same, but structurally different [3]. If the copies are sufficiently different, no common signature will be present. Figure 2 illustrates different generations of a metamorphic virus. Note that the code differs in each case, yet the viral copies all have the same function. 4 Figure 2: Metamorphic viruses [17] Apparently, well-designed metamorphic code cannot be detected via signature-based methods. However, it has been shown that metamorphic viruses produced by the hacker community can be detected using machine learning techniques [21]. The article [17] is an excellent introduction to metamorphic viruses; for additional information on metamorphic malware see [9], [10], or [13]. For a more detailed history of viruses in general, see [20]. 3 Virus Detection Signature based detection systems scan the files for specific patterns. Generally, a pattern of instructions present in a virus is identified for use as its signature. To reduce the number of false alarms, the goal is to find a pattern of bits that only appears in the virus, or at least is very uncommon in normal code. For example, the W32/Beast virus has a signature of 83EB 0274 EB0E 740A 81EB 0301 0000 [18]. 5 In Figure 3, we show the process that could be used to extract a signature from the Stoned virus [18]. First, the virus is disassembled and an uncommon opcode sequence is selected as the basis for constructing a signature. Figure 3: Signature in Stoned virus [18] While signature scanning is the most popular approach to malware detection, anomaly based detection has received considerable attention recently. An anomaly based system monitors for abnormal activity—if abnormal activity is identified, the system raises an alarm signaling the possible presence of malware [7] [15]. There are numerous practical difficulties with anomaly based systems, including finding a suitable “normal” baseline, determining how far away from normal the system can deviate in ordinary use, and adapting to legitimate changes in usage. In spite of such difficulties, there is great interest in anomaly detection, since it offers the potential for detecting attacks that have never been seen before—a feat that is not possible with signature detection. However, today anomaly detection (as defined here) is, at best, a research topic fraught with difficulty. Finally, emulation based detection relies on executing code in a virtual environment. As mentioned above, this type of detection is particularly useful for detecting polymorphic viruses. Table 1 summarizes the strengths and weaknesses of the detection methods discussed in this section. Table 1: Comparison of virus detection techniques [2] 6 Detection technique Strength Weakness Signature based Anomaly based Efficient Detect new malware Emulation based Polymorphic viruses Cannot detect new malware Costly, false positives, Unproven in practice Costly to implement 4 Code Obfuscation Metamorphic viruses use one or more code obfuscation techniques to produce structurally different versions of the same virus, while not altering the function of the code. The primary goal of the obfuscation is to avoid signature detectionif the viruses are sufficiently different, no common signature will exist and, ideally, no fixed signature will detect any significant percentage of the viruses. Below, we discuss a few of the most common code morphing techniques; see [11] and [19] for more examples. The code obfuscation techniques implemented in various hacker-produced metamorphic generators are summarized in Table 2. Table 2: Code obfuscation techniques [4] Evol (2000) Instruction substitution Instruction permutation Garbage code insertion Variable substitution Altering control flow Zmist (2001) Zperm (2000) Regswap (2000) MetaPHOR (2001) Inserting garbage instructions (e.g., NOP) between useful code blocks is a simple obfuscation technique used in all of the virus generators listed in Table 2. These garbage instructions do not alter the functionality but will tend to increase the size of the code. Viruses that contain garbage instructions are harder to detect using signatures since these instructions tend to break up predetermined signatures. Instruction reordering is another common metamorphic technique. In this method, the instructions in the virus code are shuffled in a random fashion, with the control flow adjusted (via jump statements, for example) to make the code execute in the appropriate order. Thus, the instructions are reordered within the code without altering the actual control flow. This method 7 can also effectively break up signatures. However, if too many jump instructions are inserted, this could be used as a simple heuristic for detecting malware. Figure 4 shows an example of code reordering. Subroutine reordering is a special case of code reordering. Reordering subroutines in the virus code does not change the control flow but could make signature detection more difficult. Instruction interchange is another useful obfuscation technique. In this method, instructions are replaced with equivalent instructions. Then metamorphic versions of a given base virus will have different patterns of opcodes that perform the same function. Figure 4: Code reordering [18] Register swapping, where different registers are used for equivalent operations, is a simple special case of interchanging instructions. Again, the idea is to change the opcode pattern and thereby bypass signature detection. This technique was the primary means of obfuscation used in one of the first metamorphic viruses, W95/Regswap [18]. According to [21], the most advanced of the hacker-produced metamorphic virus generator is the Next Generation Virus Construction Kit (NGVCK). Table 3 provides examples of code obfuscation used by NGVCK. Table 3: Code obfuscation in NGVCK 8 Base Version Morphed Version 1 Code Reordering Morphed Version 2 Garbage Insertion call delta delta: pop ebp sub ebp, offset delta call delta delta: sub dword ptr[esp], offset delta pop eax mov ebp, eax add ecx, 0031751B ; junk call delta delta: sub dword ptr[esp], offset delta sub ebx,00000909 ; junk mov edx, [esp] xchg ecx, eax ; junk add esp, 00000004 and ecx, 00005E44 ; junk xchg edx, ebp HEX equivalent: E8000000005D81ED05104000 HEX equivalent: E800000000812C2405104000588BE8 HEX equivalent: *812C240B104000*8B1424*83C404*87EA 5 HMMs for Virus Detection Machine learning techniques have been successfully applied to the problem of detecting metamorphic viruses [22] [23]. These techniques extract statistical information from training data and the resulting model can then be used to test any input file for similarity to the training data. While a variety of machine learning techniques have been used, here, as in [23], we focus on hidden Markov models. A hidden Markov model (HMM) is a state machine that attempts to model a “hidden” Markov process. The Markov process is hidden, in the sense that it cannot be directly observedthe actual observations are probabilistically related to the hidden process. In the context of metamorphic viruses, an HMM is trained on a specific metamorphic family. The training data consists of a sequence of opcodes derived from viruses, all of which were produced by a single metamorphic engine. Once the model is trained, it can be used to score an unknown file (again, based on an its extracted opcode sequence) to determine its similarity to the metamorphic family. For more details on the use of HMMs for metamorphic virus detection, see [22] and [23]; for related work involving profile HMMs see [2]; other machine learning techniques are applied to virus detection in [7] and [28]; for additional background on HMMs in general, see [14]. 9 6 Metamorphic Generator We have implemented a metamorphic virus generator that satisfies the conditions given in [4]. Recall that the paper [4] provides a rigorous proof that viruses generated using their approach cannot be detected using “static analysis”. Next, we briefly discuss the details of our metamorphic generator. A seed virus is the input to our metamorphic generator. The seed virus assembly code is split into small blocks and then reordered using conditional jump instructions and labels. The number of instructions in each block is variable and is typically set to an average value of six. The virus code is split into blocks, respecting the conditions given in [4], namely, code blocks cannot end with a jump or a NOP instruction. A precondition on the seed virus is that the entire code must appear in the code (or text) section of the assembly file, which implies that viruses that hide code in their data section cannot be used as a seed virus. After splitting the code into blocks, the blocks are randomly shuffled. Then labels are inserted and conditional jump instructions are used so as to maintain the original control flow. In addition, instruction reordering and garbage insertion are used for code obfuscation. In summary, our metamorphic engine performs the following steps: 1) Valid instructions in the assembly code of the seed file are identified 2) Blocks are generated based on the following conditions: (a) The first and last block of the code do not change (b) The last instruction of a block is not a label, JMP or NOP 3) Blocks are randomly permuted and labels and conditional jumps are inserted 4) Garbage instructions are randomly inserted according to a threshold value 5) Write the output file The garbage insertion is optional and the amount of garbage inserted is adjustable. The garbage instructions include various copy instructions and opaque predicates, with the garbage inserted between pairs of code blocks, after the block shuffling is completed. 10 Our generator has been successfully tested with several virus families. A few test cases are discussed in the next section; for more examples, see [24]. Next, we applied the HMM virus detection technique in [23] to our metamorphic viruses. To train an HMM model, 200 metamorphic copies of a given seed virus were created using our metamorphic engine. The metamorphic engine generates ASM files, each of which has the same functionality as the seed virus. These 200 files were assembled using the Borland Turbo TASM 5.0 assembler and linked using the Borland Turbo TLINK 7.1 linker to produce EXE files. The EXE files thus obtained were then disassembled using IDA Pro and opcode sequences were extracted. The steps performed in preparing the test data are given in Figure 6. TASM 5.0 Code obfuscation engine 200 ASM Files TLINK 7.1 200 OBJ Files IDA Pro 200 EXE Files 200 ASM Files Figure 5: Test data preparation From the set of 200 metamorphic virus files, 160 files were used for training and the remaining 40 were reserved for testing the resulting HMM model. Note that disassembled files obtained from EXE files were used for training and testing. Consequently, our training and testing is realistic, since, in general, only EXE files would be available to antivirus software. We performed 5-fold cross validation, that is, we split the 200 metamorphic virus files into 5 equal sets of 40 viruses each. From among these five sets, four sets were used for training and one set was reserved to test the trained HMM model. In addition to the 40 metamorphic test files, the test set also included 25 other viruses and 40 “normal” files. For the normal files, we used Cygwin utility files, which are similar in size to the virus files and also perform low-level operations, so they are somewhat similar in function to the actual viruses. These utility files should provide a relatively challenging test case for the HMM model. 11 7 Experimental Results For the test cases considered in this paper, the Next Generation Virus Creation Kit (NGVCK) was used to create the seed viruses. Other viruses were considered, with equally strong results in each case; see [24] for more details. In each case, popular antivirus scanners could detect the seed virus, but not the morphed viruses produced by our metamorphic generator. That is, our metamorphic generator is able to successfully bypass signature detection. However, regardless of the seed virus used, the HMM engine was able to easily distinguish the morphed viruses from normal code. Virus creation, analysis and testing experiments were conducted using the platform and tools listed in Table 4. Table 4: Experimental setup Experiment platform Programming language Disassembler Assembler Linker Virus generator Virus scanners Windows XP VMware virtual machine Perl5 OllyDbg v1.10 IDA Pro 4.9 Borland Turbo Assembler 5.0 Borland Turbo Linker 7.1 MPCGEN (Phalcon/Skism Mass Code Generator) G2 (Generation 2 Virus Generator) VCL32 (Virus Creation Lab for Win32) NGVCK (Next Generation Virus Creation Kit) Avast Home Edition 4.8 McAfee Antivirus 2009 As mentioned above, the seed viruses were detected by commercial antivirus software. For example, Figure 8 shows a screenshot of the security alert displayed by McAfee antivirus when it scanned one of our seed viruses. 12 Figure 6: Testing seed virus with McAfee Next, we consider one specific case. In this example, we used our engine to generate 200 metamorphic variants of the NGVCK seed virus. The parameters were set to generate variants with a threshold of two garbage instructions. Snippets of code from two of the resulting metamorphic variants appear in Figure 9. Figure 7: Sample metamorphic code 13 The metamorphic viruses were then assembled and the resulting executables scanned using popular antivirus scanners. The post-scan summaries for McAfee and Avast [1] are shown in Figures 10 and 11, respectively. From Figures 10 and 11 we see that, as expected, these popular scanners were not able detect the morphed executables as viruses. Figure 8: Scanning metamorphic viruses with McAfee Figure 9: Scanning metamorphic viruses with Avast An HMM model was trained using this set of metamorphic viruses. The number of distinct observation symbols (i.e., opcodes) ranged from 40 to 42 and the total number of observations ranged from 41472 to 42151. The resulting model was then used to score viruses and normal files. A typical HMM score graph appears in Figure 10. The same data appears in Figure 11, where a number of other (non-family) viruses have also been scored. In every case, a threshold 14 could easily be set that would provide flawless detection, that is, no false positives or false negatives would occur. In fact, the score differences are quite large given that the scores are computed on a per opcode basis. Note that some of the “other viruses” in Figure 11 are much closer in score to the NGVCK viruses than any of the normal files. The fact that these other viruses are more similar to our metamorphic viruses than normal code could be considered a positive feature of this detection approach—depending on how we set the threshold, we would detect some non-family viruses without any additional work. Figure 10: HMM scores 15 Figure 11: HMM scores including other viruses 8 Conclusions and Future Work In this paper, we analyzed metamorphic viruses satisfying the conditions given in the paper [4]. Recall that [4] provides a rigorous proof that such viruses are undetectable using “static analysis”. As expected, our metamorphic viruses are not susceptible to signature detection. However—and perhaps surprisingly—these viruses are easily detected via a machine learning approach. Specifically, we trained HMM models to detect such viruses. While the work presented here does not contradict [4], it does call into question the utility of relying on such a narrow definition of “static analysis”. We believe that by most accounts, our HMM approach would be considered a static technique. Unfortunately, at present our HMM-based detection method is not practical for scanning large data sets, primarily due to the time required to obtain opcode sequences during the scoring phase. The paper [25] is a promising first step towards improving the speed of scanning. At this point, a natural question to ask is whether a metamorphic generator can be produced that will evade signature detection and is undetectable using our HMM-based detector. The paper [5] was a first attempt to settle this question, while the paper [26] shows conclusively that a practical 16 metamorphic generator can evade both signature detection and the HMM based approach used in this paper. However, it is not yet entirely clear where the ultimate balance of power lies between metamorphic virus writers and detection. At first blush, the theoretical results in [4] seem to imply that metamorphic virus writers have an insurmountable advantage—at least from the perspective of static analysis. However, the results in this paper indicate that the reality of the situation is considerably more nuanced. In any case, many issues remain unresolved and further research is warranted. References [1] Avast Antivirus, http://www.avast.com/ [2] S. Attaluri, “Profile hidden Markov models for metamorphic virus analysis,” Master’s thesis, San Jose State University, 2007. http://www.cs.sjsu.edu/faculty/stamp/students/Srilatha_cs298Report.pdf [3] “Benny/29A", Theme: metamorphism, http://www.vx.netlux.org/lib/static/vdat/epmetam2.htm [4] J. Borello and L. Me, Code Obfuscation Techniques for Metamorphic Viruses, Journal in Computer Virology, Vol. 4, No. 3, August 2008, pp. 211-220 http://www.springerlink.com/content/233883w3r2652537 [5] P. Desai, “Towards an undetectable Computer Virus,” Master’s project report, Department of Computer Science, San Jose State University, 2008. http://www.cs.sjsu.edu/faculty/stamp/students/Desai_Priti.pdf [6] J. Dickinson, “The New Anti-Virus Formula,” Messaging News Press 2005. http://www.ironport.com/pdf/ironport_new_anti-virus_formula.pdf [7] IBM Corporation. (1996). “Neural Networks for Computer Virus Recognition”, http://www.research.ibm.com/antivirus/SciPapers/Tesauro/NeuralNets.html [8] IDA Pro, http://www.hex-rays.com/idapro/ [9] E. Konstantinou, “Metamorphic Virus: Analysis and Detection,” January 2008. [10] A. Lakhotia, “Are metamorphic viruses really invincible?” Virus Bulletin, December 2005. 17 [11] P. Mishra, “A taxonomy of software uniqueness transformations”, Master’s thesis, Department of Computer Science, San Jose State University, Dec. 2003 http://home.earthlink.net/~mstamp1/mss_v.html#masters [12] Orr, “The molecular virology of Lexotan32: Metamorphism illustrated,” 2007. http://www.antilife.org/files/Lexo32.pdf [13] Orr, “The viral Darwinism of W32.Evol: An in-depth analysis of a metamorphic engine,” 2006. http://www.antilife.org/files/Evol.pdf [14] M. Stamp, “A Revealing Introduction to Hidden Markov Models”, January 2004. http://www.cs.sjsu.edu/faculty/stamp/RUA/HMM.pdf [15] M. Stamp, Information Security: Principles and Practice, Wiley, 2006. [16] PCWorld. (2008). “Viruses Expected to Hit 1 Million This Year”, http://www.pcworld.com/article/144181/viruses_expected_to_hit_1_million_this_year.ht ml [17] P. Szor, P. Ferrie, “Hunting for Metamorphic”, Symantec Security Response. http://www.symantec.com/avcenter/reference/hunting.for.metamorphic.pdf [18] P. Szor, “The Art of Computer Virus Defense and Research,” Symantec Press 2005. [19] A. Venkatesan, Code obfuscation and metamoprhic virus detection, Master’s project report, Department of Computer Science, San Jose State University, 2008. [20] virus-scan-software.com, “A history of computer viruses”, http://www.virus-scansoftware.com/virus-scan-help/answers/the-history-of-computer-viruses.shtml [21] VX Heavens, http://vx.netlux.org/ [22] W. Wong, “Analysis and Detection of Metamorphic Computer Viruses,” Master’s thesis, San Jose State University, 2006. http://www.cs.sjsu.edu/faculty/stamp/students/Report.pdf [23] W. Wong and M. Stamp, Hunting for metamorphic engines, Journal in Computer Virology, Vol. 2, No. 3, December 2006, pp. 211-229 [24] S. Venkatachalam, Detecting undetectable computer viruses, Master’s project report, Department of Computer Science, San Jose State University, 2010 [25] D. Radhakrishnan, Approximate disassembly, Master’s project report, Department of Computer Science, San Jose State University, 2010 18 [26] D. Lin, Hunting for undectectable metamorphic viruses, Master’s project report, Department of Computer Science, San Jose State University, 2010 [27] Wikipedia, 1260 (computer virus), http://en.wikipedia.org/wiki/1260_(computer_virus) [28] A. Govindaraju, Exhaustive statistical analysis for detection of metamorphic malwares, Master’s project report, Department of Computer Science, San Jose State University, 2010 19