Detecting “undetectable” metamorphic viruses

advertisement
Detecting Undetectable Metamorphic Viruses
Sujandharan Venkatachalam and Mark Stamp
Department of Computer Science
San Jose State University
San Jose, California
Abstract
Signature-based detection provides a relatively simple and efficient method for detecting known
viruses. At present, most antivirus systems rely primarily on signature detection.
Metamorphic viruses are potentially one of the most difficult types of viruses to detect. Such
viruses change their internal structure, which provides an effective means of avoiding signature
detection. Previous work has provided a rigorous proof that a fairly simple metamorphic engine
can generate viruses that are undetectable using “static analysis”. In this paper, we implement a
metamorphic engine that produces such “undetectable” viruses and we show that, as expected,
popular antivirus scanners fail to detect the resulting viruses. Finally, we analyze these same
viruses using a previously developed approach based on hidden Markov models (HMM). This
HMM-based approach easily detects the viruses.
1 Introduction
Since the advent of malware, virus creation techniques and detection methodologies have
evolved in an ongoing “arms race” [16] [21]. Virus creators want to evade popular detection
mechanisms and since signature detection is the most popular, the most effort has gone towards
evading signature detection.
Virus programmers have created several types of viruses that attempt to bypass signature-based
detection. As the name suggests, encrypted viruses encrypt their code, which effectively hides
the signature. The Achilles heel of such viruses is that they include unencrypted decryption code,
1
which is susceptible to signature detection. Polymorphic viruses up the ante by morphing the
decryption code, making detection more difficult. However, polymorphic viruses can be detected
using emulation techniques. Metamorphic viruses do away with encryption entirely and instead
rely on code morphing of the virus body to hide their signature [15], and thus the term “body
polymorphic” is sometimes used. While metamorphic viruses can effectively hide their
signature, the paper [23] shows that metamorphic viruses produced by the hacker community are
generally not very metamorphic and even the most highly metamorphic are detectable using
machine learning techniques—specifically, hidden Markov models (HMMs).
The authors of [4] give the following intriguing result:
“… we prove that reliable static detection of a particular category of metamorphic viruses
is an NP-complete problem. Then we empirically illustrate our result by constructing a
practical obfuscator which could be used by metamorphic viruses in the future to evade
detection…”
Furthermore, the authors of [4] state that
“… static analysis is conceived as a process whose goal is to extract the semantics of a
given program without any help of code execution.”
In this paper, we have implemented a stand-alone metamorphic generator that satisfies the
conditions in [4] and we have applied it to several well-known viruses. We show that, as
expected, morphed versions of the base viruses are not detected using popular signature-based
antivirus software. Then we show that these “undectectable” metamorphic viruses are, in fact,
easily detected using the machine learning-based technique in [23]. Our detection technique does
not employ on any code execution and, by most definitions, would be considered to use only
“static analysis”. While our work does not contradict the paper [4], it does indicate that
metamorphic viruses generated following the approach in [4] would not be particularly difficult
to detect in practice.
The organization of this paper is as follows. In Section 2 we provide a brief history of the
evolution of virus development while Section 3 briefly outlines the corresponding history of
virus detection. Then in Section 4 we consider code obfuscation techniques used in metamorphic
2
generators, and in Section 5 we briefly discuss the use of HMMs for metamorphic detection. In
Section 6 we provide details on the metamorphic generator that we have developed—a generator
that satisfies the conditions given in the paper [4]. Section 7 summarizes our experimental
results. Finally, Section 8, Conclusions and Future Work, not surprisingly, contains our
conclusions and provides suggestions for future work.
2 A Brief History of Viruses
Early on in the Titanic struggle between the forces of good and evil (code, that is), signature
based detection became the preferred method for detecting malware. Predictably, virus writers
reacted by developing new techniques designed to evade signature detection. Here, we briefly
outline the evolution of virus development and the parallel history of virus detection. In the next
section, we consider some additional aspects of virus detection.
A stealth virus takes control of the file management system and thereby attempts to conceal any
change it has made to infected files. When virus scanners attempt to scan infected files, the virus
redirects the scanner to some uninfected file. In order to detect such memory resident viruses,
virus detection systems can scan active memory before scanning the file system. This type of
detection proved effective against early stealth viruses, including Brain, Frodo and Whale [21].
As the name indicates, encrypted viruses try to bypass virus detection by self-encryption [5]. The
code encryption implemented in such viruses effectively hides the signature of the underlying
virus. However, the virus body must be decrypted before it can execute, and the decryption code
is susceptible to signature scanning.
Like encrypted viruses, polymorphic viruses try to bypass detection by self-encryption [5].
However, unlike encrypted viruses, these viruses mutate the decryptor code, making signature
scanning much more challenging [17]. Figure 1 illustrates different variants of a polymorphic
virus. Note that the decrypted body is the same in each case.
3
Figure 1: Polymorphic viruses [17]
Polymorphic viruses can be detected using emulationif the emulated code is actually a virus, it
will eventually decrypt itself, at which point standard signature scanning can be applied [2]. The
virus “1260” (aka V2PX) is often considered the first polymorphic virus, and it paved the way
for the creation of more sophisticated polymorphic code [27].
Metamorphic viruses do away with encryption and instead modify their entire code in such a
way that each copy is functionally the same, but structurally different [3]. If the copies are
sufficiently different, no common signature will be present. Figure 2 illustrates different
generations of a metamorphic virus. Note that the code differs in each case, yet the viral copies
all have the same function.
4
Figure 2: Metamorphic viruses [17]
Apparently, well-designed metamorphic code cannot be detected via signature-based methods.
However, it has been shown that metamorphic viruses produced by the hacker community can be
detected using machine learning techniques [21].
The article [17] is an excellent introduction to metamorphic viruses; for additional information
on metamorphic malware see [9], [10], or [13]. For a more detailed history of viruses in general,
see [20].
3 Virus Detection
Signature based detection systems scan the files for specific patterns. Generally, a pattern of
instructions present in a virus is identified for use as its signature. To reduce the number of false
alarms, the goal is to find a pattern of bits that only appears in the virus, or at least is very
uncommon in normal code. For example, the W32/Beast virus has a signature of 83EB 0274
EB0E 740A 81EB 0301 0000 [18].
5
In Figure 3, we show the process that could be used to extract a signature from the Stoned virus
[18]. First, the virus is disassembled and an uncommon opcode sequence is selected as the basis
for constructing a signature.
Figure 3: Signature in Stoned virus [18]
While signature scanning is the most popular approach to malware detection, anomaly based
detection has received considerable attention recently. An anomaly based system monitors for
abnormal activity—if abnormal activity is identified, the system raises an alarm signaling the
possible presence of malware [7] [15].
There are numerous practical difficulties with anomaly based systems, including finding a
suitable “normal” baseline, determining how far away from normal the system can deviate in
ordinary use, and adapting to legitimate changes in usage. In spite of such difficulties, there is
great interest in anomaly detection, since it offers the potential for detecting attacks that have
never been seen before—a feat that is not possible with signature detection. However, today
anomaly detection (as defined here) is, at best, a research topic fraught with difficulty.
Finally, emulation based detection relies on executing code in a virtual environment. As
mentioned above, this type of detection is particularly useful for detecting polymorphic viruses.
Table 1 summarizes the strengths and weaknesses of the detection methods discussed in this
section.
Table 1: Comparison of virus detection techniques [2]
6
Detection technique
Strength
Weakness
Signature based
Anomaly based
Efficient
Detect new malware
Emulation based
Polymorphic viruses
Cannot detect new malware
Costly, false positives,
Unproven in practice
Costly to implement
4 Code Obfuscation
Metamorphic viruses use one or more code obfuscation techniques to produce structurally
different versions of the same virus, while not altering the function of the code. The primary goal
of the obfuscation is to avoid signature detectionif the viruses are sufficiently different, no
common signature will exist and, ideally, no fixed signature will detect any significant
percentage of the viruses. Below, we discuss a few of the most common code morphing
techniques; see [11] and [19] for more examples. The code obfuscation techniques implemented
in various hacker-produced metamorphic generators are summarized in Table 2.
Table 2: Code obfuscation techniques [4]
Evol (2000)
Instruction substitution
Instruction permutation
Garbage code insertion
Variable substitution
Altering control flow



Zmist
(2001)





Zperm
(2000)
Regswap
(2000)





 


MetaPHOR
(2001)





Inserting garbage instructions (e.g., NOP) between useful code blocks is a simple obfuscation
technique used in all of the virus generators listed in Table 2. These garbage instructions do not
alter the functionality but will tend to increase the size of the code. Viruses that contain garbage
instructions are harder to detect using signatures since these instructions tend to break up
predetermined signatures.
Instruction reordering is another common metamorphic technique. In this method, the
instructions in the virus code are shuffled in a random fashion, with the control flow adjusted
(via jump statements, for example) to make the code execute in the appropriate order. Thus, the
instructions are reordered within the code without altering the actual control flow. This method
7
can also effectively break up signatures. However, if too many jump instructions are inserted,
this could be used as a simple heuristic for detecting malware. Figure 4 shows an example of
code reordering.
Subroutine reordering is a special case of code reordering. Reordering subroutines in the virus
code does not change the control flow but could make signature detection more difficult.
Instruction interchange is another useful obfuscation technique. In this method, instructions are
replaced with equivalent instructions. Then metamorphic versions of a given base virus will have
different patterns of opcodes that perform the same function.
Figure 4: Code reordering [18]
Register swapping, where different registers are used for equivalent operations, is a simple
special case of interchanging instructions. Again, the idea is to change the opcode pattern and
thereby bypass signature detection. This technique was the primary means of obfuscation used in
one of the first metamorphic viruses, W95/Regswap [18].
According to [21], the most advanced of the hacker-produced metamorphic virus generator is the
Next Generation Virus Construction Kit (NGVCK). Table 3 provides examples of code
obfuscation used by NGVCK.
Table 3: Code obfuscation in NGVCK
8
Base Version
Morphed Version 1
Code Reordering
Morphed Version 2
Garbage Insertion
call delta
delta: pop ebp
sub ebp, offset delta
call delta
delta: sub dword ptr[esp], offset delta
pop eax
mov ebp, eax
add ecx, 0031751B ; junk
call delta
delta: sub dword ptr[esp], offset delta
sub ebx,00000909 ; junk
mov edx, [esp]
xchg ecx, eax
; junk
add esp, 00000004
and ecx, 00005E44 ; junk
xchg edx, ebp
HEX equivalent:
E8000000005D81ED05104000
HEX equivalent:
E800000000812C2405104000588BE8
HEX equivalent:
*812C240B104000*8B1424*83C404*87EA
5 HMMs for Virus Detection
Machine learning techniques have been successfully applied to the problem of detecting
metamorphic viruses [22] [23]. These techniques extract statistical information from training
data and the resulting model can then be used to test any input file for similarity to the training
data. While a variety of machine learning techniques have been used, here, as in [23], we focus
on hidden Markov models.
A hidden Markov model (HMM) is a state machine that attempts to model a “hidden” Markov
process. The Markov process is hidden, in the sense that it cannot be directly observedthe
actual observations are probabilistically related to the hidden process. In the context of
metamorphic viruses, an HMM is trained on a specific metamorphic family. The training data
consists of a sequence of opcodes derived from viruses, all of which were produced by a single
metamorphic engine. Once the model is trained, it can be used to score an unknown file (again,
based on an its extracted opcode sequence) to determine its similarity to the metamorphic family.
For more details on the use of HMMs for metamorphic virus detection, see [22] and [23]; for
related work involving profile HMMs see [2]; other machine learning techniques are applied to
virus detection in [7] and [28]; for additional background on HMMs in general, see [14].
9
6 Metamorphic Generator
We have implemented a metamorphic virus generator that satisfies the conditions given in [4].
Recall that the paper [4] provides a rigorous proof that viruses generated using their approach
cannot be detected using “static analysis”. Next, we briefly discuss the details of our
metamorphic generator.
A seed virus is the input to our metamorphic generator. The seed virus assembly code is split into
small blocks and then reordered using conditional jump instructions and labels. The number of
instructions in each block is variable and is typically set to an average value of six. The virus
code is split into blocks, respecting the conditions given in [4], namely, code blocks cannot end
with a jump or a NOP instruction. A precondition on the seed virus is that the entire code must
appear in the code (or text) section of the assembly file, which implies that viruses that hide code
in their data section cannot be used as a seed virus.
After splitting the code into blocks, the blocks are randomly shuffled. Then labels are inserted
and conditional jump instructions are used so as to maintain the original control flow. In
addition, instruction reordering and garbage insertion are used for code obfuscation. In summary,
our metamorphic engine performs the following steps:
1) Valid instructions in the assembly code of the seed file are identified
2) Blocks are generated based on the following conditions:
(a) The first and last block of the code do not change
(b) The last instruction of a block is not a label, JMP or NOP
3) Blocks are randomly permuted and labels and conditional jumps are inserted
4) Garbage instructions are randomly inserted according to a threshold value
5) Write the output file
The garbage insertion is optional and the amount of garbage inserted is adjustable. The garbage
instructions include various copy instructions and opaque predicates, with the garbage inserted
between pairs of code blocks, after the block shuffling is completed.
10
Our generator has been successfully tested with several virus families. A few test cases are
discussed in the next section; for more examples, see [24].
Next, we applied the HMM virus detection technique in [23] to our metamorphic viruses. To
train an HMM model, 200 metamorphic copies of a given seed virus were created using our
metamorphic engine. The metamorphic engine generates ASM files, each of which has the same
functionality as the seed virus. These 200 files were assembled using the Borland Turbo TASM
5.0 assembler and linked using the Borland Turbo TLINK 7.1 linker to produce EXE files. The
EXE files thus obtained were then disassembled using IDA Pro and opcode sequences were
extracted. The steps performed in preparing the test data are given in Figure 6.
TASM 5.0
Code obfuscation
engine
200 ASM Files
TLINK 7.1
200 OBJ Files
IDA Pro
200 EXE
Files
200 ASM
Files
Figure 5: Test data preparation
From the set of 200 metamorphic virus files, 160 files were used for training and the remaining
40 were reserved for testing the resulting HMM model. Note that disassembled files obtained
from EXE files were used for training and testing. Consequently, our training and testing is
realistic, since, in general, only EXE files would be available to antivirus software.
We performed 5-fold cross validation, that is, we split the 200 metamorphic virus files into 5
equal sets of 40 viruses each. From among these five sets, four sets were used for training and
one set was reserved to test the trained HMM model. In addition to the 40 metamorphic test files,
the test set also included 25 other viruses and 40 “normal” files. For the normal files, we used
Cygwin utility files, which are similar in size to the virus files and also perform low-level
operations, so they are somewhat similar in function to the actual viruses. These utility files
should provide a relatively challenging test case for the HMM model.
11
7 Experimental Results
For the test cases considered in this paper, the Next Generation Virus Creation Kit (NGVCK)
was used to create the seed viruses. Other viruses were considered, with equally strong results in
each case; see [24] for more details.
In each case, popular antivirus scanners could detect the seed virus, but not the morphed viruses
produced by our metamorphic generator. That is, our metamorphic generator is able to
successfully bypass signature detection. However, regardless of the seed virus used, the HMM
engine was able to easily distinguish the morphed viruses from normal code.
Virus creation, analysis and testing experiments were conducted using the platform and tools
listed in Table 4.
Table 4: Experimental setup
Experiment platform
Programming language
Disassembler
Assembler
Linker
Virus generator
Virus scanners
Windows XP
VMware virtual machine
Perl5
OllyDbg v1.10
IDA Pro 4.9
Borland Turbo Assembler 5.0
Borland Turbo Linker 7.1
MPCGEN (Phalcon/Skism Mass Code Generator)
G2 (Generation 2 Virus Generator)
VCL32 (Virus Creation Lab for Win32)
NGVCK (Next Generation Virus Creation Kit)
Avast Home Edition 4.8
McAfee Antivirus 2009
As mentioned above, the seed viruses were detected by commercial antivirus software. For
example, Figure 8 shows a screenshot of the security alert displayed by McAfee antivirus when it
scanned one of our seed viruses.
12
Figure 6: Testing seed virus with McAfee
Next, we consider one specific case. In this example, we used our engine to generate 200
metamorphic variants of the NGVCK seed virus. The parameters were set to generate variants
with a threshold of two garbage instructions. Snippets of code from two of the resulting
metamorphic variants appear in Figure 9.
Figure 7: Sample metamorphic code
13
The metamorphic viruses were then assembled and the resulting executables scanned using
popular antivirus scanners. The post-scan summaries for McAfee and Avast [1] are shown in
Figures 10 and 11, respectively. From Figures 10 and 11 we see that, as expected, these popular
scanners were not able detect the morphed executables as viruses.
Figure 8: Scanning metamorphic viruses with McAfee
Figure 9: Scanning metamorphic viruses with Avast
An HMM model was trained using this set of metamorphic viruses. The number of distinct
observation symbols (i.e., opcodes) ranged from 40 to 42 and the total number of observations
ranged from 41472 to 42151. The resulting model was then used to score viruses and normal
files. A typical HMM score graph appears in Figure 10. The same data appears in Figure 11,
where a number of other (non-family) viruses have also been scored. In every case, a threshold
14
could easily be set that would provide flawless detection, that is, no false positives or false
negatives would occur. In fact, the score differences are quite large given that the scores are
computed on a per opcode basis.
Note that some of the “other viruses” in Figure 11 are much closer in score to the NGVCK
viruses than any of the normal files. The fact that these other viruses are more similar to our
metamorphic viruses than normal code could be considered a positive feature of this detection
approach—depending on how we set the threshold, we would detect some non-family viruses
without any additional work.
Figure 10: HMM scores
15
Figure 11: HMM scores including other viruses
8 Conclusions and Future Work
In this paper, we analyzed metamorphic viruses satisfying the conditions given in the paper [4].
Recall that [4] provides a rigorous proof that such viruses are undetectable using “static
analysis”. As expected, our metamorphic viruses are not susceptible to signature detection.
However—and perhaps surprisingly—these viruses are easily detected via a machine learning
approach. Specifically, we trained HMM models to detect such viruses. While the work
presented here does not contradict [4], it does call into question the utility of relying on such a
narrow definition of “static analysis”. We believe that by most accounts, our HMM approach
would be considered a static technique.
Unfortunately, at present our HMM-based detection method is not practical for scanning large
data sets, primarily due to the time required to obtain opcode sequences during the scoring phase.
The paper [25] is a promising first step towards improving the speed of scanning.
At this point, a natural question to ask is whether a metamorphic generator can be produced that
will evade signature detection and is undetectable using our HMM-based detector. The paper [5]
was a first attempt to settle this question, while the paper [26] shows conclusively that a practical
16
metamorphic generator can evade both signature detection and the HMM based approach used in
this paper. However, it is not yet entirely clear where the ultimate balance of power lies between
metamorphic virus writers and detection.
At first blush, the theoretical results in [4] seem to imply that metamorphic virus writers have an
insurmountable advantage—at least from the perspective of static analysis. However, the results
in this paper indicate that the reality of the situation is considerably more nuanced. In any case,
many issues remain unresolved and further research is warranted.
References
[1]
Avast Antivirus, http://www.avast.com/
[2]
S. Attaluri, “Profile hidden Markov models for metamorphic virus analysis,” Master’s
thesis, San Jose State University, 2007.
http://www.cs.sjsu.edu/faculty/stamp/students/Srilatha_cs298Report.pdf
[3]
“Benny/29A", Theme: metamorphism,
http://www.vx.netlux.org/lib/static/vdat/epmetam2.htm
[4]
J. Borello and L. Me, Code Obfuscation Techniques for Metamorphic Viruses,
Journal in Computer Virology, Vol. 4, No. 3, August 2008, pp. 211-220
http://www.springerlink.com/content/233883w3r2652537
[5]
P. Desai, “Towards an undetectable Computer Virus,” Master’s project report,
Department of Computer Science, San Jose State University, 2008.
http://www.cs.sjsu.edu/faculty/stamp/students/Desai_Priti.pdf
[6]
J. Dickinson, “The New Anti-Virus Formula,” Messaging News Press 2005.
http://www.ironport.com/pdf/ironport_new_anti-virus_formula.pdf
[7]
IBM Corporation. (1996). “Neural Networks for Computer Virus Recognition”,
http://www.research.ibm.com/antivirus/SciPapers/Tesauro/NeuralNets.html
[8]
IDA Pro, http://www.hex-rays.com/idapro/
[9]
E. Konstantinou, “Metamorphic Virus: Analysis and Detection,” January 2008.
[10]
A. Lakhotia, “Are metamorphic viruses really invincible?” Virus Bulletin,
December 2005.
17
[11]
P. Mishra, “A taxonomy of software uniqueness transformations”, Master’s thesis,
Department of Computer Science, San Jose State University, Dec. 2003
http://home.earthlink.net/~mstamp1/mss_v.html#masters
[12]
Orr, “The molecular virology of Lexotan32: Metamorphism illustrated,” 2007.
http://www.antilife.org/files/Lexo32.pdf
[13]
Orr, “The viral Darwinism of W32.Evol: An in-depth analysis of a metamorphic
engine,” 2006. http://www.antilife.org/files/Evol.pdf
[14]
M. Stamp, “A Revealing Introduction to Hidden Markov Models”, January 2004.
http://www.cs.sjsu.edu/faculty/stamp/RUA/HMM.pdf
[15]
M. Stamp, Information Security: Principles and Practice, Wiley, 2006.
[16]
PCWorld. (2008). “Viruses Expected to Hit 1 Million This Year”,
http://www.pcworld.com/article/144181/viruses_expected_to_hit_1_million_this_year.ht
ml
[17]
P. Szor, P. Ferrie, “Hunting for Metamorphic”, Symantec Security Response.
http://www.symantec.com/avcenter/reference/hunting.for.metamorphic.pdf
[18]
P. Szor, “The Art of Computer Virus Defense and Research,” Symantec Press 2005.
[19]
A. Venkatesan, Code obfuscation and metamoprhic virus detection, Master’s project
report, Department of Computer Science, San Jose State University, 2008.
[20]
virus-scan-software.com, “A history of computer viruses”, http://www.virus-scansoftware.com/virus-scan-help/answers/the-history-of-computer-viruses.shtml
[21]
VX Heavens, http://vx.netlux.org/
[22]
W. Wong, “Analysis and Detection of Metamorphic Computer Viruses,” Master’s thesis,
San Jose State University, 2006.
http://www.cs.sjsu.edu/faculty/stamp/students/Report.pdf
[23]
W. Wong and M. Stamp, Hunting for metamorphic engines, Journal in Computer
Virology, Vol. 2, No. 3, December 2006, pp. 211-229
[24]
S. Venkatachalam, Detecting undetectable computer viruses, Master’s project report,
Department of Computer Science, San Jose State University, 2010
[25]
D. Radhakrishnan, Approximate disassembly, Master’s project report, Department of
Computer Science, San Jose State University, 2010
18
[26]
D. Lin, Hunting for undectectable metamorphic viruses, Master’s project report,
Department of Computer Science, San Jose State University, 2010
[27]
Wikipedia, 1260 (computer virus), http://en.wikipedia.org/wiki/1260_(computer_virus)
[28]
A. Govindaraju, Exhaustive statistical analysis for detection of metamorphic malwares,
Master’s project report, Department of Computer Science, San Jose State University,
2010
19
Download