Intelligent Malware Detection

advertisement
West Virginia University
CpE 480
Group 8
Instructor: Yenumula Reddy
Mentor: Yanfang Ye
Intelligent
Malware Detection
Individual Research Paper
Joshua Suess
2 December 2014
Contents
Malware Today ......................................................................................................................................... - 1 Current Solutions ...................................................................................................................................... - 2 Stakeholders ............................................................................................................................................. - 3 Design........................................................................................................................................................ - 3 Objectives ................................................................................................................................................. - 5 Conclusion ................................................................................................................................................. - 6 Works Cited ............................................................................................................................................... - 7 -
Malware Today
What is malware? Malware is actually a portmanteau of malicious and software. Technically
speaking, malware is software whose sole purpose is to perform unwanted actions. There are
numerous terms used to classify malware, common of which are viruses, worms, Trojan horses, and
ransomware. Each classification is divided by the unique features the software contains that let it
perform the malicious tasks it was designed for often times while avoiding detection for long periods of
time. Most malware causes at least some of the following:





Performance issues
Connectivity issues
Crashes
Loss of control
o Uninitiated Shutdowns and Restarts
o Not able to perform input and output
Unexplainable network activity
Viruses are the most commonly discussed malware as they are the most flexible. Viruses are
used to steal information, create and delete files, create botnets (clusters of infected computers used to
perform malicious activities such as denial of service attacks), and/or display advertisements as well as
many other tasks. They can be spread by being attached to other programs and documents. Infection
occurs when the infected programs or documents are ran or opened by a user. The Brain Virus,
A worm is a type of virus. They consume bandwidth and overload servers by using holes in an
operating systems design. Often times, they contain sections that can alter files, steal data, and make
botnets. The main difference in between a worm and a virus is that a worm is self-replicating and
spreads without the need for a user to initialize it. Just by plugging in an infected drive or connecting to
a compromised network can allow worm can spread to a machine. This makes a worm a very dangerous
piece of software depending on its main function. The most news worthy worm in recent history is
Stuxnet, which infected Iranian uranium enrichment computers causing centrifuges to spin at incorrect
speeds. This set Iranian nuclear production back more than two years.
Trojan horses—usually shortened to just Trojan—are simpler, yet more dangerous forms of
malware than viruses. They are simpler in the fact that they require users to manually download and
install them. This is often done by disguise. Users are tricked into downloading what they believe is
legitimate software when it is in fact a Trojan. A Trojan is very dangerous because it can allow the
author or distributor to directly access a machine. It can steal data and files, perform keylogging (record
all keys the user presses), steal electronic money (bitcoins), screen watch, and/or use the machine in a
botnet all the while hiding the identity of the beneficiary of the information. An example Trojan
-1-
includes the Score Virus which infected Mackintosh Computers. It installed files and closed executing
programs at certain time intervals without the consent of the user effectively crashing the computer.
Ransomware is a type of Trojan that does just what it sounds like. It takes a machine and
encrypts or password protects all the files on it, effectively holding them hostage. It then usually
displays a message to the user informing them of the problem and how to go about fixing it. The usual
fix involves making a monetary payment to a foreign bank account via money wire. When payment is
made, the hacker will then provide the decryption key or password so the machine can be fixed without
losing all the files. Ransomware usually spreads like a worm with initial infection occurring by opening
an infected file. Examples of ransomware include CryptoLocker, Ransom.A, and Cryzip.
The need for malware detection today is as big as ever. It seems that every week a news article
releases the details of a data breach at some company. The usual suspect is a variant or hybrid of the
above types of malware. It usually infects the point of sale terminals or payment processing network
ciphering the credit card information of unknowing customers to be sold on the black market. There has
also been an outbreak of ransomware. Malware is out there and it effects almost everyone every day in
some way, shape, or form. In order to avoid or rid a machine of malware, it has to first be detected
which is often times the most challenging part due to the adaptability.
To circumvent anti-malware programs, malware can be polymorphic, encrypt itself, and perform
other obfuscation techniques. These techniques hide the true nature of the software by reorganizing
code, scrambling it, and packing it among various other things making the file look benign when it is
actually malicious. Since most current malware detection programs use past malware as a reference,
these changes make the current strains often times undetectable until they are caught and analyzed to
be referenced in the future.
Current Solutions
All programs and code have certain features in order to be executable. They must make
function calls to the host operating system in order to receive computation time and receive memory.
Most programs are also written in high level languages such as C, Java, C++, and Python, meaning there
are strings of text which are compiled into machine code the computer can execute. Programs and files
are also classified by the behavior they follow when they run and what they are made with. These
classifications are called extensions.
Most anti-malware software including, Symantec’s Norton, MacAfee, and Kaspersky use
signatures to identify malware. A signature is a way to represent a file whether it be benign or
malicious. Signatures are formed by anti-malware programs as they scan a file. They are based on the
classification of the file (extensions such as: .exe, .jar, .dll, etc.), pieces of code in the file, and less
commonly, the behavior (function calls) the code creates when it is ran. In an actual sense, a signature
-2-
is usually a number or set of numbers that is created from the aforementioned attributes. They are
often called definitions and are stored as .dat files.
As an anti-malware program is running, it scans files and creates a signature of them. It then
compares the created signature to a table of known malicious signatures in the .dat file. If a match is
found, then the program notifies the user. The user then has the opportunity to inspect the suspected
program and make the final decision as whether it is malicious or benign. Many anti-malware programs
automatically delete known malicious files without the need for the user to take action.
As described, malware is ever evolving. This changes the signature. If the signature is far from
the original—as is usually the case with encryption, packing and morphism—then the malware will not
be detected when it compared to known signatures. This will allow the malware to run unnoticed until
either the signature file is updated or the user realizes the infection has occurred.
In signature based systems, updating the signature table one of the most important tasks.
Without updating, new malware will not be detectable unless it is very similar to current malware. As
such, the publishers of anti-malware technology update their signature tables very often. Many suites
have updates downloaded every day.
Stakeholders
There are three stakeholders with this product. First and foremost, there is the customer, or
primary stakeholder. The primary stakeholder will be the person or group having files scanned by the
product. This could be anyone who owns a computer, whether it is an individual, business, or
government entity. The second stakeholder is the group of people required to maintain the system. In
the case of this software--as is the case in many other software systems--the creators are actually
stakeholders because malware is an ever evolving problem and the software must be updated to
protect against the latest threats. The final stakeholder is the computer that the system is installed on.
System considerations have to be taken into effect to allow the widest array of users to install and run
our system.
Design
Our malware detection program will take some features of the above, signature based method.
We will compare signatures, but the way we create our signatures and how we compare them will be
different. We will take Windows PE files, which is a special format, and analyze them to create
signatures based on the Windows API function calls.
-3-
PE files are the basis for many malware samples today, which is why the basis of this software
project will focus on them. PE files have many header variables that instruct the computer how to
hande the file. Specifically, we will focus on the Windows API function calls. We plan to use the C++
programming language to extract the function calls from a PE file. This will be done by decompressing
the file and parsing the individual operating systems calls it makes. These calls will make up the
signature table database that will be used to compare the programs in question to known malicious and
benign files. The database will be made in MySQL.
Our initial signature table will be made from a large sample—on the order of thousands—of
known malware with a wide range of complexities as well as some benign files. The wider the
complexity and the larger the sample size, the more accurate the system will be at detecting malware
while passing benign files. It is important to include benign files as well as malicious because many
benign files will have the same system calls as malicious files. Preventing false positives is one of the
main objectives of the system. Obviously, the malicious files must be present to form the malicious
signature calls as well.
To do the actual classification, we plan to use the K-nearest neighbor classification system. This
method takes the unknown file and places it in a graph of the training files based on its API signatures.
We will then find its “K-nearest neighbors” and because the neighbors are training files, we know
whether they are benign or malicious. If more neighbors are malicious than benign, then the file is ruled
malicious. Likewise, if more neighbors are benign, then the file is ruled benign.
A simple visual could be represented by:
Here a (-) would be a benign file, a (+) a malicious file, and the x is the file in question. In case
(a), the file would be called benign because its 1 nearest neighbor is benign. In case (b), the file would
be called malicious (tie will always go toward malicious unless other information, such as user input, can
be taken into account). Lastly, in case (c), the file would be malicious.
-4-
Odd numbers for K are often chosen to avoid ties and alleviate the need to assume or prompt
for a decision by the user. Also, K must be large enough to eliminate false positives and negatives but
small enough to not include clusters of one kind mistakenly. Our specific value of K has yet to be
determined. Testing on the actual files will be how this is determined.
Our system will be designed with the following requirements in mind. Therefore, if these
requirements are not met, the system will not work.

Must use one the following Microsoft Windows operating systems because the system
is based on highlighting Windows API calls
o Windows 95
o Windows 98
o Windows Millennium/2000/NT
o Windows XP
o Windows Vista
o Windows 7
As with any engineering project, our product has design specifications. This details what the
product will do and how it will do them. However, we must also make it clear with the product will not
do. This will not be designed to and will not perform the following functions:



Deletion of malicious files
Directly or indirectly improve or fix
o System performance
o Connectivity issues
Anything not explicitly specified in the product documentation
Objectives
The needs of our system include the following:
Most important:
Least important:
Reliability
Ease of Use
Maintenance
Cost
Use of System Resources
Ideally all objectives will be met, but realistically speaking, several of our objectives will probably
not be met do to time and resource constraints. Being reliably accurate is of the utmost importance. If
-5-
our system does not accurately identify at least 62% (lowest rate of several popular detection systems)
of the malware it scans it should be considered unsuccessful as it does not compete with other current
market products.
We would prefer our malware detection system head the following specifications. Not following
specifications may lead to users having an unpleasant experience as well as possibly allowing malware
to infect their system.










83% or greater detection rate for malicious files
Update signature file less than once per week
5% or less false positive rate
Use less than 300MB of disk space
Use less than 10MB of RAM
Cost less than $35/year if marketed
No maintenance cost for the end user
Startup upon user request
Able to be installed in under 10 minutes
Allows user to scan individual files as well as numerous files at a time
Reliability
High detection
rate
Low false positive
rate
Signature file
update
Ease of Use
Signature file
update
Short installation
time
Short start up
time
OBJECTIVE TREE
Maintiainence
Signature file
update
Cost
Low cost
Use of System Res.
Low disk usage
Low RAM usage
Short installation
time
Short start up time
Conclusion
Contained in this document is a summary of an Intelligent Malware detection system. While
most current detection systems use the format of past malware as a basis for detection, our intelligent
system will take files apart and look at the functions it performs. Comparing these functions to known
malware call function rates, we hope to increase the detection rate for mutated, unknown malware.
Many technical decisions must still be made regarding things such as development environment and
user interface. Upon further discussions with our mentor and amongst ourselves, these decisions will be
made. The core of the project, building an intelligent malware detection system using Windows API
calls, is however set.
-6-
Works Cited
"Malware." Definition. TechTerms.com, n.d. Web. 20 Oct. 2014.
Egele, Manuel; Scholte, Theodoor; Kirda, Engin; Kruegel, Christopher. A Survey on Automated
Dynamic Malware Analysis Techniques and Tools. SBA. 1 Oct. 2014.
"Common Malware Types: Cybersecurity 101." Veracode. N.p., n.d. Web. 21 Oct. 2014.
"Most Damaging Malware." About. N.p., n.d. Web. 21 Oct. 2014.
Ye, Yanfang, Dingding Wang, Tao Li, Dongyi Ye, and Qingshan Jiang. "An Intelligent PE-malware
Detection System Based on Association Mining." Journal in Computer Virology 4.4 (2008): 323-34. Web.
"What You Need to Know about 'Virus Signatures'" About. N.p., n.d. Web. 22 Oct. 2014.
Ye, Yanfang. "Classification." CS 480 - Fall 2014 Senior Design. Morgantown, 2014.
-7-
Download