West Virginia University CpE 480 Group 8 Instructor: Yenumula Reddy Mentor: Yanfang Ye Intelligent Malware Detection Individual Research Paper Joshua Suess 2 December 2014 Contents Malware Today ......................................................................................................................................... - 1 Current Solutions ...................................................................................................................................... - 2 Stakeholders ............................................................................................................................................. - 3 Design........................................................................................................................................................ - 3 Objectives ................................................................................................................................................. - 5 Conclusion ................................................................................................................................................. - 6 Works Cited ............................................................................................................................................... - 7 - Malware Today What is malware? Malware is actually a portmanteau of malicious and software. Technically speaking, malware is software whose sole purpose is to perform unwanted actions. There are numerous terms used to classify malware, common of which are viruses, worms, Trojan horses, and ransomware. Each classification is divided by the unique features the software contains that let it perform the malicious tasks it was designed for often times while avoiding detection for long periods of time. Most malware causes at least some of the following: Performance issues Connectivity issues Crashes Loss of control o Uninitiated Shutdowns and Restarts o Not able to perform input and output Unexplainable network activity Viruses are the most commonly discussed malware as they are the most flexible. Viruses are used to steal information, create and delete files, create botnets (clusters of infected computers used to perform malicious activities such as denial of service attacks), and/or display advertisements as well as many other tasks. They can be spread by being attached to other programs and documents. Infection occurs when the infected programs or documents are ran or opened by a user. The Brain Virus, A worm is a type of virus. They consume bandwidth and overload servers by using holes in an operating systems design. Often times, they contain sections that can alter files, steal data, and make botnets. The main difference in between a worm and a virus is that a worm is self-replicating and spreads without the need for a user to initialize it. Just by plugging in an infected drive or connecting to a compromised network can allow worm can spread to a machine. This makes a worm a very dangerous piece of software depending on its main function. The most news worthy worm in recent history is Stuxnet, which infected Iranian uranium enrichment computers causing centrifuges to spin at incorrect speeds. This set Iranian nuclear production back more than two years. Trojan horses—usually shortened to just Trojan—are simpler, yet more dangerous forms of malware than viruses. They are simpler in the fact that they require users to manually download and install them. This is often done by disguise. Users are tricked into downloading what they believe is legitimate software when it is in fact a Trojan. A Trojan is very dangerous because it can allow the author or distributor to directly access a machine. It can steal data and files, perform keylogging (record all keys the user presses), steal electronic money (bitcoins), screen watch, and/or use the machine in a botnet all the while hiding the identity of the beneficiary of the information. An example Trojan -1- includes the Score Virus which infected Mackintosh Computers. It installed files and closed executing programs at certain time intervals without the consent of the user effectively crashing the computer. Ransomware is a type of Trojan that does just what it sounds like. It takes a machine and encrypts or password protects all the files on it, effectively holding them hostage. It then usually displays a message to the user informing them of the problem and how to go about fixing it. The usual fix involves making a monetary payment to a foreign bank account via money wire. When payment is made, the hacker will then provide the decryption key or password so the machine can be fixed without losing all the files. Ransomware usually spreads like a worm with initial infection occurring by opening an infected file. Examples of ransomware include CryptoLocker, Ransom.A, and Cryzip. The need for malware detection today is as big as ever. It seems that every week a news article releases the details of a data breach at some company. The usual suspect is a variant or hybrid of the above types of malware. It usually infects the point of sale terminals or payment processing network ciphering the credit card information of unknowing customers to be sold on the black market. There has also been an outbreak of ransomware. Malware is out there and it effects almost everyone every day in some way, shape, or form. In order to avoid or rid a machine of malware, it has to first be detected which is often times the most challenging part due to the adaptability. To circumvent anti-malware programs, malware can be polymorphic, encrypt itself, and perform other obfuscation techniques. These techniques hide the true nature of the software by reorganizing code, scrambling it, and packing it among various other things making the file look benign when it is actually malicious. Since most current malware detection programs use past malware as a reference, these changes make the current strains often times undetectable until they are caught and analyzed to be referenced in the future. Current Solutions All programs and code have certain features in order to be executable. They must make function calls to the host operating system in order to receive computation time and receive memory. Most programs are also written in high level languages such as C, Java, C++, and Python, meaning there are strings of text which are compiled into machine code the computer can execute. Programs and files are also classified by the behavior they follow when they run and what they are made with. These classifications are called extensions. Most anti-malware software including, Symantec’s Norton, MacAfee, and Kaspersky use signatures to identify malware. A signature is a way to represent a file whether it be benign or malicious. Signatures are formed by anti-malware programs as they scan a file. They are based on the classification of the file (extensions such as: .exe, .jar, .dll, etc.), pieces of code in the file, and less commonly, the behavior (function calls) the code creates when it is ran. In an actual sense, a signature -2- is usually a number or set of numbers that is created from the aforementioned attributes. They are often called definitions and are stored as .dat files. As an anti-malware program is running, it scans files and creates a signature of them. It then compares the created signature to a table of known malicious signatures in the .dat file. If a match is found, then the program notifies the user. The user then has the opportunity to inspect the suspected program and make the final decision as whether it is malicious or benign. Many anti-malware programs automatically delete known malicious files without the need for the user to take action. As described, malware is ever evolving. This changes the signature. If the signature is far from the original—as is usually the case with encryption, packing and morphism—then the malware will not be detected when it compared to known signatures. This will allow the malware to run unnoticed until either the signature file is updated or the user realizes the infection has occurred. In signature based systems, updating the signature table one of the most important tasks. Without updating, new malware will not be detectable unless it is very similar to current malware. As such, the publishers of anti-malware technology update their signature tables very often. Many suites have updates downloaded every day. Stakeholders There are three stakeholders with this product. First and foremost, there is the customer, or primary stakeholder. The primary stakeholder will be the person or group having files scanned by the product. This could be anyone who owns a computer, whether it is an individual, business, or government entity. The second stakeholder is the group of people required to maintain the system. In the case of this software--as is the case in many other software systems--the creators are actually stakeholders because malware is an ever evolving problem and the software must be updated to protect against the latest threats. The final stakeholder is the computer that the system is installed on. System considerations have to be taken into effect to allow the widest array of users to install and run our system. Design Our malware detection program will take some features of the above, signature based method. We will compare signatures, but the way we create our signatures and how we compare them will be different. We will take Windows PE files, which is a special format, and analyze them to create signatures based on the Windows API function calls. -3- PE files are the basis for many malware samples today, which is why the basis of this software project will focus on them. PE files have many header variables that instruct the computer how to hande the file. Specifically, we will focus on the Windows API function calls. We plan to use the C++ programming language to extract the function calls from a PE file. This will be done by decompressing the file and parsing the individual operating systems calls it makes. These calls will make up the signature table database that will be used to compare the programs in question to known malicious and benign files. The database will be made in MySQL. Our initial signature table will be made from a large sample—on the order of thousands—of known malware with a wide range of complexities as well as some benign files. The wider the complexity and the larger the sample size, the more accurate the system will be at detecting malware while passing benign files. It is important to include benign files as well as malicious because many benign files will have the same system calls as malicious files. Preventing false positives is one of the main objectives of the system. Obviously, the malicious files must be present to form the malicious signature calls as well. To do the actual classification, we plan to use the K-nearest neighbor classification system. This method takes the unknown file and places it in a graph of the training files based on its API signatures. We will then find its “K-nearest neighbors” and because the neighbors are training files, we know whether they are benign or malicious. If more neighbors are malicious than benign, then the file is ruled malicious. Likewise, if more neighbors are benign, then the file is ruled benign. A simple visual could be represented by: Here a (-) would be a benign file, a (+) a malicious file, and the x is the file in question. In case (a), the file would be called benign because its 1 nearest neighbor is benign. In case (b), the file would be called malicious (tie will always go toward malicious unless other information, such as user input, can be taken into account). Lastly, in case (c), the file would be malicious. -4- Odd numbers for K are often chosen to avoid ties and alleviate the need to assume or prompt for a decision by the user. Also, K must be large enough to eliminate false positives and negatives but small enough to not include clusters of one kind mistakenly. Our specific value of K has yet to be determined. Testing on the actual files will be how this is determined. Our system will be designed with the following requirements in mind. Therefore, if these requirements are not met, the system will not work. Must use one the following Microsoft Windows operating systems because the system is based on highlighting Windows API calls o Windows 95 o Windows 98 o Windows Millennium/2000/NT o Windows XP o Windows Vista o Windows 7 As with any engineering project, our product has design specifications. This details what the product will do and how it will do them. However, we must also make it clear with the product will not do. This will not be designed to and will not perform the following functions: Deletion of malicious files Directly or indirectly improve or fix o System performance o Connectivity issues Anything not explicitly specified in the product documentation Objectives The needs of our system include the following: Most important: Least important: Reliability Ease of Use Maintenance Cost Use of System Resources Ideally all objectives will be met, but realistically speaking, several of our objectives will probably not be met do to time and resource constraints. Being reliably accurate is of the utmost importance. If -5- our system does not accurately identify at least 62% (lowest rate of several popular detection systems) of the malware it scans it should be considered unsuccessful as it does not compete with other current market products. We would prefer our malware detection system head the following specifications. Not following specifications may lead to users having an unpleasant experience as well as possibly allowing malware to infect their system. 83% or greater detection rate for malicious files Update signature file less than once per week 5% or less false positive rate Use less than 300MB of disk space Use less than 10MB of RAM Cost less than $35/year if marketed No maintenance cost for the end user Startup upon user request Able to be installed in under 10 minutes Allows user to scan individual files as well as numerous files at a time Reliability High detection rate Low false positive rate Signature file update Ease of Use Signature file update Short installation time Short start up time OBJECTIVE TREE Maintiainence Signature file update Cost Low cost Use of System Res. Low disk usage Low RAM usage Short installation time Short start up time Conclusion Contained in this document is a summary of an Intelligent Malware detection system. While most current detection systems use the format of past malware as a basis for detection, our intelligent system will take files apart and look at the functions it performs. Comparing these functions to known malware call function rates, we hope to increase the detection rate for mutated, unknown malware. Many technical decisions must still be made regarding things such as development environment and user interface. Upon further discussions with our mentor and amongst ourselves, these decisions will be made. The core of the project, building an intelligent malware detection system using Windows API calls, is however set. -6- Works Cited "Malware." Definition. TechTerms.com, n.d. Web. 20 Oct. 2014. Egele, Manuel; Scholte, Theodoor; Kirda, Engin; Kruegel, Christopher. A Survey on Automated Dynamic Malware Analysis Techniques and Tools. SBA. 1 Oct. 2014. "Common Malware Types: Cybersecurity 101." Veracode. N.p., n.d. Web. 21 Oct. 2014. "Most Damaging Malware." About. N.p., n.d. Web. 21 Oct. 2014. Ye, Yanfang, Dingding Wang, Tao Li, Dongyi Ye, and Qingshan Jiang. "An Intelligent PE-malware Detection System Based on Association Mining." Journal in Computer Virology 4.4 (2008): 323-34. Web. "What You Need to Know about 'Virus Signatures'" About. N.p., n.d. Web. 22 Oct. 2014. Ye, Yanfang. "Classification." CS 480 - Fall 2014 Senior Design. Morgantown, 2014. -7-