Chapter 4 Anti-Virus Anti-Virus Three 1. tasks for anti-virus Detection o Infected or not? Provably undecidable… 2. Identification o May be separate from detection, depending on detection method used 3. Disinfection o Remove the virus Detection: Static Methods Generic methods o Detects known and unknown viruses o For example, anomaly detection Virus-specific methods o Detects known viruses o For example, signature detection Static --- virus code not running Dynamic --- virus code running Detection Outcomes Detection Outcomes Also can have ghost positive Virus remnant “detected” o But virus is no longer there How can this happen? o Previous disinfection was incomplete Static Detection Detection without running virus code Three approaches… 1. Scanners o Signature 2. Heuristics o Look for “virus-like” code 3. Integrity Checkers o Hash/checksum Scanners On-demand o Files scanned when you say so On-access o Constant scanning in background o Whenever file is accessed, it’s scanned Scanners Signature scanning o Viruses represented by “signature” o Signature == pattern of bits in a virus (might include wildcards) “Hundreds of thousands of signatures” Not feasible to scan one-by-one o Multiple pattern search o Efficiency is critical We look in detail at several algorithms Algorithm: Aho-Corasick Developed 1975, bibliographic search Based on finite automaton (graph) o Circles are search states o Edges are transitions o Double circles are final states/output And a failure function o What to do when no suitable transition o I.e., where to resume “matching” Algorithm: Aho-Corasick When virus scanning, search for virus signature, which is bit string For simplicity, illustrate algorithm using English words For our example… Scan for any of the following words: o hi, hips, hip, hit, chip Algorithm: Aho-Corasick Aho-Corasick Example Algorithm: Aho-Corasick How to construct automaton? o And failure function Build the automaton --- next slide o A “trie”, also known as a “prefix tree” Then determine failure function o Two slides ahead Aho-Corasick: Trie Labels added in breadth-first order Closest to root get smallest numbers Aho-Corasick: Failure Function Depth 1 nodes o Fail goes back to start state For other states o Go back to earliest place where search can resume o Pseudo-code is in the book Aho-Corasick The bottom line… Linear search that can find multiple signatures o Like searching in parallel for related signatures Efficient representation of automaton is the challenge o Both time and space issues Algorithm: Veldman Linear search on “reduced” signatures o Sequential search on reduced set From each signature, select 4 adjacent non-wildcard bytes o Want as many signatures as possible to have each selected 4-byte pattern Then use 2 hash tables to filter… o Hash tables: 1st 2 bytes & 2nd 2 bytes Algorithm: Veldman Example Suppose the following 5 signatures o blar?g, foo, greep, green, agreed Select 4-byte patterns, no wildcards: Algorithm: Veldman Hashes act as filters Test things that pass thru both filters o In this example, get things like “grar” Algorithm: Veldman Veldman allows for wildcards and complex signatures o Aho-Corasick does not But both algorithms analyze every byte of input Is it possible to do better? o That is, can we skip some of the input? Algorithm: Wu-Manber Like Veldman’s algorithm o But can skip over bytes that can’t possibly match o Faster, improved performance Illustrate algorithm with same signatures used for Veldman’s: o blar?g, foo, greep, green, agreed Algorithm: Wu-Manber Calculate MINLEN o Min length of any pattern substring Two hash tables o SHIFT --- number of bytes that can safely be skipped o HASH --- mapping to signatures Input bytes denoted b1,b2,…,bn Start at bMINLEN consider byte pairs Algorithm: Wu-Manber Example: Suppose hash tables are… Wu-Manber Example Here, MINLEN = 3 Start at bMINLEN Algorithm: Wu-Manber How to construct hash tables? It’s a 4-step process o Calculate MINLEN o Initialize SHIFT table o Fill SHIFT table o Fill HASH table Algorithm: Wu-Manber Calculate MINLEN o Minimum number of adjacent, non- wildcard bytes in any signature For this example, we have o blar?g o greep o agreed So 4 5 6 foo green we have MINLEN = 3 3 5 Algorithm: Wu-Manber SHIFT table Extract MINLEN pattern substrings o blar?g o greep o agreed Extract bla gre agr foo green foo gre all distinct 2-byte sequences o bl, la, fo, oo, gr, re, ag If input pair is not one of these, safe to skip MINLEN - 1 bytes Algorithm: Wu-Manber SHIFT table Initialize SHIFT table to MINLEN – 1 For 2-byte pairs: bl, la, fo, oo, gr, re, ag o Denote as xy o Let qxy be rightmost ending position of xy in any pattern substring o For example, gr in agr and gre, but in bla o So, qgr = 3 while qbl = 2 o Then set SHIFT[xy] = MINLEN – qxy Note: Wildcard matches everything… Algorithm: Wu-Manber HASH table If SHIFT[xy] = MINLEN – qxy = 0 o Then we are at right edge of a pattern So, set HASH[xy] to all signatures with pattern substring ending xy For example o HASH[gr] agreed o HASH[ ] greep, green Algorithm: Wu-Manber Here, we illustrated simplest form of the algorithm More advanced forms can handle 10s of thousands of signatures Worst case performance is terrible o Sequential search thru every byte of input for every signature… But tests show it’s good in practice Testing How can we know if scanner works? Test on live viruses? o Might not be a good idea EICAR standard antivirus test file o Not too useful either So, what to do? o Author doesn’t have any suggestions! Improving Performance “Grunt scanning” --- scan everything o Slow slow slow Search only beginning and end of files Scan code entry point o And points reachable from entry point If position of virus in file is known… o Make it part of the “signature” Limit scans to size of virus(es) Improving Performance Only scan certain types of files o Not so viable today Only rescan files that have changed o How to detect change? o Where to store this info? Cache? Database? Tagged to file? o Updates to signatures? Must rescan… o How to checksum efficiently? Improving Performance How to checksum efficiently? o Checksum entire file might take longer than scanning it o Only checksum parts that are scanned How to avoid checksum tampering? o Encrypt? Where to store the key? o Checksum the checksums? o Other? Improving Performance Improve the algorithm o Maybe tailor algorithms to file type Optimize implementation o May be of limited value Other? Static Heuristics Like having expert look at code… Look for “virus-like” code o Static, so we don’t execute the code 2 step process o Gather data o Analyze data Static Heuristics What data to gather? “Short signatures” or boosters o Junk code o Decryption loop o Self-modifying code o Undocumented API calls o Unusual/non-compiler instructions o Strings containing obscenities or “virus” Stopper --- thing virus would not do Static Heuristics Other heuristics include… Length of code o Too short? May be appended virus Statistical analysis of instructions o Handwritten assembly o Encrypted code Might look for signature heuristics o Common characteristics of signatures Static Heuristics Analysis phase May be simple… o Weighted sum of various factors o Unusual opcodes, etc. …or complex o Machine learning (HMM, neural nets, etc.) o Data mining o Heuristic search (genetic algorithm, etc.) Integrity Checkers Look for unauthorized change to files Start with 100% clean files Compute checksums/hashes Store checksums Recompute checksums and compare o If they differ, a change has occurred Integrity Checkers 3 types of integrity checkers Offline --- recompute checksums periodically (e.g., once/week) Self-checking --- modify file to check itself when run o Essentially, a beneficial “virus” o For example, virus scanner self-checks Integrity shell --- OS performs checksum before file executed Detection: Dynamic Methods Detection based on running the code o Observe the “behavior” Two type of dynamic methods o Behavior monitor/blockers o Emulation Behavior Monitor/Blocker Monitor program as running Watch for “suspicious” behavior What is suspicious? o It’s too far from “normal” What is normal? o A statistical measure --- mean, average How far is too far? o Depends on variance, standard deviation Behavior Monitor/Blocker “Normal” 1. monitored in 3 ways… Actions that are permitted o White list, positive detection 2. Actions that are not permitted o Black list, negative detection Some combination of these two Analogies to immune system 3. o Distinguish self from non-self Behavior Monitor/Blocker “Care must be taken… because anomalous behavior does not automatically imply viral behavior” o That’s an understatement! is the fundamental problem in anomaly detection This o Potential for lots of false positives Behavior Monitor/Blocker Look for short “dynamic signatures” o Like signature detection, but input string generated dynamically But what to monitor? Infection-like behavior? o Open an exe for read/write o Read code start address from header o Write start address to header o Seek to end of exe, append to exe, etc. Behavior Monitor/Blocker How to reduce false positives? o Consider “ownership” --- some apps get more leeway (e.g., browser clearing cache) How to prevent damage? o “Dynamic” implies code actually running… o System undo capability? How long to monitor? o Monitoring increases overhead o Can virus outlast monitor? Emulation Execute code, but not for real… Instead, emulate execution Emulation can provide all of the info gotten thru code execution o But much safer “Execute” code in emulator o Gather info for static/dynamic signatures or heuristics o Behavior blocker stuff applies too Emulation Emulation and polymorphic detection o Let virus decrypt itself o Then use ordinary signature scan When has decryption occurred? o Use some heuristics… o Execution of code that was modified (decrypted) or in such a memory location o More than N bytes of modified code, etc. Emulator Anatomy Emulate by single-stepping thru code? o Easily detected by viruses (???) o Danger of virus “escaping” emulator “A more elaborate emulation mechanism is needed” o Why? Conceptually, 5 parts to an emulator o Next slide please… Emulator Anatomy 5 parts to new-and-improved emulator 1. CPU emulation --- nothing more to say 2. Memory emulation 3. Hardware and OS emulation 4. Emulation controller 5. Extra analyses Memory Emulation This could be difficult… o 32-bit addressing, so 4G of “memory” Do we need to emulate all of this? o No, most apps only uses small amount Keep track of memory that’s modified and where it is located o Only need to deal with memory that is modified by a specific app/virus Hardware/OS Emulation Use stripped-down, fake OS, due to… o Copyright issues o Size o Startup time o Emulator needs additional monitoring What about OS system calls? o Return faked/fixed values o Don’t faithfully emulate some low-level stuff Emulation Controller When does emulation stop? o Can’t expect to run code to completion… Use heuristics to decide when to stop o Number of instructions? o Amount of time? o Threshold on percent of instructions that modify memory? o “Stoppers”? E.g., assume virus wouldn’t write output before being malicious Emulator: Extra Analyses Post-emulation analysis For example, look at histogram of instructions o Does it match typical polymorphic? o Does it match a metamorphic family? Other examples of post-emulation analysis??? If at First You Don’t Succeed Emulation controller may re-invoke emulator for the following reasons o Rerun with different CPU parameters o Test interrupt handlers o Test multiple possible entry points o Test for self-replication on “goat” files o Test untaken branches in code o Test “unused” memory locations Emulator Optimizations Improve performance, reduce size and/or complexity o Use the real file system (with caution) o “Data” files must be checked for malware, use lots of stoppers o Cache state --- if match is found to previous (non-virus) run, goto next file Cache register values, size, stack pointer and contents, number of writes, checksums, etc. Comparison of Techniques Recall, 1. 2. 3. 4. 5. the techniques considered… Scanning Static heuristics Integrity check Behavior blocker Emulation Comparison of Techniques Scanning Pros: o Precise ID of malware Cons: o Requires up-to-date signatures o Cannot detect new/unknown malware Comparison of Techniques Static heuristics Pros: o Detect known and unknown malware Cons: o Detected malware not identified o False positives Comparison of Techniques Integrity check Pros: o Can be efficient and fast o Detect known and unknown malware Cons: o Detected after infection & not identified o Can’t detect in new/modified file o Heavy burden on users/admins Comparison of Techniques Behavior blocker Pros: o Known and unknown malware detected Cons: o Probably won’t identify malware o High overhead o False positives o Malware runs on system before detected Comparison of Techniques Emulation Pros: o Known, unknown, polymorphic detection o Malware executed in “safe” environment Cons: o Slow o Malware might outlast emulator o Might not provide identification Detection: Bottom Line Static analysis is fast o Good approach when it works Dynamic analysis can “peel away a layer of obfuscation” o Dynamic analysis is relatively costly Verification, Quarantine, Disinfect What to do after virus detected? 1. Verify that it really is a virus 2. Quarantine infected code 3. Disinfect --- remove infection These are done rarely, so can be slow and costly in comparison to detection Verification After detection comes verification Why verify? o Secondary test needed due to short, general signature, or… o …no signature, due to detection method Behavior, heuristic, emulation, etc. o Do not usually provide identification Writer might try to make virus look like some other virus Verification How to verify? “X-ray” the virus If encrypted, decrypt it, or frequency analysis might suffice o Like simple substitution cipher Extract info/stats, etc. Verification After x-ray analysis… o Longer virus-specific signatures o Checksum all or part of virus o Call special-purpose verification code Note that these probably won’t work on (good) metamorphic code Quarantine Isolate detected virus from system o Then ask user if it’s OK to disinfect o Or do further analysis of virus How to quarantine virus? o Copy to a “quarantine” directory? o Hide it in “invisible” location? o Encrypt it? Disinfect Disinfect == remove infection Not always possible to return file to it’s original state o E.g., file might have been overwritten Disinfection methods… Delete the infected file o Pros and cons? Disinfect Disinfection methods… Restore files from backup o Pros and cons? Use virus-specific info o Info may be found automatically --compare infected files with uninfected o E.g., appended virus, changes start address, appends itself to file, etc. o Like a chosen plaintext attack Disinfect Disinfection methods… Use virus-behavior specific info o E.g., prepended virus changes header Save some info about files o Headers info, for example o Then changed parts can be restored o Integrates well with integrity checker o Restore parts until checksum matches… Disinfect Disinfection methods… Use the virus to disinfect o Stealth virus may give original code Generic disinfection o Virus may restore code when executed o Might be dangerous to run virus code… o …emulation is a better strategy, maybe even disinfect as part of detection Virus Databases What to put in a virus database? o Name of virus? o Characteristics of virus? o Signatures? o Encrypted/hashed signatures? o Disinfection info? o Other info? Virus Databases How to update database/signatures? o Push or pull? o Automatic or manual? o How often to update? o How to distribute updates? o Distribute entire database or deltas? Also must be able to update AV software Virus Updates Update process is a BIG target o AV’s machines that distribute updates o Insider attack at AV site o Trick user to getting “AV” from attacker o Man-in-the-middle attack on communications between user/AV Virus Description Languages AV vendors have specialized virus description languages 2 examples given in the book Short Subjects A few quick points… Anti-stealth techniques Macro viruses Compiler optimizations and detection Anti-Stealth Techniques Recall, stealth viruses hide presence Anti-stealth as part of AV? o Detect and disable stealth --- check that OS calls go to right place o Bypass usual OS features --- direct calls to BIOS, for example Macro Virus Detection Macro viruses tricky to detect o Macros are in source code o Easy to change source o Robust execution when errors occur So, any changes can create new virus AV might create a new virus o Eg, incomplete disinfection Macro virus can infect other macros Macro Viruses One redeeming feature… They operate in restricted domain o So easier to determine “normal” o Reduces number of false positives Most/all are not parasitic o More like companion viruses All the usual detection techniques can be applied Macro Viruses: Disinfection Delete all macros in infected document Delete all associated macros Delete macro if in doubt (heuristic) Emulation to find all macros used by infected macro, and delete them Basic idea? o Err on side of caution/deletion Macro viruses not so common today Compiler Optimization Compilers use similar techniques as AV “Optimizing compiler” for detection?? o Constant propagation – reduces variables o Dead code (executed, but not needed) o Polymorphics may have lots of dead code If used, efficiency could be an issue o Compilers extensively studied o Bad cases well-known, so virus writers might take advantage of these