Simple Substitution Distance and Metamorphic Detection Gayathri Shanmugam Richard M. Low Mark Stamp Simple Substitution Distance 1 The Idea Metamorphic malware “mutates” with each infection Measuring software similarity is one method of detection But, how to measure similarity? o Lots of relevant previous work Here, an unusual and interesting distance measure is considered Simple Substitution Distance 2 Simple Substitution Distance We treat each metamorphic copy as if it is an “encrypted” version of “base” virus o Where the cipher is a simple substitution Why simple substitution? Why might this work? o Easy to work with, fast algorithm to solve o Simple substitution cryptanalysis gives results that match family statistics o Accounts for modifications to files similar to some common metamorphic techniques Simple Substitution Distance 3 Motivation Given a simple substitution ciphertext where plaintext is English… o If we cryptanalyze using English language statistics, we expect a good score o If we cryptanalyze using, say, French language statistics, we expect a not-so-good score We can obtain opcode statistics for a metamorphic family o Using simple substitution cryptanalysis, a virus of same family should score well… o …but, a benign exe should not score as well o Assuming statistics of these families differ Simple Substitution Distance 4 Metamorphic Techniques Many possible morphing strategies Here, briefly consider o o o o o Register swapping Garbage code insertion Equivalent substitution Transposition Formal grammar mutation At a high level --- substitution, transposition, insertion, and deletion Simple Substitution Distance 5 Register Swap Register swapping o E.g., replace EBX register with EAX, provided EAX not in use Very simple and used in some of first metamorphic malware Not very effective o Why not? Simple Substitution Distance 6 Garbage Insertion Garbage code insertion Two cases: o Dead code --- inserted, but not executed We can simply JMP over dead code o Do-nothing instructions --- executed, but has no effect on program Like NOP or ADD EAX,0 Relatively easy to implement Effective at breaking signatures Changes the opcodes statistics Simple Substitution Distance 7 Code Substitution Equivalent instruction substitution o For example, can replace SUB EAX,EAX with XOR EAX,EAX Does not need to be 1 for 1 substitution o That is, can also include insertion/deletion Unlimited number of substitutions o And can be very effective Somewhat difficult to implement Simple Substitution Distance 8 Transposition Transposition o Reorder instructions that have no dependency For example, MOV R1,R2 ADD R3,R4 ADD R3,R4 MOV R1,R2 Can be highly effective But, can be difficult to implement o Sometimes applied only to subroutines Simple Substitution Distance 9 Formal Grammar Mutation Formal grammar mutation View morphing engine as nondeterministic automata o Allow transitions between any symbols o Apply formal grammar rules Obtain many variants, high variation Really just a formalization of others approaches, not a separate technique Simple Substitution Distance 10 Previous Work Easy to prove that “good” metamorphic code is immune to signature detection o Why? But, many successes detecting hackerproduced metamorphic malware… o o o o o HMM/PHMM/machine learning Graph-based techniques Statistics (chi-squared, naïve Bayes) Structural entropy Linear algebraic techniques Simple Substitution Distance 11 Topic of This Research Measure similarity using simple substitution distance We “decrypt” suspect file using statistics from a metamorphic family o If decryption is good, we classify it as a member of the same metamorphic family o If decryption is poor, we classify it as NOT a member of the given family Simple Substitution Distance 12 Simple Substitution Cipher Simple substitution is one of the oldest and simplest means of encryption A fixed key used to substitute letters o For example, Caesar’s cipher, substitute letter 3 positions ahead in alphabet o In general, any permutation can be key Simple substitution cryptanalysis? o Statistical analysis of ciphertext Simple Substitution Distance 13 Simple Substitution Cryptanalysis Suppose you observe the ciphertext PBFPVYFBQXZTYFPBFEQJHDXXQVAPTPQJKTOYQWIPBVWLXTOXBTFXQW AXBVCXQWAXFQJVWLEQNTOZQGGQLFXQWAKVWLXQWAEBIPBFXFQVX GTVJVWLBTPQWAEBFPBFHCVLXBQUFEVWLXGDPEQVPQGVPPBFTIXPFHXZH VFAGFOTHFEFBQUFTDHZBQPOTHXTYFTODXQHFTDPTOGHFQPBQWAQJJ TODXQHFOQPWTBDHHIXQVAPBFZQHCFWPFHPBFIPBQWKFABVYYDZBOT HPBQPQJTQOTOGHFQAPBFEQJHDXXQVAVXEBQPEFZBVFOJIWFFACFCCF HQWAUVWFLQHGFXVAFXQHFUFHILTTAVWAFFAWTEVOITDHFHFQAITIX PFHXAFQHEFZQWGFLVWPTOFFA Analyze frequency counts… Likely that ciphertext “F” represents “E” o And so on, at least for common letters Simple Substitution Distance 14 Simple Substitution Cryptanalysis Can automate the cryptanalysis 1. 2. 3. 4. 5. 6. 7. Make initial guess for key using frequency counts Compute oldScore Modify key by swapping adjacent elements Compute newScore If newScore > oldScore. let oldScore = newScore Else unswap key elements Goto 3 How to compute score? o Number of dictionary words in putative plaintext? o Much better to use English digraph statistics Simple Substitution Distance 15 Jackobsen’s Algorithm Method on previous slide can be slow o Why? Jackobsen’s algorithm uses similar idea, but fast and efficient o Ciphertext is only decrypted once o So algorithm is (essentially) independent of length of message o Then, only matrix manipulations required Simple Substitution Distance 16 Jackobsen’s Algorithm: Swapping Assume plaintext is English, 26 letters Let K = k1,k2,k3,…,k26 be putative key Then we swap elements as follows Restart this swapping from the beginning whenever the score improves o And let “|” represent “swap” Simple Substitution Distance 17 Jackobsen’s Algorithm: Swapping Minimum swaps is 26 choose 2, or 325 Maximum is unbounded Each swap requires a score computation Average number of swaps, experimentally: o Ciphertext of length 500, average 1050 swaps o Ciphertext of length 8000, avg just 630 swaps So, work depends on length of ciphertext o More ciphertext, better scores, fewer swaps Simple Substitution Distance 18 Jackobsen’s Algorithm: Scoring Let D = {dij} be digraph distribution corresponding to putative key K Let E = {eij} be digraph distribution of English language These matrices are 26 x 26 Compute score as Simple Substitution Distance 19 Jackobsen’s Algorithm So far, nothing fancy here o Could see all of this in a CS 265 assignment Jackobsen’s trick: Determine new D matrix from old D without decrypting How to do so? o It turns out that swapping elements of K swaps corresponding rows and columns of D See example on next slides… Simple Substitution Distance 20 Swapping Example To simplify, suppose 10 letter alphabet E, T, A, O, I, N, S, R, H, D Suppose you are given the ciphertext TNDEODRHISOADDRTEDOAHENSINEOAR DTTDTINDDRNEDNTTTDDISRETEEEEEAA Frequency counts given by Simple Substitution Distance 21 Swapping Example We choose the putative key K given here The corresponding putative plaintext is AOETRENDSHRIEENATE RIDTOHSOTRINEAAEAS OEENOTEOAAAEESHNA TTTTTII Corresponding digraph distribution D is Simple Substitution Distance 22 Swapping Example Suppose we swap first 2 elements of K Then decrypt using new K And compute digraph matrix for new K Simple Substitution Distance Previous key K New key K 23 Swapping Example Old D matrix vs new D matrix What do you notice? So what’s the point here? This is good! Simple Substitution Distance 24 Jackobsen’s Algorithm Simple Substitution Distance 25 Proposed Similarity Score Extract opcodes sequences from collection of (family) viruses o All viruses from same metamorphic family Determine n most common opcodes o Symbol n+1 used for all “other” opcodes Use resulting digraph statistics form matrix E = {eij} o Note that matrix is (n+1) x (n+1) Simple Substitution Distance 26 Scoring a File Given an executable we want to score… Extract it’s opcode sequence Use opcode digraph stats to get D = {dij} o This matrix also (n+1) x (n+1) Initial “key” K chosen to match monograph stats of virus family o Most frequent opcode in exe maps to most frequent opcode in virus family, etc. Score based on distance between D and E o “Decrypt” D and score how closely it matches E o Jackobsen’s algorithm used for “decryption” Simple Substitution Distance 27 Example Suppose only 5 common opcodes in family viruses (in descending frequency) Extract following sequence from an exe Initial “key” is And “decrypt” is Simple Substitution Distance 28 Example Given “decrypt” Form D matrix After swap o And so on… Simple Substitution Distance 29 Scoring Algorithm Simple Substitution Distance 30 Quantifying Success Consider Which these 2 scatterplots of scores is better (and why)? Simple Substitution Distance 31 ROC Curves Plot true-positive vs false positive o As “threshold” varies Curve nearer 45-degree line is bad Curve nearer upper-left is better Simple Substitution Distance 32 ROC Curves Use ROC curves to quantify success Area under the ROC curve (AUC) o Probability that randomly chosen positive instance scores higher than a randomly chosen negative instance AUC of 1.0 implies ideal detection AUC of 0.5 means classification is no better than flipping a coin Simple Substitution Distance 33 Parameter Selection Tested the following parameters o Opcode matrix size o Scoring function o Normalization o Swapping strategy None significant, except matrix size o So we only give results for matrix size Simple Substitution Distance 34 Opcode Matrix Size Obtained So, following results ironically, we use 26 x 26 matrix Simple Substitution Distance 35 Test Data Tested the following metamorphic families o G2 --- known to be weak o NGVCK --- highly metamorphic o MWOR --- highly metamorphic and stealthy MWOR “padding ratios” of 0.5 to 4.0 For G2 and NGVCK o 50 files tested, cygwin utilities for benign files For each MWOR padding ratio o 100 files tested, Linux utilities for benign files 5-fold cross validation in each experiment Simple Substitution Distance 36 NGVCK and G2 Graphs Simple Substitution Distance 37 MWOR Score Graphs Simple Substitution Distance 38 MWOR ROC Curves Simple Substitution Distance 39 MWOR AUC Statistics Simple Substitution Distance 40 Efficiency Simple Substitution Distance 41 Conclusions + + + - Simple substitution score, good results for challenging metamorphics Scoring is fast and efficient Applicable to other types of malware Requires opcodes Simple Substitution Distance 42 Related Work Recently, we generalized Jakobsen’s algorithm to “combination” cipher Simple substitution column transposition (SSCT) Uses multiple D matrices o One D matrix for each column o Enables easy column manipulations o Overall, fast and effective SSCT attack Simple Substitution Distance 43 SSCT SSCT for malware detection This might be stronger malware score o Why? Finding good test data is an issue o Can we find/make data where SSCT outperforms simple substitution score? Currently studying this problem Simple Substitution Distance 44 Homophonic Substitution Homophonic sub. allows more than one ciphertext symbol for each plaintext o Easy to encrypt, but harder to break than simple substitution --- why? Previous student developed Jakobsenlike algorithm for homophonic sub. o Uses a nested hill climb approach This could be tested on malware Simple Substitution Distance 45 HMM A different way to attack simple substitution ciphers? Train an HMM (of course!) o Let A be 26 x 26, English digraph stats o Then train, without updating A matrix o Resulting B matrix is the key o Can work for homophonic case too Any problems with this? Simple Substitution Distance 46 HMM with Random Restarts HMM requires lots of data to converge Often, we don’t have lots of data In such cases, try random restarts o HMM should converge with less data if we start closer to the solution o Try enough random restarts, might start close enough to converge How many random restarts? Simple Substitution Distance 47 HMM with Random Restarts Could be applied to malware detection o However, slow and expensive More relevant for cryptanalysis Zodiac 340 cipher, for example o This has previously been analyzed using millions of random restarts Simple Substitution Distance 48 References G. Shanmugam, R.M. Low, and M. Stamp, Simple substitution distance and metamorphic detection, Journal of Computer Virology and Hacking Techniques, 9(3):159-170, 2013 A. Dhavare, R.M. Low, and M. Stamp, Efficient cryptanalysis of homophonic substitution ciphers, Cryptologia, 37(3):250-281, 2013 Simple Substitution Distance 49 References T. Berg-Kirkpatrick and D. Klein, Decipherment with a million random restarts, http://www.cs.berkeley.edu/~tberg/papers /emnlp2013.pdf Simple Substitution Distance 50