Exposing Cryptographic Operations from Malware Binaries using Dynamic Information Analysis and Avalanche Effect Anitha Abraham Final Year M.Tech Student, School of computer Sciences, Mahatma Gandhi University Varghese Paul Associate Professor Department of IT CUSAT Abstract—Malwares are often used to disrupt computer operation, gather sensitive information, or gain access to private computer systems. Nowadays malwares are becoming increasingly stealthy. Cryptographic algorithms are used by most of modern malwares to protect themselves from being analyzed. The use of cryptographic algorithms and truly transient cryptographic secrets inside the malware binary imposes a key obstacle to effective malware analysis and defense. The proposed malware analysis framework can be used for effective malware analysis and defence. It can automatically identify and recover the cryptographic operations and transient secrets from the execution of binary executables. This framework is based on the avalanche effect, which is a desirable property of cryptographic functions. The avalanche effect means that, a slight change in the input causes a significant change in the output, ie flipping a single bit in the input changes half of the output bits. Another feature of the avalanche effect is that it allows us to accurately pinpoint the location, size and boundary of both the input and output buffers. Using avalanche effect, this framework is able to accurately pinpoint the boundary of cryptographic operation and recover truly transient cryptographic secrets that only exist in memory for one instant in between multiple nested cryptographic operations. Keywords—Binary analysis, Cryptographic Operations, Key Recovery, Transient secret, Avalanche effect. I. INTRODUCTION Malware analysis and reverse engineering seek to understand the inner workings of malware, which are invaluable to defending against malware. Cryptographic algorithms are used by most of modern malwares to protect themselves from being analyzed. Inorder to prevent the cryptographic secrets from being recovered by key searching tools, modern malware can make the cryptographic secrets truly transient in memory by encrypting or destroying the secrets right after using them at run-time. The use of cryptographic algorithms and truly transient cryptographic secrets inside the malware binary executable imposes a key obstacle to effective malware analysis and defense. Different methods [2], [3] have been proposed to automatically recover the cryptographic keys from process memory or file. But , these methods require the cryptographic keys to be stored in plaintext form. These methods are not effective when the cryptographic keys are stored encrypted or transient. Chinju.K Assistant Professor, School of Computer Sciences, Mahatma Gandhi University Recently different binary analysis approaches [5],[6],[7] are proposed. These approaches are able to automatically detect the existence of cryptographic operations from a given binary execution. These binary analysis approaches are based on instruction profiling and signature of particular cryptographic implementations. However, they are not effective in the presence of multiple rounds of cryptographic operations. In order to recover such transient cryptographic secrets, one needs to reliably identify exactly when and where those transient cryptographic secrets will be in memory. This requires one to accurately pinpoint the boundary of each of the multiple rounds of cryptographic operations. Existing binary analysis approaches could not accurately pinpoint the boundary between multiple rounds of cryptographic operations. Existing methods also fail to recover truly transient cryptographic secrets from the execution of a given binary executable. Malware binary analysis framework can be used to accurately pinpoint the boundary of individual cryptographic operation from multiple rounds of cryptographic operations. It can also recover truly transient secrets from the execution of a binary executable. It is build upon avalanche effect, which refers to the desirable property of cryptographic functions. Avalanche effect means that one bit change in the input or key would cause significant change in the output. II. LITERATURE REVIEW An exhaustive literature survey has been conducted to identify related research works conducted in this area. Abstracts of some of the most relevant research works are included below. ReFormat: Automatic Reverse Engineering of Encrypted Messages ReFormat, is a system that is used to derive the message format even when the message is encrypted. An encrypted input message is assumed to go through two phases, message decryption and normal protocol processing. Data lifetime analysis of run-time buffers can pinpoint the memory locations that contain the decrypted message generated from the first phase and are later accessed in the second phase. It is designed to handle applications where there exists a single boundary between decryption and normal protocol processing. However, multiple such boundaries may exist. It was not designed to identify the buffers holding the unencoded data before encoding. However it is not effective in the presence of multiple rounds of cryptographic operations. Dispatcher:Enabling Active Botnet Infiltration using Automatic Protocol Reverse engineering Dispatcher is used to extract the format of protocol messages sent by an application. It makes a forward pass over the input execution trace and replicates the callstack of the application. It then computes the read set to identify the buffers holding the unencrypted data before encryption. Read set includes the set of locations read inside the encryption routine before being written. It also computes the write set to identify the buffers holding the unencrypted data after decryption. Write set is the set of locations written inside the decryption routine and read later in the trace. However these methods are not effective in the presence of multiple rounds of cryptographic operations. C. Automated Identification of Cryptographic Primitives in Binary Programs This system includes methods to identify the cryptographic primitives within a given binary program in an automated way. Fine-grained dynamic binary analysis is performed and the collected information is used as input for several heuristics that characterize unique aspects of cryptographic code. These methods can successfully extract cryptographic keys from a given malware binary. In the first stage, it performs finegrained binary instrumentation, and the second stage implements several heuristics to identify cryptographic code from the data collected by the first stage. But this method was not effective in the presence of multiple nested cryptographic operations. III. MALWARE ANALYSIS FRAMEWORK A. Goals and Assumptions The goal is to uncover the cryptographic operations and their secrets from the execution of a binary executable. Specifically, this framework can determine if there is any cryptographic operation in its execution. If present it can further pinpoint the location of all the cryptographic functions, their respective mode and the order of execution. It can also pinpoint the location, size and boundary of the input and the output buffers used by each cryptographic function identified. It also determines exactly when the input and the output of each cryptographic function will be at which buffers. This enables us to recover those truly transient input and output of each cryptographic operation that will be immediately destroyed or re-encrypted after run-time use. It can also determine if there is any key used in each cryptographic operation. If present it can recover the key even if it will be destroyed right after run-time use. Malware binary analysis framework is assumed to monitor the execution of the binary executable. It is also assumed that the input and the output of any called cryptographic functions reside in some continuous memory buffer at run-time. B. The Principle of Malware Analysis Framework Malware analysis framework is designed upon the avalanche effect, which refers to the desirable property of all cryptographic algorithms. A slight change in the input would cause significant changes in the output. This property is referred to as avalanche effect. Cryptographic functions are designed to exhibit the avalanche effect. If no cryptographic operations are present, it will not be having the avalanche effect. Hence, the avalanche effect is a fairly unique and defining characteristic of all good cryptographic functions. This enables us to reliably identify the cryptographic operations from executables. Avalanche effect also allows us to accurately pinpoint the location, size and boundary of both the input and output buffers. Changing different bits in the input buffer results different bits changed in the output buffer. It is also seen that the changed bits are cohesive within the fixed output buffer. Let y≥ 1 be the number of bytes of the output of the cryptographic function. Because of the presence of avalanche effect, any single bit change in the input would cause 4y bits changed in the output. Those changed bits in the output are said to be “touched” by the bit change in the input. If we know the location of buffers p and q, we can measure the impact of buffer p to buffer q by repeatedly changing different bits in buffer p and comparing the corresponding results in buffer q. Since the location of the input and output buffers is unknown, this method is not practical. Different runs of an executable may generate different outputs with the same input. Therefore, it is desirable to measure the impact between any two chosen buffers with only one run of a given binary executable. If any bit in memory impacts any other bit in memory during the binary execution, then there must exist information flow between those two bits. If we can track the information flow from a given source, we could measure the impact between memory buffers and detect the avalanche effect with a single run of the binary executable. C. Overall Architecture Fig.1 Overall Architecture This framework is build upon dynamic binary analysis (DBA). Figure 1 shows the overall architecture of the framework. The framework dynamically intercepts and instruments the runtime instructions of the binary executable. It also collects valuable run-time information about the binary’s execution, tracks and records the address and value of the bytes involved in the taint propagation. It further analyzes the recorded runtime information and checks the patterns of instruction execution and memory access for any avalanche effect. Based on the detected avalanche effect patterns, it identifies cryptographic operations and determines the exact location, size and boundary of the input buffer, the output buffer and any key buffer involved in each of the identified cryptographic operations. D. Avalanche Effect Identification If we know the taint sources and we are able to track the information flow across cryptographic functions, we can identify if any part of the information flow from the taint source exhibits any avalanche effect. Since the location and size of the input buffer of the cryptographic operation is unknown, it becomes a major challenge in identifying the avalanche effect. Inorder to identify the input buffer we need to taint any memory region and the input sources which could contain the input buffer of the cryptographic operation. The monitored execution of the binary must be checked periodically to identify the presence of avalanche effect. The execution history is viewed as a sequence of routine invocations ordered by their completion time. For each routine invocation, it constructs a set which contains those tainted buffers that have been updated during that invocation or are still alive when that invocation completes. The collected run time information contains the snapshot of the value of tainted buffers. This framework will be able to detect the presence of cryptographic operations even if the cryptographic input or key have been destroyed right after using inside any function invocation . For each set, we take every continuous buffer in it as the potential input buffer, and every continuous buffer in the remaining sets as the potential output buffer. If there exists avalanche effect from one buffer to another, then every byte in first buffer would taint every byte in second buffer. A metric named taint contribution rate is used to quantitatively measure avalanche effect between buffers. It is denoted as C(buffer1,x,buffer2,y), which is defined as the portion of the first y bytes of buffer2 that would be tainted by every byte from the first x bytes of buffer1. This metric will be 100% if and only if there exists the avalanche effect from the first x bytes of buffer a to the first y bytes of buffer b. By trying different values for the four parameters, we can eventually find the right values of them that make C(buffer1,x,buffer2,y) ≈ 100%. This allows us to accurately pinpoint the location, size and boundary of both the input and the output buffers of cryptographic functions. While searching for avalanche effect, we only need to consider those bytes that have taint relationship and aggregate them together. Due to the nature of the avalanche effect, we are able to discover the exact range of the input buffer and output buffer even if part of buffers are not involved in the cryptographic operation. E. Empirical Evaluation A prototype have been developed using the dynamic instrumentation framework Valgrind in Linux. Dynamic information analysis is done using the tool Taintgrind. A test program which reads the plaintext from a disk file and processes it with multiple rounds of block cipher encryption and decryption followed by a final round of hash was designed to evaluate the framework's capability in detecting the cryptographic operations . The test program performed blowfish encryption followed by AES encryption and decryption followed by blowfish decryption and sha1 hash on the plaintext. The two block ciphers use different secret keys. The plaintext is given as the input for blowfish encryption. The subsequent steps take the output of the previous step as input. Taintgrind performs bit level dynamic taint analysis. Taint propagation calculations are performed for each valuecreating memory or register operation. It traces the flow of tainted, or marked, input data through an application during execution and logs the traversal of conditional jumps and system calls. Bit precise taint propagation allows taintedness to propagate into bitfields and bit arrays creating a more accurate view of the impact input has on an application’s execution. Table3.1 Information collected from test program Table3.2 Information collected by the proposed framework Table 3.1 shows the information about each block cipher and SHA-1 hash collected by test program 1, which will be used as the ground truth in validating the framework's analysis results. Table 3.2 shows the analysis output by proposed malware binary analysis framework. The runtime execution information of the block cipher operation includes the entry point address of the routine who is the first to see the complete output buffer of the cryptographic operation at its execution end. By checking OpenSSL’s crypto library’s symbol table, we can find the routine symbol names which correspond to the runtime environment addresses. The runtime environment information reported by the framework matches the ground truth, which validates the accuracy of framework’s analysis result IV. CONCLUSION This framework is based on the defining characteristic of all good cryptographic algorithms, avalanche effect. This framework has been shown to be able to detect block cipher and hash operations and pinpoint exactly when and where the cryptographic input, output, and key will be in memory even if they exist only for a few microseconds. If the execution can be monitored, current software implementations achieve hardly any secrecy. REFERENCES Xin Li, Xinyuan Wang, Wentao Chang, “CipherXRay: Exposing Cryptographic Operations and Transient Secrets from Monitored Binary Execution” IEEE Transactions on Dependable and Secure computing, vol. 11, no. 2, March/April 2014 A. Shamir and N. van Someren. Playing Hide and Seek with Stored Keys. In Proceedings of the Third International Conference on Financial Cryptography (FC 1999), pages 118 – 124, February 1999. T. Pettersson. Cryptographic Key Recoveryfrom Linux Memory Dumps.In Presentation, Chaos Communication Camp, August 2007. J. A. Halderman, S. D. Schoen, N. Heninger, W. Clarkson, W. Paul, J. A. Calandrino, A. J. Feldman, J. Appelbaum, and E. W. Felten. Lest We Remember: Cold Boot Attacks on Encryption Keys. In Proceedings of the 17th USENIX Security Symposium, pages 45–60. USENIX, August 2008. Z. Wang, X. Jiang, W. Cui, X. Wang, and M. Grace. ReFormat: Automatic Reverse Engineering of Encrypted Messages. In Proceedings of the 14th European Symposium on Research in Computer Security (ESORICS 2009), pages 200–215, September 2009. F. Gr¨obert, C. Willems, and T. Holz. Automated Identification of Cryptographic Primitives in Binary Programs. In Proceedings of the 14th International Symposium on Recent Advances in Intrusion Detection (RAID 2011), September 2011. J. Caballero, P. Poosankam, C. Kreibich, and D. Song. Dispatcher: Enabling Active Botnet Infiltration using Automatic Protocol Reverseengineering. In Proceedings of the 16th ACM Conference on Computer and Communications Security (CCS 2009), pages 621–634. ACM, October 2009.