Abstract of thesis presented to the Senate of Universiti Putra Malaysia in fulfilment of the requirement for the degree of Doctor of Philosophy EFFECTIVE METAMORPHIC MALWARE DETECTION WITH STRUCTURAL FEATURES AND NONNEGATIVE MATRIX FACTORIZATION By LING YEONG TYNG Month and 2021 Chairman: Associate Professor Nor Fazlida Mohd Sani, Ph.D. Faculty: Computer Science and Information Technology Metamorphic malware is well known for evading signature-based detection. It adopts code obfuscation techniques to disguise its malicious behavior and modify the syntax or structure of itself. Thus, it can generate variants within a malware family during each propagation and this make detection difficult. Indeed, although some approaches have been proposed over the years, but the results were less than ideal for metamorphic malware detection. Besides, the tools that been used to extract file features of a binary file is platform dependent, making their scope restrained. Moreover, the quantity of file features is large and this increase detection system workload, thus delay the instant detection response ability. To overcome the above issues, this research propose a framework of metamorphic malware detection which consists of three main parts. The first part is to propose a spectral-based feature reduction method called Nonnegative Matrix Factorization for metamorphic malware detection. The second part is to propose five alternative feature representations on raw bytes of a binary file by using compression ratio, entropy, Jaccard similarity coefficient on hexadecimal bytes, Jaccard similarity coefficient on integer bytes, and Chi-square statistic test. This is to reduce the prior knowledge required during feature engineering step at the same time to leverage detection result. The third part comprises of employing Nonnegative Matrix Factorization and new feature representations on structural similarity-based detection, machine learning based detection using Random Forest with Conditional Inference Tree, and Hidden Markov Model based detection. The proposed approach makes use the raw byte of executable files regardless the file format gives the flexibility to be applied in other platforms. iii Experimental evaluation of the proposed approach, using existing datasets, achieved satisfactory results. During the study, in structural similarity analysis, the experimental results demonstrated the accuracy rate is in the range of 95% to 100% when the low rank of 1 or 2 of Nonnegative Matrix Factorization is used on the compression ratio and entropy features. The experimental results of Random Forest classifier have shown the efficiency of the proposed approach with an accuracy rate of 98% ∼ 99% for metamorphic malware families. As for the Hidden Markov Model, by using entropy feature representation, a 96% ∼ 99% range of accuracy can be achieved. Based on the results, this study demonstrates the effectiveness of non-entropic feature representation with machine learning algorithms for metamorphic malware detection. iv