Learning to Detect and Classify Malicious Executables in the Wild

Learning to Detect and Classify

Malicious Executables in the

Wild

Reporter: 林佳宜

Email: M98570015@mail.ntou.edu.tw

2020/4/11

11

References



Learning to Detect and Classify Malicious

Executables in the Wild. J. Zico Kolter,

Marcus A. Maloof, JMLR 2006.

2

Outline











Introduction

Classification Methodology

Experimental Design

Experimental Results

Conclusion

3

Introduction



Malicious code can

 cause harm or subvert the system’s intended function



Malicious executables have three categories

 viruses, worms, and Trojan horses.



Describe the use of machine learning and data mining

 detect and classify malicious executables

4

Three main contributions



Detect and classify malicious executables

 Use text classification



Present empirical results

 from an extensive study of inductive methods for detecting and classifying malicious executables



Show that the methods achieve high detection rates

 even on completely new, previously unseen malicious executables

5

Several learning methods



Implemented in the Wakaito Environment for Knowledge Acquisition (WEKA)

 IBk

 naive Bayes

 support vector machine (SVM)

 J48



Used the AdaBoost.M1 algorithm

 boost SVMs, J48, naive Bayes

6

Data Collection





Gathered this collection early of 2003

◦ Benign executables

 1971

 from Windows 2000 and XP operating systems

 SourceForge

 download.com

◦ Malicious executables

 1651

 from Web site VX Heavens

 MITRE Corporation, the sponsors of this project

Recently,obtained 291 malicious executables

 from VX Heavens

7

Experimental Design









To evaluate the approach and methods

 stratified ten-fold cross-validation

 randomly partitioned the executables into ten disjoint sets of equal size

 one as a testing set

 nine to form a training set

Extracted n-grams from the executables in the training and testing sets

Selected the most relevant features from the training data

To conduct ROC analysis, for each method

8

Detecting Malicious Executables









Learning methods detected malicious executables

 three experimental studies

The first was a pilot study to determine the

 size of words and n-grams

 the number of n-grams relevant for prediction

The second experiment consisted of applying all of the classification methods to

 a small collection of executables

The third then involved applying the methodology to

 a larger collection of executables

9

Pilot Studies[1/2]







Pilot studies to determine three parameters

 the size of n-grams

 the size of words,

 the number of selected features

Extracted bytes from

 476 malicious executables, 561 benign executables

 produced n-grams, for n = 4

Selected the best 10, 20, . . . , 100, 200, . . . ,

1000, 2000, . . . , 10000 n-grams,

 Selecting 500 n-grams produced the best results

10

Pilot Studies[2/2]



Fixed the number of n-grams

 at 500

 varied n, the size of the n-grams



Evaluated the same methods for n=1,2,....,10

 n = 4 produced the best results



Varied the size of the words (one byte, two bytes, etc.)

 single bytes produced better results

11

Classification Methodology



Form training examples

 used the n-grams extracted from the executables

 by viewing each n-gram as a Boolean attribute



Selected the most relevant attributes by

 computing the information gain (IG) for each:



Selected the top 500 n-grams

12

Experiment with a Small

Collection



Executables produced 68744909 distinct n- grams



Areas under these curves (AUC) with 95% confidence intervals

 the boosted methods performed well

 Naive Bayes did not perform as well

13

14

15

Experiment with a Larger

Collection



This collection consisted of

 1971 benign executables

 1651 malicious executables

 over 255 million distinct n-grams of size four



The areas under these curves with 95% confidence intervals

 boosted J48 outperformed all other methods

16

17

18

Classifying Executables by

Payload Function



Classify malicious executables based on

 function of their payload

 present results for three functional categories

 opened a backdoor 、 mass-mailed 、 executable virus



Reduce the previously undiscovered malicious executables

19

20

21

Evaluating Real-world, Online

Performance



Compare the actual detection rates

 larger collection VS the 291 new malicious



Selected three desired false-positive rates

 0.01, 0.05, 0.1



Detected about 98% of the new malicious executables

 boosted J48

 false-positive rate of 0.05

22

23

Conclusion



Detecting and classifying unknown malicious executables by

 machine learning, data mining, text classification



Detecting malicious executables

 boosted J48 produced the best detector with an area under the ROC curve of 0.996



Classify malicious executables based on payload’s function

 boosted J48 produced the best detectors with areas under the ROC curve around 0.9

24

Questions

25

Learning to Detect and Classify Malicious Executables in the Wild

Learning to Detect and Classify

Malicious Executables in the

Wild

References

Outline

Introduction

Three main contributions

Several learning methods

Data Collection

Experimental Design

Detecting Malicious Executables

Pilot Studies[1/2]

Pilot Studies[2/2]

Classification Methodology

Experiment with a Small

Collection

Experiment with a Larger

Collection

Classifying Executables by

Payload Function

Evaluating Real-world, Online

Performance

Conclusion

Questions

Related documents

Products

Support

Learning to Detect and Classify Malicious Executables in the Wild