Document 11906347

advertisement
Toward an Automa-c, Online Behavioral Malware Classifica-on System *
*
** Raymond Canzanese Moshe Kam Spiros Mancoridis
•  Department of Electrical and Computer Engineering ** Department of Computer Science Data Fusion Laboratory dfl.ece.drexel.edu Goals Classifica-on Our goal is to create self-protecting servers that can:
•  Detect the execution of malware on the system,
•  Classify the malware according to a set of behavioral features, and
•  Mitigate the malware infection based on its computed class.
The focus of this work is the classification and mitigation of the detected malware.
The described system consists of:
•  Sensors installed on production servers
that collect data as the servers perform
their day-to-day tasks;
•  A feature extractor that preprocesses
the data to use in the detector and
classifier;
•  A detector that detects when malware
executes;
Accuracy vs. Time We take a supervised learning approach
to online malware classification, using
antivirus vendor labels as ground truth to
learn a random forest of binary decision
trees.
The decision trees consist of a series of
simple threshold tests at each node, where
the leaf nodes indicate the computed
labels.
•  Allowing malware to execute
longer enables more data
collection and provides more
accurate results.
•  Classification accuracy is highest
when the random forests are
trained with ground truth labels
that provide more specific
information about the malware.
Experimental Evalua-on •  A classifier that uses the features extracted from the sensors to determine what
type (category, family, subfamily) of malware executed on the server; and
•  A mitigator that uses information about the malware type to mitigate the
infection.
The focus of this work is on the on-line classification and mitigation of detected
malware.
Feature Extrac-on For classification, we extract features from three distinct types of sensors:
Performance Monitors System Call Counters Report the number of calls made per second to each system func-on System Call Sequence Counters The purpose of the experiments is to:
•  Determine what set of features provide the
most discrimination among malware classes;
•  Determine what ground truth labeling system
affords the best performance,
•  Evaluate the overall accuracy of the classifier.
Feature selection is accomplished by ranking features according to the amount of
information they provide about the classification task and selecting a fixed number
of features.
Report the number of calls made per second to each sequence of 2 system func-ons per second Since the sensors each report a time series of data, we collect the data before and
after the malware execute and compute the normalized change in mean and the
change in the variance that occur about the infection time. These two features are
extracted for each of the sensors.
-me (s) … 145 146 147 148 149 150 151 152 153 154 … data … 2.2 2.6 2.4 2.4 2.5 2.6 2.7 2.4 2.6 2.2 … mean 2.42 mean 2.50 variance 0.018 variance 0.032 Normalized mean change 4.54 Variance change 0.014 Confusion Matrix We evaluate the classifier on a corpus of 800
distinct malware samples.. Our experiments are
performed on virtual machine hosts configured as
web servers and database servers under heavy
computational load.
Feature Selec-on Report system resource usage every second •  CPU, disk, network, memory •  Applica-on-­‐specific •  VM-­‐specific SoNware Engineering Research Group •  The system call sequence features
provide the most discrimination.
•  Accuracy initially increases as more
features are selected
•  Accuracy gradually decreases as
less important features are included
and the random forest models
suffer from overfitting.
SASO 2013, Drexel University, 10 September 2013 0% 100% •  Shows the differences between
the ground truth and computed
labels. Rows sum to 100%.
•  The classifier is able to perfectly
identify certain classes of malware.
•  Those malware classes with the
lowest performance are the most
generic classes.
•  The lack of blockwise-diagonal
structure indicates little similarity
within families.
Mi-ga-on We can automatically mitigate malware infections using information about the
computed classes because:
•  Antivirus family and subfamily labels indicate specific malicious functions, and
•  The described classifier performs its best on the most specific labels.
Possible mitigations include:
•  Blocking specific network ports or services associated with the malware
•  Reverting unwanted configuration changes
•  Terminating services or processes associated with the malware
Download