URLDoc: Learning to Detect Malicious URLs using Online Logistic Regression Presented by : Mohammed Nazim Feroz 11/26/2013 Motivation Web services drive new opportunities for people to interact, they also create new opportunities for criminals Google detects about 300,000 malicious websites per month, this is a clear indication that these opportunities are being used by criminals Almost all online threats have something in common, they all require the user to click on a hyperlink or type in a website address Motivation The user needs to perform sanity checks and assessing the risk of visiting a URL Performing such an evaluation might be impossible for a novice user As a result, users often end up clicking links without paying close attention to the URLs – this further makes them vulnerable to malicious websites on the web which in turn exploit them Introduction Openness of the web exposes opportunities for criminals to upload malicious content Do techniques exist to prevent malicious content from entering the web? Current Techniques Security practitioners have developed techniques such as blacklisting in order to protect users from malicious websites Although this approach has minimal overhead, it does not provide complete protection as about only 55% of the malicious URLs are present in blacklists Another drawback of this approach is that malicious websites are not a part of the blacklist during the period before their detection Current Techniques Security researchers have done extensive research in order to detect accounts on social networks that are used for spreading messages that are malicious The approach still does not provide thorough protection for users in areas such as social networks where the interaction is in real-time because there is a need to build a profile of malicious activity and the process can take a considerable amount of time Current Techniques Researchers from TokDoc have used a method that decides on a per-token basis whether a token requires automatic healing Their work uses n-grams and length as features for detecting malicious URLs This research builds on their idea by supplementing a set of their features with host-based features as the latter has exhibited a wealth of information that can be used Approach URLDoc classifies URLs automatically based on the lexical (textual) and host-based features Scalable machine learning algorithms from Mahout are used to develop and test the classifier Online learning is considered over batch learning The classifier achieves 93-97% accuracy by detecting a large number of malicious hosts, with a modest false positive rate Approach If these predictor variables are correctly identified and the URLs metadata is carefully derived then the machine learning algorithms used can sift through tens of thousands of features Online algorithms are preferred over batch-learning algorithms Batch learning algorithms look at every example in the training dataset on every step and then update the weights of the classifier – a costly operation if the number of training examples is large Approach Online algorithms update the weights according to the gradient of the error with respect to a single training example Online algorithms are able to process datasets far more efficiently than batch algorithms Problem Formulation URL classification lends itself naturally as a binary classification problem The target variable y(i) can take one of two possible values-malicious or benign For k predictor variables over all categories then there will be x1(i),…, xk(i); this will result in a k-dimension feature vector characterizing the URL The goal is to learn a function h(x)=y that maps the space of input values to the space of output values so that h(x) is a good predictor for the corresponding value of y Problem Formulation The two main phases involved in building a classification system The first phase creates the model (i.e. the function h(x)) produced by the learning algorithm The second phase makes use of that model to assign new data from the test dataset to its predicted target class Selection of the training dataset and it’s predictor variables, the target classes, and the learning algorithm through which the classification system will learn are vital in the first phase of building the classification system Predicted labels are compared with known answers to evaluate the classifer Overview of Features Lexical features These features have values of both types-binary and continuous These features include Length of the URL Number of dots in the URL Tokens present in the hostname, primary domain, and path parts of a URL Features in the hostname are further characterized as bigrams Bigrams are able to capture a certain pattern on character strings permuted randomly and occurring in certain combinations Example: www.depts.ttu.edu Bigrams: depts ttu, ttu edu Overview of Features Host-Based features IP address of the URL – A Record IP address of the Mail Exchanger – MX Record IP address of the Name Server – NS Record PTR Record AS number IP Prefix Overview of Features Malicious websites have exhibited a pattern of being hosted in a particular “bad” portion of the Internet Example: McColo provided hosting for major botnets, which in turn were responsible for sending 41% of the world’s spam just before McColo’s takedown in November 2008. McColo’s AS number was 26780 These portions of the internet can be characterized on a regular basis by retraining on the predictor variables This allows keeping track of concept drift Online Logistic Regression with SGD Logistic regression is a very flexible algorithm as it allows the predictor variables to be of both typescontinuous and binary Mahout greatly helps in the learning process by choosing an optimum learning rate and thus allowing the classification system to converge to the global minimum Online Logistic Regression with SGD Online learning when compared to batch learning is usually much faster, adapts to changes in a continuous manner and is much better when the size of the training and test datasets are large Support Vector Machines were considered but not chosen since they take a longer period of time to train when compared to Online Logistic Regression Online Logistic Regression converges more quickly if malicious and benign URLs from the training dataset are presented in a random order Feature Vector Feature hashing is used in order to encode the raw feature data into feature vectors In this approach, a reasonable size (i.e. dimension) is picked for the feature vector and the data is put into feature vectors of the chosen size After carefully considering the datasets, the size of the feature vectors in the research is in the 100,000 dimension space Feature Vector Example The data is encoded into the feature vector as continuous, categorical, word-like, and text-like features using the Mahout API Results 90/10 dataset split Training/Test dataset split 80/20 dataset split Training/Test dataset split Results 50/50 dataset split Training/Test dataset split Benign:Malicious Other Approaches Attempted Term Frequency – Inverse Document Frequency A bag of words approach was used and term (lexical features) – document (URL) matrix was created Online Logistic Regression is not affected by good word weighting Clustering The URLs are viewed as a set of vectors in vector space Cosine similarity was used as the similarity measure between URLs This research focused on classification over clustering since the target classes of the URLs was known – Clustering has known to be useful when the target classes are unknown Future Work Study the various features extensively and only use those with the highest contributions – Also add new features that would help in better classification Try to use algorithms that can benefit from parallelization Summary A reliable framework for the classification of URLs is built A supervised learning method is used in order to learn the characteristics of both malicious and benign URLs and classify them in real time The applicability and usefulness of Mahout for the URL classification task is demonstrated, and the benefits of using an online setting over a batch setting are illustrated-the online setting enabled learning new trends in the characteristics of URLs over time Questions ?