CS 5375 Presentation

URLDoc: Learning to Detect
Malicious URLs using Online
Logistic Regression
Presented by :
Mohammed Nazim Feroz
 Web services drive new opportunities for people to
interact, they also create new opportunities for
 Google detects about 300,000 malicious websites per
month, this is a clear indication that these
opportunities are being used by criminals
 Almost all online threats have something in common,
they all require the user to click on a hyperlink or type
in a website address
 The user needs to perform sanity checks and assessing
the risk of visiting a URL
 Performing such an evaluation might be impossible
for a novice user
 As a result, users often end up clicking links without
paying close attention to the URLs – this further
makes them vulnerable to malicious websites on the
web which in turn exploit them
 Openness of the web exposes opportunities for
criminals to upload malicious content
 Do techniques exist to prevent malicious content from
entering the web?
Current Techniques
 Security practitioners have developed techniques such
as blacklisting in order to protect users from malicious
 Although this approach has minimal overhead, it does
not provide complete protection as about only 55% of
the malicious URLs are present in blacklists
 Another drawback of this approach is that malicious
websites are not a part of the blacklist during the
period before their detection
Current Techniques
 Security researchers have done extensive research in
order to detect accounts on social networks that are
used for spreading messages that are malicious
 The approach still does not provide thorough
protection for users in areas such as social networks
where the interaction is in real-time because there is a
need to build a profile of malicious activity and the
process can take a considerable amount of time
Current Techniques
 Researchers from TokDoc have used a method that
decides on a per-token basis whether a token requires
automatic healing
 Their work uses n-grams and length as features for
detecting malicious URLs
 This research builds on their idea by supplementing a
set of their features with host-based features as the
latter has exhibited a wealth of information that can be
 URLDoc classifies URLs automatically based on the
lexical (textual) and host-based features
 Scalable machine learning algorithms from Mahout
are used to develop and test the classifier
 Online learning is considered over batch learning
 The classifier achieves 93-97% accuracy by detecting
a large number of malicious hosts, with a modest false
positive rate
 If these predictor variables are correctly identified and
the URLs metadata is carefully derived then the
machine learning algorithms used can sift through
tens of thousands of features
 Online algorithms are preferred over batch-learning
 Batch learning algorithms look at every example in
the training dataset on every step and then update the
weights of the classifier – a costly operation if the
number of training examples is large
 Online algorithms update the weights according to the
gradient of the error with respect to a single training
 Online algorithms are able to process datasets far
more efficiently than batch algorithms
Problem Formulation
 URL classification lends itself naturally as a binary
classification problem
 The target variable y(i) can take one of two possible
values-malicious or benign
 For k predictor variables over all categories then there
will be x1(i),…, xk(i); this will result in a k-dimension
feature vector characterizing the URL
 The goal is to learn a function h(x)=y that maps the
space of input values to the space of output values so
that h(x) is a good predictor for the corresponding
value of y
Problem Formulation
 The two main phases involved in building a
classification system
 The first phase creates the model (i.e. the function h(x)) produced by the
learning algorithm
 The second phase makes use of that model to assign new data from the test
dataset to its predicted target class
 Selection of the training dataset and it’s predictor
variables, the target classes, and the learning
algorithm through which the classification system will
learn are vital in the first phase of building the
classification system
 Predicted labels are compared with known answers to
evaluate the classifer
Overview of Features
 Lexical features
 These features have values of both types-binary and
 These features include
 Length of the URL
 Number of dots in the URL
 Tokens present in the hostname, primary domain, and path parts of a URL
 Features in the hostname are further characterized as bigrams
 Bigrams are able to capture a certain pattern on character strings permuted randomly and occurring in
certain combinations
 Example: www.depts.ttu.edu  Bigrams: depts ttu, ttu edu
Overview of Features
 Host-Based features
 IP address of the URL – A Record
 IP address of the Mail Exchanger – MX Record
 IP address of the Name Server – NS Record
 PTR Record
 AS number
 IP Prefix
Overview of Features
 Malicious websites have exhibited a pattern of being
hosted in a particular “bad” portion of the Internet
 Example: McColo provided hosting for major botnets, which in turn were
responsible for sending 41% of the world’s spam just before McColo’s
takedown in November 2008. McColo’s AS number was 26780
 These portions of the internet can be characterized on
a regular basis by retraining on the predictor variables
 This allows keeping track of concept drift
Online Logistic Regression with SGD
 Logistic regression is a very flexible algorithm as it
allows the predictor variables to be of both typescontinuous and binary
 Mahout greatly helps in the learning process by
choosing an optimum learning rate and thus allowing
the classification system to converge to the global
Online Logistic Regression with SGD
 Online learning when compared to batch learning is
usually much faster, adapts to changes in a continuous
manner and is much better when the size of the
training and test datasets are large
 Support Vector Machines were considered but not
chosen since they take a longer period of time to train
when compared to Online Logistic Regression
 Online Logistic Regression converges more quickly if
malicious and benign URLs from the training dataset
are presented in a random order
Feature Vector
 Feature hashing is used in order to encode the raw
feature data into feature vectors
 In this approach, a reasonable size (i.e. dimension) is
picked for the feature vector and the data is put into
feature vectors of the chosen size
 After carefully considering the datasets, the size of the
feature vectors in the research is in the 100,000
dimension space
Feature Vector Example
 The data is encoded into the feature vector as
continuous, categorical, word-like, and text-like
features using the Mahout API
90/10 dataset split
Training/Test dataset split
80/20 dataset split
Training/Test dataset split
50/50 dataset split
Training/Test dataset split
Other Approaches Attempted
 Term Frequency – Inverse Document Frequency
 A bag of words approach was used and term (lexical features) – document
(URL) matrix was created
 Online Logistic Regression is not affected by good word weighting
 Clustering
 The URLs are viewed as a set of vectors in vector space
 Cosine similarity was used as the similarity measure between URLs
 This research focused on classification over clustering since the target
classes of the URLs was known – Clustering has known to be useful when
the target classes are unknown
Future Work
 Study the various features extensively and only use
those with the highest contributions – Also add new
features that would help in better classification
 Try to use algorithms that can benefit from
 A reliable framework for the classification of URLs is
 A supervised learning method is used in order to learn
the characteristics of both malicious and benign URLs
and classify them in real time
 The applicability and usefulness of Mahout for the
URL classification task is demonstrated, and the
benefits of using an online setting over a batch setting
are illustrated-the online setting enabled learning new
trends in the characteristics of URLs over time