CMSC 691 Project
Proposal
Purusharth Prakash and Mohid Farazi
24-Feb-12
1. Objective
Build a Classification Model for predicting Employee-Resource Access based on
historical data for Amazon Employees
2. Summary
Amazon has made available historical data for access of type of resources by a group
of employees over a period of one year as part of their MLSP Competition [1]. The
primary goal of the project is to build a classification model utilizing the employee
characteristics as features to predict the set of resources the employee would
require access to. The challenging aspects of the problem include (i) multiclass
prediction and (ii) large dataset.
Initial data analysis steps would involve (i) combining resource access information
distributed across multiple tables into a single data set (ii) filtering of the data to
ensure consistency and (iii) Variable transformation for reduced memory footprint
(iv) Normalization to ensure estimation of similarity metrics and comparison.
Statistical parameters and Data visualization techniques such as histograms would
be used to understand the characteristics and distribution of data. These analysis
steps would help in identifying the important features in the data set. Eigenanalysis
of the covariance matrix such as the principle component analysis (PCA) would be
used to identify the important features in the data set and if possible a reduction in
the dimensionality.
Since it is a multiclass problem, to use standard classification methods the problem
will be divided into K binary classification problems (where K is the number of
classes). This process would increase the computational complexity substantially as
the K ~ 105 . This sub-division creates an additional problem of class imbalance,
which needs to be addressed during the classifier training process. Based on the
results of the data analysis, appropriate classification model will be chosen. Naïve
Bayesian Classifier and Decision Tree Based Classifier are attractive initial choices
for large datasets such as this due to simplicity of model and low computational
requirements. Low prediction accuracy or data analysis characteristics may
necessitate usage of Advanced Classifier methods such as SVM and Ensemble
methods. Classifier model performance will be evaluated using K-fold cross
validation.
3. Problem Description
The employee data consists of the following features associated with each employee
Column Name
DATE
Description
data is time stamped once per
week, a user will appear as
EMPLOYEE_ID
MGR_ID
ROLE_ROLLUP_1
ROLE_ROLLUP_2
ROLE_DEPTNAME
ROLE_TITLE
ROLE_FAMILY_DESC
ROLE_FAMILY
ROLE_CODE
changes are made to their profile
a unique ID for each employee
The EMPLOYEE_ID of the manager of the
current EMPLOYEE_ID record; an
employee may have only one manager at
a time
Company role grouping category id 1;
e.g. US Engineering
Company role grouping category id 2;
Company role department description
e.g. Retail
Company role business title description
e.g. Senior Engineer Retail Manager
Company
role
family
extended
description e.g. Retail Manager, Software
Engineering
Company role family description e.g.
Retail Manager
Company role code; this code is unique
to each role such as ‘Manager’
At any given time each employee is allowed access to certain set of resources given
by RESOURCE_ID. The objective is to use the employee features to predict the set of
resources which need to be allowed access by the employee. In particular the
predictor should be able to provide a yes/no answer given a RESOURCE_ID and set
of employee features. A set of historical data is provided which contains employee
details (features) and the corresponding resources, for a period of one year
3.1 Problem Characteristics
It is obvious that the problem requires building a classifier trained and validated
using the historical data provided. However the problem poses several unique
challenges as given below, which necessitate careful analysis before attempting to
build the model.
Large Dataset:
The employee data set contains approximately 8 x 106 rows and 10 columns.
This makes data matrix too large to fit in the computer memory entirely in its
present form. Automated tools such as WEKA [2] have trouble loading and
analyzing the entire dataset, as data is being continuously swapped from
memory to hard drive.
In addition to the above constraint large dataset often pose the problem of
model overfitting which must addressed during classifier training.
Redundant Features:
Although the number of features is not high, there is still a possibility of
redundant features for e.g. ROLE_FAMILY and ROLE_FAMILY_DESC may not
be entirely independent
Large of number of possible Classes:
The number of possible class values is quite high (~105). In addition each
employee can be associated with multiple class values. This precludes the
possibility of a simple classification.
4. Approach
The proposed approach consists of the following steps
4.1
Data Pre-processing



4.2
Aggregation of different data sets
Data is currently distributed in several data sets which will be combined to
create a single consolidated data matrix
Filtration
Filtration will consist of removal of inconsistent data or data with
missing/invalid value
Normalization and Variable Transformation
Variable transformation will allow for more efficient data storage, while
normalization will allow for computation of similarity metric and
comparison
Data Analysis
This will involve computing basic statistical parameters for the attributes, such as,
mean, variance as well as similarity metrics, such as, co-variance matrix and
correlation coefficient. Distribution plots e.g., histograms will also be used to aid in
understanding the type and distribution of data.
4.3
Dimensionality Reduction
Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) will
be used to investigate the principal components and possibility of dimensionality
reduction. The high space/time complexity O(m2) may prove to be a challenge given
the huge size of data set.
4.4
Classification
Most standard classifiers are designed for binary classification, however the present
case is a multiclass problem. Two standard approaches are available [3], both of
which essentially convert the single multiclass problem into multiple binary
classification problems
(i)
1-r method (one against rest)
A separate data set is created for each class value yi by treating all yi
values as 1 and all other non yi values as 0. This will create K binary
classification problems, where K is the number of classes. The advantage
(ii)
of this approach is that it will allow to prediction of membership to each
individual class value as required by the problem description. However
the computational complexity is increased by a factor of K, which in the
present case is substantial (105).
Another challenge with this approach is the class imbalance problem [4].
Since only a few rows will have positive value for the class yi, the model
accuracy will be biased with the large number of negative values. This
will require alternative metrics to assess model performance or sampling
based approaches.
1-1 method ( one against one)
Each possible pair of classifier value is used to construct a data set which
is used to train a binary classifier. Since this approach rejects all rows
which don’t belong to either class, it creates a much smaller data set in
each case. This is reduction in computational complexity is partially offset
as the number of data sets and consequentially the number of binary
classifiers is now K(K-1)/2 . Another advantage of this approach is that it
takes care of the class imbalance problem as in general class values are
expected to be uniformly distributed.
However the major disadvantage which makes this approach unsuitable
for the present case is that it is not possible to make direct prediction
with respect to a single class value. All binary classifiers will need to be
evaluated and the results combined by employing some voting
mechanism for every single prediction. This may make the prediction
process extremely slow and unsuitable for practical use.
Since K different binary classifiers are being trained now, the dimensionality
reduction may need to be re-evaluated in each individual case. However, given the
large computational requirements this may not be feasible.
4.5
Classifiers
The results of data analysis will be primarily used to guide the appropriate
classification approach. Given below a preliminary
4.5.1 Naïve Bayes
This is the simplest and easiest to implement. Despite their simplicity bayes
classifiers perform reasonably well in practice [5-7] and particularly attractive for
large datasets [6]. Given the large size of the problem this may be the only feasible
method. In any case this classifier could serve as a benchmark to assess the
performance of more sophisticated classifier model
4.5.2 Decision Trees
Decision trees usually perform well for large data sets, are computationally efficient
and usually perform better than naïve bayes classifier [6]. However decision tree
training algorithms are relatively non-trivial to implement and may necessitate use
of a standard library or software package.
4.5.3 Advanced Classifiers
Depending on the results of preliminary data analysis and performance of the above
classifiers other advanced classification methods such as such as SVM and Ensemble
methods.
4.6
Model Comparison and Accuracy
K-fold Cross validation will be used to assess and compare the performance of each
classifier model [8]. The performance assessment will be used to guide further
modification to classifier models or investigate new models.
5. Proposed distribution of work and schedule
Initial data analysis to identify important features of the data set (2 weeks)
- combining resource access information (Mohid)
- filtering data to ensure consistency (Purusharth)
- variable transformation (Mohid)
- normalization (Purusharth)
PCA/Eigenanalysis (Purusharth and Mohid) (1 week)
Dividing problem into K binary classification problems (Purusharth and Mohid) (1
week)
Identifying and applying appropriate classification models (Purusharth and Mohid)
(4 weeks)
Validation of model performance (Mohid and Purusharth) (2 weeks)
6. References
1.
2.
3.
4.
5.
6.
7.
Weifeng Liu, V.D.C., Catherine Huang, Kenneth E. Hild, Ken Montanez. The
Eighth Annual MLSP Competition. 2012; Available from:
https://sites.google.com/site/amazonaccessdatacompetition/.
Hall, M., et al., The WEKA data mining software: an update. SIGKDD Explor.
Newsl., 2009. 11(1): p. 10-18.
Tan, P.-N., M. Steinbach, and V. Kumar, Introduction to data mining. 1st ed.
2006, Boston: Pearson Addison Wesley. xxi, 769 p.
Japkowicz, N. and S. Stephen, The class imbalance problem: A systematic study.
Intell. Data Anal., 2002. 6(5): p. 429-449.
Friedman, N., D. Geiger, and M. Goldszmidt, Bayesian Network Classifiers.
Mach. Learn., 1997. 29(2-3): p. 131-163.
Wu, X., et al., Top 10 algorithms in data mining. Knowl. Inf. Syst., 2007. 14(1):
p. 1-37.
Rish, I., An Empirical Study of the Naive Bayes Classifier, in IJCAI-01 Workshop
on Empirical Methods in AI. 2001.
8.
Dietterich, T.G., Approximate statistical tests for comparing supervised
classification learning algorithms. Neural Comput., 1998. 10(7): p. 18951923.