CMSC 691 Project Proposal Purusharth Prakash and Mohid Farazi 24-Feb-12 1. Objective Build a Classification Model for predicting Employee-Resource Access based on historical data for Amazon Employees 2. Summary Amazon has made available historical data for access of type of resources by a group of employees over a period of one year as part of their MLSP Competition [1]. The primary goal of the project is to build a classification model utilizing the employee characteristics as features to predict the set of resources the employee would require access to. The challenging aspects of the problem include (i) multiclass prediction and (ii) large dataset. Initial data analysis steps would involve (i) combining resource access information distributed across multiple tables into a single data set (ii) filtering of the data to ensure consistency and (iii) Variable transformation for reduced memory footprint (iv) Normalization to ensure estimation of similarity metrics and comparison. Statistical parameters and Data visualization techniques such as histograms would be used to understand the characteristics and distribution of data. These analysis steps would help in identifying the important features in the data set. Eigenanalysis of the covariance matrix such as the principle component analysis (PCA) would be used to identify the important features in the data set and if possible a reduction in the dimensionality. Since it is a multiclass problem, to use standard classification methods the problem will be divided into K binary classification problems (where K is the number of classes). This process would increase the computational complexity substantially as the K ~ 105 . This sub-division creates an additional problem of class imbalance, which needs to be addressed during the classifier training process. Based on the results of the data analysis, appropriate classification model will be chosen. Naïve Bayesian Classifier and Decision Tree Based Classifier are attractive initial choices for large datasets such as this due to simplicity of model and low computational requirements. Low prediction accuracy or data analysis characteristics may necessitate usage of Advanced Classifier methods such as SVM and Ensemble methods. Classifier model performance will be evaluated using K-fold cross validation. 3. Problem Description The employee data consists of the following features associated with each employee Column Name DATE Description data is time stamped once per week, a user will appear as EMPLOYEE_ID MGR_ID ROLE_ROLLUP_1 ROLE_ROLLUP_2 ROLE_DEPTNAME ROLE_TITLE ROLE_FAMILY_DESC ROLE_FAMILY ROLE_CODE changes are made to their profile a unique ID for each employee The EMPLOYEE_ID of the manager of the current EMPLOYEE_ID record; an employee may have only one manager at a time Company role grouping category id 1; e.g. US Engineering Company role grouping category id 2; Company role department description e.g. Retail Company role business title description e.g. Senior Engineer Retail Manager Company role family extended description e.g. Retail Manager, Software Engineering Company role family description e.g. Retail Manager Company role code; this code is unique to each role such as ‘Manager’ At any given time each employee is allowed access to certain set of resources given by RESOURCE_ID. The objective is to use the employee features to predict the set of resources which need to be allowed access by the employee. In particular the predictor should be able to provide a yes/no answer given a RESOURCE_ID and set of employee features. A set of historical data is provided which contains employee details (features) and the corresponding resources, for a period of one year 3.1 Problem Characteristics It is obvious that the problem requires building a classifier trained and validated using the historical data provided. However the problem poses several unique challenges as given below, which necessitate careful analysis before attempting to build the model. Large Dataset: The employee data set contains approximately 8 x 106 rows and 10 columns. This makes data matrix too large to fit in the computer memory entirely in its present form. Automated tools such as WEKA [2] have trouble loading and analyzing the entire dataset, as data is being continuously swapped from memory to hard drive. In addition to the above constraint large dataset often pose the problem of model overfitting which must addressed during classifier training. Redundant Features: Although the number of features is not high, there is still a possibility of redundant features for e.g. ROLE_FAMILY and ROLE_FAMILY_DESC may not be entirely independent Large of number of possible Classes: The number of possible class values is quite high (~105). In addition each employee can be associated with multiple class values. This precludes the possibility of a simple classification. 4. Approach The proposed approach consists of the following steps 4.1 Data Pre-processing 4.2 Aggregation of different data sets Data is currently distributed in several data sets which will be combined to create a single consolidated data matrix Filtration Filtration will consist of removal of inconsistent data or data with missing/invalid value Normalization and Variable Transformation Variable transformation will allow for more efficient data storage, while normalization will allow for computation of similarity metric and comparison Data Analysis This will involve computing basic statistical parameters for the attributes, such as, mean, variance as well as similarity metrics, such as, co-variance matrix and correlation coefficient. Distribution plots e.g., histograms will also be used to aid in understanding the type and distribution of data. 4.3 Dimensionality Reduction Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) will be used to investigate the principal components and possibility of dimensionality reduction. The high space/time complexity O(m2) may prove to be a challenge given the huge size of data set. 4.4 Classification Most standard classifiers are designed for binary classification, however the present case is a multiclass problem. Two standard approaches are available [3], both of which essentially convert the single multiclass problem into multiple binary classification problems (i) 1-r method (one against rest) A separate data set is created for each class value yi by treating all yi values as 1 and all other non yi values as 0. This will create K binary classification problems, where K is the number of classes. The advantage (ii) of this approach is that it will allow to prediction of membership to each individual class value as required by the problem description. However the computational complexity is increased by a factor of K, which in the present case is substantial (105). Another challenge with this approach is the class imbalance problem [4]. Since only a few rows will have positive value for the class yi, the model accuracy will be biased with the large number of negative values. This will require alternative metrics to assess model performance or sampling based approaches. 1-1 method ( one against one) Each possible pair of classifier value is used to construct a data set which is used to train a binary classifier. Since this approach rejects all rows which don’t belong to either class, it creates a much smaller data set in each case. This is reduction in computational complexity is partially offset as the number of data sets and consequentially the number of binary classifiers is now K(K-1)/2 . Another advantage of this approach is that it takes care of the class imbalance problem as in general class values are expected to be uniformly distributed. However the major disadvantage which makes this approach unsuitable for the present case is that it is not possible to make direct prediction with respect to a single class value. All binary classifiers will need to be evaluated and the results combined by employing some voting mechanism for every single prediction. This may make the prediction process extremely slow and unsuitable for practical use. Since K different binary classifiers are being trained now, the dimensionality reduction may need to be re-evaluated in each individual case. However, given the large computational requirements this may not be feasible. 4.5 Classifiers The results of data analysis will be primarily used to guide the appropriate classification approach. Given below a preliminary 4.5.1 Naïve Bayes This is the simplest and easiest to implement. Despite their simplicity bayes classifiers perform reasonably well in practice [5-7] and particularly attractive for large datasets [6]. Given the large size of the problem this may be the only feasible method. In any case this classifier could serve as a benchmark to assess the performance of more sophisticated classifier model 4.5.2 Decision Trees Decision trees usually perform well for large data sets, are computationally efficient and usually perform better than naïve bayes classifier [6]. However decision tree training algorithms are relatively non-trivial to implement and may necessitate use of a standard library or software package. 4.5.3 Advanced Classifiers Depending on the results of preliminary data analysis and performance of the above classifiers other advanced classification methods such as such as SVM and Ensemble methods. 4.6 Model Comparison and Accuracy K-fold Cross validation will be used to assess and compare the performance of each classifier model [8]. The performance assessment will be used to guide further modification to classifier models or investigate new models. 5. Proposed distribution of work and schedule Initial data analysis to identify important features of the data set (2 weeks) - combining resource access information (Mohid) - filtering data to ensure consistency (Purusharth) - variable transformation (Mohid) - normalization (Purusharth) PCA/Eigenanalysis (Purusharth and Mohid) (1 week) Dividing problem into K binary classification problems (Purusharth and Mohid) (1 week) Identifying and applying appropriate classification models (Purusharth and Mohid) (4 weeks) Validation of model performance (Mohid and Purusharth) (2 weeks) 6. References 1. 2. 3. 4. 5. 6. 7. Weifeng Liu, V.D.C., Catherine Huang, Kenneth E. Hild, Ken Montanez. The Eighth Annual MLSP Competition. 2012; Available from: https://sites.google.com/site/amazonaccessdatacompetition/. Hall, M., et al., The WEKA data mining software: an update. SIGKDD Explor. Newsl., 2009. 11(1): p. 10-18. Tan, P.-N., M. Steinbach, and V. Kumar, Introduction to data mining. 1st ed. 2006, Boston: Pearson Addison Wesley. xxi, 769 p. Japkowicz, N. and S. Stephen, The class imbalance problem: A systematic study. Intell. Data Anal., 2002. 6(5): p. 429-449. Friedman, N., D. Geiger, and M. Goldszmidt, Bayesian Network Classifiers. Mach. Learn., 1997. 29(2-3): p. 131-163. Wu, X., et al., Top 10 algorithms in data mining. Knowl. Inf. Syst., 2007. 14(1): p. 1-37. Rish, I., An Empirical Study of the Naive Bayes Classifier, in IJCAI-01 Workshop on Empirical Methods in AI. 2001. 8. Dietterich, T.G., Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput., 1998. 10(7): p. 18951923.