Data Mining

Overview of XLMINER Activate XLMiner 1. Start Excel 2. Click File/Options/Add-Ins 3. Select COM Add-ins and click Go as shown below. 4. Check Analytic Solver Platform Addin XLMiner Ribbon The XLMiner ribbon is broken up into five different segments as shown in the screenshot below. Get Data Use Get data button to draw a random sample of data, or summarize data from a (i) an Excel worksheet, (ii) the PowerPivot “spreadsheet data model” which can hold 10 to 100 million rows of data in Excel, (iii) an external SQL database such as Oracle, DB2 or SQL Server, or (iv) a dataset with up to billions of rows, stored across many hard disks in an external compute cluster running Apache Spark (https://spark.apache.org/), using the newly added Big Data feature. XLMiner includes several different methods for importing your data into XLMiner including Sampling from either a Worksheet or Database or Importing from a File Folder. Worksheet Take a representative sample from a dataset included in Excel workbook Database Take a representative sample from a dataset included in an Oracle, SQL Server, MSAccess, or Power Pivot database. File Folder Import and or sample from a collection of text documents for use with Text Mining. Big Data Sample from or summarize from a dataset with up to billions of rows, stored across many hard disks in an external compute cluster running Apache Spark. Data Analysis There are four task groups within Data Analysis. Explore Chart Wizard XLMiner includes 8 different types of charts to choose from, including, bar charts, line charts, scatterplots, boxplots, histograms, parallel coordinates charts, scatterplot matrix charts or variable charts. o create one or more charts of your data. Feature Selection Helps to give insight into which variables are the most important or relevant for inclusion in your classification or prediction model using various types of statistics and data analysis measures. Existing Charts Edit or view previously created charts. Transform Missing Data Handling Routines for dealing with these missing values by allowing a user to either delete the full record or apply a value of her/his choice. Bin Continuous Data Routine for binning continuous data for use with prediction and classification methods which do not support continuous data. Continuous variables can be binned using several different user specified options. Transform Categorical Data Create Dummies: Categorical variables (with k categories) converted to (k-1) dummy variables; with up to 30 distinct values. Create Category Scores: Assigns numerical score for each category of a categorical variable. Reduce Categories: Reduces number of categories if a categorical variable has more than 30 categories. Principal Components Analysis to remove highly correlated or superfluous variables from large databases. Cluster K-Means Clustering Cluster observations into K-clusters that maximizes in-cluster similarity and out-cluster dissimilarity. Hierarchical Clustering Cluster observations into K-clusters that maximizes in-cluster similarity and out-cluster dissimilarity using the hierarchical method. Text Text Miner tool to analyze a collection of text documents for patterns and trends Time Series Analysis Partition Partitions data into training and validation sets. ARIMA XLMiner features two techniques for exploring trends in a dataset, ACF (Autocorrelation function) and PACF (Partial autocorrelation function). These techniques help the user to explore various patterns in the data which can be used in the creation of the model. XLMiner also supports the analysis and forecasting of datasets that contain observations generated sequentially such as predicting next year’s sales figures, monthly airline bookings, etc. through partitioning, autocorrelations or ARIMA models and through smoothing techniques. Smoothing XLMiner offers four different smoothing techniques, Exponential, Moving Average, Double Exponential, and Holt Winters. The first two techniques, Exponential and Moving Average, are relatively simple smoothing techniques and should not be performed on datasets involving seasonality. The last two techniques are more advanced techniques which can be used on datasets involving seasonality. Data Mining Partition Partitions data into training, validation, and if desired, test sets. Classify Discriminant Analysis: Constructs a set of linear functions of the predictor variables and uses these functions to predict the class of a new observation with an unknown class. Common uses of this method include: classifying loan, credit card or insurance applicants into low or high risk categories, classifying student applications for college entrance, classifying cancer patients into clinical studies, etc. Logistic Regression: A variant of ordinary regression which is used to predict the response variable, or the output variable, when the response variable is a dichotomous variable (a variable that takes only two values such as yes/no, success/failure, survive/die, etc.). k-Nearest Neighbors: This classification method divides a training dataset into groups of k observations using a Euclidean Distance measure to determine similarity between “neighbors”. These classification groups are used to assign categories to each member of the validation training set. Classification Tree: Also known as Decision Trees, this classification method is a good choice when goal is to generate easily understood and explained “rules” that can be translated in an SQL or query language. Naive Bayes: This classification method first scans the training dataset and finds all records where the predictor values are equal. Then the most prevalent class of the group is determined and assigned to the entire collection of observations. If a new observation’s predictor variable equals the predictor variable of this group, the new observation will be assigned to this class. Due to the simplicity of this method a large number of records are required to obtain accuracy. Neural Network: Artificial neural networks are based on the operation and structure of the human brain. These networks process one record at a time and “learn” by comparing their classification of the record (which as the beginning is largely arbitrary) with the known actual classification of the record. Errors from the initial classification of the first records are fed back into the network and used to modify the networks algorithm the second time around. This continues for many, many iterations. Predict Multiple Linear Regression: This method is performed on a dataset to predict the response variable based on a predictor variable or used to study the relationship between a response and predictor variable. k-Nearest Neighbors: Like the classification method with the same name above, this prediction method divides a training dataset into groups of k observations using a Euclidean Distance measure to determine similarity between “neighbors”. These groups are used to predict the value of the response for each member of the validation set. Regression Trees: A Regression tree may be considered a variant of a decision tree, designed to approximate real-valued functions instead of being used for classification methods. As with all regression techniques, XLMiner assumes the existence of a single output (response) variable and one or more input (predictor) variables. The output variable is numerical. Neural Network: Neural networks process one record at a time and “learn” by comparing their prediction of the record (which as the beginning is largely arbitrary) with the known actual value of the response variable. Errors from the initial prediction of the first records are fed back into the network and used to modify the networks algorithm the second time around. This continues for many, many iterations. Associate The goal of association rule mining is to recognize associations and/or correlations among large sets of data items. A typical and widely-used example of association rule mining is the Market Basket Analysis. Tools Score: Use Score icon to score new data in a database or worksheet with any of the Classification or Prediction algorithms. This facility matches the input variables to the database (or worksheet) fields and then performs the scoring on the database (or worksheet). Help: Click the Help icon to enter a new license or activation code, open an example dataset (over 25 example datasets are provided and most are used in the examples throughout this guide), open the online help, open this guide, or check for updates.

Data Mining

Related documents

Products

Support

Data Mining

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib