Data Mining

advertisement
Overview of XLMINER
Activate XLMiner
1. Start Excel
2. Click File/Options/Add-Ins
3. Select COM Add-ins and click Go as shown below.
4. Check Analytic Solver Platform Addin
XLMiner Ribbon
The XLMiner ribbon is broken up into five different segments as shown in the screenshot
below.
Get Data
Use Get data button to draw a random sample of data, or summarize data from a (i) an Excel
worksheet, (ii) the PowerPivot “spreadsheet data model” which can hold 10 to 100 million rows
of data in Excel, (iii) an external SQL database such as Oracle, DB2 or SQL Server, or (iv) a
dataset with up to billions of rows, stored across many hard disks in an external compute cluster
running Apache Spark (https://spark.apache.org/), using the newly added Big Data feature.
XLMiner includes several different methods for importing your data into XLMiner including
Sampling from either a Worksheet or Database or Importing from a File Folder.
Worksheet
Take a representative sample from a dataset included in Excel workbook
Database
Take a representative sample from a dataset included in an Oracle, SQL Server, MSAccess, or Power Pivot database.
File Folder
Import and or sample from a collection of text documents for use with Text Mining.
Big Data
Sample from or summarize from a dataset with up to billions of rows, stored across many
hard disks in an external compute cluster running Apache Spark.
Data Analysis
There are four task groups within Data Analysis.
Explore
Chart Wizard
XLMiner includes 8 different types of charts to choose from, including, bar charts, line
charts, scatterplots, boxplots, histograms, parallel coordinates charts, scatterplot matrix
charts or variable charts. o create one or more charts of your data.
Feature Selection
Helps to give insight into which variables are the most important or relevant for inclusion
in your classification or prediction model using various types of statistics and data
analysis measures.
Existing Charts
Edit or view previously created charts.
Transform
Missing Data Handling
Routines for dealing with these missing values by allowing a user to either delete the full
record or apply a value of her/his choice.
Bin Continuous Data
Routine for binning continuous data for use with prediction and classification methods
which do not support continuous data. Continuous variables can be binned using several
different user specified options.
Transform Categorical Data
Create Dummies:
Categorical variables (with k categories) converted to (k-1) dummy variables; with up
to 30 distinct values.
Create Category Scores:
Assigns numerical score for each category of a categorical variable.
Reduce Categories:
Reduces number of categories if a categorical variable has more than 30 categories.
Principal Components
Analysis to remove highly correlated or superfluous variables from large databases.
Cluster
K-Means Clustering
Cluster observations into K-clusters that maximizes in-cluster similarity and out-cluster
dissimilarity.
Hierarchical Clustering
Cluster observations into K-clusters that maximizes in-cluster similarity and out-cluster
dissimilarity using the hierarchical method.
Text
Text Miner tool to analyze a collection of text documents for patterns and trends
Time Series Analysis
Partition
Partitions data into training and validation sets.
ARIMA
XLMiner features two techniques for exploring trends in a dataset, ACF (Autocorrelation
function) and PACF (Partial autocorrelation function). These techniques help the user to
explore various patterns in the data which can be used in the creation of the model.
XLMiner also supports the analysis and forecasting of datasets that contain observations
generated sequentially such as predicting next year’s sales figures, monthly airline bookings,
etc. through partitioning, autocorrelations or ARIMA models and through smoothing
techniques.
Smoothing
XLMiner offers four different smoothing techniques, Exponential, Moving Average,
Double Exponential, and Holt Winters. The first two techniques, Exponential and Moving
Average, are relatively simple smoothing techniques and should not be performed on
datasets involving seasonality. The last two techniques are more advanced techniques
which can be used on datasets involving seasonality.
Data Mining
Partition
Partitions data into training, validation, and if desired, test sets.
Classify
Discriminant Analysis:
Constructs a set of linear functions of the predictor variables and uses these functions to
predict the class of a new observation with an unknown class. Common uses of this
method include: classifying loan, credit card or insurance applicants into low or high
risk categories, classifying student applications for college entrance, classifying cancer
patients into clinical studies, etc.
Logistic Regression:
A variant of ordinary regression which is used to predict the response variable, or the
output variable, when the response variable is a dichotomous variable (a variable that
takes only two values such as yes/no, success/failure, survive/die, etc.).
k-Nearest Neighbors:
This classification method divides a training dataset into groups of k observations using a
Euclidean Distance measure to determine similarity between “neighbors”. These
classification groups are used to assign categories to each member of the validation
training set.
Classification Tree:
Also known as Decision Trees, this classification method is a good choice when goal is
to generate easily understood and explained “rules” that can be translated in an SQL or
query language.
Naive Bayes:
This classification method first scans the training dataset and finds all records where the
predictor values are equal. Then the most prevalent class of the group is determined and
assigned to the entire collection of observations. If a new observation’s predictor
variable equals the predictor variable of this group, the new observation will be assigned
to this class. Due to the simplicity of this method a large number of records are required
to obtain accuracy.
Neural Network:
Artificial neural networks are based on the operation and structure of the human brain.
These networks process one record at a time and “learn” by comparing their
classification of the record (which as the beginning is largely arbitrary) with the known
actual classification of the record. Errors from the initial classification of the first records
are fed back into the network and used to modify the networks algorithm the second time
around. This continues for many, many iterations.
Predict
Multiple Linear Regression:
This method is performed on a dataset to predict the response variable based on a
predictor variable or used to study the relationship between a response and predictor
variable.
k-Nearest Neighbors:
Like the classification method with the same name above, this prediction method divides
a training dataset into groups of k observations using a Euclidean Distance measure to
determine similarity between “neighbors”. These groups are used to predict the value of
the response for each member of the validation set.
Regression Trees:
A Regression tree may be considered a variant of a decision tree, designed to
approximate real-valued functions instead of being used for classification methods. As
with all regression techniques, XLMiner assumes the existence of a single output
(response) variable and one or more input (predictor) variables. The output variable is
numerical.
Neural Network:
Neural networks process one record at a time and “learn” by comparing their prediction
of the record (which as the beginning is largely arbitrary) with the known actual value of
the response variable. Errors from the initial prediction of the first records are fed back
into the network and used to modify the networks algorithm the second time around.
This continues for many, many iterations.
Associate
The goal of association rule mining is to recognize associations and/or correlations
among large sets of data items. A typical and widely-used example of association rule
mining is the Market Basket Analysis.
Tools
Score:
Use Score icon to score new data in a database or worksheet with any of the Classification
or Prediction algorithms. This facility matches the input variables to the database (or
worksheet) fields and then performs the scoring on the database (or worksheet).
Help:
Click the Help icon to enter a new license or activation code, open an example dataset
(over 25 example datasets are provided and most are used in the examples throughout
this guide), open the online help, open this guide, or check for updates.
Download