Data Mining

advertisement
Excel 2007 Data Mining and Analysis Overview
Excel 2007 Data Mining and Analysis tools are an add-in that can be used with Excel. Once the add-in is installed, you will have
2 new tabs in Excel: Data Mining and Analyze. This document is divided into 2 sections; the first consists of the Data Mining
options, the second consists of the Analyze options.
Tables
In order to use these tabs, your worksheet needs to be defined as a Table. To do this, follow these basic steps:
1.
2.
3.
4.
5.
6.
7.
8.
9.
Use filtering to limit your spreadsheet to the rows and columns you’d like to analyze.
Press Ctrl-A to select the entire worksheet
Press Ctrl-C to copy the data that you’ve selected
Move to a new worksheet
Put your mouse in cell A1
Press Ctrl-V to paste the data you’ve selected into this new worksheet
Click on the ‘Insert’ tab, and click on the ‘Table’ button
Make sure that the ‘Heading’ box is checked
Click on ‘OK’
You should now have a new worksheet that consists of the data you’ll be mining and analyzing, and that data is in Table format.
Section 1: Data Mining Tab
Data Preparation Options
Selecting the right attributes from the source data and getting them into the right format for mining typically takes up a large
percentage of the time in a typical data mining project. This section provides tools that address common data preparation needs
for data mining:
·
Explore Data - This option will create a simple histogram of a column of data. It will give you an idea of what your data
looks like before you start to mine and analyze it. Below is a sample of gpa spreads.
·
Clean Data – There are two options: Remove outliers and Re-label. Remove Outliers allows you to remove uncommon
data from a column in your spreadsheet. This can help improve the accuracy of your analysis. Re-label changes discrete
values for a column of data. For example, your source data may contain “M” and “F” as states for the Gender column and
you prefer to use “Male” and “Female” for clarity when presenting the model results. Or if your data contains both ‘m’ and
‘M’, you can change them both to “Male”.
Sample Data – This option creates a reduced or balanced subset of data from your spreadsheet for building your mining and
analysis structures.
·
Partition Data - Split the source data into training and test sets, take a random sample of the input data or perform
oversampling to adjust for skewed distributions.
Data Modeling Options
This section covers the actual model definition and processing. It provides wizards that help you easily build common types of
mining models without worrying about the actual mining algorithms and associated parameters supported on the server.
Also included in this section are advanced options that allow the user to pick the exact mining algorithm and tweak additional
parameters.
Classify Data – Build a model based on existing data from one of your Excel tables. This option will select patterns
that predict values in a target column based on values in other columns.
Estimate – This wizard leads you through the steps to build an estimation model based on your Excel table. The
model uses patters to predict the continuous values of one column based on other columns.
Cluster – This builds a clustering model based on your data. It looks for groups of rows that share similar
characteristics and groups them together.
Associate – An Association Rules model finds associations between appear together in multiple rows.
Forecast – Data Mining Forecasting will find patterns in a series of Excel cells, and will use these patterns to predict
the evolution of that series of cells.
Section 2: Analyze Tab
Analyze Key Influencers
This tool analyzes the patterns in data that have the strongest influence on a certain outcome. For example, if you have
a worksheet of prospects, Analyze Key Influencers can analyze the factors that are key to determining which students
will have the highest gpa. It may not be obvious which fields are the key influencers. For example, you might find that
having a particular major might lead to a higher gpa. Here’s an example of a report that looks for key influences to
membership in Sigma Xi:
To use the Key Influencers function, follow these basic steps:
1.
2.
3.
4.
Create a new worksheet with a Table of the data you’d like to analyze (see top of this document)
Click on the Analyze tab
Click on Analyze Key Influencers
You will see the below set-up window.
In the drop-down box, choose the column in your spreadsheet that you’d like to analyze. In the above example, you
would choose the gpa column.
5.
Excel, by default, assumes that you’d like to use every column in the spreadsheet to look for key influences. If you have
many columns in your spreadsheet, this may not be practical. Click on the blue text ‘Choose columns to be used for
analysis’ to limit these columns.
When you first enter this window, all of the spreadsheet columns will be checked. Uncheck any that you do NOT
want to use for analysis.
6.
Excel uses ‘Discrimanant Analysis’ for analyzing key influencers. Discriminant function analysis is used to determine
which pieces of data discriminate between two or more naturally occurring groups. For example, an Admissions office
may want to investigate which variables discriminate between (1) those who apply to Carleton, and (2) Those who do
not. For that purpose the researcher could collect data on numerous variables. Most prospects will fall into one of the
2 categories; applied or did not apply. Discriminant Analysis could then be used to determine which variable(s) are the
best predictors of students' subsequent application choice.
Detect Categories
The Detect Categories tool detects the rows in the data table that contain similar characteristics and groups them into
categories. Each category identified is described by the distinguishing characteristics of the rows assigned to the
category. Based on those characteristics, you may choose to give the category a more understandable name. Breaking
the data into categories enables you to quickly identify the natural groupings in your data.
The tool also labels each row in the source sheet with the category that it was assigned to, in addition to producing a
report like one below:
Fill from Example
This tool automatically fills in missing data in a specific column for all rows in the selected table, based on exemplary
values provided by you for some of those rows. The tool employs SQL Server Data Mining to detect patterns in the
sample you provide and apply those to the remaining missing values.
In addition to filling in the missing values, the Fill From Example tool also produces a patterns report similar to the
Analyze Key Influencers tool that describes the rules that were applied to fill in the missing values
Forecast
The Forecast tool predicts future values based on trends in existing time series data. For example, you can forecast
what annual sales will be two years from now, based on current conditions and factors in your data. The tool appends
the forecast values as new rows to the table selected for forecasting.
The tool also produces a chart that graphically plots the existing and forecast values as shown below:
Highlight Exceptions
This tool analyzes and highlights data rows that do not appear to match the general patterns found in the rest of the
table. These exceptions could be either be due to data entry errors or they may be genuinely unusual values that
require further analysis. Highlighting exceptions can be useful because these values can significantly skew averages
and trends. If exceptions are caused by data entry errors, it is likely that you will want to fix these errors before doing
additional analysis or trending.
Highlight Exceptions not only highlights exception rows, it also highlights the specific column value in each row that
is likely to be the cause of the exception. You can change the column value in-place and the row is immediately reevaluated to determine if it is still an exception.
In addition, the tool produces an interactive exceptions report that can be used to dynamically vary the exception
threshold so you can choose to see more or less exceptions.
Here’s sample output from the Highlight Exceptions tool, with the source data and the exceptions report sheets
displayed side-by-side:
You
pull
can also use the Excel sorting
options to sort by ‘color’. This will
your exceptions to the top of the
spreadsheet for easy review.
Scenario Analysis
This tool allows you to model two types of scenarios and report the effect on either a single row or the entire table of input data:
Goal Seek – Analyzes and reports the factors that need to change to achieve a specific goal. Goal Seek answers
questions such as “What is the value in column X so that column Y has a value of Z”. For example, the below screen
shot shows having a goal of a gpa between 3.20 and 4.00, and the effect of changing the first major. The sample output
below shows the output produced by the tool for a single row in the sample data:
Goal seek uses what is called a Monty Carlo method of sampling. For more information on this technique, you can
consult Wikipedia at Monte Carlo Method.
What-If – Analyzes and reports the impact of a proposed or hypothetical change. For example, you can see the effect
of change in major if the student’s home state changes to Arizona.
With both Goal Seeking and What-if, a confidence is associated with your results. This confidence number ranges
from 0 to 100, with 100 being the most confident. If you get a high confidence number, it means that a loarge number
of rows in the table support the simulation.
Prediction Calculator
This tool detects the patterns that predict a specific value of a column based on values in the other columns. The tool
generates reports and charts of various scores. It may also generate an operational Prediction Calculator and a printerready sheet. The below is a sample of student data; looking at how home country might predict gpa.
Shopping Basket Analysis
The Shopping Basket Analysis tool facilitates cross-selling analysis. It can be applied on a table containing
transactions. The tool identifies groups of items that tend to appear together as well as rules that can be used in
recommendations. If the table contains a Value column associated with each item of each transaction, the tool also
computes the "lift" for each group and rule (a measure of how much the value of the respective item or group increases
in the context of the pattern identified by the tool).
Download