Final Cheat Sheet_Nadya

advertisement
Excel Dashboards

Bullet Graphs in excel
http://peltiertech.com/WordPress/vertical-bullet-graphs-in-excel/

Checkboxes and combo boxes:
http://peltiertech.com/Excel/Charts/ChartByControl.html
Naïve Bayes

Partition Data

Click on Training Set

XML Miner - Classification - Naive Bayes

Select input variables and output variables (flight status)
Data Cleaning and Descriptive Statistics:

Next
The measures from a population are called parameters.

Next
The measures from a sample are called statistics.

Check Summary Reports Detailed report, score validation data summary
Type of sampling
report, Lift Charts
 Probabilistic sampling (assume there is no order, no bias against any

Go to Prior Class Probability
characteristic, and they are all heterogeneously mixed)

The 0.8 and 0.2 on top are the prior class probabilities

Copy data from other spreadsheet – paste – special - transpose
 simple random sampling with our without replacement (random number
gerenrator in Excel, use function randtween (1, 1638) - the numbers

Use VLOOKUP On the conditional probabilities
being the size of the population

Then PRODUCT(I43:I49, I41) (multiply each one of the probabilities by
the overall probability
 proportional random sampling

Then Naïve Bayes formula = Prob of on time / (prob of on time + prob
o Let's say you want to do proportional random sampling on
of delay)
books on south America
o
by viewing the number of books in each country and their
percentage of total, you can figure out, for instance, that for
Argentina the proportion is 578/3192, so your sample needs
to have the same proportion
o for example, if you have 200 samples, then you need to have
200 *(578/3192) from Argentina
How to pick the samples, e.g. with flight delay data
 Enter observation IDs for each record
 sort your file on delay
 note the number of delays vs. non delays. Let's stay the first 400,000
records are non-delays
 randbetween (1, 400,000) to sample delay = 0. Keep the ratio of delays
vs. non delays in your sample same as original data
 randbetween (400,001, 500,000) to sample delay = 1
 Copy all of the resulting numbers and paste special – value
 then do vlookup, to get the data from the other sheet
Outliers
 Anything more than 3 times the standard deviation
 Your range is mean + 3 sigma, mean -3 sigma
 exclude any values outside of this range (e.g. conditional formatting to
highlight offending values, filter, etc)
 keep deleting until there are no values highlighted (as it deletes, it will
recalculate the mean + 3 sigma and it will highlight new values, that's
ok, just keep deleting)
Frequency Charts (e.g. frequency of house size from real estate)
 Insert Pivot Table
 House size is row
 Count (Price) as a column (can be anything else)
 Right click house size – group. Enter Bin size in “by”
 Right click resulting table – Insert Chart – Column Chart
 Eyeball data to determine outliers
Histograms: Data  Data Analysis  Histograms
Descriptive Stats: Data  Data Analysis  Descriptive Statistics
Box Plots: Add Ins – Data Analysis – XLMiner – Charts – Box Plot
Prediction and Classification Methods:
You will have three data sets - training, validation and test sets
 Training set is what you build the model on
 Validation is used for validating the quality of the model
 Test is testing the accuracy of the model
Probability of each condition give on time * Overall on time probability
divided by
Probability of each condition given on time * overall on time probability +
Probability of each condition given dealyed * overall delayed probability
Multiple Regression
You are trying to find how a dependent variable is related to independent variable.
You want to check:
 whether the dependent variable has a linear relationship with the
independent variable
 whether the independent variable is indeed independent
 to make sure that it is a continuous relationship rather than a discrete one
(e.g. one-bedroom, 2 bedroom, 3 bedroom is discrete)
Regression equation: Y = Alpha + Beta*X + error
In other words: Dependent variable = constant + the contribution of an
independent variable + something random
For example: House price = 3000 + 600* sq ft + E
You can say Y hat = a +bx <-- this is an estimate (you drop the random part)
<-- a is an estimate of Alpha and b is an estimate of Beta
<-- The error in your estimate is Y - Y hat
<-- of you square that, you will get Error squared
If you add all the errors squared (for each error) (call it i) that is the total error
There are models that minimize this error --> you use derivatives
Data Partitioning:
If you want to minimize the error, you need to find a and b that minimize y
 Open Data Set in Excel
hat.
Before you run the regression, you want to make sure that there are no correlated
 Add Ins --> XLMiner --> Partition Data --> Standard Partition
variables (they are truly independent).
 Select all the variables and put them on the right
To find out, you go to XLMiner - Charts - Matrix Plot - pick all the variables you
 Partitioning by default is set to Automatic - 60% training, 40%
are interested in (Beds, baths, sq ft, price)
validation
Price and Sq Ft has almost linear relationship (look in lower right corner)
Lower left - discontinuous and you also see that price and beds and sq ft and bed are
 But you can also say “use partition variable”
also positively corelated...
 There, you say what kind of set you want by putting in a variable So here you will pick sq ft because
t(test), s (training), v (validation)

it means I will get more bedrooms and bathrooms
 This generated the partitions and you can use the hyperlinks at the top to

It's the continuous variable where beds and baths are discrete
stitch between training and validation data
You also know that you have to separate them because beds and baths are
discontinuous (lower right corner)
So let's say you can't decide which variable to use. Run regression for all three
variables independently (In Excel --> Data --> Data Analysis --> Regression.
Check residual plots).
Since bed vs. price residual plot is discontinuous, you can tell that beds is not a
good variable to use.
Alpha is what you set (it's your tolerance for error, typically it's 0.05 or less),
p is what you get.
Lower P --> Better result.
P =< Alpha – independent variable is significant
P > Alpha – independent variable not significant and can be removed from
regression
T > 2 – significant
Also, you look at the adjusted R Square to see the explanatory power of the
model. Lower R means worse.
Check the standard error – make sure it’s low.
Look at Correlation (Excel  Data Analysis  Correlation)
Highlight results, do a conditional formatting - color bar (home conditional
formatting - color scales). Do absolute value first
This is another way to determine multi-colinearity (in addition to doing Matrix Plot
in XLMiner)
What if I run multiple regression on all variables (using XLMiner).
Fitted values will give you the predicted value
Check Unstandardized, summary report
You know this is a problem because the coefficient for the bedroom is negative So
adding a bedroom reduces your house value???
The good news is, XlMiner will determine the best variable for you.
XLMiner --> Mutliple Linear regression
At Step 2, click Best Subset. Backwards elimination (it takes all the variables, and
eliminates the least significant first)
Look at the adjusted r Square - where is it tapering off? No more improvement
between 5 and 6.
Also, look at CP - highest R Square and CP = total number of predictors. Pick # 12
because R square is higher and CP is close to the total number of predictors.
Principle of parsimony - if you can do the job with two variables, don't use 3.
In fact, if you include too many variables, you overfit the data- you match the
model perfectly to the data and there is no predictive power.
From the output from XLMiner, click Subset Selection, then choose subset
It will automatically select the subset for you, but then you have to rerun the
regression on just this data.
The regression equation is under "Reg Model"
Prediction = constant + coeff.*input variable + coeff. *input variable....
The number in between is the number of records in the existing set that fall into that
category
e.g. someone with less than 100.5K and less than 2.95 CC average --> not worth
personal loan
Classification Errors and Costs
Misclassification - how many are placed in the incorrect category on the
test/validation data
Two kinds of Errors:
Individual Misclassification Error - this is for each category itself (you think that
a mailing will generate business but it does not); ssually associated with false
positive or false negative
Overall Misclassification Error: useful for evaluating the overall model
 Behavior of the errors with cut-off probability values
 If you provide a cut-off probability, then the classification algorithm will
reclassify according to the cut-off. Typical default cut-off is 50%
 Cut-off probability is dependent on misclassification cost and business
context
 Data Table (What-If Analysis) can be used to plot the behavior
For calculations and decision making for the future records, typically
validation results are used.
Lift Chart (or gains chart) is a graphical way to see the effectiveness of the
classification model. If you do not use any classification and just send an offer to
everyone, then your response rate will be whatever is the underlying probability.
However, when you use a classification scheme, and then sort the target records
accordingly and send the offer, then your response rate should be much higher. The
ratio of gain is the lift.
The Decile chart shows the same information, only in blocks of 10% of the records.
Allows you to know when to stop targeting.
Let Us Recreate the Lift and Decile Chart for the Universal Bank Example
Sort the records in the validation score in descending order of the classification
probabilities
Create a new column on the left to number the cases serially from 1 to 1000
Create a column to count the cumulative number of 1’s (successes) in the actual
column
Complete the entries for all columns using appropriate formulas
Find out the actual number of 1s and 0s in the validation data set (hint: can be easily
done from classification confusion matrix) and create the overall prob of 1 and 0
In a new sheet create a table that would show the number of success from every 50
records as per the probability and from your actual cumulative column in the
validation score worksheet.
K Nearest Neighbor
 when you get a new record, you compare it to existing records
 you find the "distance" between this new record and the existing records


=SQRT(SUMXMY2($H$3, A2) + ($I$3-B2)^2)
You decide to use the k number of records with the smallest distance
(has to be odd but you set it)
 see Excel example in MBAD 698 folder
To do this in XLMiner
 Partition data first
 On the training set, click inside the set
 Add Ins --> XLMiner --> Classification - K Nearest Neighbor
 On Step 2, select score on best K between 1 and specified value (will
let you do 19 max)
 Click on Prior class Probabilities. Best K will be highlighted there.
Classification Tree
Partition Data
XLMiner - Classification - Classification Tree
Select input and output variables
Run
When you look at the output, less is on left, more is on right
Clas s ification Confus ion M atrix
Predicte d Clas s
Actual Clas s
1
0
1
60
46
0
10
884
No of
cases
0
50
100
150
200
250
300
350
400
450
500
550
600
650
700
750
800
850
900
950
1000
Success Rate
when chosen
at Random
0
5.3
10.6
15.9
21.2
26.5
31.8
37.1
42.4
47.7
53
58.3
63.6
68.9
74.2
79.5
84.8
90.1
95.4
100.7
106
Number of
Success When
Logit is used Decile Lift Prior Prob
0
Success
46
Failure
75
7.075472
89
95
4.481132
97
99
3.113208
101
101
2.382075
103
103
1.943396
104
104
1.63522
104
105
1.415094
105
105
1.238208
106
106
1.111111
106
106
1
0.106
0.894
Download