Final Cheat Sheet_Nadya

Excel Dashboards  Bullet Graphs in excel http://peltiertech.com/WordPress/vertical-bullet-graphs-in-excel/  Checkboxes and combo boxes: http://peltiertech.com/Excel/Charts/ChartByControl.html Naïve Bayes  Partition Data  Click on Training Set  XML Miner - Classification - Naive Bayes  Select input variables and output variables (flight status) Data Cleaning and Descriptive Statistics:  Next The measures from a population are called parameters.  Next The measures from a sample are called statistics.  Check Summary Reports Detailed report, score validation data summary Type of sampling report, Lift Charts  Probabilistic sampling (assume there is no order, no bias against any  Go to Prior Class Probability characteristic, and they are all heterogeneously mixed)  The 0.8 and 0.2 on top are the prior class probabilities  Copy data from other spreadsheet – paste – special - transpose  simple random sampling with our without replacement (random number gerenrator in Excel, use function randtween (1, 1638) - the numbers  Use VLOOKUP On the conditional probabilities being the size of the population  Then PRODUCT(I43:I49, I41) (multiply each one of the probabilities by the overall probability  proportional random sampling  Then Naïve Bayes formula = Prob of on time / (prob of on time + prob o Let's say you want to do proportional random sampling on of delay) books on south America o by viewing the number of books in each country and their percentage of total, you can figure out, for instance, that for Argentina the proportion is 578/3192, so your sample needs to have the same proportion o for example, if you have 200 samples, then you need to have 200 *(578/3192) from Argentina How to pick the samples, e.g. with flight delay data  Enter observation IDs for each record  sort your file on delay  note the number of delays vs. non delays. Let's stay the first 400,000 records are non-delays  randbetween (1, 400,000) to sample delay = 0. Keep the ratio of delays vs. non delays in your sample same as original data  randbetween (400,001, 500,000) to sample delay = 1  Copy all of the resulting numbers and paste special – value  then do vlookup, to get the data from the other sheet Outliers  Anything more than 3 times the standard deviation  Your range is mean + 3 sigma, mean -3 sigma  exclude any values outside of this range (e.g. conditional formatting to highlight offending values, filter, etc)  keep deleting until there are no values highlighted (as it deletes, it will recalculate the mean + 3 sigma and it will highlight new values, that's ok, just keep deleting) Frequency Charts (e.g. frequency of house size from real estate)  Insert Pivot Table  House size is row  Count (Price) as a column (can be anything else)  Right click house size – group. Enter Bin size in “by”  Right click resulting table – Insert Chart – Column Chart  Eyeball data to determine outliers Histograms: Data  Data Analysis  Histograms Descriptive Stats: Data  Data Analysis  Descriptive Statistics Box Plots: Add Ins – Data Analysis – XLMiner – Charts – Box Plot Prediction and Classification Methods: You will have three data sets - training, validation and test sets  Training set is what you build the model on  Validation is used for validating the quality of the model  Test is testing the accuracy of the model Probability of each condition give on time * Overall on time probability divided by Probability of each condition given on time * overall on time probability + Probability of each condition given dealyed * overall delayed probability Multiple Regression You are trying to find how a dependent variable is related to independent variable. You want to check:  whether the dependent variable has a linear relationship with the independent variable  whether the independent variable is indeed independent  to make sure that it is a continuous relationship rather than a discrete one (e.g. one-bedroom, 2 bedroom, 3 bedroom is discrete) Regression equation: Y = Alpha + Beta*X + error In other words: Dependent variable = constant + the contribution of an independent variable + something random For example: House price = 3000 + 600* sq ft + E You can say Y hat = a +bx <-- this is an estimate (you drop the random part) <-- a is an estimate of Alpha and b is an estimate of Beta <-- The error in your estimate is Y - Y hat <-- of you square that, you will get Error squared If you add all the errors squared (for each error) (call it i) that is the total error There are models that minimize this error --> you use derivatives Data Partitioning: If you want to minimize the error, you need to find a and b that minimize y  Open Data Set in Excel hat. Before you run the regression, you want to make sure that there are no correlated  Add Ins --> XLMiner --> Partition Data --> Standard Partition variables (they are truly independent).  Select all the variables and put them on the right To find out, you go to XLMiner - Charts - Matrix Plot - pick all the variables you  Partitioning by default is set to Automatic - 60% training, 40% are interested in (Beds, baths, sq ft, price) validation Price and Sq Ft has almost linear relationship (look in lower right corner) Lower left - discontinuous and you also see that price and beds and sq ft and bed are  But you can also say “use partition variable” also positively corelated...  There, you say what kind of set you want by putting in a variable So here you will pick sq ft because t(test), s (training), v (validation)  it means I will get more bedrooms and bathrooms  This generated the partitions and you can use the hyperlinks at the top to  It's the continuous variable where beds and baths are discrete stitch between training and validation data You also know that you have to separate them because beds and baths are discontinuous (lower right corner) So let's say you can't decide which variable to use. Run regression for all three variables independently (In Excel --> Data --> Data Analysis --> Regression. Check residual plots). Since bed vs. price residual plot is discontinuous, you can tell that beds is not a good variable to use. Alpha is what you set (it's your tolerance for error, typically it's 0.05 or less), p is what you get. Lower P --> Better result. P =< Alpha – independent variable is significant P > Alpha – independent variable not significant and can be removed from regression T > 2 – significant Also, you look at the adjusted R Square to see the explanatory power of the model. Lower R means worse. Check the standard error – make sure it’s low. Look at Correlation (Excel  Data Analysis  Correlation) Highlight results, do a conditional formatting - color bar (home conditional formatting - color scales). Do absolute value first This is another way to determine multi-colinearity (in addition to doing Matrix Plot in XLMiner) What if I run multiple regression on all variables (using XLMiner). Fitted values will give you the predicted value Check Unstandardized, summary report You know this is a problem because the coefficient for the bedroom is negative So adding a bedroom reduces your house value??? The good news is, XlMiner will determine the best variable for you. XLMiner --> Mutliple Linear regression At Step 2, click Best Subset. Backwards elimination (it takes all the variables, and eliminates the least significant first) Look at the adjusted r Square - where is it tapering off? No more improvement between 5 and 6. Also, look at CP - highest R Square and CP = total number of predictors. Pick # 12 because R square is higher and CP is close to the total number of predictors. Principle of parsimony - if you can do the job with two variables, don't use 3. In fact, if you include too many variables, you overfit the data- you match the model perfectly to the data and there is no predictive power. From the output from XLMiner, click Subset Selection, then choose subset It will automatically select the subset for you, but then you have to rerun the regression on just this data. The regression equation is under "Reg Model" Prediction = constant + coeff.*input variable + coeff. *input variable.... The number in between is the number of records in the existing set that fall into that category e.g. someone with less than 100.5K and less than 2.95 CC average --> not worth personal loan Classification Errors and Costs Misclassification - how many are placed in the incorrect category on the test/validation data Two kinds of Errors: Individual Misclassification Error - this is for each category itself (you think that a mailing will generate business but it does not); ssually associated with false positive or false negative Overall Misclassification Error: useful for evaluating the overall model  Behavior of the errors with cut-off probability values  If you provide a cut-off probability, then the classification algorithm will reclassify according to the cut-off. Typical default cut-off is 50%  Cut-off probability is dependent on misclassification cost and business context  Data Table (What-If Analysis) can be used to plot the behavior For calculations and decision making for the future records, typically validation results are used. Lift Chart (or gains chart) is a graphical way to see the effectiveness of the classification model. If you do not use any classification and just send an offer to everyone, then your response rate will be whatever is the underlying probability. However, when you use a classification scheme, and then sort the target records accordingly and send the offer, then your response rate should be much higher. The ratio of gain is the lift. The Decile chart shows the same information, only in blocks of 10% of the records. Allows you to know when to stop targeting. Let Us Recreate the Lift and Decile Chart for the Universal Bank Example Sort the records in the validation score in descending order of the classification probabilities Create a new column on the left to number the cases serially from 1 to 1000 Create a column to count the cumulative number of 1’s (successes) in the actual column Complete the entries for all columns using appropriate formulas Find out the actual number of 1s and 0s in the validation data set (hint: can be easily done from classification confusion matrix) and create the overall prob of 1 and 0 In a new sheet create a table that would show the number of success from every 50 records as per the probability and from your actual cumulative column in the validation score worksheet. K Nearest Neighbor  when you get a new record, you compare it to existing records  you find the "distance" between this new record and the existing records   =SQRT(SUMXMY2($H$3, A2) + ($I$3-B2)^2) You decide to use the k number of records with the smallest distance (has to be odd but you set it)  see Excel example in MBAD 698 folder To do this in XLMiner  Partition data first  On the training set, click inside the set  Add Ins --> XLMiner --> Classification - K Nearest Neighbor  On Step 2, select score on best K between 1 and specified value (will let you do 19 max)  Click on Prior class Probabilities. Best K will be highlighted there. Classification Tree Partition Data XLMiner - Classification - Classification Tree Select input and output variables Run When you look at the output, less is on left, more is on right Clas s ification Confus ion M atrix Predicte d Clas s Actual Clas s 1 0 1 60 46 0 10 884 No of cases 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 Success Rate when chosen at Random 0 5.3 10.6 15.9 21.2 26.5 31.8 37.1 42.4 47.7 53 58.3 63.6 68.9 74.2 79.5 84.8 90.1 95.4 100.7 106 Number of Success When Logit is used Decile Lift Prior Prob 0 Success 46 Failure 75 7.075472 89 95 4.481132 97 99 3.113208 101 101 2.382075 103 103 1.943396 104 104 1.63522 104 105 1.415094 105 105 1.238208 106 106 1.111111 106 106 1 0.106 0.894

Final Cheat Sheet_Nadya

Related documents

Products

Support

Final Cheat Sheet_Nadya

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib