Course Outline We will meet in the 5 th floor conference room on

Σtatistics αt KΣU Course: Instructor: Office: Office Hours: Email: STAT 4400 – Binary Classification Dr. Jennifer Priestley WH106 As needed jpriestl@kennesaw.edu Course Outline We will meet in the 5th floor conference room on Thursdays from 12:30 – 2:30. August 19, 2010     Overview of Course Review of Expectations Introduction to Dataset Assignment of Teams ? August 26, 2010 – Data Cleansing   Examine Dependent Variable o 0/1 assignments o Roll Rate? Examine Independent Variables o Mixed? o Missing? o Duplicate observations? September 2, 2010 – Regression and Odds    Need to identify a minimum of 25 variables to be retained for analysis Method 1: Regression o Use the VIF o Retain relevant variables for modeling o Examination of Correlations Method 2: Odds Ratios September 9, 2010 – Discretization Process (Begin)   Overview of Discretization – why do we do it? Two processes will be executed o The first process will convert the continuous variables into ordinal categories of equal width. Those categories will then be assigned two values - the first will be ordinal  The second will be the odds ratio of the dep (binary) outcome. The second process will convert continuous variables into categories of equal frequency. Those ranges will then be tested using 2 sample independent ttests (dep will be the average of the dep binary variable). Where range differences are not found to be significant, they will be collapsed.  As above, the new categories will then be assigned two values the first will be ordinal  the second will be the odds ratio of the dep (binary) outcome. At the conclusion of the discretization process, each indp variable should have 5 forms.  Undergrads will have 5*10 = 50 indp variables  Grads will have 5*20 = 100 indp variables Note to Grad students: using the match key, merge in the SIC code, Tran code and Trans Amt data (TRAN) FUN AND EXCITING SAS CODE  o o o o September 16, 2010 – Deliverable 1 Due: Report of Variable selection and justification September 23, 2010 – Continue Discretization September 30, 2010 – Sampling and Intro to Logistic Regression o o Split dataset into model development and validation files Overview of Logistic Regression October 7, 2010 – Development of Logistic Regression Model October 14, 2010 – Deliverable 2 Due: Report of Discretization October 21, 2010 – November 4 , 2009 – Model Evaluation o o o o o o o Global Classification Maximize Good Minimize Bad Minimize Type 1 Minimize Type 2 ROC Curves Profitability Function Exploration of TRAN File and incorporation of information November 11, 2010 – Lab Day November 18, 2010 – Deliverable 3 and 4: Preliminary Final Report and Presentations Note: You will have to present me with functional code that I will run against a cleansed sample of the data – I need to just push “enter” and get a final profitability figure. I will tell you in advance which variables will be in the file. December 2, 2010 – Presentations to faculty DISCRETIZATION CODE… Data jlp.discfile (keep = AGE TRADES RBAL MATCHKEY CRELIM DELQID); Set JLP.creditmaster; Run; Data jlp.disctest; Set jlp.discfile; If DELQID = . then delete; else If DELQID <3 then GOODBAD = 0; Else GOODBAD = 1; Run; */////////DISCRETIZATION ONE/////////// Code to develop ordinal categories of equal width; Proc Means data=jlp.disctest min max mean median n MISSING; Var Trades RBAL AGE; Run; Proc Freq data=jlp.disctest; Tables GOODBAD; Run; Data jlp.disctest1; set jlp.disctest; if age = . then delete; else if AGE <30 then ORDAGE else if AGE <40 then ORDAGE else if AGE <50 then ORDAGE else if AGE <60 then ORDAGE else if AGE <70 then ORDAGE else if AGE <80 then ORDAGE else ORDAGE = 7; Run; = = = = = = 1; 2; 3; 4; 5; 6; Proc Means data=jlp.disctest1; Var GOODBAD; Class ORDAGE; Run; *Code to create the odds ratio version of the variable; Proc Summary data=jlp.disctest1 mean missing std; Class ORDAGE; Var ORDAGE GOODBAD; Output out=summary mean=avg_indp avg_dep; Run; Data summary; Set summary; ODSAGE = log(avg_dep/(1-avg_dep)); ODSAGE1 = (avg_dep/(1-avg_dep)); Run; Proc Print data=summary; Run; *Code to merge the summary file values back into the dataset; Proc sort data=summary; by ORDAGE; Run; Proc sort data=jlp.Disctest1; by ORDAGE; Run; Data jlp.disctest2 (drop = _TYPE_ _FREQ_ avg_indp avg_dep); Merge summary jlp.disctest1; by ORDAGE; Run; *At this point, we have 3 versions of the TRADES variable; Proc print data=jlp.disctest2; Var AGE ORDAGE ODSAGE ODSAGE1; Run; Proc Means data=jlp.disctest2; Var Age; Class ODSAGE1; Run; */////////DISCRETIZATION TWO/////////// Code to develop ordinal categories of equal Frequencies; Proc Rank data=jlp.disctest2 out=ranked groups = 10 ties = high; ranks rank; var Age; Run; Proc Freq data=ranked; Tables RANK; Run; Proc Summary data=ranked missing mean std min max; class rank; Var Age GOODBAD; Output out = summaryrank mean = avg_indp avg_dep std = std_indp std_dep min=min_indp min_dep max=max_indp max_dep; Run; Proc Sort data=summaryrank; by Rank; Run; Proc Print data=summaryrank (obs=1000); Run; Data RANKINGFILE; Merge ranked summaryrank; by Rank; Run; Proc Print data=rankingfile (obs=1000); Run; *Note that the file reference here will be whatever was output from the Proc Summary; **DO NOT CHANGE...EVER!; proc sort data=summaryrank; by rank; run; %let pval_col=0.15; Data summaryrank; Set summaryrank; var="Age"; If _TYPE_ = 1; rename _freq_ = numobs; Run; data temp; set summaryrank; by var; retain _finit _rnk1 _rnk2 _freq1 _freq2 _aindp1 _aindp2 _mindp1 _mindp2 _mxindp1 _mxindp2 _adep1 _adep2 _sdep1 _sdep2 _sdenom1 _sdenom2; *use _finit to flag when first initialization occurs; *First initialization is when the first non-missing level is; *encountered. it is from this point forward that the t-tests; *take place, since missing levels are automatically outputted; *as their own level; if first.var then _finit=0; if rank<=.Z then output; else do; if _finit=0 then do; _finit=1; _rnk1=rank; _freq1=numobs; _aindp1=numobs*avg_indp; _mindp1=min_indp; _mxindp1=max_indp; _adep1=numobs*avg_dep; if std_dep=. then _sdep1=0; else _sdep1=(numobs-1)*std_dep*std_dep; _sdenom1=numobs-1; end; else do; _w1=(_sdep1/_sdenom1)/_freq1; _w2=(std_dep**2)/numobs; _df=((_w1+_w2)**2)/((_w1**2)/(_freq1-1)+(_w2**2)/(numobs-1)); _t=abs(_adep1/_freq1-avg_dep)/sqrt(_w1+_w2); pvalue=(1-probt(_t,round(_df,1)))*2; * If t-test is significant, then output cumulated variables; * Otherwise continue cumulating; if pvalue<=&pval_col then do; * Store current row; _rnk2=rank; _freq2=numobs; _aindp2=numobs*avg_indp; _mindp2=min_indp; _mxindp2=max_indp; _adep2=numobs*avg_dep; if std_dep=. then _sdep2=0; else _sdep2=(numobs-1)*std_dep*std_dep; _sdenom2=numobs-1; * Switch cumulated variables with current row and output; rank=_rnk1; numobs=_freq1; avg_indp=_aindp1/_freq1; min_indp=_mindp1; max_indp=_mxindp1; avg_dep=_adep1/_freq1; std_dep=sqrt(_sdep1/_sdenom1); output; * Set first variables to current row; _rnk1=_rnk2; _freq1=_freq2; _aindp1=_aindp2; _mindp1=_mindp2; _mxindp1=_mxindp2; _adep1=_adep2; _sdep1=_sdep2; _sdenom1=_sdenom2; end; else do; _rnk1=rank; _freq1=_freq1+numobs; _aindp1=_aindp1+numobs*avg_indp; _mxindp1=max_indp; _adep1=_adep1+numobs*avg_dep; if std_dep ne . then _sdep1=_sdep1+(numobs-1)*std_dep*std_dep; _sdenom1=_sdenom1+numobs-1; end; *end; * If last row for current variable, then output; if last.var then do; rank=_rnk1; numobs=_freq1; avg_indp=_aindp1/_freq1; min_indp=_mindp1; max_indp=_mxindp1; avg_dep=_adep1/_freq1; std_dep=sqrt(_sdep1/_sdenom1); output; end; end; end; drop _finit _rnk1 _rnk2 _freq1 _freq2 _aindp1 _aindp2 _mindp1 _mindp2 _mxindp1 _mxindp2 _adep1 _adep2 _sdep1 _sdep2 _sdenom1 _sdenom2 _w1 _w2 _df _t /*pvalue*/; run; ***********End of secret code. You can resume normal procedures.; Proc Print data=temp; Run; ****************************************Stopped here; Data temp; set temp; ODSEQAGE = log(avg_dep/(1-avg_dep)); ODSEQAGE1 = (avg_dep/(1-avg_dep)); Run; Proc sort data=temp; by rank; Run; Proc Print data=temp; Run; Proc Freq data=ranked; Tables Rank; Run; Data ranked1; Set ranked; if rank = 0 then else if rank = 1 else if rank = 2 else if rank = 3 else if rank = 4 else if rank = 5 else if rank = 6 else if rank = 7 else if rank = 8 else if rank = 9 else rank=rank; Run; rank then then then then then then then then then = 0; rank rank rank rank rank rank rank rank rank = = = = = = = = = 3; 3; 3; 4; 7; 7; 7; 9; 9; Proc sort data=ranked1; by rank; Run; Proc sort data=temp; by rank; Run; Data jlp.disctest3; Merge ranked1 temp; by rank; Run; Proc means data=jlp.disctest3; var ODSEQAGE ODSEQAGE1; class Rank; Run; *****At this point, you can assign the ranks as 1, 2, 3...; Proc contents data=jlp.disctest3; Run; Proc Freq data=jlp.disctest3; Tables Rank; Run; Data jlp.disctest4; set jlp.disctest3; if rank = 0 then ORDEQAge = 0; else if rank = 3 then ORDEQAge else if rank = 4 then ORDEQAge else if rank = 7 then ORDEQAge else if rank = 9 then ORDEQAge else rank = rank; run; = = = = 1; 2; 3; 4; *Lets take a look at the new variations of AGe; Proc Print data=jlp.disctest4 (obs=1000); Var Age ORDAGE ODSAGE ODSAGE1 ORDEQAGE ODSEQAGE ODSEQAGE1; Run; Proc means data=jlp.disctest4 Missing; Var Age; ID ORDAGE ORDEQAGE; Run; *Now...what are you going to do about the missing obs?;

Course Outline We will meet in the 5 th floor conference room on

Related documents

Products

Support

Course Outline We will meet in the 5 th floor conference room on

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib