Course Outline We will meet in the 5 th floor conference room on

advertisement
Σtatistics αt KΣU
Course:
Instructor:
Office:
Office Hours:
Email:
STAT 4400 – Binary Classification
Dr. Jennifer Priestley
WH106
As needed
jpriestl@kennesaw.edu
Course Outline
We will meet in the 5th floor conference room on
Thursdays from 12:30 – 2:30.
August 19, 2010




Overview of Course
Review of Expectations
Introduction to Dataset
Assignment of Teams ?
August 26, 2010 – Data Cleansing


Examine Dependent Variable
o 0/1 assignments
o Roll Rate?
Examine Independent Variables
o Mixed?
o Missing?
o Duplicate observations?
September 2, 2010 – Regression and Odds



Need to identify a minimum of 25 variables to be retained for analysis
Method 1: Regression
o Use the VIF
o Retain relevant variables for modeling
o Examination of Correlations
Method 2: Odds Ratios
September 9, 2010 – Discretization Process (Begin)


Overview of Discretization – why do we do it?
Two processes will be executed
o The first process will convert the continuous variables into ordinal
categories of equal width.
Those categories will then be assigned two values - the first will be
ordinal
 The second will be the odds ratio of the dep (binary)
outcome.
The second process will convert continuous variables into categories of
equal frequency. Those ranges will then be tested using 2 sample
independent ttests (dep will be the average of the dep binary variable).
Where range differences are not found to be significant, they will be
collapsed.
 As above, the new categories will then be assigned two values the first will be ordinal
 the second will be the odds ratio of the dep (binary) outcome.
At the conclusion of the discretization process, each indp variable should
have 5 forms.
 Undergrads will have 5*10 = 50 indp variables
 Grads will have 5*20 = 100 indp variables
Note to Grad students: using the match key, merge in the SIC code, Tran
code and Trans Amt data (TRAN)
FUN AND EXCITING SAS CODE

o
o
o
o
September 16, 2010 – Deliverable 1 Due: Report of Variable selection and justification
September 23, 2010 – Continue Discretization
September 30, 2010 – Sampling and Intro to Logistic Regression
o
o
Split dataset into model development and validation files
Overview of Logistic Regression
October 7, 2010 – Development of Logistic Regression Model
October 14, 2010 – Deliverable 2 Due: Report of Discretization
October 21, 2010 – November 4 , 2009 – Model Evaluation
o
o
o
o
o
o
o
Global Classification
Maximize Good
Minimize Bad
Minimize Type 1
Minimize Type 2
ROC Curves
Profitability Function
Exploration of TRAN File and incorporation of information
November 11, 2010 – Lab Day
November 18, 2010 – Deliverable 3 and 4: Preliminary Final Report and Presentations
Note: You will have to present me with functional code that I will run against a cleansed
sample of the data – I need to just push “enter” and get a final profitability figure. I will
tell you in advance which variables will be in the file.
December 2, 2010 – Presentations to faculty
DISCRETIZATION CODE…
Data jlp.discfile (keep = AGE TRADES RBAL MATCHKEY CRELIM DELQID);
Set JLP.creditmaster;
Run;
Data jlp.disctest;
Set jlp.discfile;
If DELQID = . then delete;
else If DELQID <3 then GOODBAD = 0;
Else GOODBAD = 1;
Run;
*/////////DISCRETIZATION ONE///////////
Code to develop ordinal categories of equal width;
Proc Means data=jlp.disctest min max mean median n MISSING;
Var Trades RBAL AGE;
Run;
Proc Freq data=jlp.disctest;
Tables GOODBAD;
Run;
Data jlp.disctest1;
set jlp.disctest;
if age = . then delete;
else if AGE <30 then ORDAGE
else if AGE <40 then ORDAGE
else if AGE <50 then ORDAGE
else if AGE <60 then ORDAGE
else if AGE <70 then ORDAGE
else if AGE <80 then ORDAGE
else ORDAGE = 7;
Run;
=
=
=
=
=
=
1;
2;
3;
4;
5;
6;
Proc Means data=jlp.disctest1;
Var GOODBAD;
Class ORDAGE;
Run;
*Code to create the odds ratio version of the variable;
Proc Summary data=jlp.disctest1 mean missing std;
Class ORDAGE;
Var ORDAGE GOODBAD;
Output out=summary mean=avg_indp avg_dep;
Run;
Data summary;
Set summary;
ODSAGE = log(avg_dep/(1-avg_dep));
ODSAGE1 = (avg_dep/(1-avg_dep));
Run;
Proc Print data=summary;
Run;
*Code to merge the summary file values back into the dataset;
Proc sort data=summary;
by ORDAGE;
Run;
Proc sort data=jlp.Disctest1;
by ORDAGE;
Run;
Data jlp.disctest2 (drop = _TYPE_ _FREQ_ avg_indp avg_dep);
Merge summary jlp.disctest1;
by ORDAGE;
Run;
*At this point, we have 3 versions of the TRADES variable;
Proc print data=jlp.disctest2;
Var AGE ORDAGE ODSAGE ODSAGE1;
Run;
Proc Means data=jlp.disctest2;
Var Age;
Class ODSAGE1;
Run;
*/////////DISCRETIZATION TWO///////////
Code to develop ordinal categories of equal Frequencies;
Proc Rank data=jlp.disctest2 out=ranked groups = 10 ties = high;
ranks rank;
var Age;
Run;
Proc Freq data=ranked;
Tables RANK;
Run;
Proc Summary data=ranked missing mean std min max;
class rank;
Var Age GOODBAD;
Output out = summaryrank mean = avg_indp avg_dep std = std_indp std_dep
min=min_indp min_dep
max=max_indp max_dep;
Run;
Proc Sort data=summaryrank;
by Rank;
Run;
Proc Print data=summaryrank (obs=1000);
Run;
Data RANKINGFILE;
Merge ranked summaryrank;
by Rank;
Run;
Proc Print data=rankingfile (obs=1000);
Run;
*Note that the file reference here will be whatever was output from the
Proc Summary;
**DO NOT CHANGE...EVER!;
proc sort data=summaryrank; by rank; run;
%let pval_col=0.15;
Data summaryrank;
Set summaryrank;
var="Age";
If _TYPE_ = 1;
rename _freq_ = numobs;
Run;
data temp;
set summaryrank;
by var;
retain _finit _rnk1 _rnk2 _freq1 _freq2 _aindp1 _aindp2
_mindp1 _mindp2 _mxindp1 _mxindp2 _adep1 _adep2
_sdep1 _sdep2 _sdenom1 _sdenom2;
*use _finit to flag when first initialization occurs;
*First initialization is when the first non-missing level is;
*encountered. it is from this point forward that the t-tests;
*take place, since missing levels are automatically outputted;
*as their own level;
if first.var then _finit=0;
if rank<=.Z then output;
else do;
if _finit=0 then do;
_finit=1;
_rnk1=rank;
_freq1=numobs;
_aindp1=numobs*avg_indp;
_mindp1=min_indp;
_mxindp1=max_indp;
_adep1=numobs*avg_dep;
if std_dep=. then _sdep1=0;
else _sdep1=(numobs-1)*std_dep*std_dep;
_sdenom1=numobs-1;
end;
else do;
_w1=(_sdep1/_sdenom1)/_freq1;
_w2=(std_dep**2)/numobs;
_df=((_w1+_w2)**2)/((_w1**2)/(_freq1-1)+(_w2**2)/(numobs-1));
_t=abs(_adep1/_freq1-avg_dep)/sqrt(_w1+_w2);
pvalue=(1-probt(_t,round(_df,1)))*2;
* If t-test is significant, then output cumulated variables;
* Otherwise continue cumulating;
if pvalue<=&pval_col then do;
* Store current row;
_rnk2=rank;
_freq2=numobs;
_aindp2=numobs*avg_indp;
_mindp2=min_indp;
_mxindp2=max_indp;
_adep2=numobs*avg_dep;
if std_dep=. then _sdep2=0;
else _sdep2=(numobs-1)*std_dep*std_dep;
_sdenom2=numobs-1;
* Switch cumulated variables with current row and output;
rank=_rnk1;
numobs=_freq1;
avg_indp=_aindp1/_freq1;
min_indp=_mindp1;
max_indp=_mxindp1;
avg_dep=_adep1/_freq1;
std_dep=sqrt(_sdep1/_sdenom1);
output;
* Set first variables to current row;
_rnk1=_rnk2;
_freq1=_freq2;
_aindp1=_aindp2;
_mindp1=_mindp2;
_mxindp1=_mxindp2;
_adep1=_adep2;
_sdep1=_sdep2;
_sdenom1=_sdenom2;
end;
else do;
_rnk1=rank;
_freq1=_freq1+numobs;
_aindp1=_aindp1+numobs*avg_indp;
_mxindp1=max_indp;
_adep1=_adep1+numobs*avg_dep;
if std_dep ne . then _sdep1=_sdep1+(numobs-1)*std_dep*std_dep;
_sdenom1=_sdenom1+numobs-1;
end;
*end;
* If last row for current variable, then output;
if last.var then do;
rank=_rnk1;
numobs=_freq1;
avg_indp=_aindp1/_freq1;
min_indp=_mindp1;
max_indp=_mxindp1;
avg_dep=_adep1/_freq1;
std_dep=sqrt(_sdep1/_sdenom1);
output;
end;
end;
end;
drop _finit _rnk1 _rnk2 _freq1 _freq2 _aindp1 _aindp2
_mindp1 _mindp2 _mxindp1 _mxindp2 _adep1 _adep2
_sdep1 _sdep2 _sdenom1 _sdenom2 _w1 _w2 _df _t /*pvalue*/;
run;
***********End of secret code.
You can resume normal procedures.;
Proc Print data=temp;
Run;
****************************************Stopped here;
Data temp;
set temp;
ODSEQAGE = log(avg_dep/(1-avg_dep));
ODSEQAGE1 = (avg_dep/(1-avg_dep));
Run;
Proc sort data=temp;
by rank;
Run;
Proc Print data=temp;
Run;
Proc Freq data=ranked;
Tables Rank;
Run;
Data ranked1;
Set ranked;
if rank = 0 then
else if rank = 1
else if rank = 2
else if rank = 3
else if rank = 4
else if rank = 5
else if rank = 6
else if rank = 7
else if rank = 8
else if rank = 9
else rank=rank;
Run;
rank
then
then
then
then
then
then
then
then
then
= 0;
rank
rank
rank
rank
rank
rank
rank
rank
rank
=
=
=
=
=
=
=
=
=
3;
3;
3;
4;
7;
7;
7;
9;
9;
Proc sort data=ranked1;
by rank;
Run;
Proc sort data=temp;
by rank;
Run;
Data jlp.disctest3;
Merge ranked1 temp;
by rank;
Run;
Proc means data=jlp.disctest3;
var ODSEQAGE ODSEQAGE1;
class Rank;
Run;
*****At this point, you can assign the ranks as 1, 2, 3...;
Proc contents data=jlp.disctest3;
Run;
Proc Freq data=jlp.disctest3;
Tables Rank;
Run;
Data jlp.disctest4;
set jlp.disctest3;
if rank = 0 then ORDEQAge = 0;
else if rank = 3 then ORDEQAge
else if rank = 4 then ORDEQAge
else if rank = 7 then ORDEQAge
else if rank = 9 then ORDEQAge
else rank = rank;
run;
=
=
=
=
1;
2;
3;
4;
*Lets take a look at the new variations of AGe;
Proc Print data=jlp.disctest4 (obs=1000);
Var Age ORDAGE ODSAGE ODSAGE1 ORDEQAGE ODSEQAGE ODSEQAGE1;
Run;
Proc means data=jlp.disctest4 Missing;
Var Age;
ID ORDAGE ORDEQAGE;
Run;
*Now...what are you going to do about the missing obs?;
Download