Data

advertisement
Generalized Additive Models for
Binary Data
Safi, Seyed Roozmehr
Wang, Ying
ISQS5349, 2013 Spring
Recall what we learned
• General linear model
y  N (, 2 )
E ( y )    b0  b1 X 1  b2 X 2  ...  bp X p or   X 
• Generalized linear model
y  ExpoFam(  , etc.)
g (  )  b0  b1 X 1  b2 X 2  ...  bp X p or g (  )  X   
E ( y )    g 1 ( )
• Additive model
E( y)    b0  f1( x1 )  f 2 ( x2 )  ...  f p ( x p )
GAM
• Generalized additive model (GAM)
y  ExpoFam(  , etc.)
g (  )  b0  f1 ( x1 )  f 2 ( x2 )  ...  f p ( x p )
E ( y )    g 1 ( )
• Pros: allowing nonparametric
• Cons: overfitting
Ordinary logistic regression
GAM for binary data
A Demonstration
Data: From an online B2B exchange (1220 companies).
Purpose: To distinguish cheaters from good members.
Predictors:
• Industry: production, distribution, investment…
• No. of months since last membership renewal
• Number of days in which member logged in during past 60
days.
• Years since joined: 1 to 10 years.
• Membership Renewal duration.
• Type of service bought: standard, limited edition…
Target: If the member has cheated (Event=1) beyond a
threshold or not (event=0).
Data Preparation
• Selecting a subset of companies that have
been member for at least a year, resulting in
648 records.
• Deciding on a threshold for defining
“cheaters” (at least once a year, at least once
every two years…).
• Handling missing values (185 missing values
but proc GAM used all data)
PROC GAM
• Dependent variable: cheat1 (binary)
• The 3 Predictors:
– Industry type (1 to 8)
– Time elapsed since last membership renewal
– Login frequency
Output and Data Summary
Output: the parametric part
• The parametric terms are significant for two
variables and not significant for one variable.
Results, the non-parametric terms
• Non-parametric terms are significant for all three
variables.
Thank you. Questions?
Download