Generalized Additive Models for Binary Data Safi, Seyed Roozmehr Wang, Ying ISQS5349, 2013 Spring Recall what we learned • General linear model y N (, 2 ) E ( y ) b0 b1 X 1 b2 X 2 ... bp X p or X • Generalized linear model y ExpoFam( , etc.) g ( ) b0 b1 X 1 b2 X 2 ... bp X p or g ( ) X E ( y ) g 1 ( ) • Additive model E( y) b0 f1( x1 ) f 2 ( x2 ) ... f p ( x p ) GAM • Generalized additive model (GAM) y ExpoFam( , etc.) g ( ) b0 f1 ( x1 ) f 2 ( x2 ) ... f p ( x p ) E ( y ) g 1 ( ) • Pros: allowing nonparametric • Cons: overfitting Ordinary logistic regression GAM for binary data A Demonstration Data: From an online B2B exchange (1220 companies). Purpose: To distinguish cheaters from good members. Predictors: • Industry: production, distribution, investment… • No. of months since last membership renewal • Number of days in which member logged in during past 60 days. • Years since joined: 1 to 10 years. • Membership Renewal duration. • Type of service bought: standard, limited edition… Target: If the member has cheated (Event=1) beyond a threshold or not (event=0). Data Preparation • Selecting a subset of companies that have been member for at least a year, resulting in 648 records. • Deciding on a threshold for defining “cheaters” (at least once a year, at least once every two years…). • Handling missing values (185 missing values but proc GAM used all data) PROC GAM • Dependent variable: cheat1 (binary) • The 3 Predictors: – Industry type (1 to 8) – Time elapsed since last membership renewal – Login frequency Output and Data Summary Output: the parametric part • The parametric terms are significant for two variables and not significant for one variable. Results, the non-parametric terms • Non-parametric terms are significant for all three variables. Thank you. Questions?