A Two-Stage Ensemble of classification, regression, and ranking Models for Advertisement Ranking Presenter: Prof. Shou-de Lin Team NTU members: Kuan-Wei Wu, Chun-Sung Ferng, Chia-Hua Ho, An-Chun Liang, Chun-Heng Huang, Wei-Yuan Shen, Jyun-Yu Jiang, Ming-Hao Yang, Ting-Wei Lin, Ching-Pei Lee, Perng-Hwa Kung, Chin-En Wang, Ting-Wei Ku, ChunYen Ho, Yi-Shu Tai, I-Kuei Chen, Wei-Lun Huang, Che-Ping Chou, Tse-Ju Lin, Han-Jay Yang,Yen-Kai Wang, Cheng-Te Li, Prof. Hsuantien Lin About Team NTU (the catch up version) • A team from the EECS college of National Taiwan University • This year’s team is leading by Prof. Hsuan-tien Lin and Prof. Shoude Lin • We have a course aiming at training students to analyze real-world, large-scale datasets. – every year we recruit new students to participate in this course as well as the KDD Cup. – The majority of our students are undergraduate students, they are inexperienced but they are smart and quick learners. • Starting 2008, the NTU team has won 4 KDD Cup champions (and a 3rd place) in the past 5 years. Facts about Track 2 • Predict click-through rate (#click/#impression) of ads on search engine • 155,750,158 instances in training and 20,297,594 instances in testing • Each training instance can be viewed as a vector (#click, #impression, DisplayURL, AdID, AdvertiserID, Depth, Position, QueryID, KeywordID, TitleID, DescriptionID, UserID) • Testing instance shares the same format except for the lack of #click and #impression • Gender, age of users and tokens information are also provided • Goal: Maximize AUC on testing Framework for Track 2 • Individual models in five different categories • Validation set blending to combine portion of models, boosting performance, enhance the diversity • Test set Ensemble to aggregate the high performance blending models into our final solution • This 3-stage framework has also been exploited successfully for our solutions on KDD Cup 2011 Classification Models Combined Regression and Ranking Models Regression Models Matrix Factorization Models Ranking Models Validation Set Blending Models Test Set Ensemble Result Validation Set • We tried several strategies to create the validation set, but none of them can represent testing performance faithfully comparing to a very naïve one as below. • We divide the training data into sub-train and validation (sub-train : validation= 10:1) – Models’ performance on the validation set and the test set is slightly inconsistent, we think it is because different ratio of cold-start users in each set (6.9% in the validation, but 57.7% in the test set) • Our conclusion: It is non-trivial to create a validation set on which the model’s performance is consistent with that of the testing dataset General Features • We create 6 categories of features, and each individual model may use different subsets of them – – – – – – Categorical features Sparse features Click-through rate features ID raw value features Other numerical features Token similarity features • In track 2, we find no ‘killer features’ such as the sequential features in track 1. Categorical & Sparse Features • Categorical features – Only for Naïve Bayes – We treat IDs such as UserID, AdID as categorical features directly • Sparse binary features – Expand categorical features into binary indicator features – Most of the features=0 Click-through Rate Features • For each category, we generate the average click-through rate as a one-dimensional feature • For example, for each AdID, we compute the average clickthrough rate for all instances of the same AdID as one feature. • To handle biased CRT due to insufficient statistics, we apply additive smoothing: – Smoothing significantly boosts the performance # click , we use 0.05 and 75 in our experiment # impression ID Raw Value • We observed numerical value of ID contain some information • For example, the figure below plots the total #impression for each KeywrodID, and shows that #impressions decrease when value of KeywordID increase • We guess the ID values may contain time information in it Other Numerical Features • Features for position & depth – ad’s position – depth – relative position, (depth-position)/depth • Number of tokens for QueryID, KeywordID, TitleID and DescriptionID • Weighted number of tokens for QueryID, KeywordID, TitleID and DescriptionID, each token is weighted by its IDF value • Number of impression of categorical features Token’s Similarity Features • Tokens similarity between QueryID, KeywordID, TitleID and DescriptionID as features. – C(4,2)=6 pairs of similarity as 6 features – cosine similarity between tf-idf vector of tokens – alternatively, we use LDA model to extract topics for QueryID, KeywordID, TitleID and DescriptionID, and then generate cosine similarity between latent topics Individual Models • The click-through rate prediction problem is modeled as classification, regression and ranking problems • For each strategy, we exploit several models and most of them reach competitive performance Classification Models Combined Regression and Ranking Models Regression Models Matrix Factorization Models Ranking Models Validation Set Blending Models Test Set Ensemble Result Individual Models: Classification Models (1) • We split each training instance into #click positive samples and (#impression-#click ) negative samples • We apply two classification methods – Naïve Bayes – Logistic Regression Classification Models Combined Regression and Ranking Models Regression Models Matrix Factorization Models Ranking Models Validation Set Blending Models Test Set Ensemble Result Individual Models: Classification Models (2) • Naïve Bayes – Additive smoothing and Good-Turing are applied with promising results – The best AUC is 0.7760 on the public test set • Logistic Regression – Train on sampled subset to reduce the training time – Separate users into two group (userID=0 or not), train two models on for these groups and then combine the results – This model achieve 0.7888 on the public Test set Individual Models: Regression Models (1) • For the regression models, we use CTR # click as target # impression to predict • Two methods in this category – Linear Regression – Support Vector Regression Classification Models Combined Regression and Ranking Models Regression Models Matrix Factorization Models Ranking Models Validation Set Blending Models Test Set Ensemble Result Individual Models: Regression Models (2) • Linear Regression – degree-2 polynomial expansion on numerical value features – 0.7352 AUC on the public test set • Support vector Regression – Use degree-2 polynomial expansion – The best AUC of this model is 0.7705 on the public test set Individual Models: Ranking Models (1) • We split each training instance into #click positive samples and (#impression-#click ) negative samples • Optimize pairwise ranking • Two methods in this category – Rank Logistic Regression – RankNet Classification Models Combined Regression and Ranking Models Regression Models Matrix Factorization Models Ranking Models Validation Set Blending Models Test Set Ensemble Result Individual Models: Ranking Models (2) • Rank Logistic Regression – Optimize by Stochastic Gradient Descent (SGD) is used – The best AUC is 0.722 on the public test set • RankNet rˆ ij – Optimizes cross entropy loss function Cij rˆij log( 1 e ) H with neuron network, where rˆij (rˆi rˆj ) and rˆi w(j2) tanh( w(jk1)T xik ) j 1 k – Using SGD to update parameters – The best result is 0.7577 on the public test set Surprisingly, ranking-based model does not outperform the other models, maybe be due to the fact it is more complicated to train and tune the parameters. Individual Models: Combined Regression and Ranking Models (1) • We also explore another model that combines the ranking loss and the regression loss • In this model we try to optimize where H is ranking loss, L is regression loss • Solve by SGD SVM, the best AUC is 0.7819 Classification Models Combined Regression and Ranking Models Regression Models Matrix Factorization Models Ranking Models Validation Set Blending Models Test Set Ensemble Result Individual Models: Matrix Factorization Models (1) • We also have feature-based factorization models, which exploit latent information from data • Two different matrix factorization are provided. One optimizes regression loss, and the other optimizes ranking loss Classification Models Combined Regression and Ranking Models Regression Models Matrix Factorization Models Ranking Models Validation Set Blending Models Test Set Ensemble Result Individual Models: Matrix Factorization Models (2) • Regression-Based – Divide features into two groups, α as user’s features and β as items features – The prediction for a instance is rˆ( xi ) ( w(ju ) j w(ji ) j ) ( p j j )T ( q j j ) j j bias j j bias – Minimize RMSE – The best AUC is 0.7776 on the public test set Individual Models: Matrix Factorization Models (3) • Ranking-Based – The prediction for a instance is B rˆ( xi ) w j j j B j 1 k j 1 p j , pk j k – Features can belong to α, β or both – Optimize pairwise ranking as min N N i 1 j 1 2 L(rˆ( xi ) rˆ( x j )) , where L( ) ln( 1 exp( )) 2 – The best AUC is 0.7968 on the public test set Validation Set blending (1) • Blend models and additional features non-linearly • Re-blending to exploit additional enhancement • Four models for blending – – – – Support Vector Regression (SVR) RankNet (RN) Combined Regression and Ranking Models (CRR) LambdaMart (LM) Classification Models Combined Regression and Ranking Models Regression Models Matrix Factorization Models Ranking Models Validation Set Blending Models Test Set Ensemble Result Validation Set blending (2) • Stratgies for Model Selection 1. By difference between validation AUC and test AUC 2. Or select diverse model set by human • Different Score – Raw score – Normalize score – Ranked score Model Public Test Set AUC SVR 0.8038 RN (with re-blending) 0.8062 CRR (with re-blending) 0.8051 LM (with re-blending) 0.8060 Performance of blending models Test Set Ensemble (1) • Ensemble the selected models from validation set blending • Combine each models linearly • Weights of the linear combination depends on AUC on the public test set • It achieves 0.8064 on the public test set Classification Models Combined Regression and Ranking Models Regression Models Matrix Factorization Models Ranking Models Validation Set Blending Models Test Set Ensemble Result Final Result • We apply uniform average on the top five models on board to aggregate our final solution. It achieves 0.8089 on the private test set (0.8070 on the public test set), which outperforms all the other competitors in this competition. Classification Models Combined Regression and Ranking Models Regression Models Matrix Factorization Models Ranking Models Validation Set Blending Models Test Set Ensemble Result Take Home Points • The main reasons for our success: – Tried diverse models (ranking, classification, regression, factorization) – Novel ways for feature engineering (e.g. smoothing, latent features using LDA, ids, etc) – Complex two-stage blending models – Perseverance (we probably have tried more failure models than effective ones) Acknowledgement • We truly thank – organizers for designing a successful competition – NTU EECS college, CSIE department, INTEL-NTU center for the supports