*** 1

advertisement
A Two-Stage Ensemble of classification, regression,
and ranking Models for Advertisement Ranking
Presenter: Prof. Shou-de Lin
Team NTU members: Kuan-Wei Wu,
Chun-Sung Ferng, Chia-Hua Ho,
An-Chun Liang, Chun-Heng Huang,
Wei-Yuan Shen, Jyun-Yu Jiang,
Ming-Hao Yang, Ting-Wei Lin,
Ching-Pei Lee, Perng-Hwa Kung,
Chin-En Wang, Ting-Wei Ku, ChunYen Ho, Yi-Shu Tai, I-Kuei Chen,
Wei-Lun Huang, Che-Ping Chou,
Tse-Ju Lin, Han-Jay Yang,Yen-Kai
Wang, Cheng-Te Li, Prof. Hsuantien Lin
About Team NTU (the catch up version)
• A team from the EECS college of National Taiwan University
• This year’s team is leading by Prof. Hsuan-tien Lin and Prof. Shoude Lin
• We have a course aiming at training students to analyze real-world,
large-scale datasets.
– every year we recruit new students to participate in this course as well
as the KDD Cup.
– The majority of our students are undergraduate students, they are
inexperienced but they are smart and quick learners.
• Starting 2008, the NTU team has won 4 KDD Cup champions (and a
3rd place) in the past 5 years.
Facts about Track 2
• Predict click-through rate (#click/#impression) of ads on
search engine
• 155,750,158 instances in training and 20,297,594
instances in testing
• Each training instance can be viewed as a vector (#click,
#impression, DisplayURL, AdID, AdvertiserID, Depth,
Position, QueryID, KeywordID, TitleID, DescriptionID,
UserID)
• Testing instance shares the same format except for the
lack of #click and #impression
• Gender, age of users and tokens information are also
provided
• Goal: Maximize AUC on testing
Framework for Track 2
• Individual models in five different categories
• Validation set blending to combine portion of models, boosting
performance, enhance the diversity
• Test set Ensemble to aggregate the high performance blending
models into our final solution
• This 3-stage framework has also been exploited successfully for
our solutions on KDD Cup 2011
Classification
Models
Combined
Regression and
Ranking Models
Regression Models
Matrix Factorization
Models
Ranking Models
Validation Set
Blending Models
Test Set Ensemble
Result
Validation Set
• We tried several strategies to create the validation set,
but none of them can represent testing performance
faithfully comparing to a very naïve one as below.
• We divide the training data into sub-train and validation
(sub-train : validation= 10:1)
– Models’ performance on the validation set and the test set is
slightly inconsistent, we think it is because different ratio of
cold-start users in each set (6.9% in the validation, but 57.7% in
the test set)
• Our conclusion: It is non-trivial to create a validation
set on which the model’s performance is consistent
with that of the testing dataset
General Features
• We create 6 categories of features, and each individual model
may use different subsets of them
–
–
–
–
–
–
Categorical features
Sparse features
Click-through rate features
ID raw value features
Other numerical features
Token similarity features
• In track 2, we find no ‘killer features’ such as the sequential
features in track 1.
Categorical & Sparse Features
• Categorical features
– Only for Naïve Bayes
– We treat IDs such as UserID, AdID as categorical features
directly
• Sparse binary features
– Expand categorical features into binary indicator features
– Most of the features=0
Click-through Rate Features
• For each category, we generate the average click-through rate
as a one-dimensional feature
• For example, for each AdID, we compute the average clickthrough rate for all instances of the same AdID as one feature.
• To handle biased CRT due to insufficient statistics, we apply
additive smoothing:
– Smoothing significantly boosts the performance
# click    
, we use   0.05 and   75 in our experiment
# impression  
ID Raw Value
• We observed numerical value of ID contain some information
• For example, the figure below plots the total #impression for
each KeywrodID, and shows that #impressions decrease when
value of KeywordID increase
• We guess the ID values may contain time information in it
Other Numerical Features
• Features for position & depth
– ad’s position
– depth
– relative position, (depth-position)/depth
• Number of tokens for QueryID, KeywordID, TitleID and
DescriptionID
• Weighted number of tokens for QueryID, KeywordID, TitleID
and DescriptionID, each token is weighted by its IDF value
• Number of impression of categorical features
Token’s Similarity Features
• Tokens similarity between QueryID, KeywordID, TitleID and
DescriptionID as features.
– C(4,2)=6 pairs of similarity as 6 features
– cosine similarity between tf-idf vector of tokens
– alternatively, we use LDA model to extract topics for QueryID,
KeywordID, TitleID and DescriptionID, and then generate cosine
similarity between latent topics
Individual Models
• The click-through rate prediction problem is modeled
as classification, regression and ranking problems
• For each strategy, we exploit several models and
most of them reach competitive performance
Classification
Models
Combined
Regression and
Ranking Models
Regression Models
Matrix Factorization
Models
Ranking Models
Validation Set
Blending Models
Test Set Ensemble
Result
Individual Models: Classification Models (1)
• We split each training instance into #click positive samples
and (#impression-#click ) negative samples
• We apply two classification methods
– Naïve Bayes
– Logistic Regression
Classification
Models
Combined
Regression and
Ranking Models
Regression Models
Matrix Factorization
Models
Ranking Models
Validation Set
Blending Models
Test Set Ensemble
Result
Individual Models: Classification Models (2)
• Naïve Bayes
– Additive smoothing and Good-Turing are applied with
promising results
– The best AUC is 0.7760 on the public test set
• Logistic Regression
– Train on sampled subset to reduce the training time
– Separate users into two group (userID=0 or not), train two
models on for these groups and then combine the results
– This model achieve 0.7888 on the public Test set
Individual Models: Regression Models (1)
• For the regression models, we use CTR  # click
as target
# impression
to predict
• Two methods in this category
– Linear Regression
– Support Vector Regression
Classification
Models
Combined
Regression and
Ranking Models
Regression Models
Matrix Factorization
Models
Ranking Models
Validation Set
Blending Models
Test Set Ensemble
Result
Individual Models: Regression Models (2)
• Linear Regression
– degree-2 polynomial expansion on numerical value
features
– 0.7352 AUC on the public test set
• Support vector Regression
– Use degree-2 polynomial expansion
– The best AUC of this model is 0.7705 on the public test set
Individual Models: Ranking Models (1)
• We split each training instance into #click positive samples
and (#impression-#click ) negative samples
• Optimize pairwise ranking
• Two methods in this category
– Rank Logistic Regression
– RankNet
Classification
Models
Combined
Regression and
Ranking Models
Regression Models
Matrix Factorization
Models
Ranking Models
Validation Set
Blending Models
Test Set Ensemble
Result
Individual Models: Ranking Models (2)
• Rank Logistic Regression
– Optimize
by
Stochastic Gradient Descent (SGD) is used
– The best AUC is 0.722 on the public test set
• RankNet
rˆ
ij
– Optimizes cross entropy loss function Cij  rˆij  log( 1  e )
H
with neuron network, where rˆij   (rˆi  rˆj ) and rˆi   w(j2) tanh(  w(jk1)T xik )
j 1
k
– Using SGD to update parameters
– The best result is 0.7577 on the public test set
Surprisingly, ranking-based model does not outperform the other models,
maybe be due to the fact it is more complicated to train and tune the
parameters.
Individual Models: Combined Regression and
Ranking Models (1)
• We also explore another model that combines the ranking
loss and the regression loss
• In this model we try to optimize
where H is ranking loss, L is regression loss
• Solve by SGD SVM, the best AUC is 0.7819
Classification
Models
Combined
Regression and
Ranking Models
Regression Models
Matrix Factorization
Models
Ranking Models
Validation Set
Blending Models
Test Set Ensemble
Result
Individual Models: Matrix Factorization Models (1)
• We also have feature-based factorization models, which
exploit latent information from data
• Two different matrix factorization are provided. One optimizes
regression loss, and the other optimizes ranking loss
Classification
Models
Combined
Regression and
Ranking Models
Regression Models
Matrix Factorization
Models
Ranking Models
Validation Set
Blending Models
Test Set Ensemble
Result
Individual Models: Matrix Factorization Models (2)
• Regression-Based
– Divide features into two groups, α as user’s features and β
as items features
– The prediction for a instance is
rˆ( xi )    ( w(ju ) j   w(ji )  j )  ( p j j )T ( q j  j )
j
j
bias
j
j
bias
– Minimize RMSE
– The best AUC is 0.7776 on the public test set
Individual Models: Matrix Factorization Models (3)
• Ranking-Based
– The prediction for a instance is
B
rˆ( xi )   w j j  
j
B

j 1 k  j 1
p j , pk  j  k
– Features can belong to α, β or both
– Optimize pairwise ranking as
min

N N

i 1 j 1
2
 L(rˆ( xi )  rˆ( x j )) 
 , where L( )  ln( 1  exp(  ))
2
– The best AUC is 0.7968 on the public test set
Validation Set blending (1)
• Blend models and additional features non-linearly
• Re-blending to exploit additional enhancement
• Four models for blending
–
–
–
–
Support Vector Regression (SVR)
RankNet (RN)
Combined Regression and Ranking Models (CRR)
LambdaMart (LM)
Classification
Models
Combined
Regression and
Ranking Models
Regression Models
Matrix Factorization
Models
Ranking Models
Validation Set
Blending Models
Test Set Ensemble
Result
Validation Set blending (2)
• Stratgies for Model Selection
1. By difference between validation AUC and test AUC
2. Or select diverse model set by human
• Different Score
– Raw score
– Normalize score
– Ranked score
Model
Public Test Set AUC
SVR
0.8038
RN (with re-blending)
0.8062
CRR (with re-blending)
0.8051
LM (with re-blending)
0.8060
Performance of blending models
Test Set Ensemble (1)
• Ensemble the selected models from validation set
blending
• Combine each models linearly
• Weights of the linear combination depends on AUC
on the public test set
• It achieves 0.8064 on the public test set
Classification
Models
Combined
Regression and
Ranking Models
Regression Models
Matrix Factorization
Models
Ranking Models
Validation Set
Blending Models
Test Set Ensemble
Result
Final Result
• We apply uniform average on the top five models on
board to aggregate our final solution. It achieves
0.8089 on the private test set (0.8070 on the public
test set), which outperforms all the other
competitors in this competition.
Classification
Models
Combined
Regression and
Ranking Models
Regression Models
Matrix Factorization
Models
Ranking Models
Validation Set
Blending Models
Test Set Ensemble
Result
Take Home Points
• The main reasons for our success:
– Tried diverse models (ranking, classification,
regression, factorization)
– Novel ways for feature engineering (e.g.
smoothing, latent features using LDA, ids, etc)
– Complex two-stage blending models
– Perseverance (we probably have tried more failure
models than effective ones)
Acknowledgement
• We truly thank
– organizers for designing a successful competition
– NTU EECS college, CSIE department, INTEL-NTU
center for the supports
Download