Fast Prediction of New Feature Utility

advertisement
Fast Prediction of
New Feature Utility
Hoyt Koepke
Misha Bilenko
Machine Learning in Practice
Problem formulated
as a prediction task
Implement learner,
get supervision
Design, refine
features
To improve accuracy, we can improve:
– Training
– Supervision
– Features
Train,
validate,
ship
Improving Accuracy By Improving
• Training
– Algorithms, objectives/losses, hyper-parameters, …
• Supervision
– Cleaning, labeling, sampling, semi-supervised
• Representation: refine/induce/add new features
– Most ML engineering for mature applications happens here!
– Process: let’s try this new extractor/data stream/transform/…
• Manual or automatic [feature induction: Della Pietra et al.’97]
Evaluating New Features
• Standard procedure:
– Add features, re-run train/test/CV, hope accuracy improves
• In many applications, this is costly
– Computationally: full re-training is 𝑂(β„Žπ‘œπ‘’π‘Ÿπ‘ )
– Monetarily: cost per feature-value (must check on a small sample)
– Logistically: infrastructure pipelined, non-trivial, under-documented
Efficiently check whether a new feature can
improve accuracy without retraining
• Goal:
Feature Relevance ≠ Feature Selection
• Selection objective: removing existing features
• Relevance objective: decide if a new feature is worth adding
• Most feature selection methods either use re-training or
estimate π‘ˆπ‘‘π‘–π‘™π‘–π‘‘π‘¦ π‘“π‘’π‘Žπ‘‘π‘’π‘Ÿπ‘’, π‘™π‘Žπ‘π‘’π‘™ π‘π‘’π‘Ÿπ‘Ÿπ‘’π‘›π‘‘ π‘“π‘’π‘Žπ‘‘π‘’π‘Ÿπ‘’π‘ 
• Feature relevance requires estimating
π‘ˆπ‘‘π‘–π‘™π‘–π‘‘π‘¦(π‘“π‘’π‘Žπ‘‘π‘’π‘Ÿπ‘’, π‘™π‘Žπ‘π‘’π‘™|π‘π‘’π‘Ÿπ‘Ÿπ‘’π‘›π‘‘ π‘“π‘’π‘Žπ‘‘π‘’π‘Ÿπ‘’π‘ , π‘π‘Ÿπ‘’π‘‘π‘–π‘π‘‘π‘œπ‘Ÿ)
Formalizing New Feature Relevance
• Supervised learning setting
– Training set 𝑋𝑖 , π‘Œπ‘–
𝑖=1…𝑁
– Current predictor 𝑓0 = argmin 𝔼 𝑙 π‘Œ, 𝑓 𝑋
=argmin 𝐿(π‘Œ, 𝑓 𝑋 )
𝑓∈ℱ𝒳
𝑓∈ℱ𝒳
– New feature 𝑋′
𝑋′1
𝑋1
π‘Œ1
π‘Œ1
𝐿(π‘Œ1 , π‘Œ1 )
𝑓0
→
….
𝑋′𝑁
𝑋𝑁
π‘Œπ‘
π‘Œπ‘ 𝐿(π‘Œ1 , π‘Œ1 )
Formalizing New Feature Relevance
• Supervised learning setting
– Training set 𝑋𝑖 , π‘Œπ‘–
𝑖=1…𝑁
– Current predictor 𝑓0 = argmin 𝔼 𝑙 π‘Œ, 𝑓 𝑋
𝑓∈ℱ𝒳
=argmin 𝐿(π‘Œ, 𝑓 𝑋 )
𝑓∈ℱ𝒳
– New feature 𝑋′
• Hypothesis: can a better predictor be learned with the new feature?
min 𝐿 π‘Œ, 𝑓 𝑋, 𝑋′
𝑓∈ℱ𝒳,𝒳 ′
< 𝐿(π‘Œ, 𝑓0 𝑋 )
• Too general Instead, let’s test an additive form:
∃β„Ž ∈ ℱ𝒳,𝒳 ′ s.t. 𝐿 π‘Œ, 𝑓0 𝑋 + β„Ž(𝑋, 𝑋 ′ ) < 𝐿(π‘Œ, 𝑓0 𝑋 )
For efficiency, we can just test:
∃β„Ž ∈ ℱ𝒳 ′ s.t. 𝐿 π‘Œ, 𝑓0 𝑋 + β„Ž(𝑋 ′ ) < 𝐿(π‘Œ, 𝑓0 𝑋 )
Hypothesis Test for New Feature Relevance
• We want to test whether 𝑋′ has incremental signal:
∃β„Ž ∈ ℱ𝒳 ′ s.t. 𝐿 π‘Œ, 𝑓0 𝑋 + β„Ž(𝑋 ′ ) < 𝐿(π‘Œ, 𝑓0 𝑋 )
• Intuition: loss gradient tells us how to improve the predictor
• Consider functional loss gradient Λ𝑓0 =
πœ•πΏ
πœ•π‘“
𝐻0
|𝑓
0
– Since 𝑓0 is locally optimal, 𝔼 Λ𝑓0 = 0: no descent direction exists
• Theorem: under reasonable assumptions, π‘―πŸŽ is equivalent to:
min 𝐿(π‘Œ, 𝑓0 𝑋 + 𝛽𝑔∗ 𝑋 ′ ) < 𝐿(π‘Œ, 𝑓0 𝑋 )
𝛽∈ℝ
𝔼 𝑔∗ 𝑋 ′ ⋅ Λ𝑓0 > 0
where 𝑔∗ =
argmax
𝑔∈ℱ𝒳 ′ std 𝑔 =1
𝔼 𝑔 𝑋 ′ ⋅ Λ𝑓0
𝐻1
𝐻2
Hypothesis Test for New Feature Relevance
𝔼 𝑔∗ 𝑋 ′ ⋅ Λ𝑓0 > 0
• Intuition: can 𝑋 ′ yield a descent direction in functional space?
• Why this is cool:
Testing new feature relevance for a broad class of losses
⟺
testing correlation between feature and normalized loss gradient
𝑋′1
𝑋1
π‘Œ1
π‘Œ1 𝐿(π‘Œ1 , π‘Œ1 ) 𝛻𝐿(π‘Œ1 , π‘Œ1 )
𝑓0
→
….
𝑋′𝑁
𝑋𝑁
π‘Œπ‘
π‘Œπ‘ 𝐿(π‘Œ1 , π‘Œ1 ) 𝛻𝐿(π‘Œπ‘ , π‘Œπ‘ )
Hypothesis Test for New Feature Relevance
𝔼 𝑔∗ 𝑋 ′ ⋅ Λ𝑓0 > 0
• Intuition: can 𝑋 ′ yield a descent direction in functional space?
• Why this is cool:
Testing new feature relevance for a broad class of losses
⟺
testing correlation between feature and normalized loss gradient
Testing Correlation to Loss Gradient
• We don’t have a consistent test for 𝔼 𝑔∗ 𝑋 ′ ⋅ Λ𝑓0 > 0
…but 𝔼 Λ𝑓0 = 0 (𝑓0 locally optimal), so above is equivalent to:
∃𝑔 s.t. 𝔼 𝑔∗ 𝑋′ ⋅ Λ𝑓0 − 𝔼𝑔∗ 𝑋 ′ 𝔼 Λ𝑓0 > 0
…for which we can design a consistent bootstrap test!
• Intuition
– We need to test if we can train 𝐿2 regressor 𝑔∗ 𝑋′ → Λ𝑓0
– We want it to be as powerful as possible and work on small samples
Q: How do we distinguish between true correlation and overfitting?
A: We correct by correlation from 𝑔∗ bootstrap(𝑋′) → bootstrap(Λ𝑓0 )
New Feature Relevance: Algorithm
(1) Train best-fit 𝐿2 regressor 𝑔∗ 𝑋′ → Λ𝑓0
- Compute correlation between predictions and targets
(2) Repeat π‘π‘π‘œπ‘œπ‘‘π‘ π‘‘π‘Ÿπ‘Žπ‘ times
a) Draw independent bootstrap samples 𝑋′ and Λ𝑓0
b) Train best-fit 𝐿2 regressor, compute correlation
(3) Score: correlation (1) corrected by (2)
New Feature Relevance: Algorithm
Connection to Boosting
• AnyBoost/gradient boosting additive form:
– 𝑓0 𝑋) + 𝑔∗ (𝑋′ vs. 𝑓0 𝑋) + 𝑔∗ (𝑋
– Gradient vs. coordinate descent in functional space
• Anyboost/GB: generalization
• This work: consistent hypothesis test for feasibility
– Statistical stopping criteria for boosting?
Experimental Validation
• Natural methodology: compare to full re-training
• For each feature 𝑋′:
– Actual βˆ†πΏ = 𝐿 𝑓 𝑋
– Predicted βˆ†πΏ =
− 𝐿 𝑓 𝑋, 𝑋 ′
𝑣−mean t
std t
• We are mainly interested in high-βˆ†πΏ features
Datasets
• WebSearch: each “feature” is a signal source
• E.g., “Body” source defines all features that depend on document
body:
– 𝐡𝑀25(π‘žπ‘’π‘’π‘Ÿπ‘¦, π΅π‘œπ‘‘π‘¦) πΆπ‘œπ‘’π‘›π‘‘ π‘žπ‘’π‘’π‘Ÿπ‘¦, π΅π‘œπ‘‘π‘¦ , πΏπ‘’π‘›π‘”π‘‘β„Ž π΅π‘œπ‘‘π‘¦
• Signal source examples: AnchorText, ClickLog, etc.
Results: Adult
Results: Housing
Results: WebSearch
Comparison to Feature Selection
New Feature Relevance: Summary
• Evaluating new features by re-training can be costly
– Computationally, Financially, Logistically
• Fast alternative: testing correlation to loss gradient
• Black-box algorithm: 𝐿2 regression for (almost) any loss!
• Just one approach, lots of future work:
–
–
–
–
Alternatives to hypothesis testing: info-theory, optimization, …
Semi-supervised methods
Back to feature selection?
Removing black-box assumptions
Download