Fast Prediction of New Feature Utility Hoyt Koepke Misha Bilenko Machine Learning in Practice Problem formulated as a prediction task Implement learner, get supervision Design, refine features To improve accuracy, we can improve: – Training – Supervision – Features Train, validate, ship Improving Accuracy By Improving • Training – Algorithms, objectives/losses, hyper-parameters, … • Supervision – Cleaning, labeling, sampling, semi-supervised • Representation: refine/induce/add new features – Most ML engineering for mature applications happens here! – Process: let’s try this new extractor/data stream/transform/… • Manual or automatic [feature induction: Della Pietra et al.’97] Evaluating New Features • Standard procedure: – Add features, re-run train/test/CV, hope accuracy improves • In many applications, this is costly – Computationally: full re-training is π(βππ’ππ ) – Monetarily: cost per feature-value (must check on a small sample) – Logistically: infrastructure pipelined, non-trivial, under-documented Efficiently check whether a new feature can improve accuracy without retraining • Goal: Feature Relevance ≠ Feature Selection • Selection objective: removing existing features • Relevance objective: decide if a new feature is worth adding • Most feature selection methods either use re-training or estimate ππ‘ππππ‘π¦ ππππ‘π’ππ, πππππ ππ’πππππ‘ ππππ‘π’πππ • Feature relevance requires estimating ππ‘ππππ‘π¦(ππππ‘π’ππ, πππππ|ππ’πππππ‘ ππππ‘π’πππ , πππππππ‘ππ) Formalizing New Feature Relevance • Supervised learning setting – Training set ππ , ππ π=1…π – Current predictor π0 = argmin πΌ π π, π π =argmin πΏ(π, π π ) π∈β±π³ π∈β±π³ – New feature π′ π′1 π1 π1 π1 πΏ(π1 , π1 ) π0 → …. π′π ππ ππ ππ πΏ(π1 , π1 ) Formalizing New Feature Relevance • Supervised learning setting – Training set ππ , ππ π=1…π – Current predictor π0 = argmin πΌ π π, π π π∈β±π³ =argmin πΏ(π, π π ) π∈β±π³ – New feature π′ • Hypothesis: can a better predictor be learned with the new feature? min πΏ π, π π, π′ π∈β±π³,π³ ′ < πΏ(π, π0 π ) • Too generalο Instead, let’s test an additive form: ∃β ∈ β±π³,π³ ′ s.t. πΏ π, π0 π + β(π, π ′ ) < πΏ(π, π0 π ) For efficiency, we can just test: ∃β ∈ β±π³ ′ s.t. πΏ π, π0 π + β(π ′ ) < πΏ(π, π0 π ) Hypothesis Test for New Feature Relevance • We want to test whether π′ has incremental signal: ∃β ∈ β±π³ ′ s.t. πΏ π, π0 π + β(π ′ ) < πΏ(π, π0 π ) • Intuition: loss gradient tells us how to improve the predictor • Consider functional loss gradient Λπ0 = ππΏ ππ π»0 |π 0 – Since π0 is locally optimal, πΌ Λπ0 = 0: no descent direction exists • Theorem: under reasonable assumptions, π―π is equivalent to: min πΏ(π, π0 π + π½π∗ π ′ ) < πΏ(π, π0 π ) π½∈β πΌ π∗ π ′ ⋅ Λπ0 > 0 where π∗ = argmax π∈β±π³ ′ std π =1 πΌ π π ′ ⋅ Λπ0 π»1 π»2 Hypothesis Test for New Feature Relevance πΌ π∗ π ′ ⋅ Λπ0 > 0 • Intuition: can π ′ yield a descent direction in functional space? • Why this is cool: Testing new feature relevance for a broad class of losses βΊ testing correlation between feature and normalized loss gradient π′1 π1 π1 π1 πΏ(π1 , π1 ) π»πΏ(π1 , π1 ) π0 → …. π′π ππ ππ ππ πΏ(π1 , π1 ) π»πΏ(ππ , ππ ) Hypothesis Test for New Feature Relevance πΌ π∗ π ′ ⋅ Λπ0 > 0 • Intuition: can π ′ yield a descent direction in functional space? • Why this is cool: Testing new feature relevance for a broad class of losses βΊ testing correlation between feature and normalized loss gradient Testing Correlation to Loss Gradient • We don’t have a consistent test for πΌ π∗ π ′ ⋅ Λπ0 > 0 …but πΌ Λπ0 = 0 (π0 locally optimal), so above is equivalent to: ∃π s.t. πΌ π∗ π′ ⋅ Λπ0 − πΌπ∗ π ′ πΌ Λπ0 > 0 …for which we can design a consistent bootstrap test! • Intuition – We need to test if we can train πΏ2 regressor π∗ π′ → Λπ0 – We want it to be as powerful as possible and work on small samples Q: How do we distinguish between true correlation and overfitting? A: We correct by correlation from π∗ bootstrap(π′) → bootstrap(Λπ0 ) New Feature Relevance: Algorithm (1) Train best-fit πΏ2 regressor π∗ π′ → Λπ0 - Compute correlation between predictions and targets (2) Repeat πππππ‘π π‘πππ times a) Draw independent bootstrap samples π′ and Λπ0 b) Train best-fit πΏ2 regressor, compute correlation (3) Score: correlation (1) corrected by (2) New Feature Relevance: Algorithm Connection to Boosting • AnyBoost/gradient boosting additive form: – π0 π) + π∗ (π′ vs. π0 π) + π∗ (π – Gradient vs. coordinate descent in functional space • Anyboost/GB: generalization • This work: consistent hypothesis test for feasibility – Statistical stopping criteria for boosting? Experimental Validation • Natural methodology: compare to full re-training • For each feature π′: – Actual βπΏ = πΏ π π – Predicted βπΏ = − πΏ π π, π ′ π£−mean t std t • We are mainly interested in high-βπΏ features Datasets • WebSearch: each “feature” is a signal source • E.g., “Body” source defines all features that depend on document body: – π΅π25(ππ’πππ¦, π΅πππ¦) πΆππ’ππ‘ ππ’πππ¦, π΅πππ¦ , πΏππππ‘β π΅πππ¦ • Signal source examples: AnchorText, ClickLog, etc. Results: Adult Results: Housing Results: WebSearch Comparison to Feature Selection New Feature Relevance: Summary • Evaluating new features by re-training can be costly – Computationally, Financially, Logistically • Fast alternative: testing correlation to loss gradient • Black-box algorithm: πΏ2 regression for (almost) any loss! • Just one approach, lots of future work: – – – – Alternatives to hypothesis testing: info-theory, optimization, … Semi-supervised methods Back to feature selection? Removing black-box assumptions