Cost-Sensitive Learning for LargeScale Hierarchical Classification of Commercial Products Jianfu Chen, David S. Warren Stony Brook University Classification is a fundamental problem in information management. Product description Email content UNSPSC Spam Ham Segment Office Equipment and Accessories and Supplies (44) Vehicles and their Accessories and Components (25) Food Beverage and Tobacco Products (50) Family Marine transport (11) Motor vehicles (10) Aerospace systems (20) Class Safety and rescue vehicles (17) Passenger motor vehicles (15) Product and material transport vehicles (16) Commodity Buses (02) Automobiles or cars (03) Limousines (06) How should we design a classifier for a given real world task? Method 1. No Design Training Set f(x) Test Set Try Off-the-shelf Classifiers SVM Logistic-regression Decision Tree Neural Network ... Implicit Assumption: We are trying to minimize error rate, or equivalently, maximize accuracy Method 2. Optimize what we really care about What’s the use of the classifier? How do we evaluate the performance of a classifier according to our interests? Quantify what we really care about Optimize what we care about Hierarchical classification of commercial products Textual product description UNSPSC Segment Office Equipment and Accessories and Supplies (44) Vehicles and their Accessories and Components (25) Food Beverage and Tobacco Products (50) Family Marine transport (11) Motor vehicles (10) Aerospace systems (20) Class Safety and rescue vehicles (17) Passenger motor vehicles (15) Product and material transport vehicles (16) Buses (02) Automobiles or cars (03) Limousines (06) Commodity Product taxonomy helps customers to find desired products quickly. • Facilitates exploring similar products • Helps product recommendation • Facilitates corporate spend analysis Toys&Games Looking for gift ideas for a kid? dolls ... puzzles building toys We assume misclassification of products leads to revenue loss. Textual product description of a mouse Product ... ... ... Desktop computer and accessories mouse keyboard realize an expected annual revenue ... ... pet lose part of the potential revenue What do we really care about? A vendor’s business goal is to maximize revenue, or equivalently, minimize revenue loss Observation 1: the misclassification cost of a product depends on its potential revenue. Observation 2: the misclassification cost of a product depends on how far apart the true class and the predicted class in the taxonomy. Textual product description of a mouse Product ... ... ... Desktop computer and accessories mouse keyboard ... ... pet The proposed performance evaluation metric: average revenue loss revenue loss π ππ 1 = π of product x π£ π₯ ⋅ πΏπ¦π¦′ π₯,π¦,π¦ ′ ∈π· • example weight π£ π₯ is the potential annual revenue of product x • error function πΏπ¦π¦′ is the loss ratio – the percentage of the potential revenue a vendor will lose due to misclassification from class y to class y’. – a non-decreasing monotonic function of hierarchical distance between y and y’, f(d(y, y’)) d(y,y’) 0 1 2 3 4 πΏπ¦π¦ ′ 0 0.2 0.4 0.6 0.8 Learning – minimizing average revenue loss π ππ 1 = π π£ π₯ ⋅ πΏπ¦π¦′ π₯,π¦,π¦ ′ ∈π· Minimize convex upper bound Multi-class SVM with margin re-scaling ππ¦ππ π₯π ππ¦ππ π₯π − ππ¦π′ π₯π ≥ πΏ π₯π , π¦π , π¦ ′ = π£ π₯π ⋅ πΏπ¦ππ¦′ π ππ¦′ π₯π 1 min π π,π 2 ′ π . π‘. ∀π, ∀π¦ : ππ¦ππ π₯π − 2 πΆ + π π ππ¦′ π₯π π ππ π=1 ≥ πΏ π₯π , π¦π , π¦ ′ −ππ ππ ≥ 0 Multi-class SVM with margin re-scaling 1 min π π,π 2 ′ π . π‘. ∀π, ∀π¦ : πΆ 2 + π ππ¦ππ π₯π − Convex upper bound of π ππ π=1 π ππ¦′ π₯π 1 π π πΏ(π₯π , π¦π , π¦ ′ ) π=1 ≥ πΏ π₯π , π¦π , π¦ ′ −ππ ππ ≥ 0 plug in any loss function 0-1 VALUE TREE REVLOSS [π¦π ≠ π¦ ′ ] π£ π₯π [π¦π ≠ π¦ ′ ] π·(π¦π , π¦ ′ ) π£ π₯π πΏπ¦π π¦′ error rate (standard multi-class SVM) product revenue hierarchical distance revenue loss Dataset • UNSPSC (United Nations Standard Product and Service Code) dataset data source multiple online market places oriented for DoD and Federal government customers GSA Advantage DoD EMALL taxonomy structure 4-level balanced tree UNSPSC taxonomy #examples 1.4M #leaf classes 1073 • Product revenues are simulated – revenue = price * sales Experimental results 60 47.708 50 48.082 40 IDENTITY 30 UNIT 20 10 4.745 4.964 5.092 5.082 0 0-1 TREE VALUE REVLOSS Average revenue loss (in K$) of different algorithms What’s wrong? 1 min π π,π 2 π . π‘. ∀π, ∀π¦ ′ ≠ π¦π : 2 πΆ + π π ππ π=1 ππ¦ππ π₯π − ππ¦π′ π₯π ≥ πΏ π₯π , π¦π , π¦ ′ − ππ ππ ≥ 0 π£ π₯π ⋅ πΏπ¦ππ¦′ Revenue loss ranges from a few K to several M Loss normalization • Linearly scale loss function to a fixed range [1, ππππ₯ ], say [1, 10] π πΏ π₯,π¦,π¦ ′ =1+ πΏ π₯,π¦,π¦ ′ −πΏπππ πΏπππ₯ −πΏπππ ⋅ (ππππ₯ − 1) The objective now upper bounds both 0-1 loss and the average normalized loss. Final results 60 47.708 50 48.082 40 IDENTITY 30 UNIT RANGE 20 10 4.745 4.964 5.092 4.387 5.082 4.371 7.88% reduction in average revenue loss! 0 0-1 TREE VALUE REVLOSS Average revenue loss (in K$) of different algorithms Conclusion What do we really care about for this task? Minimize error rate? Minimize revenue loss? empirical risk, average misclassification cost: π ππ 1 = π Performance evaluation metric = How do we approximate the performance evaluation metric to make it tractable? Find the best parameters Model + Tractable loss function Optimization 1 π πΏ(π₯, π¦, π¦ ′ ) π₯,π¦,π¦ ′ ∈π· π€ π₯ ⋅ Δ(π¦, π¦ ′ ) π₯,π¦,π¦ ′ ∈π· regularized empirical risk minimization A general method: multiclass SVM with margin re-scaling and loss normalization Thank you! Questions?