Cost-Sensitive Learning for Large-Scale Hierarchical Classification

advertisement
Cost-Sensitive Learning for LargeScale Hierarchical Classification of
Commercial Products
Jianfu Chen, David S. Warren
Stony Brook University
Classification is a fundamental problem
in information management.
Product description
Email content
UNSPSC
Spam
Ham
Segment
Office Equipment and
Accessories and
Supplies (44)
Vehicles and their
Accessories and
Components (25)
Food Beverage and
Tobacco Products (50)
Family
Marine transport (11)
Motor vehicles (10)
Aerospace systems
(20)
Class
Safety and rescue
vehicles (17)
Passenger motor
vehicles (15)
Product and material
transport vehicles (16)
Commodity
Buses (02)
Automobiles or cars
(03)
Limousines (06)
How should we design a classifier
for a given real world task?
Method 1. No Design
Training Set
f(x)
Test Set
Try Off-the-shelf Classifiers
SVM
Logistic-regression
Decision Tree
Neural Network
...
Implicit Assumption: We are trying to minimize
error rate, or equivalently, maximize accuracy
Method 2. Optimize what we really care about
What’s the use of the
classifier?
How do we evaluate the
performance of a classifier
according to our interests?
Quantify what we
really care about
Optimize what we care about
Hierarchical classification of commercial products
Textual product description
UNSPSC
Segment
Office Equipment and
Accessories and Supplies
(44)
Vehicles and their
Accessories and
Components (25)
Food Beverage and
Tobacco Products (50)
Family
Marine transport (11)
Motor vehicles (10)
Aerospace systems (20)
Class
Safety and rescue vehicles
(17)
Passenger motor vehicles
(15)
Product and material
transport vehicles (16)
Buses (02)
Automobiles or cars (03)
Limousines (06)
Commodity
Product taxonomy helps customers to
find desired products quickly.
• Facilitates exploring similar products
• Helps product recommendation
• Facilitates corporate spend analysis
Toys&Games
Looking for
gift ideas for a
kid?
dolls
...
puzzles
building
toys
We assume misclassification of
products leads to revenue loss.
Textual product description
of a mouse
Product
...
...
...
Desktop computer and
accessories
mouse
keyboard
realize an expected annual revenue
...
...
pet
lose part of the potential revenue
What do we really care about?
A vendor’s business goal is to maximize
revenue, or equivalently, minimize revenue
loss
Observation 1: the misclassification
cost of a product depends on its
potential revenue.
Observation 2: the misclassification cost of a
product depends on how far apart the true class
and the predicted class in the taxonomy.
Textual product description
of a mouse
Product
...
...
...
Desktop computer and
accessories
mouse
keyboard
...
...
pet
The proposed performance evaluation metric:
average revenue loss revenue loss
π‘…π‘’π‘š
1
=
π‘š
of product x
𝑣 π‘₯ ⋅ 𝐿𝑦𝑦′
π‘₯,𝑦,𝑦 ′ ∈𝐷
• example weight 𝑣 π‘₯ is the potential annual
revenue of product x
• error function 𝐿𝑦𝑦′ is the loss ratio
– the percentage of the potential revenue a vendor will
lose due to misclassification from class y to class y’.
– a non-decreasing monotonic function of hierarchical
distance between y and y’, f(d(y, y’))
d(y,y’)
0
1
2
3
4
𝐿𝑦𝑦 ′
0
0.2
0.4
0.6
0.8
Learning – minimizing average revenue loss
π‘…π‘’π‘š
1
=
π‘š
𝑣 π‘₯ ⋅ 𝐿𝑦𝑦′
π‘₯,𝑦,𝑦 ′ ∈𝐷
Minimize convex upper bound
Multi-class SVM with margin re-scaling
πœƒπ‘¦π‘‡π‘– π‘₯𝑖
πœƒπ‘¦π‘‡π‘– π‘₯𝑖 − πœƒπ‘¦π‘‡′ π‘₯𝑖
≥ 𝐿 π‘₯𝑖 , 𝑦𝑖 , 𝑦 ′
= 𝑣 π‘₯𝑖 ⋅ 𝐿𝑦𝑖𝑦′
𝑇
πœƒπ‘¦′ π‘₯𝑖
1
min
πœƒ
πœƒ,πœ‰ 2
′
𝑠. 𝑑. ∀𝑖, ∀𝑦 :
πœƒπ‘¦π‘‡π‘– π‘₯𝑖
−
2
𝐢
+
π‘š
𝑇
πœƒπ‘¦′ π‘₯𝑖
π‘š
πœ‰π‘–
𝑖=1
≥ 𝐿 π‘₯𝑖 , 𝑦𝑖 , 𝑦 ′ −πœ‰π‘–
πœ‰π‘– ≥ 0
Multi-class SVM with margin re-scaling
1
min
πœƒ
πœƒ,πœ‰ 2
′
𝑠. 𝑑. ∀𝑖, ∀𝑦 :
𝐢
2
+
π‘š
πœƒπ‘¦π‘‡π‘– π‘₯𝑖
−
Convex upper bound of
π‘š
πœ‰π‘–
𝑖=1
𝑇
πœƒπ‘¦′ π‘₯𝑖
1
π‘š
π‘š
𝐿(π‘₯𝑖 , 𝑦𝑖 , 𝑦 ′ )
𝑖=1
≥ 𝐿 π‘₯𝑖 , 𝑦𝑖 , 𝑦 ′ −πœ‰π‘–
πœ‰π‘– ≥ 0
plug in any loss function
0-1
VALUE
TREE
REVLOSS
[𝑦𝑖 ≠ 𝑦 ′ ]
𝑣 π‘₯𝑖 [𝑦𝑖 ≠ 𝑦 ′ ]
𝐷(𝑦𝑖 , 𝑦 ′ )
𝑣 π‘₯𝑖 𝐿𝑦𝑖 𝑦′
error rate (standard
multi-class SVM)
product revenue
hierarchical distance
revenue loss
Dataset
• UNSPSC (United Nations Standard Product and
Service Code) dataset
data source
multiple online market places
oriented for DoD and Federal
government customers
GSA Advantage
DoD EMALL
taxonomy structure
4-level balanced tree
UNSPSC
taxonomy
#examples
1.4M
#leaf classes
1073
• Product revenues are simulated
– revenue = price * sales
Experimental results
60
47.708
50
48.082
40
IDENTITY
30
UNIT
20
10
4.745
4.964
5.092
5.082
0
0-1
TREE
VALUE
REVLOSS
Average revenue loss (in K$) of different algorithms
What’s wrong?
1
min
πœƒ
πœƒ,πœ‰ 2
𝑠. 𝑑. ∀𝑖, ∀𝑦 ′ ≠ 𝑦𝑖 :
2
𝐢
+
π‘š
π‘š
πœ‰π‘–
𝑖=1
πœƒπ‘¦π‘‡π‘– π‘₯𝑖 − πœƒπ‘¦π‘‡′ π‘₯𝑖 ≥ 𝐿 π‘₯𝑖 , 𝑦𝑖 , 𝑦 ′ − πœ‰π‘–
πœ‰π‘– ≥ 0
𝑣 π‘₯𝑖 ⋅ 𝐿𝑦𝑖𝑦′
Revenue loss ranges from a few K to several M
Loss normalization
• Linearly scale loss function to a fixed range
[1, π‘€π‘šπ‘Žπ‘₯ ], say [1, 10]
𝑠
𝐿
π‘₯,𝑦,𝑦 ′
=1+
𝐿 π‘₯,𝑦,𝑦 ′ −πΏπ‘šπ‘–π‘›
πΏπ‘šπ‘Žπ‘₯ −πΏπ‘šπ‘–π‘›
⋅ (π‘€π‘šπ‘Žπ‘₯ − 1)
The objective now upper bounds both 0-1 loss
and the average normalized loss.
Final results
60
47.708
50
48.082
40
IDENTITY
30
UNIT
RANGE
20
10
4.745
4.964
5.092 4.387
5.082 4.371 7.88% reduction
in average
revenue loss!
0
0-1
TREE
VALUE
REVLOSS
Average revenue loss (in K$) of different algorithms
Conclusion
What do we really care about for
this task?
Minimize error rate?
Minimize revenue loss?
empirical risk, average
misclassification cost:
π‘…π‘’π‘š
1
=
π‘š
Performance
evaluation metric
=
How do we approximate the
performance evaluation metric to
make it tractable?
Find the best parameters
Model +
Tractable loss function
Optimization
1
π‘š
𝐿(π‘₯, 𝑦, 𝑦 ′ )
π‘₯,𝑦,𝑦 ′ ∈𝐷
𝑀 π‘₯ ⋅ Δ(𝑦, 𝑦 ′ )
π‘₯,𝑦,𝑦 ′ ∈𝐷
regularized empirical
risk minimization
A general method: multiclass SVM with margin
re-scaling and loss
normalization
Thank you!
Questions?
Download