Uploaded by FOCOS PUBG

Decision Tree Learning: Entropy, GINI Index, Information Gain

advertisement
Dr. B. C. Roy Engineering College
Name of the Exam: CA2
Title: Learning of Decision Tree: Top-Down induction
of Decision Tree, Choice of the attribute, Entropy,
Information Gain, GINI Index explain with example
Paper Name: Machine Learning
Paper Code: PCC-DS 601(E)
Department: CSE (Data Science)
Semester: 6th Semester
Academic Year: 2024-25
Student Name: Sujoy Paul
Student University Roll no: 12030523067
Assigned Professors for the Paper: Prof.
Banashree Chatterjee
Designation: Assistant Professor, CSE(DS)
Table of Contents
1.​ Introduction
2.​ Top-Down Induction of Decision Trees
3.​ Choice of the Attribute
4.​ Entropy and Information Gain
○​ Definition of Entropy
○​ Definition of Information Gain
○​ Example of Entropy and Information Gain Calculation
5.​ GINI Index
○​ Definition of GINI Index
○​ Example of GINI Index Calculation
6.​ Comparative Analysis of Information Gain and GINI Index
7.​ Conclusion
8.​ References
1. Introduction
Decision Trees are widely used machine learning algorithms for classification and
regression tasks. They model decisions in a tree-like structure, where each internal
node represents an attribute, each branch represents a decision rule, and each leaf
node represents an outcome. The learning process involves selecting the best
attributes to split the dataset to create a tree with minimal depth and high predictive
accuracy.
2. Top-Down Induction of Decision Trees
The top-down induction method is a recursive approach to constructing a decision
tree. It involves:
●​ Selecting the best attribute based on a criterion such as Information Gain
or GINI Index.
●​ Splitting the dataset into subsets based on the selected attribute.
●​ Repeating the process recursively until a stopping condition is met, such as
all instances belonging to a single class or reaching a predefined tree depth.
3. Choice of the Attribute
The selection of attributes at each node greatly impacts the performance of the
decision tree. The choice is typically made based on:
●​ Entropy and Information Gain (for ID3 and C4.5 algorithms)
●​ GINI Index (for CART algorithm)
4. Entropy and Information Gain
Definition of Entropy
Entropy measures the impurity of a dataset. It is defined as:
where pip_ipi​is the proportion of instances belonging to class iii in dataset SSS.
Definition of Information Gain
Information Gain measures the reduction in entropy achieved by splitting the dataset
on an attribute. It is defined as:
Example of Entropy and Information Gain Calculation
Consider a dataset with the following class distribution:
●​ 5 positive (+) examples
●​ 5 negative (-) examples
Entropy before splitting:
If an attribute splits the dataset into subsets:
●​ Subset 1: (3+, 1-) → Entropy = 0.811
●​ Subset 2: (2+, 4-) → Entropy = 0.918
Information Gain:
5. GINI Index
Definition of GINI Index
GINI Index measures the impurity of a dataset using:
A lower GINI index indicates a purer node.
Example of GINI Index Calculation
Consider the same dataset:
If an attribute splits the dataset into:
●​ Subset 1: (3+, 1-) → GINI = 0.375
●​ Subset 2: (2+, 4-) → GINI = 0.444
Weighted GINI:
6. Comparative Analysis of Information Gain and
GINI Index
●​ Information Gain is used in ID3 and C4.5 and prefers attributes with many
unique values.
●​ GINI Index is used in CART and is faster to compute.
●​ Information Gain uses logarithms, making it computationally expensive.
●​ GINI Index is less biased towards attributes with many values.
7. Conclusion
Decision Trees are powerful yet interpretable models for classification. The choice of
attribute selection criterion impacts their accuracy and performance. Information
Gain and GINI Index are two widely used measures, each with its strengths. A
deeper understanding of these methods helps in optimizing tree-based models.
8. References
●​ Quinlan, J. R. (1986). "Induction of Decision Trees." Machine Learning.
●​ Breiman, L. (1984). "Classification and Regression Trees." Wadsworth.
●​ Mitchell, T. (1997). "Machine Learning." McGraw Hill.
Download