Dr. B. C. Roy Engineering College Name of the Exam: CA2 Title: Learning of Decision Tree: Top-Down induction of Decision Tree, Choice of the attribute, Entropy, Information Gain, GINI Index explain with example Paper Name: Machine Learning Paper Code: PCC-DS 601(E) Department: CSE (Data Science) Semester: 6th Semester Academic Year: 2024-25 Student Name: Sujoy Paul Student University Roll no: 12030523067 Assigned Professors for the Paper: Prof. Banashree Chatterjee Designation: Assistant Professor, CSE(DS) Table of Contents 1. Introduction 2. Top-Down Induction of Decision Trees 3. Choice of the Attribute 4. Entropy and Information Gain ○ Definition of Entropy ○ Definition of Information Gain ○ Example of Entropy and Information Gain Calculation 5. GINI Index ○ Definition of GINI Index ○ Example of GINI Index Calculation 6. Comparative Analysis of Information Gain and GINI Index 7. Conclusion 8. References 1. Introduction Decision Trees are widely used machine learning algorithms for classification and regression tasks. They model decisions in a tree-like structure, where each internal node represents an attribute, each branch represents a decision rule, and each leaf node represents an outcome. The learning process involves selecting the best attributes to split the dataset to create a tree with minimal depth and high predictive accuracy. 2. Top-Down Induction of Decision Trees The top-down induction method is a recursive approach to constructing a decision tree. It involves: ● Selecting the best attribute based on a criterion such as Information Gain or GINI Index. ● Splitting the dataset into subsets based on the selected attribute. ● Repeating the process recursively until a stopping condition is met, such as all instances belonging to a single class or reaching a predefined tree depth. 3. Choice of the Attribute The selection of attributes at each node greatly impacts the performance of the decision tree. The choice is typically made based on: ● Entropy and Information Gain (for ID3 and C4.5 algorithms) ● GINI Index (for CART algorithm) 4. Entropy and Information Gain Definition of Entropy Entropy measures the impurity of a dataset. It is defined as: where pip_ipiis the proportion of instances belonging to class iii in dataset SSS. Definition of Information Gain Information Gain measures the reduction in entropy achieved by splitting the dataset on an attribute. It is defined as: Example of Entropy and Information Gain Calculation Consider a dataset with the following class distribution: ● 5 positive (+) examples ● 5 negative (-) examples Entropy before splitting: If an attribute splits the dataset into subsets: ● Subset 1: (3+, 1-) → Entropy = 0.811 ● Subset 2: (2+, 4-) → Entropy = 0.918 Information Gain: 5. GINI Index Definition of GINI Index GINI Index measures the impurity of a dataset using: A lower GINI index indicates a purer node. Example of GINI Index Calculation Consider the same dataset: If an attribute splits the dataset into: ● Subset 1: (3+, 1-) → GINI = 0.375 ● Subset 2: (2+, 4-) → GINI = 0.444 Weighted GINI: 6. Comparative Analysis of Information Gain and GINI Index ● Information Gain is used in ID3 and C4.5 and prefers attributes with many unique values. ● GINI Index is used in CART and is faster to compute. ● Information Gain uses logarithms, making it computationally expensive. ● GINI Index is less biased towards attributes with many values. 7. Conclusion Decision Trees are powerful yet interpretable models for classification. The choice of attribute selection criterion impacts their accuracy and performance. Information Gain and GINI Index are two widely used measures, each with its strengths. A deeper understanding of these methods helps in optimizing tree-based models. 8. References ● Quinlan, J. R. (1986). "Induction of Decision Trees." Machine Learning. ● Breiman, L. (1984). "Classification and Regression Trees." Wadsworth. ● Mitchell, T. (1997). "Machine Learning." McGraw Hill.