Chapter # 08
Decision Tree

What is Decision Tree?

Definition: A supervised learning algorithm used for both classification and regression tasks.
Structure: A tree-like structure where each internal node represents a test on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label or a prediction.

Core Concepts - Nodes

Root Node: The starting point of the tree, representing the entire dataset.
Internal Nodes (Decision Nodes): Represent tests on different attributes.
Leaf Nodes (Terminal Nodes): Represent the final outcome or prediction.

Core Concepts - Branches

Represent the possible outcomes of the test at each internal node.
Lead to either another internal node or a leaf node.

How Decision Trees Work - The Basic Idea

Divide and Conquer approach: Recursively split the dataset into smaller subsets based on the values of attributes.
Goal: To create subsets where all data points within each subset belong to the same class (for classification) or have similar values (for regression).

Decision Tree for Classification - Example

Imagine you’re building a decision tree to classify animals. The dataset contains features like:

Hair: Yes/No

Feathers: Yes/No

Milk: Yes/No

Eggs: Yes/No

Type: Mammal, Bird, Reptile, etc. (target variable)

Steps to build the tree:

Start at the root node: Evaluate the feature that most effectively splits the dataset into homogeneous classes using measures like Information Gain (Entropy) or Gini Impurity.

For example, the root node might split based on the Hair feature.

Branch out: Create sub-nodes for each possible outcome of the root node split. Repeat the process by choosing the best feature to split further.

If Hair = Yes, the next decision might be based on Milk.

If Hair = No, the decision might consider Feathers.

Stop splitting: Continue until:

Nodes are pure (contain a single class).

No features remain to split.

The stopping criteria (like maximum depth) is met.

Leaf nodes: These are the terminal nodes where classification outcomes are assigned.

Decision Tree for Regression - Example

We use the famous Boston Housing dataset, which contains information about housing prices in various neighborhoods. Features include:

CRIM: Per capita crime rate.

RM: Number of rooms per dwelling.

AGE: Proportion of owner-occupied units built before 1940.

PRICE: Median value of owner-occupied homes (target variable).

The Goal of Learning a Decision Tree

To create a model that can predict the class or value of a new, unseen instance by traversing the tree based on its attributes.

Attribute Selection - The Key Challenge

How do we decide which attribute to test at each internal node? The choice of attribute significantly impacts the tree’s structure and performance.

Common Attribute Selection Measures (for Classification)

Information Gain: Measures the reduction in entropy after splitting the dataset on an attribute.
Gini Impurity: Measures the degree of probability of misclassifying a randomly chosen element if it were randomly labeled according to the class distribution in the subset.

Entropy (for Information Gain)

Definition: A measure of the impurity or randomness of a dataset. Formula: \[ Entropy = - \sum_{i} P(i) \log_2 P(i) \]

This formula calculates the randomness or impurity in a dataset. Here:

𝑃(𝑖) is the proportion of instances belonging to class 𝑖.
The summation is over all possible classes.

Information Gain Calculation

Definition: how much uncertainty (or entropy) is reduced after splitting data at a particular node. It’s a key concept in building decision trees.

Formula: \[ Information\ Gain = Entropy(Parent) - \sum_{i} \left( \frac{N_{i}}{N_{Total}} \times Entropy(Child_{i}) \right) \]

Where:

𝑁_𝑖 is the number of samples in the 𝑖-th child node.
𝑁_{𝑇𝑜𝑡𝑎𝑙} is the total number of samples in the parent node.
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 represents the entropy calculation for both parent and child nodes.

Gini Impurity Calculation

\[ Gini\ Impurity = 1 - \sum_{i} P(i)^2 \]