Definition: A supervised learning algorithm used for both classification and regression tasks.
Structure: A tree-like structure where each internal node represents a test on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label or a prediction.
Root Node: The starting point of the tree, representing the entire dataset.
Internal Nodes (Decision Nodes): Represent tests on different attributes.
Leaf Nodes (Terminal Nodes): Represent the final outcome or prediction.
Represent the possible outcomes of the test at each internal node.
Lead to either another internal node or a leaf node.
Divide and Conquer approach: Recursively split the dataset into smaller subsets based on the values of attributes.
Goal: To create subsets where all data points within each subset belong to the same class (for classification) or have similar values (for regression).
Imagine you’re building a decision tree to classify animals. The dataset contains features like:
Hair: Yes/No
Feathers: Yes/No
Milk: Yes/No
Eggs: Yes/No
Type: Mammal, Bird, Reptile, etc. (target variable)
Steps to build the tree:
Start at the root node: Evaluate the feature that most effectively splits the dataset into homogeneous classes using measures like Information Gain (Entropy) or Gini Impurity.
For example, the root node might split based on the Hair feature.
Branch out: Create sub-nodes for each possible outcome of the root node split. Repeat the process by choosing the best feature to split further.
If Hair = Yes, the next decision might be based on Milk.
If Hair = No, the decision might consider Feathers.
Stop splitting: Continue until:
Nodes are pure (contain a single class).
No features remain to split.
The stopping criteria (like maximum depth) is met.
Leaf nodes: These are the terminal nodes where classification outcomes are assigned.
We use the famous Boston Housing dataset, which contains information about housing prices in various neighborhoods. Features include:
CRIM: Per capita crime rate.
RM: Number of rooms per dwelling.
AGE: Proportion of owner-occupied units built before 1940.
PRICE: Median value of owner-occupied homes (target variable).
Information Gain: Measures the reduction in entropy after splitting the dataset on an attribute.
Gini Impurity: Measures the degree of probability of misclassifying a randomly chosen element if it were randomly labeled according to the class distribution in the subset.
Definition: A measure of the impurity or randomness of a dataset. Formula: \[ Entropy = - \sum_{i} P(i) \log_2 P(i) \]
This formula calculates the randomness or impurity in a dataset. Here:
Definition: how much uncertainty (or entropy) is reduced after splitting data at a particular node. It’s a key concept in building decision trees.
Formula: \[ Information\ Gain = Entropy(Parent) - \sum_{i} \left( \frac{N_{i}}{N_{Total}} \times Entropy(Child_{i}) \right) \]
Where:
𝑁𝑖 is the number of samples in the 𝑖-th child node.
𝑁𝑇𝑜𝑡𝑎𝑙 is the total number of samples in the parent node.
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 represents the entropy calculation for both parent and child nodes.
\[ Gini\ Impurity = 1 - \sum_{i} P(i)^2 \]
Where:
𝑃(𝑖) is the proportion of instances belonging to class 𝑖.
Gini Impurity measures how often a randomly chosen element would be incorrectly classified if it was randomly labeled according to the distribution of classes. It’s another common metric for assessing splits in decision trees.
Simple to understand and to interpret. Trees can be visualized.
Requires little data preparation. Other techniques often require data normalization, dummy variables need to be created and blank values to be removed. Note however that this algorithm does not support missing values.
Able to handle both numerical and categorical data.
Able to handle multi-output problems.
Uses a white box model. Results are easy to interpret.
Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.
The decision tree has no assumptions about distribution because of the non-parametric nature of the algorithm.
It can be used for feature engineering such as predicting missing values, suitable for variable selection.
It can easily capture Non-linear patterns.
Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting. Mechanisms such as pruning, setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.
Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.