Implementing decision tree from scratch

Let us code from scratch. No packages used.

Vizuara AI Labs

and

Sreedath Panat

Jan 21, 2025

Decision tree from scratch: Full course

What is a decision tree?

A Decision Tree is a predictive model that maps decisions and their possible outcomes in a tree-like structure
It is used for both classification and regression tasks.

Consider the following dataset. There are 2 features and yes or no as the target.

What is entropy?

Information gain measures an amount the information that we gain. It does so using entropy.

Subtract from the entropy of our data before the split the entropy of each possible partition thereafter.

We then select the split that yields the largest reduction in entropy, or equivalently, the largest increase in information

Let us build a decision tree step by step

Step 1: Calculate Root Node Entropy

Step 2: Calculate information gain for each feature

Step 3: Choose the best feature based on information gain

IG(root, weather) = 0.667

IG(root, temperature) = 0.208

Thus weather is the best feature becuase it has higher information gain.

Step 4: Split the data

Step 5: Find the best feature at the internal impure node

Step 6: Repeat the steps until…

Implementation in Julia

Import package

Statsbase is used for calculating the mode at the leaves.

using StatsBase

Dataset definition

We define three weathers using number 1, 2, 3 and three temperatures also using 1, 2, 3. So the 1st column is weather and 2nd column is temperature.

# Features: Weather (1=Sunny, 2=Rainy, 3=Overcast), Temperature (1=Hot, 2=Mild, 3=Cool)
# Labels: Play Outside (0=No, 1=Yes)
X = [
    1 1;
    1 2;
    2 3;
    2 2;
    3 3;
    3 2
]
println(typeof(X))  # Matrix{Int}
y = [0, 1, 0, 0, 1, 1]  # Labels (Play Outside: No or Yes)

Function for entropy calculation

function entropy(labels)
     n = length(labels)
    if n == 0
        return 0.0  # Entropy is 0 for an empty set
    end
    probs = [count(labels .== c) / n for c in unique(labels)]
    return -sum(p * log2(p) for p in probs if p > 0)
end

Function for information gain calculation

In this code block, left_y is the left branch going from a node and right_y is the right branch.

function information_gain(X, y, feature_index, threshold)
    # Perform element-wise comparison for splitting
    left_indices = findall(X[:, feature_index] .<= threshold)
    right_indices = findall(X[:, feature_index] .> threshold)
    
    left_y = y[left_indices]
    right_y = y[right_indices]
    
    # Calculate entropy before the split
    entropy_before = entropy(y)
    
    # Calculate entropy after the split
    n = length(y)
    entropy_after = (length(left_y) / n) * entropy(left_y) + (length(right_y) / n) * entropy(right_y)
    
    # Information gain is the reduction in entropy
    return entropy_before - entropy_after
end

Find the best split

Best split requires finding the best features and best threshold. For example, feature 1 (weather) should be only sunny (value =1).

function best_split(X, y)
    best_gain = -Inf
    best_feature = -1
    best_threshold = -1
    
    n_features = size(X, 2)
    for feature_index in 1:n_features
        thresholds = unique(X[:, feature_index])  # Test each unique value in the feature
        for threshold in thresholds
            gain = information_gain(X, y, feature_index, threshold)
            if gain > best_gain
                best_gain = gain
                best_feature = feature_index
                best_threshold = threshold
            end
        end
    end
    
    return best_feature, best_threshold, best_gain
end

Code for building decision tree

The decision tree building stops when the depth exceeds maximum depth or entropy becomes zero or data points in a node equals 1 or 0. Also note that the build_tree function is called recursively inside the function itself until stopping criteria is reached.

function build_tree(X, y, depth=0, max_depth=3)
    # Check stopping conditions
    if entropy(y) == 0 || length(y) <= 1 || depth >= max_depth
        # Return a leaf node with the majority class
        return mode(y)
    end
    
    # Find the best split
    feature_index, threshold, gain = best_split(X, y)
    if gain == 0  # No further information can be gained
        return mode(y)
    end
    
    # Split the dataset
    left_indices = findall(Float64.(X[:, feature_index]) .<= Float64.(threshold))
    right_indices = findall(Float64.(X[:, feature_index]) .> Float64.(threshold))
    
    left_X = X[left_indices, :]
    right_X = X[right_indices, :]
    left_y = y[left_indices]
    right_y = y[right_indices]
    
    # Build the left and right subtrees recursively
    left_tree = build_tree(left_X, left_y, depth + 1, max_depth)
    right_tree = build_tree(right_X, right_y, depth + 1, max_depth)
    
    # Return the decision node
    return Dict(
        :feature_index => feature_index,
        :threshold => threshold,
        :left => left_tree,
        :right => right_tree
    )
end

Train and print the decision tree

tree = build_tree(X, y)
# Print the decision tree
println(tree)