Random forest

An individual model may make mistakes, but if we combine the output of multiple models, the chance of an incorrect answer is reduced. An ensemble is a combination of models, and when we combine their outputs, the process is called ensemble learning.

image.png

For this to work, the models need to be different. If we train the same decision tree model 10 times, they will all predict the same output, so it’s not useful at all.

The easiest way to have different models is to train each tree on a different subset of features. When we combine their predictions their mistakes average out and they have more predictive power.

This way of putting together multiple decision trees into an ensemble is called a Random Forest. Scikit-learn contains an implementation of a random forest, so we can use it for solving our problem.

Training a random forest

We need to import RandomForestRegressor from the ensemble package:

from sklearn.ensemble import RandomForestRegressor

Let’s invoke a Random Forest model:

rf = RandomForestClassifier(n_estimators=10, random_state=1, n_jobs=-1)
rf.fit(X_train, y_train)

Setting n_estimators=10 specifies the number of trees in the ensemble, and random_state=1 fixes the seed for the random-number generator, making the results consistent.

After training finishes, we can evaluate the performance of the result:

y_pred_val = rf.predict(X_val)
RMSE(y_val, y_pred_val)
#42.13724207871227

The number of trees in the ensemble is an important parameter, and it influences the performance of the model. Usually, a model with more trees is better than a model with fewer trees. On the other hand, adding too many trees is not always helpful.

import numpy as np
import matplotlib.pyplot as plt
import mplcyberpunk
plt.style.use("cyberpunk")

rmse = []
n_estimators=np.arange(10, 201, 10)

for n in n_estimators:
    rf = RandomForestRegressor(n_estimators=n, random_state=1, n_jobs=-1)
    rf.fit(X_train, y_train)
    y_pred_val = rf.predict(X_val)
    rmse.append(RMSE(y_val, y_pred_val))

plt.plot(n_estimators, rmse)
plt.title("Number of trees vs RMSE")
plt.xlabel("Number of trees")
plt.ylabel("RMSE")

image.png

The RMSE decreases as the number of trees increases; however, there is a point where the RMSE reaches its minimum, around 80 trees. We can observe that with more than 80 trees, the RMSE begins to increase again.

Parameter tuning for random forest