What is Random forest?

Random forest is an algorithm that builds on top of decision trees. It can be used for both classification and regression tasks.

It is a form of ensemble learning, wherein multiple individual machine learning models are created and then combined in a way to obtain the final prediction.

In Random forest, each of these individual machine learning models is a decision tree trained on a random subset of the training data with a random subset of training features.

Sounds intimidating? 😰 Here's an illustration to help you out.

A Random forest made of several decision trees to make a prediction. Image courtesy: Tony Yiu

Simply put, Random forests rely on the power of unrelated crowds (a.k.a. decision trees with random data).

Put together a group on models without any correlation and they'll better than any individual model. That's because while some crowds (decision trees) may be wrong, others may be right.

As a group, the unrelated crowds help make better decisions.

Why the name Random forest, you might wonder.🤔 The forest comes from the multiple trees that take part in the algorithm. The random is because of the random selection of training points and features.

How does a Random forest work?

Bootstrap aggregating (also known as bagging) is the basis for the working of Random forest.

Bagging

In bagging, given a training dataset for each tree in the ensemble, we randomly sample a fixed number of training points with replacement. Once this is done, each tree is trained on its sampled data. In the end, the results of each tree are aggregated.

Like we mentioned earlier, Random forests work for both regression and classification. Here's how.

For a classification task, you take the majority voting from all the trees and in case of a regression task, you take the average of the output values of all the trees.

Random forests and bagging

Random forest takes the concept of bagging slightly further. Instead of just sampling training data for each tree, it also samples the features.

So, for each tree, we select only a fixed number of features.

In machine learning, a feature is data that you can measure and use in your analysis. If you're analyzing fitness of your customers, then age is a feature that you will consider in your analysis.

These features are randomly selected from all the available features. Therefore, each tree of the ensemble is now trained on a random subset of training data with a random subset of features. This helps reduce variance (how far your data is from the average value) in the resulting Random forest model and reduce outliers.

What are the drawbacks of the Random Forest algorithm?

The biggest strength of a decision tree is its interpretability. However, in Random forest, you train a large number of individual decision trees independently. This makes it more difficult to interpret Random forests.

To tackle this, several methods exist. An example is treeinterpreter.

Treeinterpreter is a library in Python that helps you determine the impact of each feature chosen for a Random forest algorithm.

Nonetheless, the ease of interpretability that decision trees offer is lost in Random Forests. This makes it difficult to explain the model's predictions.

Yet another issue with Random forest algorithms is that there are several hyperparameters. Some of these include the number of trees, depth of each tree and number of features to be considered.

These hyperparameters take time to be tuned and have to be adjusted heuristically. Therefore, tuning the Random forest to achieve above average results is an extremely time consuming process.

When should you use the Random forest algorithm?

Random forest is one of the most widely used algorithms in machine learning and has been a part of various Kaggle competition solutions.

It has several use cases from fraud detection in banking and recommendation systems in streaming services to predicting stock market behavior and determining the right combination of chemicals for medicines.

You should definitely try using Random Forests on tabular datasets. You can expect really good results after investing some time in hyperparameter tuning.

Think we're missing something? 🧐 Help us update this article by sending us your suggestions here. 🙏

See also

Articles you might be interested in

  1. Understanding Random Forest
  2. Sklearn Random forest classifier documentation
  3. Treeinterpreter - a method to explain outputs of Random Forest