Random forests often also called random decision forests represent a Machine Learning task that can be used for classification and regression problems. They work by constructing a variable number of decision tree classifiers or regressors and the output is obtained by corroborating the output of the all the decision trees to settle for a single result. Because they work based on the simple concept of wisdom of crowds, random forests are very powerful machine learning tools, because they keep the simplicity of decision trees but employing the power of the ensemble.
In this article we are going to explore the basic concepts behind random forests and then are going to see how we can implement them using Python and Scikit-Learn. Full code for this article can be found on Github.
Interested in more stories like this? Follow me on Twitter at @b_dmarius and I'll post there every new article.
- Decision Trees
- Introduction to Random Forest Classifiers
- Introduction to Random Forest Regressors
- Random Forest algorithm explained
- Decision Trees vs Random Forests
- Random Forests applications and use cases
- Random Forest Sklearn Python implementation
- Visualizing Random Forests
Decision Trees are a type of Supervised Machine Learning tasks that analyse the features of the given data and try to understand how to effectively split the data into trees and substrees so that it can correctly predict new values.
A simple, layman's explanation for a decision tree is that it means figuring out the best questions you can ask to a dataset so that the answers to those questions will help you classify a data entry into one of many category or correctly estimate some value for that data entry.
I'm sure you've seen a tree or even a decision tree before, so here's a quick image that can help us better understand the concept.
You see, the tree has nodes, edges and leaf nodes and the goal and you need to navigate the tree from top to bottom to classify a data entry or to estimate a value. The goal of the Decision tree algorithm is to figure out which of the features in the data are the ones we should use in our decision tree and what's the correct way to structure the tree.
If you need more information on Decision Trees you can check out Decision Tree Classifiers Explained where we go into a more, in-depth explanation of the concept. You can also check out how to implement a Decision Tree classifier in Python using Scikit-Learn.
Introduction To Random Forest Classifiers
A random forest classifier is, as the name implies, a collection of decision trees classifiers that each do their best to offer the best output. Because we talk about classification and classes and there's no order relation between 2 or more classes, the final output of the random forest classifier is the mode of the classes.
This means the "winner" class is the one who appears most times in the list of outputs from all the decision trees used.
We all know Machine Learning models are never perfect and that sometimes they offer us incorrect data, especially when the data is sparse or we don't have lots of features we can look at.
But random forests are so far better than decision trees because they use the decisions from an ensemble of trees to figure out what is the final output. And while some decision trees may be wrong indeed, there's a high chance that, using carefully groomed data, most of the trees in the forest will be able to offer a better output.
Introduction To Random Forest Regressors
Decision Trees and therefore random forests can also be used for regression problems. While in classification tasks we need to predict the class to which we can assign an entry in our dataset, with regression tasks the goal is to predict a continuous value by modelling a set of independent variables.
For random forest regressors, the decision as to which is the final answer of the ensemble means identifying the mean value of the list of all outputs offered by every decision tree in the forest.
The rest of the process is basically the same as for decision trees classifiers. You deploy a series of decision trees, you let them build their own decisions, and then you use the wisdom of the crowds to build one final output.
The power of the algorithm stands again in the number of individual trees, because you know the mean value of a series of number is heavily influenced in the direction where most of the values in the series are located.
More high values pull the mean higher, while low values push the mean lower and equally distributed values keep the mean in the middle.
Random Forest algorithm explained
As I kept thinking about wisdom of the crowds while writing this article, I've realized there is one key aspect we need to take into consideration while deciding whether to use random forests or not.
In real life, an ideal democracy is powerful because the power of all people who vote is greater than any individual's power and that helps pushing a democracy forward towards the benefit of everybody. But what happens if suddenly, an individual begins to gain more influence amongst other peers? This individual will begin influencing others and then they will push the direction of the democracy to something which might not be the best decision.
That's also the case with decision trees. We need to make sure that decision trees don't get to influence each other. In Statistics language, that means finding a way to enforce the fact that decision trees are not correlated, and this is achieved by employing two methods.
The first method is based on choosing a random subset of features for each individual tree in the forest.
Let's say we have a dataset with N available features. A normal decision trees looks at all the N features, but with random forests, each individual tree will be able to analyze only a subset of M<N features, with the M features being randomly chosen from the whole set of features. Because the trees are forced to look at different data points, it is natural that the results they obtain will be less correlated between each other.
The second method is called bootstrap aggregation and it means choosing a random subset of the whole data entries in the dataset to train our decision trees. The sampling is done with replacement, which means the number of elements in the subset is actually the same as the whole set, so that all decision trees have a fair chance to get some proper training.
For example, if the dataset has the following rows for a specific column: [a, b, c, d, e], a decision tree might get to be trained on a subset that looks like this: [a, a, b, c, c]. You notice the set and the subset have the same number of elements, but the elements are chosen randomly from the whole possible combinations.
Taking all of these into consideration, we can sketch the simple algorithm for building a random forest of decision trees. Let's say we want to build a random forest with T trees with a dataset of M rows and N features.
For each one of the T decision trees:
- Select a random subset of features from the whole set of features
- Select a random subset of rows from the whole dataset
- Build the current tree with the current features and rows subsets
Decision Trees vs Random Forests
So all the previous sections tell us that usually random forests will perform better than simple decision trees. As with all Machine Learning problems, the decision to choose one model over another is a process of experimentation and depends on the data you have and on the objectives you have set for your model.
Decision Trees might be more suitable when you need a simple model, that computes on small or simple datasets. If the data is simple in terms of features or entries, then a decision tree will most probably perform very well.
A decision tree might also be more appropriate when you want a model that's easy to understand and visualize. Also, if you want your model to be explainable, decision trees will work very well for you. That means your results will be somehow more predictable and you can easy explain to other interested parties how it works.
You can also choose decision trees when you don't worry or you don't want to worry about certain features in your dataset being correlated. If the features are correlated, a decision tree will perform way more poorly than a random forest.
For all the other cases, you can certainly go with random forests. As a general rule of thumb here, we might say that you should try decision trees first and random forests after and see which works best for you.
Random forests applications and use cases
All of the previous sections bring us to a simple conclusions: a random forest is like a decision tree, but on steroids. So all the uses cases that apply to a decision tree will most definitely also apply to a random forest. That being said:
- You can use random forests both for classification and regression tasks.
- Use random forests when you have tried a decision tree already and you've come to the conclusion that you need higher accuracy and computational costs in terms of money, data or time are not a problem
- Use random forests when you have tried a decision tree already and after testing and validation you observe your decision tree is overfitting
- Use random forests if your dataset has too many features for a decision tree to handle
Random Forest Python Sklearn implementation
We can use the Scikit-Learn python library to build a random forest model in no time and with very few lines of code.
We will first need to install a few dependencies before we begin.
pip3 install scikit-learn pip3 install matplotlib pip3 install pydotplus pip3 install ipython
We will use the same small, mini-dataset we've built for the Decision Tree model. Here's a quick peek at the data.
The goal here is to predict whether we'll encounter a traffic jam on our way through the city dependending on 3 variables: the weather, whether it's a weekday or a weekend and the moment of the day. We can see that we have a classification problem, so we will build a Random Forest Classifier.
Let's import our dependencies into our project.
from sklearn import preprocessing from sklearn.ensemble import RandomForestClassifier from IPython.display import Image import pydotplus from sklearn import tree
The code for building the small dataset will be in the Github repository for this article, but the main idea is that we'll have four methods, one for each of the columns from the table in the image above.
weather = getWeather() timeOfWeek = getTimeOfWeek() timeOfDay = getTimeOfDay() trafficJam = getTrafficJam()
The next thing we need to do is to encode our data using the LabelEncoder class from Scikit-Learn. What this encode will do for us is take each unique value from a column and transform it into a number from 0 to n-1(where n is the number of unique values on a column) so that each unique text value will be mapped to a specific number.
We will use this encoding for every value in our dataset so that the model can work with number and perform well on our dataset.
# Encode the features and the labels encodedWeather = labelEncoder.fit_transform(weather) encodedTimeOfWeek = labelEncoder.fit_transform(timeOfWeek) encodedTimeOfDay = labelEncoder.fit_transform(timeOfDay) encodedTrafficJam = labelEncoder.fit_transform(trafficJam)
We will then build our features dataset that will be fed into our classifier.
# Build the features features =  for i in range(len(encodedWeather)): features.append([encodedWeather[i], encodedTimeOfWeek[i], encodedTimeOfDay[i]])
Now it's time to train our random forest classifier by using the RandomForestClassifier class from Scikit-Learn. We are going to use only 5 estimators(meaning 5 decision trees). The reasons for going with such a low number of estimators is that they are more than enough for such a small dataset(even a simple decision tree works perfectly fine) and because i want them to be easy to visualize in the next section.
classifier = RandomForestClassifier(n_estimators=5) classifier.fit(features, encodedTrafficJam)
It's time now to use the classifier for predictions. We can do that very easily like this:
# ["Snowy", "Workday", "Morning"] print(classifier.predict([[2, 1, 2]])) # Prints , meaning "Yes" # ["Clear", "Weekend", "Lunch"] print(classifier.predict([[0, 0, 1]])) # Prints , meaning "No"
Of course the classifier works perfectly! 😀 Using a Random Forest classifier for such a small dataset will definitely offer you great results. I went for a small dataset so that we can concentrate better on the model and not on the data.
Visualizing Random Forests
We can visualize random forests by looking at each decision tree in particular and see the structure of the tree(see why i went for very few estimators?). Here's the method we can use for visualizing a decision tree.
def printTree(classifier, index): feature_names = ['Weather', 'Time of Week', 'Time of Day'] target_names = ['Yes', 'No'] # Build the daya dot_data = tree.export_graphviz(classifier, out_file=None, feature_names=feature_names, class_names=target_names) # Build the graph graph = pydotplus.graph_from_dot_data(dot_data) # Write the image Image(graph.create_png()) graph.write_png("tree" + str(index) + ".png")
We can iterate through all the estimators from the Random Forest and call this method.
for index in range(len(classifier.estimators_)): printTree(classifier.estimators_[index], index)
Here are our five decision trees.
You can see that even for such a small dataset, each and every decision tree looks at least a little bit different from the others.
This was a long but fun article to write. We've made an introduction to random forests classifiers and regressors, we've analyzed the key differences between random forests and decision trees and then saw how we can build a random forest classifier using scikit-learn.
Where to go from here? Don't forget you can check a more in-depth explanation of decision trees or a step by step tutorial on how to implement a decision tree classifier using scikit-learn.
Thank you so much for reading this! Interested in more stories like this? Follow me on Twitter at @b_dmarius and I'll post there every new article.