Reinforcement learning is a Machine Learning paradigm oriented on agents learning to take the best decisions in order to maximize a reward. It is a very popular type of Machine Learning algorithms because some view it as a way to build algorithms that act as close as possible to human beings: choosing the action at every step so that you get the highest reward possible.

Photo by Brett Jordan / Unsplash

In this article we are going to take a basic leap into the Reinforcement Learning field by looking at a basic introduction, followed by an overview of some of the most popular use cases for Reinforcement Learning. Then we are going to take a look at a particular type of Reinforcement Learning, called Q-Learning.

Interested in more stories like this? Follow me on Twitter at @b_dmarius and I'll post there every new article.
This article is part of a two-part mini-series on Reinforcement Learning. In this article we are going to explore theoretical concepts on Reinforcement Learning. If you want to, you can jump to a concrete example: Reinforcement Learning Tic Tac Toe Python implementation.

Article Overview

  • What is Reinforcement Learning
  • Reinforcement Learning applications and examples
  • When to use Reinforcement Learning
  • What is Q-Learning
  • Conclusions and next steps

What is Reinforcement Learning

Reinforcement Learning is one of the three basic paradigms for Machine Learning, with the other two being Supervised Learning and Unsupervised Learning.

In Reinforcement Learning we build agents that try to solve a problem by looking at the current step and trying to find the best action that can be taken so that in the end the reward they receive by solving the problem is maximized.

Actually the usual approach here is to reward the agent when he manages to solve the problem and penalize it when it fails. Other approaches may have gradual rewards, meaning the reward is bigger and bigger when the agent does better at solving the problem: finding a faster or a complete solution.

When we build Supervised Machine Learning models we prepare pairs of inputs and labeled outputs so that the model can know what's correct and what's wrong and try to figure out meaningful correlations between inputs and outputs.

When we build Unsupervised Machine Learning models we don't have labeled outputs, but we check the model each time it performs an action and find a way to correct it if the action was suboptimal.

In Recinforcement Learning we have none of these. Here we let the agent take whatever steps it decides to take and only at the end we reward or penalize it and then we let it try solving the problem again. The basic idea here is that the agent will learn what actions led it to solving the problem and what actions led it to failing to solve the problem. In this way, it will learn a correlation between a state and the collection of actions it took from that state and whether that led to success or failure.

Reinforcement Learning applications and examples

Many people have head about Reinforcement Learning for the first time when they've read about DeepMind and their projects. The company leverages Reinforcement Learning along with other techniques to use them for various use cases, but I think the most popular project they have is the AlphaGo, which is a computer program that plays the Go board game. AlphaGo has been the first program to ever beat a 9-dan professional Go player and since then it has had 3 more powerful successive iterations.

Other than that, many articles around the internet (including mine today) use Reinforcement Learning methods for game playing because the problem statement is simpler than for other use cases and so it allows you to focus on the RL algorithm (thus allowing for a better introductory overview).

But Reinforcement Learning can be used for a variety of other applications, so it's not just for fun projects or didactic and research purposes.

A primary use case is in robotics, where RL techniques can be used to allow robots to perform certain actions that may perhaps be too difficult or too dangerous for human beings. By allowing a robot to learn from its previous actions and apply that knowledge to future actions, we can build robots that can achieve human-level performance and perhaps even beyond that!

Reinforcement Learning can also be used by business to help them take decisions and model certain business use cases. A good RL agent can help you figure out how to optimize your business processes, your stocks or other resources in your company.

RL can also be used for trading. Here we say that the state of the problem is represented by the current level of certain stocks and actions are defined by whether the agent buys or not some stocks.

When to use Reinforcement Learning

It is important to note here that even we've described a contrast between Reinforcement Learning, Supervised and Unsupervised Learning, the three paradigms are actually not competing with each other. That means there is no point in comparing RL with the other two types of Machine Learning, because they usually address different types of problems.

Reinforcement Learning is best used for problems where, at any given point in time, the model needs to be told how good or how bad it has been performing and based on that it should take the next actions.

In Supervised Learning we tell the model how it should have performed: whether it has made the correct decision or not. So for the cases where there is a clear delimitation between good and bad decisions and we can build a training set and a test set based on that, then yes, we should use a Supervised Learning method.

Take for example a classification problem - for examples Decision Trees or a Naive Bayes Classifier. The model looks at some inputs and tries to map those inputs to one of the predefined classes. We tell the model whether it has made the correct decision and based on that information it will rework on the mappings to improve its results.

In Unsupervised Learning we observe the output of the model and we tell it whether it is enough or it should try coming up with a better solution. For this we need to come up with some metrics to measure if the solution is good enough or how close we are to a suficient solution. If we can do that, then we should use Unsupervised Learning.

Take for example a k-means clustering algorithm or a topic modelling algorithm. The model comes up with a solution candidate, and we define some metrics by which we can conclude that the solution is good enough. If the solution is not ready yet, then the model has to go into the same direction and try for one more iteration before verifying again if we are ready to stop.

In Reinforcement Learning, finding the perfect solution is more difficult, because the solution is not found after one iteration, but after a variable number of trials. Let's say we want to teach an agent to play a game. If the agent wins the game, it is not safe to say that the last move made it win the game. It was rather a result of several good moves made throughout the game. And the same applies to losing the game. The final result is made up by several good or poor moves throughout the game.

What is Q-Learning

Q-Learning is a particular type (and one of the most popular implementations) of Reinforcement Learning. Before we move to explaining the Q-Learning method, let's agree on a few definitions.

  • Agent: the model we have build - it can be a player in a game, or a trader on the stock market.
  • Environment: the space the Agent observes and in which we apply our decisions.
  • Action: this one is pretty self-explanatory. It's the move the player takes in the game or the action to buy or not to buy on the stock market.
  • State: current state of the environment and of the agent.
  • Reward: it's the feedback the environment gives back to the agent after an action is taken: can be points in a game, or profits on the stock market.
  • Value: an estimation made by the agent on the reward it's going to get. This value will be adjusted after the reward is received and it will be used to adjust the behaviour of the agent.
  • Policy: a mapping between the state, the value and the actions that the agent is going to take.

The Q in Q-Learning stands for quality. Q-Learning is a type of Reinforcement Learning where the model tries to assess the quality of each and every possible action given a current state of the environment and then take each time the best action(meaning the highest quality action). The assessement on the quality of the possible actions is made based on the past decisions and the rewards the agent has received from the environment.

To understand how that works, let's first paste here the Bellman equation(as most Q-Learning implementations use equation or some variations on it).

Reinforcement Learning explained - Bellman Equation
Reinforcement Learning explained - Bellman Equation. Source: https://en.wikipedia.org/wiki/Q-learning

Now let's use the concrete example of learning to play a game as RL agent. Let's say we are at a particular stage during the game and it's our turn to make a move.

We first assess which move we think it's best given the current status(state) of the game. For this we store a table, a mapping between every state possible, every available move and the Q value at the intersection between a state and an action.

We choose the action with the highest Q value and we do this until the end of the game. Then we look at the result of the game. If the game was won by the agent, a positive reward is given to the agent. If the game was lost, we penalize the agent.

The agent takes that value and uses the Bellman equation to take that reward and adjust the Q mapping table so that next time it knows which moves are best in a particular situation.

If there are more possible actions with the same Q value, we can choose randomly from one of the highest values.

There are two parameters here that we can play with and see which makes our agent better.

  • Learning rate: this accounts for how quickly the agent learns new information versus how long does the user remember old information. The learning rate is a value from the (0, 1) interval. A value closer to 0 means that the agent tends to stick better to old information and new updates will have a lower value, meaning the agent will adapt slower.
  • Discount factor: this helps our agent decide on the temporal difference between several actions took during one iteration. In our game example, a low discount factor will make our agent think it was the last moves of the game that made it win or lose, while a hivh discount factor will make it think the earlier moves had a bigger impact on the final outcome of the game.

One more important thing that needs to be taken into account is allowing the model to explore different strategies. From time to time we should allow our agent to take its next move not based on the Q values, but by choosing randomly from a list of available actions. This will allow it to explore better and maybe come up with better solutions.

Conclusions and next steps

In this article we've made a small introduction to the large field of Reinforcement Learning. We've seen some explanations on what is Reinforcement Learning, how it compares to Supervised and Unsupervised Learning and how it can be used based on some real-world applications and examples. Then we jumped to a particular implementation of Reinforcement Learning, the Q-Learning algorithm.

Next step would be to look at a concrete example: Reinforcement Learning Tic Tac Toe Python implementation.

Interested in more stories like this? Follow me on Twitter at @b_dmarius and I'll post there every new article.