Reinforcement learning is a Machine Learning paradigm oriented on agents learning to take the best decisions in order to maximize a reward. It is a very popular type of Machine Learning algorithms because some view it as a way to build algorithms that act as close as possible to human beings: choosing the action at every step so that you get the highest reward possible.

Change of Perspectives
Photo by Lea Böhm / Unsplash
Interested in more stories like this? Follow me on Twitter at @b_dmarius and I'll post there every new article.
This article is part of a two-part mini-series on Reinforcement Learning. In this article we are going to explore the practical aspects of implementing a Reinforcement Learning agent in Python for a Tic Tac Toe Game. If you want to focus on theoretical aspects of Reinforcement Learning, please check: Reinforcement Learning Explained. What Is Reinforcement Learning.
The entire code for this project can be found on the Tic Tac Toe Reinforcement Learning Python Implementation project on Github. Feel free to star the repository if it helped you in any way.

While in the other article we've explored the technical aspects of Reinforcement Learning, this time we will focus on the more practical aspects of the task.

Reinforcement Learning Tic Tac Toe Python Implementation

So let's jump right into the code. We will need to install only 2 dependencies for this one. We will use matplotlib and pandas so that we can visualize our game state and our agent results.

pip3 install matplotlib
pip3 install pandas


Next up we will define some constants which we will be used in our implementation. The constants are defined in a separate file so that they can be reused everywhere. Here is our file.

from enum import Enum

EMPTY = ' '
VERTICAL_SEPARATOR = '---------------'

class Strategies(Enum):
    Q = "Q"

We have some markers defined for both players, some values that are used for displaying our board and/or our game results, and an enum to define our 2 strategies for the game: some players will play by randomly selecting a move from a list of available moves, while some other players will use Q-Learning to select the best move.

Game board

Now let's define our game board. This will be a class to hold all information about the board of the game and will have the following responsibilities:

  • Hold the current board configuration for the current game
  • Reset the board after a game is ended
  • Compute the game result: we will look at the current board configuration and see if the game was ended or not, and in the case the game is ended, find out who is the winner. Because we've defined our constants earlier, the board does not know what strategy are the 2 players having.
  • Offer a list of available moves at any given point of time. Another approach would have been to let the players try their moves until they find a valid move, but that is computationally expensive and frankly not the right way. So the players will receive only the valid, available moves to choose from.
  • Return a deep copy of the current board configuration. You will see in a minute why we need this.
  • Print board(for debugging purposes).

Here is the content of our file.

import constants
import copy
import os

class Board:

    def __init__(self):
        self.board = []

    def resetBoard(self):
        self.board = [
            [0, 0, 0],
            [0, 0, 0],
            [0, 0, 0]

    def getGameResult(self):
        # Rows
        for i in range(len(self.board)):
            rowSet = set(self.board[i])
            if len(rowSet) == 1 and self.board[i][0] != 0:
                return self.board[i][0]

        # Columns
        for i in range(len(self.board)):
            column = []
            for j in range(len(self.board[i])):
            columnSet = set(column)
            if len(columnSet) == 1 and self.board[0][i] != 0:
                return self.board[0][i]

        # First diagonal
        diagonal = []
        for i in range(len(self.board)):
        if len(set(diagonal)) == 1 and self.board[0][0] != 0:
            return self.board[0][0]

        # Second diagonal
        diagonal = []
        for i in range(len(self.board)):
            diagonal.append(self.board[i][len(self.board) - i - 1])
        if len(set(diagonal)) == 1 and self.board[0][2] != 0:
            return self.board[0][2]

        for i in range(len(self.board)):
            for j in range(len(self.board[i])):
                if self.board[i][j] == constants.EMPTY_VAL:
                    return constants.GAME_STATE_NOT_ENDED

        return constants.GAME_STATE_DRAW

    def getAvailableMoves(self):
        availableMoves = []
        for i in range(len(self.board)):
            for j in range(len(self.board[i])):
                if (self.board[i][j]) == constants.EMPTY_VAL:
                    availableMoves.append([i, j])
        return availableMoves

    def move(self, position, value):
        availableMoves = self.getAvailableMoves()
        for i in range(len(availableMoves)):
            if position[0] == availableMoves[i][0] and position[1] == availableMoves[i][1]:
                self.board[position[0]][position[1]] = value

    def getBoardCopy(self):
        return copy.deepcopy(self.board)

    def printBoard(self):
        for i in range(len(self.board)):
            print(' ', end='')
            for j in range(len(self.board[i])):
                if constants.PLAYER_X_VAL == self.board[i][j]:
                    print(constants.PLAYER_X, end='')
                elif constants.PLAYER_O_VAL == self.board[i][j]:
                    print(constants.PLAYER_O, end='')
                elif constants.EMPTY_VAL == self.board[i][j]:
                    print(constants.EMPTY, end='')
                print(constants.HORIZONTAL_SEPARATOR, end='')


Now it's time to write our code for strategies that the players can take during the game.

There are 2 strategies that can be used in this game:

  • Random: the player will receive a list of available moves from the game and will choose randomly from that list
  • Q-Learning: the player will receive a list of available moves from the game and will use Q-Learning to choose the best move.

For this I've first written a Strategy class to serve as an abstract class for our 2 strategy implementations. This class will never be instantiated, I've only used it for clean code purposes.

Here's the code for our file.

import abc

class Strategy:

    def __init__(self, type):
        self.type = type

    def selectMove(self,  availableMoves, board, playerValue):
        """Select a move from the list of available moves"""

    def reward(self, rewardValue):
        """Applicable only for Q Strategy"""

    def resetStatesHistory(self):
        """Applicable only for Q Strategy """

The type that the Strategy received in the constructor is one from the Strategies enum in the file. The selectMove method will be used by both strategies, while reward and resetStatesHistory will actually only be used by the Q-Learning strategy.

With this being said, let's take a look into our Random Strategy class. This will be a very simple one, with only one method which receives the list of available moves and chooses one randomly.

Let's see the code for our class.

from strategy import Strategy
import random
import constants

class RandomStrategy(Strategy):

    def __init__(self):
        Strategy.__init__(self, constants.Strategies.RANDOM)

    def selectMove(self, availableMoves, board, playerValue):
        return availableMoves[random.randrange(0, len(availableMoves))]

Now comes the juicy part, our Q-Learning strategy. To understand better what I'm doing here, you really need to read about the theoretical aspects on Reinforcement Learning and Q-Learning.

First we will setup a few parameters for our Q-Learning Bellman equation:

  • Learning rate: how quickly the agent learns new information versus how long does the user remember old information.
  • Discount factor/Decay rate: to help our agent decide which moves are more important for the result of the game: early or late moves.
  • Exploratory rate: The probability that the agent will try a random move instead of following the Q-Learning algorithm. I'm using this to allow the agent to explore uncharted teritory with the hopes this will allow it to discover new strategies.

You can play around with these parameters, you might get even better results.

For the Q-Learning algorithm to work, we need to store all the states our agent ever encounters during its lifetime. This will allow our agent to remember which action he took in which state, so that next time he encounters the same state, it will take the same action.

This is the memory of our agent and it will be stored in the states field from our class. This is a map where the key is a hash of our board(a unique string that is generated from the board configuration) and the value is the Q-value of that board configuration, meaning the value obtained by using the Bellman equation.

The last field in our class is the states_history field, which holds the hashes of all the board configuration encountered by the player in the last game. After the game is ended, the model will go through this history in reverse order and compute new q-values that will be used in the new game. This is the learning part of our model.

Let's see the content of our file.

import copy
import random

import constants
from strategy import Strategy

class QStrategy(Strategy):

    def __init__(self):
        Strategy.__init__(self, constants.Strategies.Q)
        self.learningRate = 0.2
        self.decay = 0.8
        self.exploratory_rate = 0.1
        self.states = {}
        self.states_history = []

    def selectMove(self, availableMoves, board, playerValue):
        probability = random.uniform(0, 1)
        if probability <= self.exploratory_rate:
            bestMove = availableMoves[random.randrange(0, len(availableMoves))]
            maxValue = float("-inf")
            bestMove = availableMoves[0]
            for availableMove in availableMoves:
                # get a copy of a board
                boardCopy = copy.deepcopy(board)
                boardCopy[availableMove[0]][availableMove[1]] = playerValue
                value = self.states.get(self.computeHash(boardCopy)) if self.states.get(
                    self.computeHash(boardCopy)) else 0
                if value > maxValue:
                    maxValue = value
                    bestMove = availableMove

        # Remember this state
        boardCopy = copy.deepcopy(board)
        boardCopy[bestMove[0]][bestMove[1]] = playerValue
        return bestMove

    def computeHash(self, board):
        """Rearrange the board matrix as a single vector of 9 elements and convert it to string"""
        return str(board)

    def reward(self, rewardValue):
        for stateHash in reversed(self.states_history):
            if self.states.get(stateHash) is None:
                self.states[stateHash] = 0
            self.states[stateHash] += self.learningRate * (self.decay * rewardValue - self.states[stateHash])
            rewardValue = self.states[stateHash]

    def resetStatesHistory(self):
        self.states_history = []

The selectMove method is, well, the method to select a move 😀. It gets the list of available moves, the current board configuration and whether the player plays with X or O.

We generate a random number between 0 and our exploratory_rate to see if we should allow our model to explore this time or use the q-learning values.

If we need to use our Q-values, we take each and every available move, compute the Q-value for that and try to find the maximum value. If the player has not seen this board configuration ever before, it will not know what to do so we will just use 0 for that situation. After a few thousand games played, this will tend not to happen again, which means our player will know better what to do.

The other very important method is the reward method, which will be called at the end of each game. This will take the positive or negative reward(depending on the game winner) and use the Bellman equation to update the Q values for the board configuration we have seen in the game that just ended.


The next class is the Player class. This holds a Strategy and acts according to that strategy(random or Q-based).

Here's the content of our file.

class Player:

    def __init__(self, value, strategy):
        self.value = value
        self.strategy = strategy

    def getValue(self):
        return self.value

    def move(self, availableMoves, board):
        return self.strategy.selectMove(availableMoves, board, self.value)

    def reward(self, rewardValue):

    def resetStatesHistory(self):


Finally, we've arrived to our Game class. The game has two players and one board, of course, so that's what it takes to construct an object of our Game class.

This has two important methods:

  • playGame controls the flow of one complete game of Tic Tac Toe. While the game is still on(it hasn't ended), it switches the move from one player to another and back). When the game is ended, we reward our players: 1 for the winner, -1 for the loser, and 0.2 to both players in case of a draw. Of course, you can play with these moves and see if you get better results. Giving a bigger value for winning and a smaller value for losing will turn your model into a more aggresive player, while rewarding more for draws will turn your model into a more defensive player.
  • playManyGames is used to control the flow for more than just one game. This takes care of calling the playGame method, then resetting the board of the game between two games. It also collects statistics about the current run so that we know how our players perform.

Here's the content of our file.

import constants
import copy
import pandas as pd

class Game:

    def __init__(self, player1, player2, board):
        self.player1 = player1
        self.player2 = player2
        self.board = board
        self.playerXWins = 0
        self.playerOWins = 0
        self.draws = 0

    def playGame(self):
        playerToMove = self.player1
        while (self.board.getGameResult() == constants.GAME_STATE_NOT_ENDED):
            availableMoves = self.board.getAvailableMoves()
            selectedMove = playerToMove.move(availableMoves, copy.deepcopy(self.board.getBoardCopy()))
            self.board.move(selectedMove, playerToMove.getValue())
            if playerToMove == self.player1:
                playerToMove = self.player2
                playerToMove = self.player1
        if self.board.getGameResult() == self.player1.getValue():
        elif self.board.getGameResult() == self.player2.getValue():


    def playManyGames(self, numberOfGames):
        self.playerXWins = 0
        self.playerOWins = 0
        self.draws = 0
        numberOfGamesAxis = []
        playerXWinsAxis = []
        playerOWinsAxis = []
        drawsAxis = []
        for i in range(numberOfGames):
            totalWins = self.playerXWins + self.playerOWins + self.draws
            if totalWins > 0:
                self.printStatistics(self.playerXWins, self.playerOWins, self.draws, i + 1)
            if self.board.getGameResult() == constants.PLAYER_X_VAL:
                self.playerXWins = self.playerXWins + 1
            elif self.board.getGameResult() == constants.PLAYER_O_VAL:
                self.playerOWins = self.playerOWins + 1
                self.draws = self.draws + 1
            numberOfGamesAxis.append(i + 1)

        statistics = {"Games": numberOfGamesAxis,
                     "X": playerXWinsAxis,
                     "O": playerOWinsAxis,
                     "Draws": drawsAxis}
        return pd.DataFrame(statistics, columns = ['Games', 'X', 'O', 'Draws'])

    def printStatistics(self, playerXWins, playerOWins, draws, numberOfGames):
        print ("Number of games: {}".format(numberOfGames))
        totalWins = playerXWins + playerOWins + draws
        print('X Wins: {} ({}%)'.format(str(playerXWins), str(int(playerXWins * 100 / totalWins))))
        print('O Wins: {} ({}%)'.format(str(playerOWins), str(int(playerOWins * 100 / totalWins))))
        print('Draws:  {} ({}%)'.format(str(draws), str(int(draws * 100 / totalWins))))

    def printFinalStatistics(self):
        totalWins = self.playerXWins + self.playerOWins + self.draws
        print('X Wins: {} ({}%)'.format(str(self.playerXWins), str(int(self.playerXWins * 100 / totalWins))))
        print('O Wins: {} ({}%)'.format(str(self.playerOWins), str(int(self.playerOWins * 100 / totalWins))))
        print('Draws:  {} ({}%)'.format(str(self.draws), str(int(self.draws * 100 / totalWins))))

We also have a method to print the statistics. I've used the method to follow in real time how the players perform while waiting for thousands of games to be played.

Putting it all together

It's time to put all our code together, test it and see our results. We will play 3 games.

  • Random vs Random - I'm using this to see whether X or O tend to win more games in Tic Tac Toe. This will help us put our Q Learning model performance in perspective
  • Random vs Q Learning model
  • Q Learning model vs Random

After the games are played we will gather our statistics, plot the results and discuss on them. Each round will consist of 10000 games each.

Let's see the content of our class.

from random_strategy import RandomStrategy
from q_strategy import QStrategy
from player import Player
import constants
from board import Board
from game import Game
import matplotlib.pyplot as plt

if __name__=="__main__":

    print ("Game 1 - Random vs Random")
    player1Strategy = RandomStrategy()
    player2Strategy = RandomStrategy()
    player1 = Player(constants.PLAYER_X_VAL, player1Strategy)
    player2 = Player(constants.PLAYER_O_VAL, player2Strategy)
    board1 = Board()
    game1 = Game(player1, player2, board1)
    df = game1.playManyGames(10000)
    df.plot(x="Games", y=["X", "O", "Draws"], title="X - Random; O - Random")
    df.plot(x="Games", y=["X", "O", "Draws"])

    print("Game 2 - Q vs Random")
    player3Strategy = QStrategy()
    player3 = Player(constants.PLAYER_X_VAL, player3Strategy)
    board2 = Board()
    game2 = Game(player3, player2, board2)
    df = game2.playManyGames(10000)
    df.plot(x="Games", y=["X", "O", "Draws"], title="X - Q; O - Random")
    df.plot(x="Games", y=["X", "O", "Draws"])

    print("Game 3 - Random vs Q")
    player4Strategy = QStrategy()
    player4 = Player(constants.PLAYER_O_VAL, player4Strategy)
    board3 = Board()
    game3 = Game(player1, player4, board3)
    df = game3.playManyGames(10000)
    df.plot(x="Games", y=["X", "O", "Draws"], title="X - Random; O - Q")
    df.plot(x="Games", y=["X", "O", "Draws"])


First game ends with the following statistics.

X Wins: 5766 (57%)
O Wins: 2936 (29%)
Draws:  1298 (12%)

It's seems that X players tend to win more games. Obviously, our plot tells us the same story.

Reinforcement Learning Tic Tac Toe Python implementation - Random vs Random
Random vs Random

You can see that the percentage of wins keeps the same ratio over the course of the 10000 games, but the X player is the clear winner.

Now let's see the statistics of our Q vs Random round.

X Wins: 9115 (91%)
O Wins: 376 (3%)
Draws:  509 (5%)

It seems clear here that the Q model is crashing the Random player. In fact, there are more draws than O victories.

Let's plot our results.

Reinforcement Learning Tic Tac Toe Python implementation - Q vs Random
Q vs Random

It seems that the Q model was crushing it even from the beginning.

Now let's see our results for our Random vs Q round.

X Wins: 1270 (12%)
O Wins: 7010 (70%)
Draws:  1720 (17%)

Here we have more wins for the random player than in the previous round and more draws. But still the Q model is winning by a wide margin.

Let's plot our results.

Reinforcement Learning Tic Tac Toe Python implementation - Random vs Q
Random vs Q

Aha! In the begginng, the players were kind of equal, the model was still learning(and we saw from the random vs random round that it's more difficult to be the second player). But from a certain point on, the Q model was again crushing the other player, confirming for us that indeed the model was learning.


This was a fun project for me. We saw how to build a Tic Tac Toe game using Python and then how to build a Reinforcement Learning agent that uses Tic Tac Toe to win the game. Next stop should be writing a UI for the game so that we can play agains the Q model and see how much it has learned.

Don't forget, the entire code for this project can be found on the Tic Tac Toe Reinforcement Learning Python Implementation project on Github. Feel free to star the repository if it helped you in any way.
Interested in more stories like this? Follow me on Twitter at @b_dmarius and I'll post there every new article.