Brandon Da Silva

A Bayesian Perspective on Q-Learning

2020-10-21T00:00:00+00:00

Please redirect to the following link: HERE

Deep Learning Presentation at Quant World 2018

2019-04-02T00:00:00+00:00

The Math of Loss Functions

2019-02-06T00:00:00+00:00

Overview

In this post we will go over some of the math associated with popular supervised learning loss functions. Specifically, we are going to focus on linear, logistic, and softmax regression. We show that the derivatives used for parameter updates are the same for all of those models! Most people probably won’t care because they use automatic differentiation libraries like TensorFlow, but I find it cool.

Each section in this blog is going to start out with a few lines of math that explain how the model works. Then we are going to dive into the derivative of the loss function with respect to \(z\). I go into a lot of detail when calculating the derivatives - probably more than necessary. I do this because I want everybody to completely understand how the math works.

We will layout the math for each of the models in the following way:

Define a linear equation which will be denoted \(z\)
Define an activation function if there is any
Define a prediction (transform of the linear equation) which will be denoted \(\hat{y}\)
Define a loss function which will be denoted \(\mathcal{L}\)

I am laying it out this way to maintain consistency between each of the models. Keep this in mind when you realize how silly it is to move from \(z\) to \(\hat{y}\) for linear regression.

Before diving into the math it is important to note that I shape the input \(X\) as \((\text{# instances}, \text{# features})\), and the weight matrix \(w\) as \((\text{# features}, \text{# outputs})\). This is an important distinction because the equations would look slightly different if you used \(X^\boldsymbol{\top}\), which a lot of people use for some reason. I personally dislike using rows for features and columns for instances, but if it floats your boat then go for it (you’ll just need to make minor changes to the math notation).

Linear Regression

To start, let’s define our core functions for linear regression:

\[\begin{align*} &\text{Linear Equation}: &&z = Xw + b \\[1.5ex] &\text{Activation Function}: &&\text{None} \\[1.5ex] &\text{Prediction}: &&\hat{y} = z \\[0.5ex] &\text{Loss Function}: &&\mathcal{L} = \frac{1}{2}(\hat{y} - y)^2 \end{align*}\]

We can also define the functions in python code:

  weights = np.random.normal(size = n_features).reshape(n_features, 1)
  bias = 0

  def linear_regression_inference(inputs):
      return np.matmul(inputs, weights) + bias   

  def calculate_error(x, y):
      ### Mean Squared Error (I know I'm not taking an average, but you get the point)
      y_hat = linear_regression_inference(x)
      return 0.5 * (yhat - y)**2      

We are interested in calculating the derivative of the loss with respect to \(z\). Throughout this post, we will do this by applying the chain rule:

\[\frac{\partial \mathcal{L}}{\partial z} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z}\]

First we will calculate the partial derivative of the loss with respect to our prediction:

\[\frac{\partial \mathcal{L}}{\partial \hat{y}} = \hat{y} - y\]

Next, although silly, we calculate the partial derivative of our prediction with respect to the linear equation. Of course since the linear equation is our prediction (since we’re doing linear regression), the partial derivative is just 1:

\[\frac{\partial \hat{y}}{\partial z} = 1\]

When we combine them together, the derivative of the loss with respect to the linear equation is:

\[\frac{\partial \mathcal{L}}{\partial z} = \hat{y} - y\]

Although this was pretty straight forward, the next two sections are a bit more involved, so buckle up. Get ready to have your mind blown as you learn that \(\frac{\partial \mathcal{L}}{\partial z} = (\hat{y} - y)\) for logistic regression and softmax regression as well!

Logistic Regression

Like linear regression, we will define the core functions for logistic regression:

\[\begin{align*} &\text{Linear Equation}: &&z = Xw + b \\[0.5ex] &\text{Activation Function}: &&\sigma(z) = \frac{1}{1 + e^{-z}} \\[0.5ex] &\text{Prediction}: &&\hat{y} = \sigma(z) \\[1.5ex] &\text{Loss Function}: &&\mathcal{L} = -(y\log\hat{y} + (1-y)\log(1-\hat{y})) \end{align*}\]

We can also define the functions in python code:

  weights = np.random.normal(size = n_features).reshape(n_features, 1)
  bias = 0

  def sigmoid(x):
      return 1 / (1 + np.exp(-x))

  def logistic_regression_inference(x):
      return sigmoid(np.matmul(x, weights) + bias)

  def calculate_error(x, y):
      ### Binary Cross-Entropy
      y_hat = logistic_regression_inference(x)
      return -(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))

NOTE: When calculating the error for logistic regression we usually add a small constant inside the \(\log\) calculation to prevent taking the log of 0.

Again, we use the chain rule to calculate the partial derivative of interest:

\[\frac{\partial \mathcal{L}}{\partial z} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z}\]

The partial derivative of the loss with respect to our prediction is pretty simple to calculate:

\[\frac{\partial \mathcal{L}}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\]

Next we will calculate the derivative of our prediction with respect to the linear equation. We can use a little algebra to move things around and get a nice expression for the derivative:

\[\begin{align*} \frac{\partial \hat{y}}{\partial z} &= \frac{\partial}{\partial z}\left[\frac{1}{1 + e^{-z}}\right] \\[0.75ex] &= \frac{e^{-z}}{(1 + e^{-z})^2} \\[0.75ex] &= \frac{1 + e^{-z} - 1}{(1 + e^{-z})^2} \\[0.75ex] &= \frac{1 + e^{-z}}{(1 + e^{-z})^2} - \frac{1}{(1 + e^{-z})^2} \\[0.75ex] &= \frac{1}{1 + e^{-z}} - \frac{1}{(1 + e^{-z})^2} \\[0.75ex] &= \frac{1}{1 + e^{-z}} \left(1 - \frac{1}{1 + e^{-z}}\right) \\[0.75ex] &= \hat{y}(1 - \hat{y}) \end{align*}\]

Isn’t that awesome?! Anyways, enough of my love for math, let’s move on. Now we’ll combine the two partial derivatives to get our final expression for the derivative of the loss with respect to the linear equation.

\[\begin{align*} \frac{\partial \mathcal{L}}{\partial z} &= \left(-\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\right)\hat{y}(1 - \hat{y}) \\[0.75ex] &= -\frac{y}{\hat{y}}\hat{y}(1 - \hat{y}) + \frac{1-y}{1-\hat{y}}\hat{y}(1 - \hat{y}) \\[0.75ex] &= -y(1 - \hat{y}) + (1-y)\hat{y} \\[0.75ex] &= -y + y\hat{y} + \hat{y} - y\hat{y} \\[0.75ex] &= \hat{y} - y \end{align*}\]

Would you look at that, it’s the exact same!! If you think that is cool (which you should), then just wait for the next section where we go through softmax regression.

Softmax Regression

Once again, we will define the core functions for softmax regression:

\[\begin{align*} &\text{Linear Equation}: &&z = Xw + b \\[0.5ex] &\text{Activation Function}: &&\varphi(z_i) = \frac{e^{z_i}}{\sum_n e^{z_n}} \\[0.5ex] &\text{Prediction}: &&\hat{y_i} = \varphi(z_i) \\[1.5ex] &\text{Loss Function}: &&\mathcal{L} = -\sum_i y_i\log\hat{y_i} \end{align*}\]

We can also define the functions in python code:

  weights = np.random.normal(size = n_features).reshape(n_features, 1)
  bias = 0

  def softmax(x):
      return np.exp(x) / np.sum(np.exp(x), axis = 1).reshape(-1,1)

  def softmax_regression_inference(x):
      return softmax(np.matmul(x, self.weights) + self.bias)     

  def calculate_error(x, y):
      ### Categorical Cross-Entropy
      y_hat = softmax_regression_inference(x)
      return -np.mean(np.sum(y * np.log(yhat), axis = 1))     

NOTE: With softmax regression, we also typically add a small constant inside \(\log\) for the same reason as logistic regression.

For the last time, we will restate the partial derivative using the chain rule:

\[\frac{\partial \mathcal{L}}{\partial z_j} = \frac{\partial \mathcal{L}}{\partial \hat{y_i}} \frac{\partial \hat{y_i}}{\partial z_j}\]

Let’s calculate the first partial derivative of the loss with respect to our prediction:

\[\frac{\partial \mathcal{L}}{\partial \hat{y_i}} = -\sum_i \frac{y_i}{\hat{y_i}}\]

That was pretty easy! Now let’s tackle the monster… the partial derivative of our prediction with respect to the linear equation:

\[\frac{\partial \hat{y_i}}{\partial z_j} = \frac{\sum_n e^{z_n} \frac{\partial}{\partial z_j}[e^{z_i}] - e^{z_i} \frac{\partial}{\partial z_j}\left[\sum_n e^{z_n}\right]}{\left(\sum_n e^{z_n}\right)^2}\]

It is important to realize that we need to break this down into two parts. The first is when \(i = j\) and the second is when \(i \neq j\).

if \(i = j\):

\[\begin{align*} \frac{\partial \hat{y_i}}{\partial z_j} &= \frac{e^{z_j}\sum_n e^{z_n} - e^{z_i}e^{z_j}}{\left(\sum_n e^{z_n}\right)^2} \\[0.75ex] &= \frac{e^{z_i}\sum_n e^{z_n}}{\left(\sum_n e^{z_n}\right)^2} - \frac{e^{z_i}e^{z_j}}{\left(\sum_n e^{z_n}\right)^2} \\[0.75ex] &= \frac{e^{z_i}}{\sum_n e^{z_n}} - \frac{e^{z_i}e^{z_j}}{\left(\sum_n e^{z_n}\right)^2} \\[0.75ex] &= \frac{e^{z_i}}{\sum_n e^{z_n}} - \frac{e^{z_i}}{\sum_n e^{z_n}} \frac{e^{z_j}}{\sum_n e^{z_n}} \\[0.75ex] &= \frac{e^{z_i}}{\sum_n e^{z_n}} \left(1 - \frac{e^{z_j}}{\sum_n e^{z_n}}\right) \\[0.75ex] &= \hat{y_i}(1 - \hat{y_j}) \end{align*}\]

if \(i \neq j\):

\[\begin{align*} \frac{\partial \hat{y_i}}{\partial z_j} &= \frac{0 - e^{z_i}e^{z_j}}{\left(\sum_n e^{z_n}\right)^2} \\[0.75ex] &= - \frac{e^{z_i}}{\sum_n e^{z_n}} \frac{e^{z_j}}{\sum_n e^{z_n}} \\[0.75ex] &= - \hat{y_i}\hat{y_j} \end{align*}\]

We can therefore combine them as follows:

\[\frac{\partial \mathcal{L}}{\partial z_j} = - \hat{y_i}(1 - \hat{y_j})\frac{y_i}{\hat{y_i}} - \sum_{i \neq j} \frac{y_i}{\hat{y_i}}(-\hat{y}_i\hat{y_j})\]

The left side of the equation is where \(i = j\), while the right side is where \(i \neq j\). You will notice that we can cancel out a few terms, so the equation now becomes:

\[\frac{\partial \mathcal{L}}{\partial z_j} = - y_i(1 - \hat{y_j}) + \sum_{i \neq j} y_i\hat{y_j}\]

These next few steps trip some people out, so pay close attention. The first thing we’re going to do is change the subscript on the left side from \(y_i\) to \(y_j\) since \(i = j\) for that part of the equation:

\[\frac{\partial \mathcal{L}}{\partial z_j} = - y_j(1 - \hat{y_j}) + \sum_{i \neq j} y_i\hat{y_j}\]

Next, we are going to multiply out the left side of the equation to get:

\[\frac{\partial \mathcal{L}}{\partial z_j} = - y_j + y_j\hat{y_j} + \sum_{i \neq j} y_i\hat{y_j}\]

We will then factor out \(\hat{y_j}\) to get:

\[\frac{\partial \mathcal{L}}{\partial z_j} = - y_j + \hat{y_j}\left(y_j + \sum_{i \neq j} y_i\right)\]

This is where the magic happens. We realize that inside the bracket \(y_j\) can become \(y_i\) since it is from the left side of the equation. Since \(y\) is a one-hot encoded vector:

\[y_j + \sum_{i \neq j} y_i = 1\]

So our final partial derivative equals:

\[\frac{\partial \mathcal{L}}{\partial z_j} = \hat{y_j} - y_j = \hat{y} - y\]

Partial Derivative to Update Parameters

As you may have noticed, the equation for \(z\) is the same for all of the models mentioned above. This means that the derivative for the parameter updates will also be the exact same, since the only other step is to chain together \(\frac{\partial \mathcal{L}}{\partial z}\) and \(\frac{\partial z}{\partial w}\):

\[\frac{\partial \mathcal{L}}{\partial w} = \frac{\partial \mathcal{L}}{\partial z} \frac{\partial z}{\partial w}\]

Since \(\frac{\partial z}{\partial w} = X\), to get the partial derivative of the loss with respect to the weights, we simply take the dot product between the transpose of the input and \((\hat{y} - y)\). We transpose \(X\) to make the shapes line up nicely for matrix multiplication. Thus, we get:

\[\frac{\partial \mathcal{L}}{\partial w} = X^\boldsymbol{\top}(\hat{y} - y)\]

Concluding Remarks

After reading this post it might be temping to say that you can use Mean Squared Error (MSE) for logistic regression since the derivatives for linear and logistic regression are the same. However, this is incorrect. It is important to realize that the derivative only works out to be the same because there is no activation function for linear regression. If you now have a sigmoid activation function in the output, then \(\frac{\partial \mathcal{L}}{\partial z} \neq (\hat{y} - y)\) for \(\mathcal{L}_{MSE}\).

I hope you enjoyed learning about the math behind some supervised learning loss functions! In the future I might make another blog post about loss functions, except with less math and more visuals.

Accelerated Proximal Policy Optimization

2018-12-29T00:00:00+00:00

Overview

This post is going to be a little different than the other ones that I’ve made (and probably quite different than most blog posts out there) because I’m not going to be showcasing a finished algorithm. Rather, I’m going to show some of the progress I’ve made in developing a new algorithm that builds off of Proximal Policy Optimization (PPO) and Nesterov’s Accelerated Gradient (NAG). The new algorithm is called Accelerated Proximal Policy Optimization (APPO). The reason I’m making a post about an incomplete algorithm is so other researchers can help accelerate its development. I only ask that you cite this blog post if you use this algorithm in a research paper.

Nesterov’s Accelerated Gradient

We already know how PPO works from my previous blog post, so now the only background information we need is NAG. In this post I will not be explaining how gradient descent works, so for those who are not familiar with gradient descent and want a comprehensive explanation, I highly recommend Sebastian Ruder’s post. I actually used that post to first learn gradient descent a couple years ago.

Below is the update rule for vanilla gradient descent. We have our parameters (weights), \(\theta\), which we update with our gradients \(\nabla_{\theta}J(\theta)\). If you are not familiar with this notation, the \(\nabla_\theta\) refers to a vector of partial derivatives with respect to our parameters (also called the gradient vector). \(J(\theta)\) represents the cost function given our parameters, while \(\eta\) represents our learning rate.

\[\theta = \theta - \eta \nabla_{\theta}J(\theta)\]

The problem with vanilla gradient descent however, is that progress is quite slow during training (shown on the left side of the image below). You will notice a large amount of oscillations across the error surface. To prevent overshooting, we use a small learning rate, which ultimately makes training slow. To help solve this problem, we use the momentum algorithm (shown on the right side of the image below), which is basically just an exponentially weighted average of the gradients.

\[v_t = \gamma v_{t-1} + \eta \nabla_{\theta}J(\theta)\] \[\theta = \theta - v_t\]

The two new terms introduced in the equations above are \(\gamma\), which is the decay factor, and \(v_t\), which is the exponentially weighted average for the current update. Let’s consider two scenarios to see why momentum helps move our parameters in the right direction faster, while also dampening oscillations. In scenario 1, imagine our previous update \(v_{t-1}\) was a positive number for one of our parameters. Now imagine the current partial derivative for that parameter is also positive. With the momentum update rule, we will be accelerating our parameter update in that direction by adding \(v_{t-1}\) to the already positive partial derivative. The same logic works for negative partial derivatives if \(v_{t-1}\) is negative. In scenario 2, imagine if \(v_{t-1}\) and \(\nabla_{\theta}J(\theta)\) had opposite directions (i.e. one is positive and the other is negative). In that case they will somewhat cancel each other out, which ultimately makes the gradient update smaller (dampening the oscillation).

While this sounds great, there is one pretty obvious flaw in its design - what happens when we have a large momentum term \(v_{t-1}\) and we reached a local minima (i.e. the current gradient is \(\sim 0\))? Well, if we use the momentum update rule, then we will overshoot the local minima because we have to add \(\gamma v_{t-1}\) to the current gradient. To prevent this from happening, we can anticipate the effect of \(\gamma v_{t-1}\) on our parameters and calculate that gradient vector to come up with \(v_t\). So if we used our previous example and assume \(v_{t-1}\) had a large positive value for most of the parameters, then after anticipating what our parameters will be, the gradient vector will consist of mostly negative numbers since we overshot the local minima. Now when we add the two together, they cancel each other out (for the most part). This is know as Nesterov’s Accelerated Gradient and is shown below:

\[v_t = \gamma v_{t-1} + \eta \nabla_{\theta}J(\theta - \gamma v_{t-1})\] \[\theta = \theta - v_t\]

You might be wondering how the concept behind NAG can be used to improve PPO. To understand this, we first need to understand some of the drawbacks of PPO.

Areas of Improvement for PPO

Even though I think PPO is an awesome algorithm, upon examining it more closely I noticed a few things that I would like to improve. I want to change the following aspects of PPO:

It is a reactive algorithm and I want to make it a proactive algorithm
Using a ratio to measure divergence from the old policy handicaps updates for low probability actions

Before diving into these points, I want to define a word I will be using going forward: training round is defined as the series of updates to our policy network after collecting a certain number of experiences. For example, we can define one training round to be 4 epochs after playing two episodes of a game.

1) Reactive vs Proactive

Clipping only takes effect after the policy gets pushed outside of the range (it also depends on the sign of advantage). As such, it is a reactive algorithm because it only restricts movement once the policy has already moved outside of the specified bounds. This means that PPO does not ensure the new policy is proximal to the old policy because \(r_t(\theta)\) can easily move well below \(1 - \epsilon\) or above \(1 + \epsilon\). As we will see later in this post, by using the accelerated gradient concept behind NAG, we can design a proactive algorithm which anticipates how much the policy will move. We can then use the anticipated move to better control our policy and keep it roughly within the bounds.

2) Ratio vs Absolute Difference

The bounds that we set for the policy in PPO is based off of a ratio, which I do not like. The denominator \(\pi_{\theta_\text{old}}\) matters a lot because if it is too small, then learning is severely impaired. Let’s use an example to show you what I mean. Imagine two scenarios:

Low probability action: \(\pi_{\theta_\text{old}}(s_t, a_t) = 2\%\)
High probability action: \(\pi_{\theta_\text{old}}(s_t, a_t) = 70\%\)

Now let’s assume that \(\epsilon = 0.2\). This means that we restrict our new policy to be \(1.6\% \leq \pi_\theta \leq 2.4\%\) for the first scenario and \(56\% \leq \pi_\theta \leq 84\%\) for the second scenario. Wait a second… under the first scenario the policy can only move within a range that is \(0.8\%\) wide, while in the second scenario the policy can move within a range that is \(28\%\) wide. If the low probability action should actually have a high probability, then it will take forever to get it to where it should be. However, if we use an absolute difference, then the range in which the new policy can move is the exact same regardless of how small or large the probability of taking an action under our old policy is.

NOTE - There is the obvious exception when the probability of taking an action is near \(0\%\) or \(100\%\). In those cases the lower and upper bound on the new policy is bounded at \(0\%\) and \(100\%\) respectively. However, I don’t consider this a drawback because it is the same when using the ratio.

Initial Attempt to Improve PPO

Let’s deal with the ratio first. In order to get rid of the denominator problem, we define \(\hat{r}_t(\theta)\) as the absolute difference between the new policy and the old policy:

\[\hat{r}_t(\theta) = \left|\pi_\theta(a_t \mid s_t) - \pi_{\theta_\text{old}}(a_t \mid s_t)\right|\]

Next, we need to make the algorithm proactive instead of reactive. To do this, we create an additional neural network that is responsible for telling us how much the policy would change if we implement a policy gradient update. We will denote its parameters as \(\theta_\text{pred}\). At the start of each mini-batch, we reset \(\theta_\text{pred}\) to be equal to \(\theta\). As a result, we can define:

\[\hat{r}_t(\theta_\text{pred}) = \left|\pi_{\theta_\text{pred}}(a_t \mid s_t) - \pi_{\theta_\text{old}}(a_t \mid s_t)\right|\]

Now we can see exactly how much our policy will change if we apply a policy gradient update. In order to constrain the amount our policy can change to be within \(\pi_{\theta_\text{old}}(a_t \mid s_t) \pm \epsilon\), we can calculate the following shrinkage factor:

\[shrinkage = \frac{\epsilon}{\max(\hat{r}_t(\theta_\text{pred}), \epsilon)}\]

I thought that applying this shrinkage factor to the gradients when updating \(\pi_\theta\) will constrain \(\hat{r}_t(\theta) \leq \epsilon\). Boy was I wrong. I was making a linear extrapolation on a function approximation that is non-linear… It was clear that I needed a better way to ensure our policy stays within the specified range per training round.

Current state of APPO

Okay so the shrinkage factor didn’t work, what else can we do? Let’s take a page out of the supervised learning book! After calculating \(\pi_{\theta_\text{pred}}\), we can see if it moves outside of the bounds \(\pi_{\theta_\text{old}}(a_t \mid s_t) \pm \epsilon\). If so, then we can use Mean Squared Error (MSE) as the loss function for those samples and move \(\pi_\theta\) towards the bound that it crossed. On the other hand, if it is within the range, then you can update \(\pi_\theta\) with the regular policy gradient method.

There is one important nuance that we should keep in mind. If we consider a neural network update in isolation, then the method above works great. However, given that we train on multiple mini-batches afterwards, the proceeding change in neural network weights can easily push our policy well beyond our specified range. To prevent this, I found that increasing the number of epochs during training, while also shuffling samples between each epoch significantly reduces the probability of this occurring. I use 10 epochs, but you can probably get away with a smaller number. Empirically, this method has been shown to constrain the new policy to be within the specified bound, with an occasional small deviation outside of the range after each training round.

You will notice that by using the method above, an if statement splits the mini-batch into two smaller batches: one to be trained with MSE and the other to be trained with a policy gradient loss. If you don’t want to split up your mini-batch with an if statement during training, then you can update the whole mini-batch with the following loss function:

\[\mathcal{L} = \frac{1}{n}\sum^n_{i=1}\left(\pi^{(i)}_{\theta} - \text{clip}(\pi^{(i)}_{\theta_\text{pred}}, \pi^{(i)}_{\theta_\text{old}} - \epsilon, \pi^{(i)}_{\theta_\text{old}} + \epsilon)\right)^2\]

This is more computationally expensive because you no longer train a portion of the mini-batch using the policy gradient method (which requires less epochs than the MSE portion). Nonetheless, it is still an option for those who don’t like breaking up the loss function with an if statement.

Concluding Remarks

Results currently look promising, but I don’t think the algorithm is complete. I will continue to work on it and I welcome any feedback!

Playing Super Mario Bros with Proximal Policy Optimization

2018-12-02T00:00:00+00:00

Overview

In this post, our AI agent will learn how to play Super Mario Bros by using Proximal Policy Optimization (PPO). We want our agent to learn how to play by only observing the raw game pixels so we use convolutional layers early in the network, followed by dense layers to get our policy and state-value output. The architecture of our model is shown below.

To find the code, please follow this link.

Throughout this post, I’m going to explain each of the model’s components. First we start with the convolutional layers.

Convolutional Neural Network

Convolutional neural networks (CNNs) are widely used in image recognition, and have achieved very impressive results to date. They have their own set of issues, such as the inability to take important spatial hierarchies into account, which capsule networks attempt to address. However, we don’t think that this significantly impacts an agent’s ability to play a video game from raw pixels so convolutional layers will be just fine for our algorithm.

Unfortunately I’m not going to fully explain CNNs because that would take a whole post on its own. Rather, I’m going to explain some of the most important concepts for our model. If you want a more detailed explanation, I highly recommend Chrisopher Olah’s blog - all his posts are incredible. Also Andrew Ng’s course is awesome!

The first thing to understand is that every image is comprised of pixels, and every pixel is represented as a numerical value (or combination of values). The images from the game screen use the RGB color model, which means that for each pixel in the picture, there are going to be 3 numbers associated with it. The numbers correspond to how much red, green, and blue light to add to the image. An example of the RGB codes for the Mario picture are shown below:

Okay great, now what do we do with these pixel values? Convolutional layers are a great way to deal with raw pixel inputs into a neural network. Each convolutional layers consists of multiple filters, which extract important information about an image. You can think of each convolutional layer as a building block for the next. For example, the first layer can put together the pixels to form edges, the second layer can put together the edges to form shapes, the third layer can put the shapes together to form objects, etc.

The filters work by performing an operation called convolution, shown in the image below. The operation works by taking the sum of the element-wise product between a portion of the image and the filter (also called a kernel). It focuses on a portion of the image because we need the two matrices to be the same size. In our example, we perform convolution on the bottom-right portion of the image. The filter shown below is specifically designed to detect vertical edges in an image. However, in practice we don’t preset the filter weights to perform a specific task - instead the neural network will learn the weights that it deems the best with backpropagation.

I know I said that the operation being performed above is convolution, but that is not completely true… We’re technically performing cross-correlation, but everyone refers to this operation in the neural network context as convolution. Let me explain why. To actually perform convolution, you need to either flip the source image or the kernel. The reason why we don’t do this for CNNs is because it adds unnecessary complexity. Why is it unnecessary? Because the neural network learns the weights for the kernel anyways, so if you needed to flip the kernel, the CNN will automatically learn the flipped kernel weights, making the actual flipping pointless. Since flipping does not make a difference, cross-correlation is equivalent to convolution in this context.

As mentioned before, the kernel is applied to a portion of the image, so we have to slide the kernel over the whole image to account for all the portions. Below we show an example of the filter in action! We used some different numbers - they don’t actually mean anything, I just made them up:

The last concept that I want to introduce for CNNs is stride. The stride determines how many pixels the filter jumps over between convolution operations. For example, in our animation above, the stride was 1 because it moved one pixel at a time. But if we specify a stride of 2, then it will move two pixels at a time (skipping over one pixel). The larger the stride, the smaller the output from the convolutional layer. Below we show what a stride of 2 looks like for the same input and kernel:

Now that we understand how the neural network is able to deal with pixelated inputs, we will move onto the feed-forward (dense) portion of our model - it splits into a value estimation stream and a policy stream. Below we show the implementation of convolutional layers followed by a flattening layer in TensorFlow:

  # Convolutional Layers
  conv1 = tf.layers.conv2d(inputs = inputs, filters = n_filters[0], kernel_size = kernel_size[0],
                           strides = [n_strides[0], n_strides[0]], padding = "valid", activation = tf.nn.elu, trainable = trainable)
  conv2 = tf.layers.conv2d(inputs = conv1, filters = n_filters[1], kernel_size = kernel_size[1],
                           strides = [n_strides[1], n_strides[1]], padding = "valid", activation = tf.nn.elu, trainable = trainable)

  # Flatten the last Convolutional Layer
  first_dimension = round((((image_height - kernel_size[0] + 1) / n_strides[0]) - kernel_size[1] + 1) / n_strides[1])
  second_dimension = round((((image_width - kernel_size[0] + 1) / n_strides[0]) - kernel_size[1] + 1) / n_strides[1])
  dimensionality = first_dimension * second_dimension * n_filters[1]
  conv2_flat = tf.reshape(conv2, [-1, dimensionality])

The Value Function

In reinforcement learning, we often care about value functions - specifically, the state-value function \(V(s)\) and the action-value function \(Q(s,a)\). Before diving into some math, I want to explain these concepts intuitively. \(V(s)\) tells us how good it is to be in a particular state. In Super Mario Bros, the goal is to go all the way to the right side of the map, as fast as possible. Thus, we get a positive (negative) reward if we move to the right (left), while getting a negative reward every time the clock ticks. Let’s let M1 and M2 represent two of Mario’s possible positions. If we define \(V(s)\) as the expectation of \(G_t\), which is the cumulative discounted reward from time step \(t\), then we realize that \(V(s_{M2}) > V(s_{M1})\).

If my previous statement did not completely make sense, let’s make it a bit more concrete with some math. Let’s let \(R_t\) represent the reward from time step \(t\). We will define the cumulative discounted reward from time step \(t\) as:

\[G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots = \sum_{k=0}^{\infty} \gamma^kR_{t+k+1}\]

where \(\gamma\) is a discount factor that we apply to future rewards. This is a math trick that makes an infinite sum finite since \(0 \leq \gamma \leq 1\). Although technically if \(\gamma = 1\) then the sum is still infinite because all future rewards have an equal weight. However, we generally use \(\gamma < 1\).

Now that we know how \(G_t\) is defined mathematically, let’s revisit our previous statement: \(V(s_{M2}) > V(s_{M1})\). The farther Mario is from the right, the longer it takes to get to the end of the map. If it takes longer to get to the end of the map, then we have to add up more negative rewards to our cumulative sum (since we get a negative reward every time the clock ticks). Thus, it makes sense that \(V(s_{M2}) > V(s_{M1})\).

Great, now that we have an intuition into how the state-value function works, let’s do some algebra to get a very important equation in reinforcement learning:

\[\begin{align*} V(s) &= \mathbb{E}[\, G_t \, | \, S_t=s \,] \\ &= \mathbb{E}[\, R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots \, | \, S_t=s \,] \\ &= \mathbb{E}[\, R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} + \ldots) \, | \, S_t=s \,] \\ &= \mathbb{E}[\, R_{t+1} + \gamma G_{t+1} \, | \, S_t=s \,] \\ &= \mathbb{E}[\, R_{t+1} + \gamma V(S_{t+1}) \, | \, S_t=s \,] \end{align*}\]

The equation that we end up with is know as the Bellman equation. If we think about it, it’s actually quite intuitive: the value for being in a particular state is equal to the expected reward we will receive from that state plus the discounted expected value of being in the next state. Let’s break this down a bit more. If the value for being in a state is equal to the sum of discounted future rewards, then \(V(s_{t+1})\) is the sum of discounted rewards after \(t+1\). So if we add \(R_{t+1}\) to \(\gamma V(s_{t+1})\), then we get the sum of discounted rewards after \(t\), which is \(V(s)\).

Alternatively, we can write the Bellman equation as,

\[V(s)=\mathcal{R}_s + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime} V(s^\prime)\]

where \(\mathcal{P}_{ss^\prime}\) refers to the probability transition matrix (i.e. the probability of moving from \(s\) to \(s^{\prime}\) for all \(\mathcal{S}\)).

Up until now, we were talking about the state-value function, but what about \(Q(s,a)\)? Most times, people actually care more about \(Q\) than \(V\). The reason is because they want to know how to act in a given state, rather than the value of being in a state. This is exactly what \(Q(s,a)\) helps you determine because it tells you the value for taking a specific action in a given state. Thus, if you calculate the Q-value for all actions you can take (assuming the action space is discrete), then you can choose the action that has the maximum value. The super popular Q-learning algorithm learns the mapping from states to Q-values, so that an agent knows which actions will yield the highest cumulative discounted reward.

Let’s solidify our understanding of state-value and action-value. There is going to be a bit more math in this part, so get ready! First, let’s define a new term: the mapping from states to actions is defined as the policy and is denoted as \(\pi(a \mid s)\). Although policies can be deterministic, we are going to read \(\pi(a \mid s)\) as “the probability of taking an action given the state”. I find that reading equations out loud in plain english helps solidify my understanding, so that’s what I’m going to do for the next few equations.

First we show the value of being in state \(s\) by following policy \(\pi\). It is equal to the sum of Q-values, which correspond to particular actions, multiplied by the probability of taking those actions according to policy \(\pi\).

\[v_{\pi}(s)=\sum_{a \in \mathcal{A}}\pi(a \mid s)q_{\pi}(s,a)\]

Let’s break down \(v_{\pi}(s)\) in english:

In any state, there are multiple actions that we can take
We take each action according to a probability distribution
Each action has a different value associated with it
Thus, the value of being in a state is equal to the weighted average of the action-values, in which the weights are the probabilities of taking each action

Next we show the value of taking action \(a\) in state \(s\) by following policy \(\pi\). It is equal to the expected reward from taking an action plus the discounted expected value of being in the next state.

\[q_{\pi}(s,a)=\mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}}\mathcal{P}_{ss^\prime}^{a}v_{\pi}(s^\prime)\]

Let’s break down \(q_{\pi}(s,a)\) in english:

For any action an agent takes, it receives a reward
When an agent takes an action, it can end up in a different state
- Image if your action was to move to the right - your agent is now in a new state
Sometimes environments have randomness embedded in them
- Imagine if you try to move to the right, but wind pushes you back and you end up to the left of your original position
Thus, by taking an action in a given state, there is a probability that the agent will end up in various new states
As a result, the value of taking an action in a given state is equal to the immediate reward from taking that action plus the weighted average of state-values for the next state multiplied by a discount factor.
- The weights are the probabilities of ending up in the next states.

The previous two equations shown were half-step lookaheads. To show the full one-step lookaheads, we can plug in the previous equations to obtain the following:

\[v_{\pi}(s)=\sum_{a \in \mathcal{A}}\pi(a \mid s)\left(\mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}}\mathcal{P}_{ss^\prime}^{a}v_{\pi}(s^\prime)\right)\] \[q_{\pi}(s,a)=\mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}}\mathcal{P}_{ss^\prime}^{a}\sum_{a^\prime \in \mathcal{A}}\pi(a^\prime|s^\prime)q_{\pi}(s^\prime,a^\prime)\]

If you understood the intuition for the first two equations, then you should have no problem with the two equations above - they are simply an extension using the exact same logic.

Policy Gradient

What if we want to skip the middle part and just learn a mapping from states to actions without estimating the value of taking an action? We can do this with the policy gradient method, in which we explicitly learn \(\pi\)! Well sort of… we will soon see why we will actually need to incorporate the value function, but until then, let’s walk through a simple implementation of a policy gradient. Let’s consider the loss function:

\[\mathcal{L} = r \times \log \pi(s,a)\]

We want to maximize \(\mathcal{L}\), which is equivalent to minimizing \(-\mathcal{L}\) (we usually perform gradient descent, so minimizing a loss function is the convention). By minimizing \(-\mathcal{L}\), we ensure that we increase the probability of taking an action that gives us a positive reward, and decrease the probability of taking an action that gives us a negative reward. That seems like a good idea, right? Not really… let’s go through an example to understand why. Imagine there are 3 actions that an agent can take with rewards of \([-1,3,20]\) in particular state. There are two main problems with this approach:

Credit Assignment Problem
Multiple “Good” Actions

The credit assignment problem refers to the fact that rewards can be temporally delayed. For example, if an agent takes an action in time step \(t\), the reward might come well after \(t+1\). An example in Super Mario Bros is when our agent has to jump over a tube; multiple frames elapse from the time it presses the jump button to the time it actually makes it over the tube. The number of time steps that can possibly elapse between actions and rewards differ for each situation, so how do we solve this problem? Although this is not a perfect solution, we can use value functions, specifically \(q_{\pi}(s,a)\). Since \(q_{\pi}(s,a)\) sums all future discounted rewards from taking action \(a\) and following policy \(\pi\), our agent can take into account rewards that are temporally delayed. Our loss function now becomes:

\[\mathcal{L} = q_{\pi}(s,a) \times \log \pi(s,a)\]

Let’s now assume that \([-1,3,20]\) represents Q-values instead of rewards. We still have an issue because there are multiple actions that have a positive expected value. Imagine if we sample the second action, which has a positive Q-value. Based on our new policy gradient loss function, the parameter update would increase the probability of taking that action since \(q_{\pi}(s,a)\) is positive. But what about action 3? It had a much higher Q-value than action 2, so instead we need a way to tell the model to decease the probability of selecting action 2 and instead select action 3. That is what advantage helps us do.

The Advantage Function

Rather than looking at how good it is to take an action, advantage tells us how good an action is relative to other actions. This subtlety is important because we want to select actions that are better than average, as opposed to any action that has a positive expected value. To do this, we have to strip out the state-value from the action-value to get a pure estimate of how good an action is. We define advantage as:

\[A(s,a) = Q(s,a) - V(s)\]

If we assume that our policy follows a uniform distribution (equal probability for each action), then \(V(s) = 7.33\), which means that \(A(s,a) = [-8.3,-4.3,12.7]\). Using our new loss function for policy gradients,

\[\mathcal{L} = A_{\pi}(s,a) \times \log \pi(s,a)\]

we see that after selecting action 2, our agent will decrease the probability of selecting that action again in the same state because it has a negative advantage (its value is worse than the average). This is great, it does exactly what we want it to do! However, we don’t know the true advantage function (much like the value functions), so we have to estimate it. Luckily, there are a few ways to do this, but I’m going to focus on one method - using the temporal difference error (\(\delta_{TD}\)) from our value estimation.

Let me back up a little to explain what temporal difference error is. Remember when we saw this somewhat complicated equation earlier:

\[q_{\pi}(s,a)=\mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}}\mathcal{P}_{ss^\prime}^{a}\sum_{a^\prime \in \mathcal{A}}\pi(a^\prime|s^\prime)q_{\pi}(s^\prime,a^\prime)\]

Well it turns out that it will come in handy after all! Just a refresher - the equation above considers all possible paths. But what if we just sample one action from our policy and sample the next state from the environment? Well then it becomes:

\[q_{\pi}(s,a) = r + \gamma v_{\pi}(s^{\prime})\]

Keep this in mind while I explain \(\delta_{TD}\). As the name implies, temporal difference error refers to the difference between the one-step lookahead and the current estimate. We can calculate \(\delta_{TD}\) for either the state-value or action-value, but in this example we’re using the state-value. When we sample, the one-step lookahead equation for state-value becomes \(v_{\pi}(s) = r + \gamma v_{\pi}(s^{\prime})\). You’ll notice that the left side is a pure estimate, while the right side is a mix of estimation and actual data from the environment. This means that the right side contains more information about the environment than the left! By taking the difference between the two we obtain:

\[\delta_{TD} = r + \gamma v_{\pi}(s^{\prime}) - v_{\pi}(s)\]

and by minimizing \(\delta_{TD}^2\), we move our value estimation closer to the actual value function. This is because we are continually moving our estimate closer to a target that contains more data from the actual environment. In addition to using \(\delta_{TD}\) to optimize our value network, it turns out that we can also use it to estimate advantage. Wait, what? How? Let’s bring back \(q_{\pi}(s,a)\):

\[q_{\pi}(s,a) = r + \gamma v_{\pi}(s^{\prime})\]

Recall what advantage is defined as:

\[A = Q - V\]

Now let’s take a look at \(\delta_{TD}\) again:

\[\delta_{TD} = \underbrace{r + \gamma v_{\pi}(s^{\prime})}_{q_{\pi}(s,a)} - v_{\pi}(s)\]

which means that \(\delta_{TD} \approx A\)

Generalized Advantage Estimation

The paper we are referencing in this section was used for continuous control, but it can also be used for a discrete action space, like the one we are working with.

We will denote our advantage estimate as \(\hat{A}_t\). Like any other estimate, \(\hat{A}_t\) is subject to bias (although it has low variance). To get an unbiased estimate, we need to get rid of the value estimate completely and sum all future rewards in an episode. This is known as the Monte Carlo return, and it has high variance. As with most things in machine learning, there is a tradeoff - this one is known as the bias-variance tradeoff in reinforcement learning. Generalized Advantage Estimation (GAE) is a great solution that significantly reduces variance while maintaining a tolerable level of bias. It is parametereized by \(\gamma \in [0,1]\) and \(\lambda \in [0,1]\), where \(\gamma\) is the discount factor mentioned earlier in this blog, and \(\lambda\) is the decay parameter used to take an exponentially weighted average of k-step advantage estimators. It is analogous to Sutton’s TD(\(\lambda\)).

Before we get into some of the math, I want to note that \(\gamma\) and \(\lambda\) serve different purposes. To determine the scale of the value function, we use \(\gamma\). In other words, the value of \(\gamma\) determines how nearsighted (\(\gamma\) near 0) or farsighted (\(\gamma\) near 1) we want our agent to be in its value estimate. No matter how accurate our value function is, if \(\gamma < 1\), we introduce bias into the policy gradient estimate. On the other hand, \(\lambda\) is a decay factor and \(\lambda < 1\) only introduces bias when the value function is inaccurate.

I’m going to spare you the details on the derivation of GAE because I feel like we’ve gone through enough math for one post. However, if you have any questions just let me know in the comments section below and I’ll explain it in-depth. As mentioned before, GAE is defined as the exponentially weighted average of k-step advantage estimators. The equation is shown below:

\[\hat{A}^{GAE(\gamma,\lambda)}_t = \sum^{\infty}_{l=0}(\gamma \lambda)^l\delta_{t+l}\]

Below we show an implementation of GAE in python:

  def get_gaes(rewards, state_values, next_state_values, GAMMA, LAMBDA):
      deltas = [r_t + GAMMA * next_v - v for r_t, next_v, v in zip(rewards, next_state_values, state_values)]
      gaes = copy.deepcopy(deltas)
      for t in reversed(range(len(gaes) - 1)):
          gaes[t] = gaes[t] + LAMBDA * GAMMA * gaes[t + 1]
      return gaes, deltas

If you understand the equation above, then you might find this next part pretty cool, otherwise you can just skip over it. There are two special cases of the formula above, when \(\lambda=0\) and \(\lambda=1\):

\[GAE(\gamma,0): \hat{A}_t := \delta_t = r_t + \gamma V(S_{t+1}) - V(S_t)\] \[GAE(\gamma,1): \hat{A}_t := \sum^{\infty}_{l=0}\gamma\delta_{t+l} = \sum^{\infty}_{l=0}\gamma^lr_{t+l} - V(S_t)\]

When we have \(0 < \lambda < 1\), our GAE is making a compromise between bias and variance. From now on, our loss function for the policy gradient becomes:

\[\mathcal{L} = \hat{A}^{GAE(\gamma,\lambda)} \times \log \pi(s,a)\]

Going forward, when you see \(\hat{A}_t\), we are actually referring to \(\hat{A}^{GAE(\gamma,\lambda)}_t\).

Proximal Policy Optimization

We’re finally done catching up on all the background knowledge - time to learn about Proximal Policy Optimization (PPO)! This algorithm is from OpenAI’s paper, and I highly recommend checking it out to get a more in-depth understanding after reading my blog.

PPO takes inspiration from Trust Region Policy Optimization (TRPO), which maximizes a “surrogate” objective function:

\[L^{CPI}(\theta) = \hat{\mathbb{E}}_t\big[r_t(\theta)\hat{A}_t\big]\]

where \(r_t(\theta)\) represents the probability ratio of our current policy versus our old policy:

\[r_t(\theta) = \frac{\pi_{\theta}(a_t \mid s_t)}{\pi_{\theta_\text{old}}(a_t \mid s_t)}\]

TRPO also has constraints that I’m not going to get into, but if you’re interested, I highly recommend reading the paper. While TRPO is quite impressive, it is complex and computationally expensive to run. As a result, OpenAI came up with a simpler, more general algorithm that has better sample complexity (empirically). The idea is to limit how much our policy can change during each round of updates by clipping \(r_t(\theta)\) between a range determined by \(\epsilon\):

\[L^{CLIP}(\theta) = \hat{\mathbb{E}}_t\big[\min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon)\hat{A}_t)\big]\]

The reason we do this is because conventional policy gradient methods are very sensitive to your choice of step size. If the step size is too small then the training progresses too slowly. If the step size is too large then your policy can overshoot the optimal policy during training, making it too noisy. By limiting how much our policy can change, we reduce the sensitivity to the step size. An implementation of \(L^{CLIP}\) in python is shown below:

  with tf.variable_scope('actor_loss'):
      action_probabilities = tf.reduce_sum(policy * tf.one_hot(indices = actions, depth = output_dimension), axis = 1)
      old_action_probabilities = tf.reduce_sum(old_policy * tf.one_hot(indices = actions, depth = output_dimension), axis = 1)

      ratios = tf.exp(tf.log(action_probabilities) - tf.log(old_action_probabilities))
      clipped_ratios = tf.clip_by_value(ratios, clip_value_min = 1 - _clip_value, clip_value_max = 1 + _clip_value)
      clipped_loss = tf.minimum(tf.multiply(GAE, ratios), tf.multiply(GAE, clipped_ratios))
      actor_loss = tf.reduce_mean(clipped_loss)

You will notice in the image below (taken from the PPO paper) that there are certain values of \(r_t(\theta)\) where the gradient is 0. When the advantage is positive, the cutoff point is \(1 + \epsilon\). When the advantage is negative, the cutoff point is \(1 - \epsilon\). By taking the minimum of the clipped and unclipped objective, as demonstrated below, we are creating a lower bound on the unclipped objective. In other words, we ignore a change in \(r_t(\theta)\) when it makes the objective improve, which is why the lower bound is also known as the pessimistic bound.

Our implementation has a unique feature that I haven’t mentioned yet: after the convolutional layers, we concatenate a series of one hot encodings that correspond to previous actions that our agent took. The reason we do this is because there are a few cases in which a combination of buttons need to be pressed in a sequential order. By taking previous actions into account, we allow our agent to learn such sequences. The video shown below was created after relatively little training using PPO on a Macbook Pro. I plan on running the algorithm for longer and updating the video sometime in the near future:

Concluding Remarks

In this post, we covered a lot of reinforcement learning background and learned how PPO works. We see that using GAE with PPO is a clever way to deal with the credit assignment problem, while keeping bias in check. We also learned a little bit about convolutional neural networks as a way to deal with pixelated inputs. I hope you can take what you learned in this post and apply it to your favorite games!

Learning How to Run with Genetic Algorithms

2018-11-18T00:00:00+00:00

Overview

When most people think of Deep Reinforcement Learning, they probably think of Q-networks or policy gradients. Both of these methods require you to calculate derivatives and use gradient descent. In this post, we are going to explore a derivative-free method for optimizing a policy network. Specifically, we are going to be using a genetic algorithm on DeepMind’s Control Suite to allow the “cheetah” physical model to learn how to run. You can find the complete code on my github repo.

Genetic Algorithm Background

Genetic algorithms (GAs) are inspired by natural selection, as put forth by Charles Darwin. The idea is that over generations, the heritable traits of a population change because of mutation and the concept of survival of the fittest.

Similar to natural selection, GAs iterate over multiple generations to evolve a population. The population in our case is going to consist of a bunch of neural network weights, which define our cheetah agents. You can think of each set of neural network weights as an individual agent in the population - usually called a chromosome or genotype. Chromosomes are usually encoded as binary strings, but since we want to optimize neural networks weights, we will adapt it for continuous numbers. Each neural network weight in our chromosome can be referred to as a gene. After iterating through all the generations, and continually improving the cheetah’s chromosome (its neural network weights), we hope that it learns how to run.

Initialization

To begin the process, we need to initialize our population of agents. We sample the initial neural network weights from a normal distribution with a scaling factor outlined in Glorot and Bengio’s paper:

\[Var[W^i] = \frac{2}{n_i + n_{i+1}}\]

where \(W^i\) refers to the weight matrix in the \(i^\text{th}\) layer, while \(n_i\) and \(n_{i+1}\) refer to the input and output dimensionality of that layer. Below you’ll see python code to implement the population initialization, where scaling_factor is a vector of variances calculated according to the equation above:

  population = np.random.multivariate_normal(mean = [0]*scaling_factor.shape[0],
                                             cov = np.diag(scaling_factor),
                                             size = population_size)  

Selection

Now that we have a population, we can have the agents within the population compete against each other! The agents that are the most “fit” have the highest probability of passing their genes onto the next generation. We will define fitness as the cumulative reward of our agent over the span of an episode. As you might have guessed by the way we defined it, fitness refers to how good an agent is at performing the task we want it to learn. Those that are better at performing the task will have a better chance of being selected as parents to breed a new generation. There are two primary methods for parent selection - Roulette and Tournament.

The roulette method selects parents with a probability proportional to their fitness score. This is why it is also called Fitness Proportionate Selection.

  # Roulette Wheel Selection
  position = []
  for i in range(2):
      random_number = np.random.uniform(low = 0, high = scores_cumulsum[-1])
      position.append(next(x[0] for x in enumerate(scores_cumulsum) if x[1] > random_number))

  parent_1 = population[population_index_sorted[position[0]]]
  parent_2 = population[population_index_sorted[position[1]]]

The tournament method runs two tournaments in parallel with different subsets of the total population. The competitors for each tournament are chosen at random. The winners from each tournament are selected as the parents to breed the next generation.

  # Tournament Selection
  k = population_size // 2
  tournament_population = np.zeros((k, 2))
  total_competitors = np.random.choice(np.arange(population_size), k * 2, replace = False)
  tournament_population[:,0] = competition_scores[total_competitors[:k]]
  tournament_population[:,1] = competition_scores[total_competitors[k:]]

  parent_indexes = total_competitors[np.argmax(tournament_population, axis = 0) + np.array([0,k])]
  parent_1 = population[parent_indexes[0],]
  parent_2 = population[parent_indexes[1],]

Elitism

One thing we can do to improve performance in our algorithm is introduce the concept of elitism. This refers to the act of carrying over the most fit agents to the next generation without altering their chromosomes through crossover or mutation (which we will explore very soon). We do this because we always want to preserve the best agents from one generation to the next; it is not guaranteed that any of the children will be more fit than their parents.

Crossover

Now that we know how to select the parents from the population, let’s talk breeding! Crossover, also called recombination, takes the chromosomes of two parents and combines them to form children in the next generation. Here are a few ways you can combine two chromosomes:

The first and easiest way is to perform One Point crossover. You randomly select a partition in the chromosome, as indicated by the red line below. The child gets the left side of the partition from one parent and the right side from the other parent.

  partition = np.random.randint(0, parent_1.shape[0])
  # Select which parent will be the "left side"
  if which_parent == "Parent 1":
      child = parent_1
      child[partition:] = parent_2[partition:]
  elif which_parent == "Parent 2":
      child = parent_2
      child[partition:] = parent_1[partition:]

Building on the previous method is Two Point crossover. This is conceptually the same, except you randomly select two points, which serve as a lower and upper bound. The child gets the elements outside of the bounds from one parent, and the elements within the bounds from the other parent.

  lower_limit = np.random.randint(0, parent_1.shape[0]-1)
  upper_limit = np.random.randint(lower_limit+1, parent_1.shape[0])
  # Select which parent will be the "outside bounds"
  if which_parent == "Parent 1":
      child = parent_1
      child[lower_limit:upper_limit+1] = parent_2[lower_limit:upper_limit+1]
  elif which_parent == "Parent 2":
      child = parent_2
      child[lower_limit:upper_limit+1] = parent_1[lower_limit:upper_limit+1]

Unlike the previous two methods, which required the swapped genes to be in a sequence, the Uniform crossover does not. Rather, it randomly selects, with a uniform distribution, the indexes to be swapped during crossover.

  random_sequence = np.random.choice(np.arange(parent_1.shape[0]), np.random.randint(1, parent_1.shape[0]), replace = False)
  if which_parent == "Parent 1":
      child = parent_1
      child[np.sort(random_sequence)] = parent_2[np.sort(random_sequence)]
  elif which_parent == "Parent 2":
      child = parent_2
      child[np.sort(random_sequence)] = parent_1[np.sort(random_sequence)]

For the last crossover method, we’ll switch it up a little bit with the Arithmetic crossover. Like the name implies, rather than swapping genes to form a new chromosome, we will do some arithmetics to make a new chromosome. We will perform a simple weighted average on the chromosomes, where the weight is randomly generated.

  random_weight = np.random.rand()
  child = parent_1 * random_weight + parent_2 * (1 - random_weight)

Personally, I like using all of the crossover methods, so each time my algorithm performs crossover I randomly select one of the above methods with equal probability.

When setting up a genetic algorithm we define a probability of performing crossover, \(p_\text{cross}\). Thus, with \(1 - p_\text{cross}\) probability, we carry over the parent chromosomes to the next generation without crossover. Since we are going to use elitism in our algorithm, we will probably want to set \(p_\text{cross}\) to be close to 1 because otherwise there is a high probability that we will have duplicate chromosomes in the next generation.

Mutation

After reviewing some of the crossover methods, you might be thinking that we’re just combining genes together without changing their order (with the exception of the arithmetic operator). This means that our chromosomes will be bounded by the initialized values from the first generation, which limits how much our agents can evolve. To ensure this doesn’t happen, we need to maintain genetic diversity - we do this with the mutation operator.

Similar to crossover, there are multiple ways to perform mutation. For my implementation I randomly select a gene with \(p_\text{mutate}\) probability and add gaussian noise to it:

    noise = np.random.standard_normal() * noise_scale
    mutation_position = np.random.randint(0, population.shape[1])
    child[mutation_position] = child[mutation_position] + noise

Even though I remained relatively simple with my implementation, you can get a bit fancier by implementing some of the mutation methods outlined below. The first is the Swap mutation, which selects two random positions in the chromosome and swaps their genes:

  random_positions = np.random.choice(np.arange(child.shape[0]), 2, replace = False)
  value_1, value_2 = child[random_positions[0]], child[random_positions[1]]
  child[random_positions[0]], child[random_positions[1]] = value_2, value_1

Another method you can implement is the Inversion mutation, which selects two random positions and inverts/reverses the substring of genes between them:

  lower_limit = np.random.randint(0, child.shape[0]-1)
  upper_limit = np.random.randint(lower_limit+1, child.shape[0])
  child[lower_limit:upper_limit+1] = child[lower_limit:upper_limit+1][::-1]

Lastly, you can implement the Scramble mutation, which selects two random positions and scrambles the positions of the genes within them:

  lower_limit = np.random.randint(0, child.shape[0]-1)
  upper_limit = np.random.randint(lower_limit+1, child.shape[0])
  scrambled_order = np.random.choice(np.arange(lower_limit, upper_limit+1), upper_limit + 1 - lower_limit, replace = False)
  child[lower_limit:upper_limit+1] = child[scrambled_order]

DeepMind’s Control Suite

Great, now that we have all the pieces to make a genetic algorithm, let’s put them together to train the “cheetah” domain from DeepMind’s Control Suite. For those who are not familiar with the library, it is powered by the MuJoCo physics engine and provides you with an environment to train agents on a set of continuous control tasks. For our experiment we want the cheetah to learn how to run.

The thing that I really like about this library is that it has a standardized structure. For example, the library provides you with an observation of the environment and a reward for every action you take. The state observation for our domain task is a combination of the cheetah’s position and velocity. The reward, \(r\), is a function of the forward velocity, \(v\), up to a maximum of \(10 m/s\):

\[r(v) = max(0, min(v/10, 1))\]

We run each episode for 500 frames and calculate the fitness, \(f\), as:

\[f = \sum_{i=1}^{500}r_i\]

At each time step, our agent has to make 6 actions in parallel - the movement of each of its limbs. The action vector for our cheetah has the following property: \(\boldsymbol{a} \in \mathcal{A} \equiv [-1,1]^{6}\). Thus, for our policy, we are going to use a neural network with a 6-dimensional \(\tanh\) output. We flatten all of the neural network weights to a one dimensional array in order to implement the crossover and mutation operators mentioned above. After the child chromosome is created, we reshape the weights to be used in a neural network for the next generation. Overall, we used 1000 generations with a population size of 40 to train our cheetah.

When the training process starts (Generation 1), we see that the cheetah doesn’t know how to move and end up falling backwards:

As training progresses (Generation 250), the cheetah learns how to run forward. However, we see that near the end of the episode it loses control of its stride and falls flat on its face:

At the end of the training process (Generation 1000), we see that the cheetah learns how to run, while also maintaining its center of gravity during large strides:

Awesome, we did it!

Concluding Remarks

In this post we learned how genetic algorithms can be used to optimize parameters of a neural network for a continuous control task. In a future post we will explore an application where we mix genetic algorithms (derivative-free method) and policy gradients (derivative-based method) for better training.

Learning Probability Distributions in Bounded Action Spaces

2018-11-12T00:00:00+00:00

Overview

In this post we will learn how to apply reinforcement learning in a probabilistic manner. More specifically, we will be looking at some of the difficulties in applying conventional approaches to bounded action spaces, and provide a solution. This blog assumes you have knowledge in deep learning. If not, check out Michael Nielsen’s book - it is very comprehensive and easy to understand.

Reinforcement Learning Background

I am not going to provide a complete background on Reinforcement Learning (RL) because there are already some excellent resources online such as Arthur Juliani’s blogs and David Silver’s lectures. I highly recommend going through both to get a solid understanding of the fundamentals. With that said, I will explain some concepts that are important for this blog post.

At the most basic level, the goal of RL is to learn a mapping from states to actions. To understand what this means, I think it is important to take a step back and understand the RL framework more generally. Cue the overused RL diagram:

The first thing to notice is that there is a feedback loop between the agent and the environment. For clarity, the agent refers to the AI that we are creating, while the environment refers to the world that the agent has to navigate through. In order to navigate through an environment, the agent has to take actions. The specific actions will depend on the domain - we will describe a few fairly soon. After the agent takes an action, it receives an observation of the environment (the current state) and a reward (assuming we don’t have sparse rewards).

After interacting with the environment for long enough, we hope that our agent learns how to take actions that maximize its cumulative reward over the long-term. It is important to realize that the best action in one state is not necessarily the best action in another state. So going back to our statement about mapping states to actions, this simply means that we want our agent to learn the best actions to take in each environment state. The function that maps states to actions is called a policy and is denoted as \(\pi(a \mid s)\). Usually we read \(\pi(a \mid s)\) as: probability of taking action \(a\), given we are in state \(s\). However, just as a side note, your policy does not have to be defined probabilistically - you can define it deterministically as well.

Now let’s talk a bit about actions an agent can take. The first distinction I would like to make is between discrete actions and continuous actions. When we refer to discrete actions, we simply mean that there is a finite set of possible actions an agent can take. For example, in pong an agent can decide to move up or down. On the other hand, continuous actions have an infinite number of possibilities. An example of a continuous action, although kind of silly, is the hiding position of an agent if it is playing hide and seek.

Given enough time, the agent can theoretically hide anywhere - so the action space is unbounded. In contrast, we can have a continuous action space that is bounded. An example close to my heart is position sizing when trading a financial asset. The bounds are -1 (100% Short) and 1 (100% Long). To map states to that bounded action space, we can use \(\tanh\) in the final layer of a neural network. That seems pretty easy… so why am I writing a blog post about it? Often times we need more than just a deterministic output, especially when the underlying data has a low signal-to-noise ratio. The additional piece of information that we need is uncertainty in our agent’s decision. We will use a Bayesian approach to model a posterior distribution and sample from this distribution to estimate the uncertainty. Don’t worry if that doesn’t completely make sense yet - it will by the end of this post!

Probability Distributions

For a great introduction to Bayesian statistics I suggest reading Will Kurt’s blog - Count Bayesie. It’s awesome.

Distributions can be thought of as representing beliefs about the world. Specifically as it relates to our task at hand, the probability distributions represent our beliefs in how good an action is, given the state. In the financial markets context, where the action space is continuous and bounded between -1 and 1, a mean close to 1 represents a belief that it is a good time to buy that asset, so we should long it. A mean close to -1 represents the opposite, so we should short the asset. Building on this example, if the standard deviation of our distribution is large (small) then our agent is uncertain (certain) in its decision. In other words, if the agent’s policy has a large standard deviation, then it has not developed a strong belief yet.

Whenever you hear anyone talking about Bayesian statistics, you always hear the terms “prior” and “posterior”. Simply put, a prior is your belief about the world before receiving new information. However, once you receive new information, then you update your prior distribution to form a posterior distribution. After that, if you receive more information, then your posterior becomes your prior, and the new information gets incorporated to form a new posterior distribution. Essentially, there is this feedback loop of continual learning that happens as more and more new information gets processed by your agent. Below we visually show one iteration of this loop:

Our goal is to learn a good posterior distribution on actions, conditioned on the state that the agent is in. If you are familiar with this paper, then you might be thinking that we can just use Monte Carlo (MC) dropout with a \(\tanh\) output layer. For those who are not familiar with this concept, let me explain. Dropout is a technique that was originally used for neural network regularization. With each pass, it will randomly “drop” neurons from each hidden layer by setting their output to 0. This reduces the output’s dependency on any one particular neuron, which should help generalization. However, researchers at Cambridge found that using dropout during inference can be used to approximate a posterior distribution. This is because each time you pass inputs through the network, a different set of neurons will be dropped, so the output is going to be different for each run - creating a distribution of outputs.

The great thing about this architecture is that you can easily pass gradients through the policy network. The loss function that we are minimizing throughout this blog is \(\mathcal{L} = - r \times \pi(s)\), where \(r\) denotes the reward and \(\pi(s)\) denotes the policy output given the states (i.e. the action). We wanted to demonstrate how the distribution changes in a controlled environment. So we use the same state input throughout all our experiments and continually feed it a positive reward to view the changes during training. Below is the first example using the MC Dropout method and a \(\tanh\) output layer.

I omitted a kernel density estimation (KDE) plot on top of the histogram because as training progressed, the KDE became much more jagged and not representative of the actual probability density function (PDF). I was using sns.kdeplot, if anyone knows how to fix this, please let me know in the comments section!

There are two things that I don’t particularly like about this approach. The first is that it is possible to have multiple peaks in the distribution, as seen when the neural network is first initialized. I realize that as training went on, only one peak emerged. However, the fact that an agent can potentially learn such a distribution (with multiple peaks) makes me uncomfortable. If we go back to our example in the financial markets, an action of -1 will have the exact opposite reward compared to an action of 1 (because it is the other side of the trade), so having peaks at both ends of the spectrum is quite confusing. I would much rather just have one peak near 0 with a large standard deviation if the agent is uncertain which action to take. The second is that it becomes overly optimistic in its decision when compared to a gaussian output (we will see this later), which could possibly indicate that it is understating the uncertainty.

I will digress for a moment to state that a multimodal distribution (a distribution with multiple peaks) is not always bad. For example if you imagine an agent trying to navigate through a room and their policy dictates the angle at which they will move, then there could be two different angles that, while momentarily will send them in different directions, will ultimately lead them to the same end location. However, for this post, we will stick to the example in the financial markets, where a multimodal distribution doesn’t make sense.

Instead of using MC dropout, we can try using a normal distribution in the output and see if things improve. The architecture of our neural network now becomes:

If our neural network parameters are denoted by \(\theta\), then we can define \(\mu_{\theta}\) and \(\sigma_{\theta}\) as outputs of the neural network, such that:

\[\pi \sim \mathcal{N}(\mu_{\theta}(s), \sigma_{\theta}(s))\]

Reparameterization Trick

We want to update the policy network with backpropagation (similar to what we did with the MC dropout architecture), but you’ll notice that we have a bit of a problem - a random variable is now part of the computation graph. This is a problem because backpropagation cannot flow through a random node. However, by using the reparameterization trick, we can move the random node outside of the computation graph and then feed in samples drawn from the distribution as constants. Inference is the exact same, but now our neural network can perform backpropagation.

To do this, we define a random variable \(\varepsilon\), which does not depend on \(\theta\). The new architecture becomes:

\[\varepsilon \sim \mathcal{N}(0,I)\] \[\pi = \mu_{\theta}(s) + \sigma_{\theta}(s) \cdot \varepsilon\]

Python code to take the random variable outside of the computation graph is shown below (I’m only showing the relevant portion of the computation graph):

  import tensorflow as tf

  policy_mu = tf.nn.tanh(tf.matmul(previous_layer, weights_mu) + bias_mu)
  policy_sigma = tf.nn.softplus(tf.matmul(previous_layer, weights_sigma) + bias_sigma)                

  epsilon = tf.random_normal(shape = tf.shape(policy_sigma), mean = 0, stddev = 1, dtype = tf.float32)

  policy = policy_mu + policy_sigma * epsilon

Now to get the neural network to work in a bounded space, we can clip outputs to be between -1 and 1. We simply change the last line of code in our network to:

  policy = tf.clip_by_value(policy_mu + policy_sigma * epsilon, clip_value_min = -1, clip_value_max = 1)

The resulting distribution is shown below:

There is one obvious flaw in this approach - all of the clipped values get a value of either -1 or 1, which creates a very unbalanced distribution. To fix this, we will sample \(\varepsilon\) from a truncated normal distribution.

Truncated Normal Solution

A truncated normal distribution is similar to a normal distribution, in that it is defined by a mean (\(\mu\)) and standard deviation (\(\sigma\)). However, the key distinction is that the distribution’s range is limited to be within a lower and upper bound. Typically the lower bound is denoted by \(a\) and the upper bound is denoted by \(b\), but I’m going to use \(L\) and \(U\) because I think it is easier to follow.

One might think that the bounds we define for the distribution should be the same as the bounds of our policy, but that won’t work if we want to use reparameterization. This is because the bounds apply to \(\varepsilon\) and not \(\pi\). Since we expand \(\varepsilon\) by \(\sigma\) and shift it by \(\mu\), then applying bounds of -1 and 1 will result in a \(\pi\) that extends beyond the bounds. To make this point more clear, let’s say we defined our bounds \(-1 \leq \varepsilon \leq 1\), and \(\mu = 0.5 , \, \sigma = 1\). If we generate a sample \(\varepsilon = 0.9\), then after you apply the transformation \(\mu + \sigma \cdot \varepsilon\), you get \(\pi = 0.5 + 1 \cdot 0.9 = 1.4\), which is beyond the upper bound.

To generate the proper upper and lower bounds, we will use the equations below:

\[L = \frac{-1 - \mu_{\theta}}{\sigma_{\theta}}\] \[U = \frac{1 - \mu_{\theta}}{\sigma_{\theta}}\]

Using our previous example, we find that \(U = 0.5\), which means that the largest \(\varepsilon\) we can sample is 0.5. Plugging this into our reparameterized equation, we see that the largest \(\pi\) we can generate is 1. Similarly, \(L = -1.5\), which means that the lowest \(\pi\) we can generate is -1. Perfect, we figured it out!

Given the PDF for a normal distribution:

\[p(\varepsilon) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{\varepsilon - \mu}{\sigma}\right)^2}\]

We will let \(F(\varepsilon)\) denote our cumulative distribution function (CDF). Our truncated density now becomes:

\[p(\varepsilon \mid L \leq \varepsilon \leq U) = \frac{p(\varepsilon)}{F(U) - F(L)} \, \, \text{for} \, L \leq \varepsilon \leq U\]

The denominator, \(F(U) - F(L)\), is the normalizing constant that allows the truncated density to integrate to 1. The reason we do this is because, as shown below, we are only sampling from a portion of \(p(\varepsilon)\).

You can import scipy and use the following function to generate samples from a truncated normal distribution:

  import scipy.stats as stats

  mu_dims = 3 # Dimensionality of the Mu generated by the Neural Network

  n_samples = 10000
  sn_mu = 0 # Standard Normal Mu
  sn_sigma = 1 # Standard Normal Sigma

  generator = stats.truncnorm((lower_bound - sn_mu) / sn_sigma, (upper_bound - sn_mu) / sn_sigma, loc = sn_mu, scale = sn_sigma)
  epsilons = generator.rvs([n_samples, mu_dims])

This distribution looks a lot nicer than both of the previous approaches, and has some nice properties:

It only has one peak at all times
Outputs do not need to be clipped
The policy doesn’t look overly optimistic.

Concluding Remarks

In this post, we examined a few approaches to approximating a posterior distribution over our policy. Ultimately, we feel that using a neural network with a truncated normal policy is the best approach out of those examined. We learned how to reparameterize a truncated normal, which allows us to train the policy network using backpropagation.

Acknowledgments

I would like to thank Alek Riley for his feedback on how to improve the clarity of certain explanations.