Can we use Q-value distributions to efficiently explore the state-action space?

Explore Exploit

The agent is pretrained and all of its progress is stored. Use the slider to choose the learned Q-values as of that episode in training. After a couple hundred episodes, the agent learns the optimal policy with certainty, which reduces exploration significantly. The full Q-value distributions for each state can be seen here.

Recent work by Dabney et al. suggests that the brain represents reward predictions as probability distributions Experiments were conducted on mice using single-unit recordings from the ventral tegmental area. . This contrasts against the widely adopted approach in reinforcement learning (RL) of modelling single scalar quantities (expected values). In fact, by using distributions we are able to quantify uncertainty in the decision-making process. Uncertainty is especially important in domains where making a mistake can result in the inability to recover Examples of such domains include autonomous vehicles, healthcare, and the financial markets. . Research in risk-aware reinforcement learning has emerged to address such problems . However, another important application of uncertainty, which we focus on in this article, is efficient exploration of the state-action space.


The purpose of this article is to clearly explain Q-Learning from the perspective of a Bayesian. As such, we use a small grid world and a simple extension of tabular Q-Learning to illustrate the fundamentals. Specifically, we show how to extend the deterministic Q-Learning algorithm to model the variance of Q-values with Bayes' rule. We focus on a sub-class of problems where it is reasonable to assume that Q-values are normally distributed and derive insights when this assumption holds true. Lastly, we demonstrate that applying Bayes' rule to update Q-values comes with a challenge: it is vulnerable to early exploitation of suboptimal policies.

This article is largely based on the seminal work from Dearden et al. . Specifically, we expand on the assumption that Q-values are normally distributed and evaluate various Bayesian exploration policies. One key distinction is that we model $$\mu$$ and $$\sigma^2$$, while the authors of the original Bayesian Q-Learning paper model a distribution over these parameters. This allows them to quantify uncertainty in their parameters as well as the expected return - we only focus on the latter.

Epistemic vs Aleatoric Uncertainty

Since Dearden et al. model a distribution over the parameters, they can sample from this distribution and the resulting dispersion in Q-values is known as epistemic uncertainty. Essentially, this uncertainty is representative of the "knowledge gap" that results from limited data (i.e. limited observations). If we close this gap, then we are left with irreducible uncertainty (i.e. inherent randomness in the environment), which is known as aleatoric uncertainty .

One can argue that the line between epistemic and aleatoric uncertainty is rather blurry. The information that you feed into your model will determine how much uncertainty can be reduced. The more information you incorporate about the underlying mechanics of how the environment operates (i.e. more features), the less aleatoric uncertainty there will be.

It is important to note that inductive bias also plays an important role in determining what is categorized as epistemic vs aleatoric uncertainty for your model.

Important Note about Our Simplified Approach:

Since we only use $$\sigma^2$$ to represent uncertainty, our approach does not distinguish between epistemic and aleatoric uncertainty. Given enough interactions, the agent will close the knowledge gap and $$\sigma^2$$ will only represent aleatoric uncertainty. However, the agent still uses this uncertainty to explore. This is problematic because the whole point of exploration is to gain knowledge, which indicates that we should only explore using epistemic uncertainty.

Since we are modelling $$\mu$$ and $$\sigma^2$$, we begin by evaluating the conditions under which it is appropriate to assume Q-values are normally distributed.

When Are Q-Values Normally Distributed?

The readers who are familiar with Q-Learning can skip over the collapsible box below.

Temporal Difference Learning

Temporal Difference (TD) learning is the dominant paradigm used to learn value functions in reinforcement learning . Below we will quickly summarize a TD learning algorithm for Q-values, which is called Q-Learning. First, we will write Q-values as follows :

\overbrace{Q_\pi(s,a)}^\text{current Q-value} = \overbrace{R_s^a}^\text{expected reward for (s,a)} + \overbrace{\gamma Q_\pi(s^{\prime},a^{\prime})}^\text{discounted Q-value at next timestep}

We will precisely define Q-value as the expected value of the total return from taking action $$a$$ in state $$s$$ and following policy $$\pi$$ thereafter. The part about $$\pi$$ is important because the agent's view on how good an action is depends on the actions it will take in subsequent states. We will discuss this further when analyzing our agent in the game environment.

For the Q-Learning algorithm, we sample a reward $$r$$ from the environment, and estimate the Q-value for the current state-action pair $$q(s,a)$$ and the next state-action pair $$q(s^{\prime},a^{\prime})$$ For Q-Learning, the next action $$a^{\prime}$$ is the action with the largest Q-value in that state: $$\max_{a^{\prime}} q(s^{\prime}, a^{\prime})$$. . We can represent the sample as:

q(s,a) = r + \gamma q{(s^\prime,a^\prime)}

The important thing to realize is that the left side of the equation is an estimate (current Q-value), and the right side of the equation is a combination of information gathered from the environment (the sampled reward) and another estimate (next Q-value). Since the right side of the equation contains more information about the true Q-value than the left side, we want to move the value of the left side closer to that of the right side. We accomplish this by minimizing the squared Temporal Difference error ($$\delta^2_{TD}$$), where $$\delta_{TD}$$ is defined as:

\delta_{TD} = r + \gamma q(s^\prime,a^\prime) - q(s,a)

The way we do this in a tabular environment, where $$\alpha$$ is the learning rate, is with the following update rule:

q(s,a) \leftarrow \alpha(r_{t+1} + \gamma q(s^\prime,a^\prime)) + (1 - \alpha) q(s,a)

Updating in this manner is called bootstrapping because we are using one Q-value to update another Q-value.

We will use the Central Limit Theorem (CLT) as the foundation to understand when Q-values are normally distributed. Since Q-values are sample sums, then they should look more and more normally distributed as the sample size increases . However, the first nuance that we will point out is that rewards must be sampled from distributions with finite variance. Thus, if rewards are sampled distributions such as Cauchy or Lévy, then we cannot assume Q-values are normally distributed.

Otherwise, Q-values are approximately normally distributed when the number of effective timesteps $$\widetilde{N}$$ is large We can think of effective timesteps as the number of full samples. . This metric is comprised of three factors:

We combine the factors above to formally define the number of effective timesteps: \widetilde{N} = \frac{1}{\xi + 1}\sum_{i=0}^{N-1}\gamma^{i}

Below we visually demonstrate how each factor affects the normality of Q-values We scale the Q-values by $$\widetilde{N}$$ because otherwise the distribution of Q-values moves farther and farther to the right as the number of effective timesteps increases, which distorts the visual. :

Select whether the underlying distribution follows a skew-normal or a Bernoulli distribution. In the Google Colab notebook we also include three statistical tests of normality for the Q-value distribution.

Experiment in a Notebook

There is a caveat in the visual analysis above for environments that have a terminal state. As the agent moves closer to the terminal state, then $$N$$ will progressively get smaller and Q-values will look less normally distributed. Nonetheless, it is reasonable to assume that Q-values are approximately normally distributed for most states in dense reward environments if we use a large $$\gamma$$.

Bayesian Interpretation

We preface this section by noting that the following interpretations are only theoretically justified when we assume Q-values are normally distributed. We begin by defining the general update rule using Bayes' Theorem:

\text{posterior} \propto \text{likelihood} \times \text{prior}

When using Gaussians, we have an analytical solution for the posterior A Gaussian is conjugate to itself, which simplifies the Bayesian updating process significantly; instead of computing integrals for the posterior, we have closed-form expressions . :

\mu = \frac{\sigma^2_1}{\sigma^2_1 + \sigma^2_2}\mu_2 + \frac{\sigma^2_2}{\sigma^2_1 + \sigma^2_2}\mu_1 \sigma^2 = \frac{\sigma^2_1\sigma^2_2}{\sigma^2_1 + \sigma^2_2}

By looking at a color-coded comparison, we can see that deterministic Q-Learning is equivalent to updating the mean using Bayes' rule:

\begin{aligned} &\color{green}\mu& &\color{black}=& &\color{orange}\frac{\sigma^2_1}{\sigma^2_1 + \sigma^2_2}& &\color{red}\mu_2& &\color{black}+& &\color{purple}\frac{\sigma^2_2}{\sigma^2_1 + \sigma^2_2}& &\color{blue}\mu_1& \\ \\ &\color{green}q(s,a)& &\color{black}=& &\color{orange}\alpha& &\color{red}(r_{t+1} + \gamma q(s^\prime,a^\prime))& &\color{black}+& &\color{purple}(1 - \alpha)& &\color{blue}q(s,a)& \end{aligned}

What does this tell us about the deterministic implementation of Q-Learning, where $$\alpha$$ is a hyperparameter? Since we don't model the variance of Q-values in deterministic Q-Learning, $$\alpha$$ does not explicitly depend on the certainty in Q-values. Instead, we can interpret $$\alpha$$ as being the ratio of how implicitly certain the agent is in its prior, $$q(s,a)$$, relative to the likelihood, $$r + \gamma q(s^\prime,a^\prime)$$ Our measurement is $$r + \gamma q(s^\prime,a^\prime)$$ since $$r$$ is information given to us directly from the environment. We represent our likelihood as the distribution over this measurement: $$\mathcal{N}\left(\mu_{r + \gamma q(s^\prime,a^\prime)}, \sigma^2_{r + \gamma q(s^\prime,a^\prime)}\right)$$. . For deterministic Q-Learning, this ratio is generally constant and the uncertainty in $$q(s,a)$$ does not change as we get more information.

What happens "under the hood" if we keep $$\alpha$$ constant? Right before the posterior from the previous timestep becomes the prior for the current timestep, we increase the variance by $$\sigma^2_{\text{prior}_{(t-1)}} * \alpha$$ When $$\alpha$$ is held constant, the variance of the prior implicitly undergoes the following transformation: $$\sigma^2_{\text{prior}_{(t)}} = \sigma^2_{\text{posterior}_{(t-1)}} + \sigma^2_{\text{prior}_{(t-1)}} * \alpha$$.

Let us first state that $$\alpha = \frac{\sigma^2_\text{prior}}{\sigma^2_\text{prior} + \sigma^2_\text{likelihood}}$$, which can be deduced from the color-coded comparison in the main text.
Given the update rule $$ \sigma^2_{\text{posterior}_{(t)}} = \frac{\sigma^2_{\text{prior}_{(t)}} \times \sigma^2_{\text{likelihood}_{(t)}}}{\sigma^2_{\text{prior}_{(t)}} + \sigma^2_{\text{likelihood}_{(t)}}} $$, we know that $$\sigma^2_{\text{posterior}_{(t)}} \lt \sigma^2_{\text{prior}_{(t)}}$$
We also know that the update rule works in such a way that $$\sigma^2_{\text{prior}_{(t)}} = \sigma^2_{\text{posterior}_{(t-1)}}$$
Therefore, we can state that $$\sigma^2_{\text{prior}_{(t)}} \lt \sigma^2_{\text{prior}_{(t-1)}}$$ if we assume $$\sigma^2_\text{likelihood}$$ does not change over time. This means that $$\alpha_{(t)} \neq \alpha_{(t-1)}$$
In order to make $$\alpha_{(t)} = \alpha_{(t-1)}$$, we need to increase $$\sigma^2_{\text{posterior}_{(t-1)}}$$ before it becomes $$\sigma^2_{\text{prior}_{(t)}}$$. We solve for this amount below:

$$ \begin{aligned} \sigma^2_{\text{posterior}_{(t-1)}} + X &= \sigma^2_{\text{prior}_{(t-1)}} \\ \frac{\sigma^2_{\text{prior}_{(t-1)}} \times \sigma^2_\text{likelihood}}{\sigma^2_{\text{prior}_{(t-1)}} + \sigma^2_{likelihood}} + X &= \sigma^2_{\text{prior}_{(t-1)}} \\ X &= \sigma^2_{\text{prior}_{(t-1)}} \left(1 - \frac{\sigma^2_\text{likelihood}}{\sigma^2_{\text{prior}_{(t-1)}} + \sigma^2_\text{likelihood}} \right) \\ X &= \sigma^2_{\text{prior}_{(t-1)}} * \alpha \end{aligned} $$
. This keeps the uncertainty ratio between the likelihood and the prior constant An alternative interpretation is that the variance for the prior and likelihood are both decreasing in such a way that keeps the ratio between them constant. However, we do not think it is reasonable to assume that the variance of the sampled reward would constantly decrease as the agent becomes more certain in its prior. . Below we visualize this interpretation by comparing the "regular" Bayesian update to the constant $$\alpha$$ update:

Click the right arrow to calculate the posterior given the prior and likelihood. Click the right arrow a second time to see the previous posterior transform into the new prior for the next posterior update. Use the slider to select different values for the starting $$\alpha$$. NOTE: Larger starting values of $$\alpha$$ make the distinction visually clear.

Now that we know what happens under the hood when we hold $$\alpha$$ constant, it is worth noting that not everyone holds it constant. In practice, researchers also decay $$\alpha$$ for the agent to rely less on new information (implicitly becoming more certain) for each subsequent timestep . Although deterministic Q-Learning largely depends on heuristics to create a decay schedule, Bayesian Q-Learning has it built in:

\alpha = \frac{\sigma^2_{q(s,a)}}{\sigma^2_{q(s,a)} + \sigma^2_{r + \gamma q(s^\prime,a^\prime)}}

As our agent updates its belief about the world it will naturally create a decay schedule that corresponds to how certain it is in its prior. As uncertainty decreases, so does the learning rate. Note that the learning rate is bespoke for each state-action pair because it is possible to become more confident in specific state-action pairs faster than others Some reasons include visiting those state-action pairs more often than others, or simply because they are inherently less noisy. .


Exploration Policies

There are many ways we can use a distribution over Q-values to explore as an alternative to the $$\varepsilon$$-greedy approach. Below we outline a few, and evaluate each in the final section of this article.

Below we visualize the different exploration policies in action:

The circles represent the evaluation criteria for the agent's actions. The agent chooses the action with the circle that is farthest to the right. For epsilon-greedy, we use $$\varepsilon = 0.1$$. The "sample" button only appears for stochastic exploration policies.

By interacting with the visual above, one might wonder if we can infer what the "exploration parameter" is for the other stochastic policy, Q-value sampling, which does not explicitly define $$\varepsilon$$. We explore this question in the next section.

Implicit $$\varepsilon$$

In contrast to deterministic Q-Learning, where we explicitly define $$\varepsilon$$ as the exploration hyperparameter, when we use Q-value sampling there is an implicit epsilon $$\hat{\varepsilon}$$. Before defining $$\hat{\varepsilon}$$, we will get some notation out of the way. Let's define two probability distributions, $$x_1 \sim \mathcal{N}(\mu_1, \sigma^2_1)$$ and $$x_2 \sim \mathcal{N}(\mu_2, \sigma^2_2)$$. To calculate the probability that we sample a value $$x_1 \gt x_2$$, we can use the following equation, where $$\Phi$$ represents the cumulative distribution function :

\begin{aligned} &\mu = \mu_1 - \mu_2 \\ &\sigma = \sqrt{\sigma^2_1 + \sigma^2_2} \\ &Pr(x_1 \gt x_2) = 1 - \Phi\left(\frac{-\mu}{\sigma}\right) \end{aligned}

With this equation, we can now calculate the probability of sampling a larger Q-value for a reference action $$\hat{a}$$ relative to another action. If we do this for each action that an agent can make (excluding the reference action) and calculate the joint probability, then we get the probability that the sampled Q-value for $$\hat{a}$$ is larger than all other actions In a given state, the Q-value for one action should be independent of the other Q-values in that state. This is because you can only take one action at a time, and we generally apply Q-learning to MDPs, where the Markov property holds (i.e. history does not matter). Thus, to calculate the joint probability, it is simply a multiplication of the marginal probabilities. :

\bar{P}_{\hat{a}} = \prod_{a}^{\mathcal{A}}Pr(x_{\hat{a}} \gt x_a), \quad \text{for} \,\, a \neq \hat{a}

We then find the action with the largest $$\bar{P}_{a}$$ because that is the action that we would select if we were not exploring Since we're using normal distributions, $$\text{arg}\max{\bar{P}_{a}}$$ happens to correspond to the Q-value with the largest mean. .

a_{max} = \text{arg}\max{\bar{P}_{a}}, \quad \forall \,\, a \in \mathcal{A}

Then, if we sum up the probabilities of sampling the largest Q-value, for all actions except the exploitation action, then we get the probability that we will explore:

\hat{\varepsilon} = \frac{1}{C}\sum_{a}^{\mathcal{A}}\bar{P}_{a}, \quad \text{for} \,\, a \neq a_{max}

Where $$C$$ is the normalizing constant (sum of all $$\bar{P}_{a}$$)

Applying Bayes' Rule

We will now put the theory into practice! By inspecting the learning process, we can see that there is a key challenge in applying Bayes' rule to Q-Learning. Specifically, we focus on diverging Q-value distributions, which can cause agents to become confident in suboptimal policies.

Game Setup

As researchers in the financial markets, we designed the environment after a sub-class of problems that share similar characteristics. These problems are characterized by environments that give a reward at every timestep, where the mean and variance of the rewards depends on the state that the agent is in This is equivalent to the return received for any trade/investment, where the expected return and volatility depends on the market regime. . To achieve this, we use a modified version of the Cliff World environment :

From any state in the grid the agent can take one of the following actions: $$[\text{Left, Right, Up, Down}]$$. If the agent is on the outer edge of the grid and moves towards the edge, then the agent stays in the same place (imagine running into a wall).

Analyzing the Learned Distributions

Below we show the Q-value distributions learned by our agent for each state-action pair. We use an arrow to highlight the learned policy.

Hover your mouse above the positions on the grid to see the Q-value distributions for each state-action pair. The distributions are colored with a red-white-green gradient (ranging from -50 to 50).

By hovering our mouse over the path, we realize that the agent does not learn the "true" Q-value distribution for all state-action pairs. Only the pairs that guide it through the path seem to be accurate. This happens because the agent stops exploring once it thinks it has found the optimal policy Even if agents do not learn the true Q-values, they can still learn the optimal policy if they learn the relative value of actions in a state. The relative value of actions is referred to as the advantage . . Below we see that learning plateaus once exploration stops:

Click on a state (square on grid) and action (arrow) to see the learning progress for that state-action pair.

One thing that always happens when using Bayes' rule (after enough episodes) is that the agent finds its way to the goal without falling off the cliff. However, it does not always find the optimal path. Below we color states according to how often they are visited during training - darker shades represent higher visitation rates. We see that state visitations outside of the goal trajectory are almost non-existent because the agent becomes anchored to the path that leads it to the goal.

Let's dig into the exact state that is responsible for the agent either finding the optimal policy or not. We will call this the "critical state" and highlight it with a star in the figure above. When analyzing what happens during training, we see that the cause of the problem is that the Q-value distributions diverge. We will use Q-Value sampling for the following analysis. Since the agent explores via Q-Value sampling, once the density of the joint distribution approaches 0, the agent will always sample a higher Q-value from one distribution relative to the other. Thus, it will never take the action from the Q-value distribution with a lower mean. Let's look at a visual illustration of this concept:

We will represent the distribution that we toggle as $$x_1$$ and the static distribution as $$x_2$$. The first bar represents $$Pr(x_1 \gt x_2)$$ and the second bar represents $$\hat{\varepsilon}$$. When visualized, it is clear that $$\hat{\varepsilon}$$ is just the overlapping area under the two distributions The agent only explores when there is a possibility of sampling a higher value from either distribution, which is only the case when there is a decent amount of overlap between the distributions. . Let us now inspect the learning progress at the critical state:

Optimal Suboptimal

Whether the agent finds the optimal policy or the suboptimal policy, we notice that exploration stops as soon as the Q-values diverge far enough. This can be seen as the training progress flat lines for the action with a lower mean. Therefore, a risk in applying Bayes' rule to Q-learning is that the agent does not explore the optimal path before the distributions diverge.

Impact of Policy on Perception

We will use the agent that learned the suboptimal policy for a quick experiment. At the critical state, we know that the Q-value distributions diverge in such a way that the agent will never sample a Q-value for $$\text{Down}$$ that is higher than $$\text{Right}$$, and thus it will never move down. However, what if we force the agent to move down and see what it does from that point on? Test it out below:

Click on one of the arrows (actions) and see the path the agent goes on after it takes that action. We run 10 paths each run.

By forcing the agent to move down, we realize that there are times when it goes around the danger zone to the goal. We will explain what is happening with an analogy:

Imagine getting into a car accident at intersection X when you are learning to drive. You will associate that intersection with a bad outcome (low Q-value) and take a detour going forward. Overtime you will get better at driving (policy improvement) and if you accidentally end up at intersection X, you will do just fine. The problem is that you never revisit intersection X because it is hard to decouple the bad memory from the fact that you were a bad driver at the time.

This problem is highlighted in one of David Silver's lectures, where he states that although Thompson sampling (Q-value sampling in our case) is great for bandit problems, it does not deal with sequential information well in the full MDP case . It only evaluates the Q-value distribution using the current policy and does not take into account the fact that the policy can improve. We will see the consequence of this in the next section.


To evaluate the exploration policies previously discussed, we compare the cumulative regret for each approach in our game environment. Regret is the difference between the return obtained from following the optimal policy compared to the actual policy that the agent followed If the agent follows the optimal policy, then it will have a regret of $$0$$. .

Median Median with Range
Click on the legend elements to add/remove them from the graph. The range was generated with 50 initializations. Play around with the hyperparameters for any of the benchmarks in the Google Colab notebook.
Experiment in a Notebook

Although experiments in our game environment suggest that Bayesian exploration policies explore more efficiently on average, there seems to be a wider range of outcomes. Additionally, given our analysis on diverging Q-value distributions, we know that there are times when Bayesian agents can become anchored to suboptimal policies. When this happens, the cumulative regret looks like a diagonal line $$\nearrow$$, which can be seen protruding from the range of outcomes.

In conclusion, while Bayesian Q-Learning sounds great theoretically, it can be challenging to apply in actual environments. This challenge only gets harder as we move to more realistic environments with larger state-action spaces. Nonetheless, we believe modelling distributions over value functions is an exciting area of research and has the ability to achieve state-of-the-art (SOTA) results, as demonstrated in some related works on distributional RL.

Related Work

Although we focus on modelling Q-value distributions in a tabular setting, a lot of exciting research has gone into using function approximations to model these distributions . More recently, a series of distributional RL papers using deep neural networks have emerged achieving SOTA results in Atari-57. The first of such papers introduced the categorical DQN (C51) architecture as a way to discretize Q-values into bins and then assign a probability to each bin .

One of the weaknesses in C51 is the discretization of Q-values as well as the fact that you have to specify a minimum and maximum value. To overcome these weaknesses, work has been done to "transpose" the problem with quantile regression . With C51 they adjust the probability for each Q-value range, but with quantile regression they adjust the Q-values for each probability range A probability range is more formally known as a quantile - hence the name "quantile regression". . Following this research, the implicit quantile network (IQN) was introduced to learn the full quantile function as opposed to learning a discrete set of quantiles . The current SOTA improves on IQN by fully parameterizing the quantile function; both the quantile fractions and the quantile values are parameterized .

Others specifically focus on modelling value distributions for efficient exploration . Osband et al. also focus on efficient exploration, but in contrast to other distributional RL approaches, they use randomized value functions to approximately sample from the posterior . Another interesting approach for exploration uses the uncertainty Bellman equation to propagate uncertainty across multiple timesteps .


I would like to thank Wei Xie, Sylvie Shang Shi, and Paola Soto for their constructive criticism on the flow and clarity of this article. I would like to thank all of the authors on Distill because their amazing papers inspired me to learn javascript and create this article. Lastly, I would like to thank the Distill team for open sourcing their technology which greatly improved the aesthetics of this article.


For attribution in academic contexts, please cite this work as

Da Silva, "A Bayesian Perspective on Q-Learning", 2020.

BibTeX citation

    author = {Da Silva, Brandon},
    title = {A Bayesian Perspective on Q-Learning},
    year = {2020},
    note = {},