Extensive Math Warning!

Loading...

TreeQN AND ATreeC

TreeQN AND ATreeC

n-step Q-Learning with synchronous environment threads.

Starting at a timestep $t$, we roll forward $n_{env} = 16$ threads for $n=5$ timesteps each. We then bootstrap off the final states only and gather all $n_{env} \times n = 80$ transitions in a single batch for the backward pass, minimising the loss:

\[\mathcal{L}_{\text{nstep-Q}} = \sum_\text{envs} \sum_{j=1}^n \left ( \sum_{k=1}^j [\gamma^{j-l}r_{t+n-k}] + \gamma^j \max_{a'} Q(s_{t+n}, a' , \theta^{-}) - Q (s_{t+n-j}, a_{t+n-j},\theta)\right)^2.\]

If the episode terminates we use the remaining episode return as the target, without bootstrapping.

nstepqlearning

The algorithm’s actor-critic counterpart is A2C, a synchronous variant of A3C in which a policy $\pi$ and state-value function $V(x)$ are trained using the gradient:

\[\Delta\theta = \sum_\text{envs} \sum_{j=1}^n \nabla_{\theta_\pi} \log \pi (a_{t+n-j} \mid s_{t+n-j}) A_j (s_{t+n-j},a_{t+n-j}) + \beta \nabla_{\theta_\pi} H(\pi(s_{t+n-j}) \\ + \alpha \nabla_{\theta_V} A_j(s_{t+n-j},a_{t+n-j})^2\]

Where $A_j$ is an advantage estimate given by $\sum_{k=1}^j \gamma^{j-k} r_{t+n-k} + \gamma^j V(s_{t+n}) - V(s_{t+n-j})$, $H$ is the policy entropy, $\beta$ is a hyperparameter tuning the degree of entropy regularization, and $\alpha$ is a hyperparameter controlling the relative learning rates of actor and critic.

Which is just a way of applying the gradient update with:

\[\begin{aligned} \theta' &\colon d \theta \leftarrow d \theta + \nabla_{\theta'} \log \pi(a_i \mid s_i ; \theta')(R - V(s_i;\theta_v')) \\ \theta_v' &\colon d \theta_v \leftarrow d \theta_v + \partial(R - V(s_i;\theta_v'))^2 / \partial \theta_v' \end{aligned}\]

on the n-step, synchronous setting.

Anyway, these algorithms were chosen for their simplicity and reasonable wallclock speeds, but TreeQN can also be used in other algorithms, as described in Section 3. For the paper, OpenAI baselines implementations were used.

DQN can be represented as follows:

dqn

The cannonical neural network architecture in deep RL with visual observations has a series of convolutional layers followed by two fully connected layers, where the final layer produces one output for each action-value. We can think of this network as first calculating an encoding $z_t$ of the state $s_t$ which is then evaluated by the final layer to estimate $Q^* (s_t, a_t)$.

In tree-search on-line planning, a look-ahead tree of possible future states is constructed by recursively applying an environment model. These states are typically evaluated by a heuristic, a learned value function, or Monte-Carlo rollouts. Backups though the tree aggregate these values along with the immediate rewards accumulated along each path to estimate the value of taking an action in the current state. The paper focuses on a simple tree-search, with a deterministic transition function and no value uncertainty estimates, but can be extended to other tree-search variants like UCT.

TreeQN

TreeQN is a end-to-end differentiable tree-structured architecture for deep reinforcement learning.

TreeQN uses a recursive tree-structured neural network between the encoded state $z_t$ and the predicted state-action values $Q(s_t, a)$, instead of directly estimating the state-action value from the current encoded state $z_t$ using fully connected layers as in DQN. Specifically, TreeQN uses a recursive model to refine its estimate of $Q(s_t, a)$ via learned transition, reward, and value functions, and a tree backup. Because these learned components are shared thoughout the tree, TreeQN implements an inductive bias, missing from DQN, that reflects the prior knowledge that the Q-values are properties of a stationary Markov process.

Specifically, TreeQN learns an action-dependent transition function that, given a state representation $z_{l|t}$, predicts the next state representation $z_{l+1|t}^{a_i}$ for action $a_i \in \mathcal{A}$, and the corresponding reward ${\hat r}_{l|t}^{a_i}$. To make the distinction between internal planning steps and steps taken in the environment explicit, we write $z_{l|t}$ to denote the encoded state at time $t$ after $l$ internal transitions, starting with $z_{0|t}$ for the encoding of $s_t$. TreeQN applies this transition function recursively to construct a tree containing the state representations and rewards received for all possible sequences of actions up to some predefined depth $d$.

treeqn

The value of each predicted state $V(z)$ is estimated with a value function module. Using these values and the predicted rewards, TreeQN then performs a tree backup, mixing the $k$-step returns along each path in the tree using $TD(\lambda)$. This corresponds to “Value Prediction & Backup” in the above diagram, and can be formalized as

\[\begin{aligned} Q^l(z_{l|t}, a_i) &= r(z_{l|t}, a_i) + \gamma V^{(\lambda)} (z_{l+1|t}) \\ V^{(\lambda)}(z_{l|t}) &= \begin{cases}V(z_{l|t}^{a_i})& l = d \\ (1-\lambda) V(z_{l|t}^{a_i} + \lambda \text{b}(Q^{l+1}(z_{l+1|t}^{a_i},a_j))& l < d\end{cases} \end{aligned}\]

where $\text{b}$ is a function to recursively perform the backup. For $0<\lambda<1$, value estimates of the intermediate states are mixed into the final Q-estimate, which encourages the intermediate nodes of the tree to correspond to meaningful states, and reduces the impact of outlier values.

When $\lambda=1$, and and $\text{b}$ is the standard hard $\max$ function, then the above simplifies to a backup thought the tree using the familiar Bellman equation:

\[Q(z_{l|t}, a_i) = r(z_{l|t},a_i) + \begin{cases}\gamma V(z_{d|t}^{a_i}) & l = d-1 \\ \gamma \max_{a_j} Q(z_{l+1|t}^{a_i},a_j) & l < d -1 \end{cases}\]

Crucially, during training we backpropagate all the way from the final $Q$-estimate, through the value prediction, tree transitioning, and encoding layers of the tree, i.e., the entire network. Learning these components jointly ensures that they are useful for planning on-line.

Model Components

Description of TreeQN’s components in more detail.

Encoder function. As in DQN, a series of convoolutional layers produces an embedding of the observed state, $z_{0|t} = \text{encode}(s_t)$.

Transition function. We first apply a single fully connected layer to the current state embedding, shared by all actions. This generats an intermediate representation $(z_{l+1|t}^\text{env})$ that could carry information about action-agnostic changes to the environment. In addition we use a fully connected layer per action, which is applied to the intermediate representation to caculate a next-state representation that carries information about the effect of taking action $a_i$. We use residual connections for these layers:

\[\begin{aligned} z_{l+1|t}^\text{env} &= z_{l|t} + \tanh (\mathbf{W}^\text{env} z_{l|t} + \mathbf{b}^\text{env}), \\ z_{l+1|t}^{a_i} &= z_{l+1|t} ^ \text{env} + \tanh(\mathbf{W}^{a_i} z_{l+1|t}^\text{env}), \end{aligned}\]

Where $\mathbf{W}^{a_i},\mathbf{W}^\text{env} \in \mathbb{R}^{k \times k}, \mathbf{b}^\text{env} \in \mathbb{R}^k$ are learnable parameters. Note that the next-state representation is calculated for every action $a_i$ independently using the respective transition matrix $\mathbf{W}^{a_i}$, but this transition function is shared for the same action thoughout the tree.

A caveat is that the model can still learn to use different parts of the latent state space in different parts of the tree, which could undermine the intended parameter sharing in the model structure. To help TreeQN learn useful transition functions that maintain quality and diversity in their latent states, we introduce a unit-length projection of the state representations by simply dividing a state’s vector representation by its L2 norm before each application of the transition function, $z_{l|t} = z_{l|t} / \| z_{l|t} \|$. This prevents the magnitude of the representation from growing or shrinking, which encourages the behaviour of the transition function to be more consistent thoughout the tree.

Reward function. In addition to predicting the next state, we also predict the immediate reward for every action $a_i \in \mathcal{A}$ in state $z_{l|t}$ using

\[\hat{\mathbf{r}}(z_{l|t}) = \mathbf{W}_2^r \text{ReLU}(\mathbf{W}_1^r z_{l|t} + \mathbf{b}_1^r) + \mathbf{b}_2^r,\]

where $\mathbf{W}_1^r \in \mathbb{R}^{m \times k}$, $\mathbf{W}_2^r \in \mathbb{R}^{|\mathcal{A}| \times m}$ and $\text{ReLU}$ is the recified linear unit and the predicted reward for a particular action $\hat{r}_{l|t}^{a_{i}}$ is the i-th element of the vector $\hat{\mathbf{r}}(z_{l|t})$.

Value function. The value of a state representation $z$ is estimated as

\[V(\mathbf{z}) = \mathbf{w}^\top \mathbf{z} + b,\]

where $\mathbf{w} \in \mathbb{R}^k$.

Backup function. We use the following function that can be recursively applied to calculate the tree backup:

\[\text{b}(\mathbf{x}) = \sum_i x_i \text{softmax}(\mathbf{x})_i.\]

Using a hard $\max$ for calculating the backup would result in gradient information only being used to update parameters along the maximal path in the tree. By contrast, the $\text{softmax}$ allows use to use downstream gradient information to update parameters along all paths. Furthermore, it potentially reduces the impact of outlier value predictions. With a learned temperature for the $\text{softmax}$, this function could represent the hard $\max$ arbitrarily closely. However, we did not find an empirical difference so we left the temperature at 1.

Author face

B-Boy Seiok

Pianist, B-Boy, street artist

Recent post