RL in Robotics
Robotics..
Reinforcement Learning
We consider the standard reinforcement learning formalism consisting of an agent interacting with an environment. To simplify the exposition we assume that the environment is fully observable.
An environment is descirbed by a set of states $\mathcal{S}$, a set of actions $\mathcal{A}$, a distribution of initial states $p(s_0)$, a reward function $r \colon \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$, transition probabilities $p(s_{t+1} \mid s_t, a_t)$, and a discount factor $\gamma \in [0,1]$.
A deterministic policy is a mapping from states to actions: $\pi \colon \mathcal{S} \times \mathcal{A}$. Every episode starts with sampling an initial state $s_0$. At every timestep $t$ the agent produces an action based on the current state: $a_t = \pi(s_t)$. Then it gets the reward $r_t = r(s_t,a_t)$ and the environment’s new state is sampled from the distribution $p(\cdot \mid s_t, a_t)$. A discounted sum of future rewards is called a return: $R_t = \sum^\infty_{i=t} \gamma^{i-t} r_i$. The agent’s goal is to maximize its expected return $\mathbb{E}_{s_0}[R_0 \mid s_0]$. The Q-function or action-value function is defined as $Q^f\pi(s_t,a_t) = \mathbb{E}[R_t \mid s_t,a_t]$.
Hindsight Experience Replay
Multi-goal RL
We are interested in training agents which learn to achieve multiple different goals. We follow the approach from Universal Value Function Approximators (Schaul et al., 2015a), i.e. we train policies and value functions which take as input not only as state $s \in \mathcal{S}$ but also a goal $g \in \mathcal{G}$. Moreover, we show that training an agent to perform multiple tasks can be easier than training it to perform only one task $%TODO$ and therefore our approach may be applicable even if there is only one task we would like the agent to perform (a similar situation was recently observed by Pinto and Gupta(2016)).
We assume that every goal $g \in \mathcal{G}$ corresponds to some predicate $f_g \colon \mathcal{S} \rightarrow {0,1}$ and that the agent’s goal is to achieve any state $s$ that satisfies $f_g(s) = 1$. In the case we want to exactly specify the desired state of the system we may use $\mathcal{S} = \mathcal{G}$ and $f_g(s) = [s = g]$. The goals can also specify only some properties of the state, e.g. suppose that $\mathcal{S} = \mathbb{R}^2$ and we want to be able to achieve an arbitrary state with the given value of $s$ coordinate. In this case $\mathcal{G} = \mathbb{R}$ and $f_g((x,y)) = [x=g].$