
RL in Robotics
Robotics..
Reinforcement Learning
We consider the standard reinforcement learning formalism consisting of an agent interacting with an environment. To simplify the exposition we assume that the environment is fully observable.
An environment is descirbed by a set of states , a set of actions , a distribution of initial states , a reward function , transition probabilities , and a discount factor .
A deterministic policy is a mapping from states to actions: . Every episode starts with sampling an initial state . At every timestep the agent produces an action based on the current state: . Then it gets the reward and the environment’s new state is sampled from the distribution . A discounted sum of future rewards is called a return: . The agent’s goal is to maximize its expected return . The Q-function or action-value function is defined as .
Hindsight Experience Replay
Multi-goal RL
We are interested in training agents which learn to achieve multiple different goals. We follow the approach from Universal Value Function Approximators (Schaul et al., 2015a), i.e. we train policies and value functions which take as input not only as state but also a goal . Moreover, we show that training an agent to perform multiple tasks can be easier than training it to perform only one task and therefore our approach may be applicable even if there is only one task we would like the agent to perform (a similar situation was recently observed by Pinto and Gupta(2016)).
We assume that every goal corresponds to some predicate and that the agent’s goal is to achieve any state that satisfies . In the case we want to exactly specify the desired state of the system we may use and . The goals can also specify only some properties of the state, e.g. suppose that and we want to be able to achieve an arbitrary state with the given value of coordinate. In this case and