2022, Jun 16

Multimodal Imitation Learning with Partial Observability

By learning from expert demonstrations, Imitation Learning (IL) can be used to guide the reinforcement learning (RL) through scarce reward situations or even when no appropriate reward signal is available [2]. Existing IL methods, such as Behavior Cloning (BC) [6], Inverse Reinforcement Learning (IRL) [1] and Generative Adversarial Imitation Learning (GAIL) [5] only learns single expert demonstrations for a single task under full observability assumption.

To learn from unstructured demonstrations, multi-modal imitation learning was proposed to disentangle the different distributions of intentions or such semantic features [4] Multi-modal imitation learning was also explored on Partially Observable Markov Decision Process (POMDP) settings with RNN-based belief state representations [3].

In this work, we explore applying multimodal imitation learning on manipulation task with uncertainty of pose of the unactuated objects.

Crackerbox is pushed left or right so that the blue can can be grasped.
Multimodal imitation learning this manipulation task requires accounting for uncertainty, such as perception errors.

State is represented by the poses of the two objects.
Observation is the poses of the visible object, with pose estimation error with standard deviation ε. (Perception error)
gripper is not wide enough to grip the crackerbox, so has to push it first, and then grasp the can.
The action is given as difference of the end effector configuration.
Expert demonstrations are given: (a) push crackerbox to the left first. (b) push crackerbox to the right first.

Hypothesis

RNNs are not appropriate for belief state estimation in manipulation task.
Our model learns both expert trajectory distribution well.
InfoGAIL cannot imitate the above given demonstrations.

pomdp-infogail Multi-expert 2D trajectories. Unlabeled Experts’ demonstrations have two modes: starting from (0, 0), go anti-clockwise or clockwwise. Blue and Green represent two distinguishable sets of optimal behaviors. Our method precisely recovers the multi-modal policy while InfoGAIL fails.

In the above figure, in partially observable situations, InfoGAIL fails to recover the multi-modal policy. Using POMDP and updating the belief state solves this problem.
Belief state embedding is given from RNNs such as the following:

\[b_t (\tau_t;\phi) = \text{RNN}_\phi (b_{t-1} (\tau_{t-1};\phi), z_t, a_{t-1}).\]

References

[1] Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse rein- forcement learning. In Proceedings of the Twenty-First International Con- ference on Machine Learning, ICML ’04, page 1, New York, NY, USA, 2004. Association for Computing Machinery.

[2] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Man ́e. Concrete problems in AI safety. CoRR, abs/1606.06565, 2016.

[3] Zipeng Fu, Minghuan Liu, Ming Zhou, Weinan Zhang, 1UCLA, 2Shang- hai, and Jiao Tong. Multi-modal imitation learning in partially observable environments. 2019.

[4] Karol Hausman, Yevgen Chebotar, Stefan Schaal, Gaurav S. Sukhatme, and Joseph J. Lim. Multi-modal imitation learning from unstructured demon- strations using generative adversarial nets. CoRR, abs/1705.10479, 2017.

[5] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran As- sociates, Inc., 2016.

[6] Dean A. Pomerleau. Efficient training of artificial neural networks for au- tonomous navigation. Neural Computation, 3(1):88–97, 1991

Multimodal Imitation Learning with Partial Observability

Multimodal Imitation Learning with Partial Observability

Hypothesis

References

B-Boy Seiok

Recent post