P. Abolghasemi, A. Mazaheri, M. Shah, and L. Bölöni

Pay attention!-Robustifying a Deep Visuomotor Policy through Task-Focused Attention


Cite as:

P. Abolghasemi, A. Mazaheri, M. Shah, and L. Bölöni. Pay attention!-Robustifying a Deep Visuomotor Policy through Task-Focused Attention. In Proc. of Conference on Computer Vision and Pattern Recognition (CVPR-2019), pp. 4254–4262, 2019.

Download:

Download Video 

Abstract:

Several recent studies have demonstrated the promise of deep visuomotor policies for robot manipulator control. Despite impressive progress, these systems are known to be vulnerable to physical disturbances, such as accidental or adversarial bumps that make them drop the manipulated object. They also tend to be distracted by visual disturbances such as objects moving in the robot's field of view, even if the disturbance does not physically prevent the execution of the task. In this paper, we propose an approach for augmenting a deep visuomotor policy trained through demonstrations with Task Focused visual Attention (TFA). The manipulation task is specified with a natural language text such as ``move the red bowl to the left''. This allows the visual attention component to concentrate on the current object that the robot needs to manipulate. We show that even in benign environments, the TFA allows the policy to consistently outperform a variant with no attention mechanism. More importantly, the new policy is significantly more robust: it regularly recovers from severe physical disturbances (such as bumps causing it to drop the object) from which the baseline policy, i.e. with no visual attention, almost never recovers. In addition, we show that the proposed policy performs correctly in the presence of a wide class of visual disturbances, exhibiting a behavior reminiscent of human selective visual attention experiments. Our proposed approach consists of a VAE-GAN network which encodes the visual input and feeds it to a Motor network that moves the robot joints. Also, our approach benefits from a teacher network for the TFA that leverages textual input command to robustify the visual encoder against various types of disturbances.

BibTeX:

@inproceedings{Abolghasemi-2019-PayAttention,
title={Pay attention!-Robustifying a Deep Visuomotor Policy through Task-Focused Attention},
author={P. Abolghasemi and A. Mazaheri and M. Shah and L. B{\"o}l{\"o}ni},
xxxjournal={arXiv preprint arXiv:1809.10093},
year={2019},
booktitle = "Proc. of Conference on Computer Vision and Pattern Recognition (CVPR-2019)",
pages = "4254-4262",
doi = "DOI: 10.1109/CVPR.2019.00438",
abstract = {
  Several recent studies have demonstrated the promise of deep visuomotor policies for robot manipulator control. Despite impressive progress, these systems are known to be vulnerable to physical disturbances, such as accidental or adversarial bumps that make them drop the manipulated object. They also tend to be distracted by visual disturbances such as objects moving in the robot's field of view, even if the disturbance does not physically prevent the execution of the task. In this paper, we propose an approach for augmenting a deep visuomotor policy trained through demonstrations with Task Focused visual Attention (TFA). The manipulation task is specified with a natural language text such as ``move the red bowl to the left''. This allows the visual attention component to concentrate on the current object that the robot needs to manipulate. We show that even in benign environments, the TFA allows the policy to consistently outperform a variant with no attention mechanism. More importantly, the new policy is significantly more robust: it regularly recovers from severe physical disturbances (such as bumps causing it to drop the object) from which the baseline policy, i.e. with no visual attention, almost never recovers. In addition, we show that the proposed policy performs correctly in the presence of a wide class of visual disturbances, exhibiting a behavior reminiscent of human selective visual attention experiments. Our proposed approach consists of a VAE-GAN network which encodes the visual input and feeds it to a Motor network that moves the robot joints. Also, our approach benefits from a teacher network for the TFA that leverages textual input command to robustify the visual encoder against various types of disturbances.
 }
}

Generated by bib2html.pl (written by Patrick Riley, Lotzi Boloni ) on Fri Jan 29, 2021 20:15:22