View-Invariant Policy Learning
via Zero-Shot Novel View Synthesis

Anonymous Authors

To supplement our submission, on this page we provide visualizations of augmented dataset trajectories, policy rollouts from novel test viewpoints in real and simulated environments, comparisons between view synthesis methods, and a brief overview of our work.

We aim to learn policies that generalize to novel viewpoints from widely available,
offline single-view RGB robotic trajectory data.

Abstract

Large-scale visuomotor policy learning is a promising approach toward developing generalizable manipulation systems. Yet, policies that can be deployed on diverse embodiments, environments, and observational modalities remain elusive. In this work, we investigate how knowledge from large-scale visual data of the world may be used to address one axis of variation for generalizable manipulation: observational viewpoint. Specifically, we study single-image novel view synthesis models, which learn 3D-aware scene-level priors by rendering images of the same scene from alternate camera viewpoints given a single input image. For practical application to diverse robotic data, these models must operate zero-shot, performing view synthesis on unseen tasks and environments. We empirically analyze view synthesis models within a simple data-augmentation scheme to understand their capabilities for learning viewpoint-invariant policies from single-viewpoint demonstration data. Upon evaluating the robustness of policies trained with our method to out-of-distribution camera viewpoints, we find that they outperform baselines in both simulated and real-world manipulation tasks.

Learning View-Invariance




Depiction of the data augmentation scheme that we study. Observations are replaced with viewpoint-augmented versions of the same scene with action labels held constant.

Example Augmented Dataset Trajectories


Below, we show examples of expert demonstration trajectories augmented by various novel view synthesis models. For ZeroNVS (finetuned), the model is finetuned on synthetic data from MimicGen tasks for simulated environments and on the DROID dataset for the real setting, qualitatively improving the fidelity of generated images and quantitatively improving downstream policy performance.
Please see the manuscript and appendix for more details about finetuning.

Coffee

Stack

Threading

Hammer

Cup on saucer (real)


Single original view

Depth est. + Reproj.


ZeroNVS

ZeroNVS (Finetuned)

Qualitative Behavior of
Learned Policies on Novel Test Viewpoints


Here we show rollouts of policies trained with view synthesis model augmentation when placed in random test viewpoints on the quarter circle arc distribution. We note that even when the policies augmented using the finetuned ZeroNVS model fails, they tend to make more progress on the task compared to single view or depth estimation and reprojection baselines. Critically, note that these policies are all trained using a single-view source demonstration dataset.


Hammer

Coffee

Stack

Single original view

Depth est. + Reproj.

ZeroNVS (Finetuned)

Learning & Deploying View-Invariant Policies on Real Robots


Using the ZeroNVS novel view synthesis model, we learn diffusion policies that take as input both third-person and (unaugmented) wrist camera observations to solve a "put cup in saucer" task from multiple novel viewpoints. Here we show successful rollouts from policies trained on datasets augmented by a pretrained ZeroNVS model and a ZeroNVS model finetuned on DROID data. Again, the original training dataset only contains trajectories observed from a single RGB view.
We find that performance is further improved by finetuning the ZeroNVS model on the DROID dataset.

In contrast, a baseline model trained on the single original view and wrist observations is heavily overfitted to the original view, and often fails to reach toward the cup when tested on novel views:

Original view

Novel view

Novel view

Additionally, we find that policies that are trained on only wrist observations can have difficulty localizing the correct object to grasp (in this case the cup). Note that here the third-person view is only for visualization purposes and is not provided to the policy: