Act3D

Act3D: 3D Feature Field Transformers for
Multi-Task Robotic Manipulation

Carnegie Mellon University

Abstract

3D perceptual representations are well suited for robot manipulation as they easily encode occlusions and simplify spatial reasoning. Many manipulation tasks require high spatial precision in end-effector pose prediction, which typically demands high-resolution 3D feature grids that are computationally expensive to process. As a result, most manipulation policies operate directly in 2D, foregoing 3D inductive biases.

We propose Act3D, a manipulation policy transformer that represents the robot's workspace using a 3D feature field with adaptive resolution. The model lifts 2D pre-trained features to 3D using sensed depth, and attends to them to compute features for sampled 3D points. It samples 3D point grids in a coarse to fine manner, featurizes them using relative-position attention, and selects where to focus the next round of point sampling. In this way, it **efficiently computes 3D action maps of high spatial resolution**.

Act3D sets a new state-of-the-art in RLbench, an established manipulation benchmark. Our model achieves 10% absolute improvement over the previous SOTA 2D multi-view policy on 74 RLbench tasks and 22% absolute improvement with 3x less compute over the previous SOTA 3D policy. In thorough ablations, we show the importance of relative spatial attention, large-scale vision-language pre-trained 2D backbones, and weight tying across coarse-to-fine attentions.

Results

We test Act3D in learning from demonstrations single-task and multi-task manipulation policies in simulation and the real world. In simulation, we test Act3D in RLbench in two settings to ensure a clear comparison with prior work: a single-task setting with 74 tasks proposed by HiveFormer and a multi-task multi-variation setting with 18 tasks and 249 variations proposed by PerAct.

Single-task performance. On 74 RLBench tasks across 9 categories, Act3D reaches 83% success rate, an absolute improvement of 10% over InstructRL, prior SOTA in this setting.

Multi-task performance. On 18 RLBench tasks with 249 variations, Act3D reaches 65% success rate, an absolute improvement of 22% over PerAct, prior SOTA in this setting.

Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation

Act3D is a manipulation policy Transformer that casts 6-DoF keypose prediction as 3D detection with adaptive spatial computation

Abstract

Act3D

Results

Act3D: 3D Feature Field Transformers for
Multi-Task Robotic Manipulation