ReorientDiff : Diffusion Model based Reorientation for Object Manipulation

In Submission

Georgia Institute of Technology

TL;DR: Diffusion Models for language-conditioned multi-step object manipulation for precise object placement.


The ability to manipulate objects in a desired configurations is a fundamental requirement for robots to complete various practical applications. While certain goals can be achieved by picking and placing the objects of interest directly, object reorientation is needed for precise placement in most of the tasks. In such scenarios, the object must be reoriented and re-positioned into intermediate poses that facilitate accurate placement at the target pose. To this end, we propose a reorientation planning method, ReorientDiff, that utilizes a diffusion model-based approach. The proposed method employs both visual inputs from the scene, and goal-specific language prompts to plan intermediate reorientation poses. Specifically, the scene and language-task information are mapped into a joint scene-task representation feature space, which is subsequently leveraged to condition the diffusion model. The diffusion model samples intermediate poses based on the representation using classifier-free guidance and then uses gradients of learned feasibility-score models for implicit iterative pose-refinement. The proposed method is evaluated using a set of YCB-objects and a suction gripper, demonstrating a success rate of 95.2% in simulation. Overall, our study presents a promising approach to address the reorientation challenge in manipulation by learning a conditional distribution, which is an effective way to move towards more generalizable object manipulation.


(a). We construct the scene-task representation feature space by using pre-trained foundation model CLIP and a segmentation encoder:

(b). We then use the scene-task representation to condition the diffusion model:


Evaluating Scene Task representations

We evaluate our method on a set of YCB-objects and a suction gripper in simulation. The performance of the scene-task embedding network is shown in the following figure:

Diffusion Model Sampling performance

Videos of Robot Manipulation using Sampled Reorient Poses

reorient1 reorient2

reorient3 reorient4

reorient5 reorient6

reorient7 reorient8


The primary contributions of this work encompass:

  • The first to explore reorientation poses as conditional distribution and offer a better prior for sampling intermediate poses than rejection sampling with random prior.
  • A framework which incorporates both visual inputs from the scene and goal-specific language prompts to plan intermediate reorientation poses. Based on intuition, the scene and task should be enough to plan the intermediate poses.
  • To address kino-dynamic grasp constraints while inference, we use classifier-guidance using learned feasibility score models to samplke only the valid poses from the learned prior based on feasible grasp poses.


      title={ReorientDiff: Diffusion Model based Reorientation for Object Manipulation},
      author={Mishra, Utkarsh A and Chen, Yongxin},
      journal={arXiv preprint arXiv:2303.12700},