SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning

1QUT Centre for Robotics 2University of Adelaide 3CSIRO Data61

*Equal Contribution

📌 Conference on Robot Learning (CoRL), 2023 (Oral Talk)

📌 Robotics: Science and Systems (RSS) Robot Representations For Scene Understanding, Reasoning and Planning, 2023 (Oral Talk)

SayPlan scales the grounding of task plans generated by large language models to multi-room and multi-floor environments using 3D scene graph representations.

Abstract

Large language models (LLMs) have demonstrated impressive results in developing generalist planning agents for diverse tasks. However, grounding these plans in expansive, multi-floor, and multi-room environments presents a significant challenge for robotics. We introduce SayPlan, a scalable approach to LLM-based, large-scale task planning for robotics using 3D scene graph (3DSG) representations. To ensure the scalability of our approach, we: (1) exploit the hierarchical nature of 3DSGs to allow LLMs to conduct a semantic search for task-relevant subgraphs from a smaller, collapsed representation of the full graph; (2) reduce the planning horizon for the LLM by integrating a classical path planner and (3) introduce an iterative re-planning pipeline that refines the initial plan using feedback from a scene graph simulator, correcting infeasible actions and avoiding planning failures. We evaluate our approach on two large-scale environments spanning up to 3 floors, 36 rooms and 140 objects, and show that our approach is capable of grounding large-scale, long-horizon task plans from abstract, and natural language instruction for a real-world mobile manipulator robot to execute.

Approach

This work is motivated by the need to ground task plans generated by LLMs in large-scale, multi-floor and multi-room environments. 3DSGs capture a rich topological and hierarchically-organised semantic graph representation of such environments with the versatility to encode the necessary information required for task planning including object state, predicates, affordances and attributes using natural language -- suitable for parsing by an LLM. We leverage a JSON representation of this graph as input to a pre-trained LLM, however, to ensure the scalability of the plans to larger scenes, we introduce two key components which exploit the hierarchical nature of 3DSGs: semantic search and iterative re-planning.


Semantic Search

semantic search gif

Given a natural language instruction, the LLM conducts a semantic search for a task-relevant subgraph which contains the assets and objects that are required for planning. This is done by manipulating the nodes of a collapsed 3D scene graph through expand and contract API function calls -- thus making it feasible to plan over increasingly large-scale environments. In doing so, the LLM maintains focus on the small, informative subgraph, during planning, without exceeding its token limit.

Iterative Re-planning

To ensure the executability of the proposed plan, we iteratively verify the plan using feedback from a scene graph simulator to correct for any unexecutable actions such as - missing to open the fridge before placing something into it; thus avoiding planning failures due to inconsistencies, hallucinations, or violations of the physical constraints imposed by the environment. We additionally relax the need for the LLM to produce long navigation sequences by incorporating an existing path planner such as Dijkstra to connect high-level nodes generated by the LLM.

semantic search gif

Results

We demonstrate the ability to autonomously deploy the generated plan on a real mobile manipulator robot using a visually grounded motion controller. We utilise a pre-constructed 3D scene graph of an office floor spanning 36 rooms and containing 150 different assets and objects that the agent can interact with. Note the ability of the agent to generate grounded plans across multiple rooms. In all the videos shown, the robot operates fully autonomously using visual feedback from its in-hand camera for manipulation and localises itself within the scene graph using its onboard laser scanner.

BibTeX


      @inproceedings{
        rana2023sayplan,
        title={SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Task Planning},
        author={Krishan Rana and Jesse Haviland and Sourav Garg and Jad Abou-Chakra and Ian Reid and Niko Suenderhauf},
        booktitle={7th Annual Conference on Robot Learning},
        year={2023},
        url={https://openreview.net/forum?id=wMpOMO0Ss7a}
      }