Learning Task Planning from Multi-Modal Demonstration
for Multi-Stage Contact-Rich Manipulation

Large Language Models (LLMs) have gained popularity in task planning for long-horizon manipulation tasks. To enhance the validity of LLM-generated plans, visual demonstrations and online videos have been widely employed to guide the planning process. However, for manipulation tasks involving subtle movements but rich contact interactions, visual perception alone may be insufficient for the LLM to fully interpret the demonstration. Additionally, visual data provides limited information on force-related parameters and conditions, which are crucial for effective execution on real robots.

We introduce an in-context learning framework that incorporates tactile and force-torque information from human demonstrations to enhance LLMs' ability to generate plans for new task scenarios. We propose a bootstrapped reasoning pipeline that sequentially integrates each modality into a comprehensive task plan. This task plan is then used as a reference for planning in new task configurations. Real-world experiments on two different sequential manipulation tasks demonstrate the effectiveness of our framework in improving LLMs' understanding of multi-modal demonstrations and enhancing the overall planning performance.

In-context Learning Framework

We deploy our framework to two sequential manipulation tasks.
Click on each of the image below to view the corresponding workflow and experiment results.
Toggle Cable
Cable Mounting
Toggle Cap
Cap Tightening
1. Pre-processing

Segmentation with Object Status
Segmentation Image
Compression
Object status: grasped
Decompression
Object status: released
Linear Force
Object status: under a linear force
Rotational Force
Object status: under torque

We fine-tune the video classifier TimeSformer with a dataset of labeled tactile videos. Applying this classifier to the complete demonstration then segments it into events when new interaction happens.

Key Camera Frames
Frame 0
Pre-processing Translation
Transition Image
User Input (Skill Library)
User Request
LLM Response (PDDL Domain)
2. Skill Reasoning

Skill Reasoning Image
1. User Input (Demo Task Description)
1. User Request
1. LLM Response (Demo Skill Sequence)
2. User Request
2. LLM Response (Demo Skill Sequence)

3. Condition Reasoning

Condition Reasoning Image
1. User Request
1. LLM Response
4. Task Planning

Task Planning Image
User Request
User Input (Demo Task Plan)
LLM Response (New Task Plan)


Evaluation on Demonstration Reasoning

We present real-world experiments to evaluate the effectiveness of our demonstration reasoning pipeline and planning results for new tasks. This evaluation is conducted through ablation study: by disabling or replacing certain parts in our framework, we design the following control groups.

A. Transition Frames Without Object Status: Key frames in our demonstration reasoning pipeline are replaced by frames at key timestamps but without status annotation.

B. Uniform Sampled Frames Without Object Status: Frames are sampled uniformly from video, again without status annotation.

C. Conditions Without F/T Signals: Force/torque signals are excluded from the demonstration reasoning pipeline, so the success conditions remain as initially generated by the LLM without any updates.

D. Without Demonstrations: No demonstration data is provided and the LLM generates the plan solely based on its prior knowledge.

Skill Sequences

Full Demo
Keyframes and Reasoning
Frame 0

Skill: move_object clip8
Because one of the robots is moving the cable towards the position of clip 8.
Transition Frames Without Object Status
Keyframes and Reasoning
Frame 0

Skill: move_object clip8
Because the robot on the right is moving the cable towards the position of clip8.
Uniform Sampling Without Object Status
Keyframes and Reasoning
Frame 0

Skill: move_object cable
Because the cable is being moved towards the position of clip8.
Transition Conditions

Updated Stretch Skill
F/T Measurement
With F/T Sensor
Updated Insert Skill
F/T Measurement C-Clip F/T Measurement U-Clip
With F/T Sensor
Success Rate
condition success


Evaluation on Task Planning

New Task Plan
Full Demo
A. Without Status / B. Uniform Sampled
C. Without F/T Signal
D. Without Any Demo
Success Rate
all success