Learning Task Planning from Multi-Modal Demonstration
for Multi-Stage Contact-Rich Manipulation
Large Language Models (LLMs) have gained popularity in task planning for long-horizon manipulation tasks. To enhance the validity of LLM-generated plans, visual demonstrations and online videos have been widely employed to guide the planning process. However, for manipulation tasks involving subtle movements but rich contact interactions, visual perception alone may be insufficient for the LLM to fully interpret the demonstration. Additionally, visual data provides limited information on force-related parameters and conditions, which are crucial for effective execution on real robots.
We introduce an in-context learning framework that incorporates tactile and force-torque information from human demonstrations to enhance LLMs' ability to generate plans for new task scenarios. We propose a bootstrapped reasoning pipeline that sequentially integrates each modality into a comprehensive task plan. This task plan is then used as a reference for planning in new task configurations. Real-world experiments on two different sequential manipulation tasks demonstrate the effectiveness of our framework in improving LLMs' understanding of multi-modal demonstrations and enhancing the overall planning performance.
In-context Learning Framework
We deploy our framework to two sequential manipulation tasks.
Click on each of the image below to view the corresponding workflow and experiment results.
1. Pre-processing
Segmentation with Object Status
Compression
Object status: grasped
Decompression
Object status: released
Linear Force
Object status: under a linear force
Rotational Force
Object status: under torque
We fine-tune the video classifier TimeSformer with a dataset of labeled tactile videos. Applying this classifier to the complete demonstration then segments it into events when new interaction happens.
Key Camera Frames
Pre-processing Translation
User Input (Skill Library)
User Request
LLM Response (PDDL Domain)
2. Skill Reasoning
1. User Input (Demo Task Description)
1. User Request
1. LLM Response (Demo Skill Sequence)
2. User Request
2. LLM Response (Demo Skill Sequence)
3. Condition Reasoning
1. User Request
1. LLM Response
4. Task Planning
User Request
User Input (Demo Task Plan)
LLM Response (New Task Plan)
Evaluation on Demonstration Reasoning
We present real-world experiments to evaluate the effectiveness of our demonstration reasoning pipeline and planning results for new tasks. This evaluation is conducted through ablation study: by disabling or replacing certain parts in our framework, we design the following control groups.
A. Transition Frames Without Object Status: Key frames in our demonstration reasoning pipeline are replaced by frames at key timestamps but without status annotation.
B. Uniform Sampled Frames Without Object Status: Frames are sampled uniformly from video, again without status annotation.
C. Conditions Without F/T Signals: Force/torque signals are excluded from the demonstration reasoning pipeline, so the success conditions remain as initially generated by the LLM without any updates.
D. Without Demonstrations: No demonstration data is provided and the LLM generates the plan solely based on its prior knowledge.
Skill Sequences
Full Demo
Keyframes and Reasoning
Skill: move_object clip8
Because one of the robots is moving the cable towards the position of clip 8.
Transition Frames Without Object Status
Keyframes and Reasoning
Skill: move_object clip8
Because the robot on the right is moving the cable towards the position of clip8.
Uniform Sampling Without Object Status
Keyframes and Reasoning
Skill: move_object cable
Because the cable is being moved towards the position of clip8.