Integrating Multimodal Communication and Comprehension Evaluation during
Human-Robot Collaboration for
Increased Reliability of Foundation Model-based Task Planning Systems

1Universitè catholique de Louvain (UCLouvain), 2Ritsumeikan University,
3Karlstad University, 4Kyoto University
Accepted at IEEE/SICE SII 2025.

*Corresponding Author

Abstract

Foundation models provide the adaptability needed in robotics but often require explicit tasks or human verification due to potential unreliability in their responses, complicating human-robot collaboration (HRC). To enhance the reliability of such task-planning systems, we propose 1) an adaptive task-planning system for HRC that reliably performs non-predefined tasks implicitly instructed through HRC, and 2) an integrated system combining multimodal large language model (LLM)-based task planning with multimodal communication of human intention to increase the HRC success rate and comfort. The proposed system integrates GPT-4V for adaptive task planning and comprehension evaluation during HRC with multimodal communication of human intention through speech and deictic gestures. Four pick-and-place tasks of gradually increasing difficulty were used in three experiments, each evaluating a key aspect of the proposed system: task planning, comprehension evaluation, and multimodal communication. The quantitative results show that the proposed system can interpret implicitly instructed tabletop pick-and-place tasks through HRC, providing the next object to pick and the correct position to place it, achieving a mean success rate of 0.80. Additionally, the system can evaluate its comprehension of three of the four tasks with an average precision of 0.87. The qualitative results show that multimodal communication not only significantly enhances the success rate but also the feelings of trust and control, willingness to use again, and sense of collaboration during HRC.

Overview

image of model

The proposed system enhances the reliability of foundation model-based task planning in HRC (blue area) through comprehension evaluation using an inferred confidence level by the robot agent (green area) and multimodal communication using deictic gestures and speech instructions by the human agent (orange area).

Multimodal Communication and Comprehension Assessment for Task Planning

image of model

Implementation of the proposed system. The colored areas represent the four modules of the proposed system: the cloud API, the sensors, the computer, and the robot. The black arrows indicate the flow of information, while the red lines denote the types of connections between the modules.

Experiment1: Task Planning

image of model
image of model

Experiment2: Comprehension Evaluation

image of model
image of model

Experiment3: Multimodal Communication

image of model
image of model

BibTeX


      @inproceedings{martin2025mcce,
        author={Martin, Eden  and Hasegawa, Shoichi  and Solis, Jorge  and Macq, Benoit  and Ronsse, Renaud and Garcia Ricardez, Gustavo Alfonso and El Hafi, Lotfi and Taniguchi, Tadahiro},
        title={Integrating Multimodal Communication and Comprehension Evaluation during Human-Robot Collaboration for Increased Reliability of Foundation Model-based Task Planning Systems},
        booktitle={IEEE/SICE International Symposium on System Integrations (SII)},
        year={2025, accepted}
      }
    

Related Research

Laboratory Information

Funding

This work was supported by JST Moonshot Research \& Development Program (Grant Number JPMJMS2011).