Object Instance Retrieval in Assistive Robotics: Leveraging Fine-Tuned SimSiam with Multi-View Images Based on 3D Semantic Map

1Ritsumeikan University, 2Soka University, 3Kyoto University
Accepted at IEEE/RSJ IROS 2024

*Corresponding Author

Abstract

Robots that assist in daily life are required to locate specific instances of objects that match the user's desired object in the environment. This task is known as Instance-Specific Image Goal Navigation (InstanceImageNav), which requires a model capable of distinguishing between different instances within the same class. One significant challenge in robotics is that when a robot observes the same object from various 3D viewpoints, its appearance may differ greatly, making it difficult to recognize and locate the object accurately. In this study, we introduce a method, SimView, that leverages multi-view images based on a 3D semantic map of the environment and self-supervised learning by SimSiam to train an instance identification model on-site. The effectiveness of our approach is validated using a photorealistic simulator, Habitat Matterport 3D, created by scanning real home environments. Our results demonstrate a 1.7-fold improvement in task accuracy compared to CLIP, which is pre-trained multimodal contrastive learning for object search. This improvement highlights the benefits of our proposed fine-tuning method in enhancing the performance of assistive robots in InstanceImageNav tasks.

Overview

image of model

Focused task in this study. (Top) The robot identifies the position of an object shown in a query image provided by a user's mobile phone. (Bottom left) Domain gap that the image quality significantly differs between the image taken by the user's mobile phone and the object image observed by the real robot. (Bottom right) Contrastive learning to align images of the same instance with different image quality in latent space.

SimView

image of model

In the proposed system, a robot explores the environment, identifies the instance identical to a given query image from among the collected object images, and uses a 3D semantic map of the environment to locate the target object's position. In addition, we propose a method, Semantic Instance Multi-view Contrastive Fine-tuning (SimView), for fine-tuning pre-trained models using a self-supervised learning framework to improve task accuracy in the environment. Figure 1 shows the diagram of our proposed system.

Self-Supervised Fine-tuning Module

image of model

This module fine-tuned the image encoder, which was pre-trained by contrastive learning using self-supervised learning, object images observed by the robot while exploring the environment, and their pseudo-labels. When a robot explores the environment and observes objects, images of the same instance include images observed from various angles of view. In a preliminary experiment, we confirmed that when fine-tuning a pre-trained model using only contrastive learning on such a dataset, the accuracy of discrimination between instances is worse than that of the trained model. Therefore, we propose a method to train a linear classifier simultaneously with contrastive learning. We use object instance ID \( y_{true} \) obtained from a 3D semantic map of the robot's environment as pseudo labels. In addition, the contrastive learning method using negative pairs is recommended to be trained with a very large batch size and requires a large amount of data for learning. Then, to conduct fine-tuning, it is necessary to continue exploring the environment for a long time and collecting images of objects. Therefore, we use SimSiam for fine-tuning, which allows learning even with a small batch size.

Poster

BibTeX


      @inproceedings{sakaguchi2024simview,
        author={Sakaguchi, Taichi  and Taniguchi, Akira and Hagiwara, Yoshinobu  and El Hafi, Lotfi  and Hasegawa, Shoichi and Taniguchi, Tadahiro},
        title={Object Instance Retrieval in Assistive Robotics: Leveraging Fine-Tuned SimSiam with Multi-View Images Based on 3D Semantic Map},
        booktitle={IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
        year={2024, in press}
      }
    

Laboratory Information

Funding

This work was supported by JSPS KAKENHI Grants-in-Aid for Scientific Research (Grant Numbers JP23K16975, 22K12212) and JST Moonshot Research & Development Program (Grant Number JPMJMS2011).