PointingImgEst

Abstract

Daily life support robots in the home environment interpret the user's pointing and understand the instructions, thereby increasing the number of instructions accomplished. This study aims to improve the estimation performance of pointing frames by using speech information when a person gives pointing or verbal instructions to the robot. The estimation of the pointing frame, which represents the moment when the user points, can help the user understand the instructions. Therefore, we perform pointing frame estimation using a time-series model, utilizing the user's speech, images, and speech-recognized text observed by the robot. In our experiments, we set up realistic communication conditions, such as speech containing everyday conversation, non-upright posture, actions other than pointing, and reference objects outside the robot's field of view. The results showed that adding speech information improved the estimation performance, especially the Transformer model with Mel-Spectrogram as a feature. This study will lead to be applied to object localization and action planning in 3D environments by robots in the future.

Overview

An overview of our study. The robot makes an estimation of the moment at which a person gives a pointing instruction. The robot uses the speech uttered by the person, the video showing the pointing, and the sentence for the estimation.

Pointing Frame Estimator with Audio-Visual Time Series Data

An overview of our study. The robot makes an estimation of the moment at which a person gives a pointing instruction. The robot uses the speech uttered by the person, the video showing the pointing, and the sentence for the estimation.

Scenarios of Instructions

We assumed 4 scenarios when the user gives pointing instructions to the robot.

Scenario 1: The verbal utterances include daily conversations

Scenario 2: The user's posture is not consistently upright

Scenario 3: The user's movements involve actions beyond pointing

Scenario 4: The referenced object is outside the robot's visual field

BibTeX

@inproceedings{nakagawa2024pointing, author={Nakagawa, Hikaru and Hasegawa, Shoichi and Hagiwara, Yoshinobu and Taniguchi, Akira and Taniguchi, Tadahiro}, title={Pointing Frame Estimation with Audio-Visual Time Series Data for Daily Life Service Robots}, booktitle={IEEE International Conference on Systems, Man, and Cybernetics (SMC)}, year={2024, in press} }

Pointing Frame Estimation with Audio-Visual Time Series Data for Daily Life Service Robots

Abstract

Overview

An overview of our study. The robot makes an estimation of the moment at which a person gives a pointing instruction. The robot uses the speech uttered by the person, the video showing the pointing, and the sentence for the estimation.

Pointing Frame Estimator with Audio-Visual Time Series Data

An overview of our study. The robot makes an estimation of the moment at which a person gives a pointing instruction. The robot uses the speech uttered by the person, the video showing the pointing, and the sentence for the estimation.

Scenarios of Instructions

Scenario 1: The verbal utterances include daily conversations

Scenario 2: The user's posture is not consistently upright

Scenario 3: The user's movements involve actions beyond pointing

Scenario 4: The referenced object is outside the robot's visual field

Pointing frame estimation results

Poster

BibTeX

Laboratory Information

Funding

Pointing Frame Estimation with Audio-Visual Time Series Data for Daily Life Service Robots

Abstract

Overview

An overview of our study. The robot makes an estimation of the moment at which a person gives a pointing instruction. The robot uses the speech uttered by the person, the video showing the pointing, and the sentence for the estimation.

Pointing Frame Estimator with Audio-Visual Time Series Data

An overview of our study. The robot makes an estimation of the moment at which a person gives a pointing instruction. The robot uses the speech uttered by the person, the video showing the pointing, and the sentence for the estimation.

Scenarios of Instructions

Scenario 1: The verbal utterances include daily conversations

Scenario 2: The user's posture is not consistently upright

Scenario 3: The user's movements involve actions beyond pointing

Scenario 4: The referenced object is outside the robot's visual field

Pointing frame estimation results

Poster

BibTeX

Related Research

Laboratory Information

Funding