The AICO Multimodal Corpus – Data Collection and Preliminary Analyses

The AICO Multimodal Corpus – Data Collection and Preliminary Analyses

Table of Contents


This paper describes data collection and the first explorative research on the AICO Multimodal Corpus. The corpus contains eye-gaze, Kinect, and video recordings of human-robot and human-human interactions, and was collected to study cooperation, engagement and attention of human participants in task-based as well as in chatty type interactive situations. In particular, the goal was to enable comparison between human-human and human-robot interactions, besides studying multimodal behaviour and attention in the different dialogue activities. The robot partner was a humanoid Nao robot, and it was expected that its agent-like behaviour would render human-robot interactions similar to human-human interaction but also highlight important differences due to the robot’s limited conversational capabilities. The paper reports on the preliminary studies on the corpus, concerning the participants’ eye-gaze and gesturing behaviours, which were chosen as objective measures to study differences in their multimodal behaviour patterns with a human and a robot partner.


eye-tracking, gesturing, multimodal corpus collection, human-human and human-robot dialogues


Current development of interactive robot agents is backed by extensive research on methods and tools concerning neural models and big data, as well as symbolic and rule-based systems incorporating models for knowledge, reasoning and cooperation (Siciliano and Khatib, 2016). Research has been conducted on verbal and non-verbal communication and building of multimodal systems (see an overview in Almeida et al. 2018), but investigations comparing human multimodal behaviour in interactions with a human or a robot partner are few.

In human-human interactions (HHI), multimodal signals play a fundamental role in turn management, feedback, and meaning creation: they are related to coordination of conversation and building of a shared context in which to achieve task goals, seek for information, and form social bonds. By extension, such behaviour is important also in human-robot interaction (HRI), since the manner of interaction by which humans effectively respond to signals that indicate the partner’s (mis)understanding, agreement and emotional state is intuitively used also when interacting with social robots (Jokinen, 2019).

Humans perceive verbal and non-verbal communication in an effortless manner, however, modelling of social signals in HRI is still less common, less smooth, and less effective for serving communicative goals. In experimental settings users often evaluate the robot’s communicative patterns as inflexible and monotonous, and comment that the robot talks too much: the robot agent does not provide similar feedback or non-verbal engagement as human partners.

This paper discusses our data collection as a starting point to compare human behaviour in HHI and HRI. The main goal is to study understanding, engagement, and attention of human participants in various interaction activities, and to enable comparison between similar human-human and human-robot interactions. It is expected that interactions with an agent-like robot show similarities with human-human interactions, but also differ due to the robot’s limited conversational capabilities (turn-taking, feedback, understanding). We explore the differences through the participants’ multimodal behaviour and concentrate especially on visual attention (eye-tracker data). The corpus also contains video data which has been used for gesture studies and personality experiments and Kinect data which is available for further investigations on the participants’ movement in HHI and HRI. The corpus provides a useful starting point for systematic comparisons and modelling of the human partner’s engagement and understanding depending on the conversational partner. The paper is structured as follows. Section 2 discusses the setup of the data collection and gives a basic presentation of the data. Section 3 provides preliminary analyses based on the data so far, with the focus on human gaze-patterns and gesturing. Finally, Section 4 presents conclusions and future research directions.

Conclusions and Future Work

The paper has presented the AICO corpus which is a multimodal corpus of corresponding human-human and human-robot interactions. It is a systematic collection of eye-tracking and video data which takes into consideration different interactive activities and languages, with the aim to compare engagement and attention in human-human and human-robot interaction. The main purpose of the corpus is to be used as training and testing data to bootstrap studies on engagement, awareness, and attention in naturally occurring interactions (i.e. data in the wild), using both qualitative and quantitative research methods as well as neural modelling (e.g. transfer learning and attention networks). The corpus is available for research by contacting the author.

Several preliminary analyses of the AICO corpus have already been conducted and reported in other publications. Currently, the corpus is being further analysed with speech and dialogue acts, and by building models for the fusion of gaze and gesture behaviour with spoken utterance analysis, and to coordinate dialogue interactions. Future research will aim at more detailed analyses on gaze and gesturing to deepen our understanding of the use and correlation between visual attention, action, collaboration and engagement in interactive situations. Moreover, the data can be used for computational modelling of natural and engaging interactions and for explorations concerning neural techniques to design and develop interactive systems. Such models and systems can be applied to a variety of contexts, including everyday tasks and context-aware applications for care-giving and educational domains. Finally, the corpus provides a starting point for discussions concerning ethical, legal, and privacy issues with robot agents. Some important aspects are also discussed in Jokinen et al (2019).

About KSRA

The Kavian Scientific Research Association (KSRA) is a non-profit research organization to provide research / educational services in December 2013. The members of the community had formed a virtual group on the Viber social network. The core of the Kavian Scientific Association was formed with these members as founders. These individuals, led by Professor Siavosh Kaviani, decided to launch a scientific / research association with an emphasis on education.

KSRA research association, as a non-profit research firm, is committed to providing research services in the field of knowledge. The main beneficiaries of this association are public or private knowledge-based companies, students, researchers, researchers, professors, universities, and industrial and semi-industrial centers around the world.

Our main services Based on Education for all Spectrum people in the world. We want to make an integration between researches and educations. We believe education is the main right of Human beings. So our services should be concentrated on inclusive education.

The KSRA team partners with local under-served communities around the world to improve the access to and quality of knowledge based on education, amplify and augment learning programs where they exist, and create new opportunities for e-learning where traditional education systems are lacking or non-existent.

FULL Paper PDF file:

The AICO Multimodal Corpus – Data Collection and Preliminary Analyses



Kristiina Jokinen AI Research Center AIST Tokyo Waterfront 2-4-7 Aomi Koto-ku Tokyo 135-0064 JAPAN




The AICO Multimodal Corpus – Data Collection and Preliminary Analyses

Publish in

European Language Resources Association

PDF reference and original file: Click here


Website | + posts

Nasim Gazerani was born in 1983 in Arak. She holds a Master's degree in Software Engineering from UM University of Malaysia.

Website | + posts

Professor Siavosh Kaviani was born in 1961 in Tehran. He had a professorship. He holds a Ph.D. in Software Engineering from the QL University of Software Development Methodology and an honorary Ph.D. from the University of Chelsea.

+ posts

Somayeh Nosrati was born in 1982 in Tehran. She holds a Master's degree in artificial intelligence from Khatam University of Tehran.