Multimodal Human Computer Interaction with MIDAS Intelligent Infokiosk

Multimodal Human Computer Interaction with MIDAS Intelligent Infokiosk

Table of Contents


In this paper, we present an intelligent information kiosk called MIDAS (Multimodal Interactive-Dialogue Automaton for Self-service), including its hardware and software architecture, stages of deployment of speech recognition and synthesis technologies. MIDAS uses the methodology Wizard of Oz (WOZ) that allows an expert to correct speech recognition results and control the dialogue flow. User statistics of the multimodal human-computer interaction (HCI) have been analyzed for the operation of the kiosk in the automatic and automated modes. The infokiosk offers information about the structure and staff of laboratories, the location, and phones of departments and employees of the institution. The multimodal user interface is provided with a touchscreen, natural speech input, and head and manual gestures, both for ordinary and physically handicapped users.


multimodal user interfaces, human-computer interaction, automatic speech recognition, speech synthesis, artificial intelligence, infokiosk


In recent years, smart information kiosks employing speech and multimodal user interfaces are being actively developed. The following systems can be mentioned as examples: Touch’n’Speak developed by Tampere University (Finland); Memphis Intelligent Kiosk Initiative (MIKI) [1] from Memphis University (USA); French system Multimodal-Multimedia Automated Service Kiosk (MASK); and Multimodal Access to City Help Kiosk (MATCHKiosk) [2] manufactured by AT&T company. In a smart kiosk, information can be input by a touchscreen/keyboard, by voice, or by manual or body gestures.

A prototype of an infokiosk called MIDAS has been developed and installed in the SPIIRAS hall. The infokiosk is able to detect a human being inside the working zone with the Haar-based object detector [3], to track his/her movements and demonstrate awareness by a 3D avatar, which tracks clients by rotation of her talking head [4]. The infokiosk realizes a mixed initiative dialogue strategy, starting from the system’s initiative and giving initiative for the query to a user after a verbal welcome. The architecture of MIDAS is presented in Figure 1. It contains a lot of hardware and software technologies, which work simultaneously and synchronously. Most important of these modules are: (1) video processing with two non-stereo video-cameras and a technology of computer vision in order to detect the human’s position, face and some facial organs; (2) speaker-independent system of automatic recognition of continuous speech that uses an array of 4 microphones with the T-shape geometry to eliminate acoustical noise and to localize the source of a relevant acoustic signal for distant speech processing; (3) modules for audio-visual speech synthesis to be applied for a talking avatar; (4) an interactive graphical user interface with a touchscreen; (5) a dialogue and data manager that accesses an application database, generates multimodal output and synchronizes input modalities fusion and output modalities fission. The kiosk has been designed for multimodal HCI by users with special needs as well. It includes a module for contactless control of the mouse cursor by head movements; which is helpful for hand-disabled people. It is based on a Lucas-Kanade feature tracker for optical flow and uses two video cameras. We are working also on the task of equipping the infokiosk with technologies for sign language analysis and synthesis; [5] extending the 3D talking head and automatic recognition of manual gestures of sign language [6] available to deaf people. The proposed kiosk provides information about the structure and staff of laboratories, location and phones of departments and employees, current events as well as contact information needed both for visitors and employees of the institute. To access the interactive diagram of SPIIRAS structure, users may apply both touchscreen and/or voice requests.


The infokiosk is able to detect clients and support verbal HCI as described. It combines both the standard means for information input/output (touchscreen and loudspeaker) and the devices for contactless HCI (video-cameras and microphones). Usability results and the analysis of real exploitation have demonstrated that people prefer to communicate in a multimodal way; more than 60% of the user requests were made by natural speech and the rest by touch. In the automated mode, with WOZ speech dialogue completion, recognition rate was 96%. Afterwards, MIDAS is launched in the automatic mode, the results of which will be reported later on.

About KSRA

The Kavian Scientific Research Association (KSRA) is a non-profit research organization to provide research / educational services in December 2013. The members of the community had formed a virtual group on the Viber social network. The core of the Kavian Scientific Association was formed with these members as founders. These individuals, led by Professor Siavosh Kaviani, decided to launch a scientific / research association with an emphasis on education.

KSRA research association, as a non-profit research firm, is committed to providing research services in the field of knowledge. The main beneficiaries of this association are public or private knowledge-based companies, students, researchers, researchers, professors, universities, and industrial and semi-industrial centers around the world.

Our main services Based on Education for all Spectrum people in the world. We want to make an integration between researches and educations. We believe education is the main right of Human beings. So our services should be concentrated on inclusive education.

The KSRA team partners with local under-served communities around the world to improve the access to and quality of knowledge based on education, amplify and augment learning programs where they exist, and create new opportunities for e-learning where traditional education systems are lacking or non-existent.

FULL Paper PDF file:

Multimodal Human Computer Interaction with MIDAS Intelligent Infokiosk



Alexey Karpov, Andrey Ronzhin, Irina Kipyatkova, Alexander Ronzhin St. Petersburg Institute for Informatics and Automation of RAS (SPIIRAS), Russia e-mail: {karpov, ronzhin, kipyatkova}
Lale Akarun Department of Computer Engineering, Bo-aziçi University 0stanbul, Turkey e-mail:




Multimodal Human-Computer Interaction with MIDAS Intelligent Infokiosk

Publish in

International Conference on Pattern Recognition



PDF reference and original file: Click here



Website | + posts

Nasim Gazerani was born in 1983 in Arak. She holds a Master's degree in Software Engineering from UM University of Malaysia.

Website | + posts

Professor Siavosh Kaviani was born in 1961 in Tehran. He had a professorship. He holds a Ph.D. in Software Engineering from the QL University of Software Development Methodology and an honorary Ph.D. from the University of Chelsea.

+ posts

Somayeh Nosrati was born in 1982 in Tehran. She holds a Master's degree in artificial intelligence from Khatam University of Tehran.