Extending Automated Usability Evaluation Tools for Multimodal Input

Extending Automated Usability Evaluation Tools for Multimodal Input

Table of Contents


In this work the following three basic research questions are discussed: (1) can significant effects of modality efficiency and input performance on the selection of input modalities in multimodal HCI be disclosed by unified experimental investigations? (2) Can a utility-driven computational model of modality selection be formed based on empirical data? (3) Can the compiled model for modality selection be utilized for the practical application in the field of automated usability evaluation?

Initially, foundations of decision-making in multimodal HCI are discussed, and the state of the art in automatic usability evaluation (AUE) is described. It is shown that there are currently no uniform empirical results on factors influencing modality choice that allow for the creation of a computational model. As part of this work two AUE tools, the MeMo workbench, and CogTool, are extended by a newly created computational model for the simulation of multimodal HCI.

Aiming at answering the first research question, the empirical part of the thesis describes three experiments with a mobile application integrating touch screen and speech input. In summary, the results indicate that modality efficiency and input performance are important moderators of modality choice. The second research question is answered by the derivation of a utility-driven model for input modality choice in multimodal HCI based on the empirical data. The model provides probability estimations of modality usage, based on different levels of the parameters modality efficiency and input performance. Four variants of the model that differ in training data are tested. The analysis reveals a considerable fit for models based on averaged modality usage data.

Answering the third research question it is illustrated how the modality choice model can be deployed within AUE tools for simulating multimodal interaction. The multimodal extension as well as the practical utilization of MeMo is depicted, and it is described how unimodal CogTool models of touch screen and speech-based interaction can be rendered into multimodal models. A comparison of data generated by simulations with the AUE tools with predictions of the derived modality selection algorithm verifies the correct integration of the model into the tools. The practical application discloses the usefulness of the modality choice model for the prediction of the number of steps and the total time spent to solve specific tasks with multimodal systems. The practical part is concluded by a comparison of Memo and CogTool. Both tools are classified, and an assessment on a subjective basis as well as on the basis of the quality of predictions is conducted.


Recent developments in human-computer interaction (HCI) show that increasing numbers of dialog systems offer more than one option to enable user input. Smartphones or navigational systems often come with an additional speech interface (Schaffer et al., 2016). Speech-based interactive systems are a subject of current research (Jokinen and Cheng, 2010) and new devices such as smartwatches arise whereby speech as an input modality is becoming increasingly important (Nurminen et al., 2015). However, the Graphical User Interface (GUI) is still the more common input modality in those systems.

Touch screens are now a standard input method for the GUI. If an additional speech interface is integrated into a system where only a GUI was previously available, a multimodal dialog system (MDS) is built. Examples of such systems are Apple’s iPhone extended with the speech interface Siri (Apple, 2011) and Android smartphones provided with google voice search (Franz et al., 2006). Both systems enable the use of selected spoken commands for specific interactions. Employing speech input often saves interaction steps or time. A novice user deploying this benefit, however, will not know exactly if or how a speech-based interaction is possible. Experience is needed and, if the user in not accustomed to the system, reasoning, and decision-making processes increase the cognitive load (Sweller, 1988). One possible way to avoid this additional load would be to make any modality possible at any point in the interaction.

Input modalities would then be processed sequentially and independently (Nigay and Coutaz, 1993). Sequential processing here means that users may perform system input only consecutively regardless of which modality they use. Independent processing means that modalities are interpreted separately and no semantic fusion is performed. In doing so, a system would provide a graphical input element like a button and a speech input for each interaction. The previously mentioned smartphone examples fit only partly into this category of systems, as they integrate speech interaction only for very specialized tasks like menu navigation or keyboard typing.

In the domain of consumer products almost no sequential independent multimodal systems (SIMS) that allow any interaction using any modality have appeared on the market so far. An unsolved issue with speech input is that direct interaction via spoken commands still does not work sufficiently well because actual automatic speech recognition (ASR) modules cause too many errors, e.g., by processing extraneous background noise as user input. Accordingly, push-to-talk or ”Open microphone in combination with Key-word-spotting Systems” (OKS) are used to enable speech input. However, these implementations need at least one additional step during the interaction to activate ASR.

Assuming further improvements in ASR technology during the next years, future SIMS might have no need for push-to-talk or OKS. OKS might be implemented in an effective manner minimizing interaction time and cognitive resources. With these systems, any graphical and speech input will be entirely possible for each achievable task. Further, users will have to select input modalities in each step during the dialog with the system.

The selection of input modalities is influenced by various factors. Several studies revealed that input performance (like ASR error rate) and modality efficiency (measured in the number of turns to solve a task) are significant moderators in multimodal systems integrating a graphical and speech input (Moller et al., 2011; Schaffer et al., ¨ 2011a; Wechsung et al., 2010). Other documented influence factors are cognitive demand (Wickens and Hollands, 2000), interaction time (Bevan, 1995), hedonic quality (Hassenzahl et al., 2003), environment, and dynamic as well as static user attributes (Bohn et al., 2005; Ren et al., 2000).

The evaluation of the quality of MDS is a current research area (Moller et al., ¨ 2011; Perakakis and Potamianos, 2008; Turunen et al., 2010). Thereby many methods are based on evaluations, using questionnaires in order to gain quality perceptions from real system users (Metze et al., 2009; Kuhnel et al., 2010; Turunen et al., ¨ 2010). However, studies with real test subjects are at most expensive with respect to typically limited resources of time and money. Here effort could be saved by automated usability evaluation (AUE). In AUE predictions about quality factors of a system can be created automatically. The predictions are usually derived from parameters, which are obtained by the simulation of the interaction between models of users and systems.


Future developments in HCI will enable sequential independent multimodal systems (SIMS), thereby enabling free choice of input modalities. Users’ modality choice is moderated by various factors. The motivation of this work was to examine the factors of input performance and modality efficiency and to build a model enabling the prediction of modality usage. The usefulness of the model was demonstrated by the deployment in the AUE tools MeMo and CogTool. The foundations of the work were laid in Chapter 2 starting with an introduction to multimodal interaction including theories of human decision making, and to the topic of automated usability evaluation. Further, the two AUE tools MeMo and CogTool were introduced. Finally, the research questions of the work were formulated.

In three experiments the empirical results reported in Chapter 3 reveal that users of multimodal systems adapt modality usage to estimated modality efficiency as well as to input performance of modalities. On the one hand, speech input is increasingly preferred if speech gets more efficient in terms of interaction steps. On the other hand, the usage probability of a modality decreases if its input performance is limited (e.g., due to ASR errors or touch screen malfunction). Previous research, mostly in line with these findings, revealed rather discrete insights into the continuum of parameters influencing modality choice (Bilici et al., 2000; Wechsung et al., 2010). The presented series of experiments describe the relationships of multiple factor levels and gives a coherent idea about essential moderators of modality choice. The empirical results turned out to be consistent across experiments. The empirical findings answer the first research question RQ1: significant effects of modality efficiency and input performance on modality selection in multimodal HCI can be disclosed by unified experimental investigations.

The presented theoretical foundations and the observed user behavior inspire a utility theory-driven model that is derived in Chapter 4. The model forecasts the average users’ modality choice behavior with considerable predictive power. Model comparison revealed that an integrative model that incorporates data about all available input performance conditions is qualified for beneficial estimations of modality usage. Particularly high prediction performances on unseen data and on conditions resting on sparse data indicate the reliability of the integrative model. If individual subject data is predicted, substantial variances in individual modality usage profiles lead to decreased accuracy. Individual users appear to have different interaction strategies than those demonstrated by the model. An application example showed that the model is able to simulate plausible interaction between an average user model and a system model. Predicted average speech usage is mostly in line with human data. The simulation was realized as a state machine, which is a common concept in the AUE area. The modality selection mechanism can therefore beneficially extend existing AUE tools. A utility-driven computational model of modality selection could be formed based on the empirical data, which provides an answer for research question 2 (RQ2).

The utilization of the modality selection model for the MeMo workbench and for CogTool based simulations is presented in Chapter 5. The multimodal extension of the MeMo workbench was documented and the creation of a multimodal system model of the RBA was outlined. Compared to the predictions of the bare algorithm the modality selection behavior of the MeMo simulations showed good results. The corrected predictions of the total number of interaction steps of three tasks with different modality efficiency provide useful results for two different error conditions. In combination with the MeMo reports the prediction results provided valuable insights into the usability of multimodal interaction. The reports revealed realistic modality usage as well as different possible interaction strategies. The uncorrected prediction includes interaction errors of the model indicating a usability problem of the RBA: within the list screens a back button is missing.

Regarding CogTool the development of a multimodal procedure is documented combining a touch screen ACT-R model and a speech ACT-R model generated by CogTool into one multimodal ACT-R model. The multimodal model is augmented with the modality selection algorithm. The modeling of the RBA with CogTool and the adaptions made to the multimodal ACT-R model are outlined. Also for the CogTool based predictions of modality selection, the comparison to the predictions of the bare algorithm showed good results. The usefulness of the multimodal ACTR model was illustrated by an application example for the prediction of total task duration. CogTool was used to generate baseline predictions that were compared to human data and the predictions of the multimodal ACT-R model. In combination with the CogTool reports the prediction results provided valuable insights into the usability of multimodal interaction. The results revealed realistic performance predictions. Further, the ACT-R simulation provides several automatically generated task solutions which would have to be demonstrated manually with CogTool.

Both tools have been compared by means of taxonomy, subjective assessment, and common performance measures. The results showed that both tools encounter on an equal footing in all disciplines and that the tools per se have different fields of application. It was concluded that the usage of both tools in sequence could be valuable if a comprehensive analysis based on AUE methods should be conducted. The utilization of the derived model for modality selection in MeMo and CogTool answered research question 3 (RQ3).

About KSRA

The Kavian Scientific Research Association (KSRA) is a non-profit research organization to provide research / educational services in December 2013. The members of the community had formed a virtual group on the Viber social network. The core of the Kavian Scientific Association was formed with these members as founders. These individuals, led by Professor Siavosh Kaviani, decided to launch a scientific / research association with an emphasis on education.

KSRA research association, as a non-profit research firm, is committed to providing research services in the field of knowledge. The main beneficiaries of this association are public or private knowledge-based companies, students, researchers, researchers, professors, universities, and industrial and semi-industrial centers around the world.

Our main services Based on Education for all Spectrum people in the world. We want to make an integration between researches and educations. We believe education is the main right of Human beings. So our services should be concentrated on inclusive education.

The KSRA team partners with local under-served communities around the world to improve the access to and quality of knowledge based on education, amplify and augment learning programs where they exist, and create new opportunities for e-learning where traditional education systems are lacking or non-existent.

FULL Paper PDF file:

Extending Automated Usability Evaluation Tools for Multimodal Input



Stefan Schaffer




Extending Automated Usability Evaluation Tools for Multimodal Input

Publish in

Technische Universität Berlin, Doctoral Thesis



PDF reference and original file: Click here






Website | + posts

Nasim Gazerani was born in 1983 in Arak. She holds a Master's degree in Software Engineering from UM University of Malaysia.

Website | + posts

Professor Siavosh Kaviani was born in 1961 in Tehran. He had a professorship. He holds a Ph.D. in Software Engineering from the QL University of Software Development Methodology and an honorary Ph.D. from the University of Chelsea.

+ posts

Somayeh Nosrati was born in 1982 in Tehran. She holds a Master's degree in artificial intelligence from Khatam University of Tehran.