User experience with voice user interfaces – a diary study using Amazon Echo as an example

Voice user interfaces are in no way a new phenomenon. Siri has been on the market since 2011. However, with the seemingly omnipresent advertising for Amazon Echo, the topic is now more relevant than ever. Competitors of the Echo are very active in this field, too. Google with Assistant, Microsoft with Cortana and Facebook with their Messenger M are vying for consumer favour. Consequently, graphic interfaces are being increasingly replaced by or only used in conjunction with voice user interfaces, as speech has been the simplest and most used form of communication for millennia. Even today, voice user interfaces are not without their faults. However, the combination of artificial intelligence and machine learning means that there’s a bright future on the horizon for this technology, even if it’s not suitable for all human-machine interactions. If complex selection processes which place a high demand on the working memory are carried out via the interface, or if multiple pieces of information need to be processed at the same time, then written language and visually processed content is often more appropriate.

Why this study?

The starting point of this study was to establish how acceptance and use of voice user interfaces develops among users during the initial test phase. Using the example of Amazon Echo, we wanted to find out if the voice interaction function would make an impression on the user, what aspects make a good user experience and how this technology might be adapted. We were particularly interested in the question of how the development of skills can contribute to a better UX.

Diary study monitored users during the first Alexa test

To answer these questions, Facit Digital carried out a two-week long diary study using 26 first-time Amazon Echo users in cooperation with the voice interface agency VUI.agency. As part of the study, the participants were given an Echo Dot, which they activated using their own Amazon accounts. Every two days, the users were asked to describe their thoughts and feelings when using the Echo, where problems arose and what about it impressed them. They were also tasked with using a skill specifically designed for the experiment called “Brain Challenge”. “Brain Challenge” is a skill designed for practising mental maths, riddles, quizzes or brain training with three difficulty levels.

Two versions of the skill were developed by our partners in this study, VUI.agency. In the version of the skill without a tutorial, the users had to navigate it without active help prompts. The tutorial version of the skill regularly offered help to explain how to navigate it.

Expectations of Alexa were often met

The results show that after the two-week test period, over half of the participants reported that Alexa had met their expectations. They had great fun interacting with the assistant, they liked the friendly voice, the ease of interaction as well as the wide range of skills available to use. The average usage time for the group was around 30 minutes.

42% of the users felt that their expectations were only partially met or not met at all. They often found themselves frustrated by Alexa’s unnatural communication and the inflexible way it worked. A recurring problem was that the voice assistant did not completely understand the users. This was partly due to the fact that the users were too far away from the device and partly due to too much background noise. For example, if the Echo was playing music, the users had to “shout” to get Alexa to hear them. The same was true when the user used words or names in a foreign language. Often, Alexa could not understand them correctly.

But even when correct commands were given, some participants still had difficulties. For example, some participants couldn’t remember the right invocation name to enable the skill, and so the command did not work. In another case, the commands were not given in the correct order, which meant that the skills were not executed correctly. In connection with this, users criticised that Alexa had trouble understanding context. For example, for a question like: “When was Mozart born?”, Alexa gave the correct answer of 1756. But for the follow up question: “And where was he born?”, Alexa could not answer because she no longer made the connection with Mozart. More “intelligence” would be appreciated in this regard.
Among the more critical users, feelings of unease at the idea of something constantly “eavesdropping” were expressed repeatedly during the experiment.

Not much impressive content

The content on offer is obviously a decisive factor for user experience. During the two-week test period, only a few of the most frequently used skills deviated from the “standard content”. This included listening to the news, music or radio, using Spotify or checking for weather and traffic updates.

Use and user experience over time

Excluding the time spent setting up the Echo on the first day, the usage time then quickly averaged out to between 20 and 25 minutes. The most enthusiastic users used it for an average of around 30 minutes a day. The less convinced users had an average usage time of around 15 minutes per day.

We can also observe a steep learning curve. After just one week, the proportion of users reporting to have learned something new about interacting with Alexa was under half. After two weeks, this proportion sank to under a third. Everything seems to suggest that Alexa is relatively easy to use.

This also explains the change in the way people reacted to Alexa over the course of the two weeks.

After the initial “honeymoon period”, interest in Alexa dwindled, particularly when participants encountered some issues or snags in the interaction. But satisfaction among users was on the up again in no time, which is linked to how quickly they were able to learn to use Alexa, as mentioned above. How much people liked Alexa grew from the beginning onwards. The highpoint was reached after around 10 days. The familiarisation effect made it easy for positive feelings to fade. But for a test lasting two weeks, the approval rating was still remarkably high.

Satisfaction rests on skill design

The comparison of performance indicators with regard to using the two different skills is particularly interesting. We were unable to prove the hypothesis that people would be less satisfied with the skill with a detailed tutorial after using it for an extended period of time. Quite the opposite was observed as the participant group with the tutorial version was significantly more satisfied over the course of the entire experiment. The skill with no tutorial made up some ground towards the end of the experiment, but could not compete with the tutorial version.

Questioning the participants directly also revealed that users who were given no tutorial would have really appreciated some guidance. But even more surprisingly, in the group with the extensive tutorial, almost a third wanted even more help. Only 41% wanted less help in this group. This shows that the guidance and assistance features were received very positively.

This also applies when looking at the results of the subjective assessment of the voice recognition system. The following figures show the perceived, relative differences in opinion between the users with the tutorial version of the skill and those with no tutorial.

Conclusion: What makes a good skill?

A well-designed skill will walk the user through its functions and how to use them. Just like with graphic interfaces, this can make it easier to use and users are more satisfied when dialogues appear to help them. An example of this in graphic user interfaces is dialogue windows. “Would you like to save before closing the file?” or “Open recently viewed files?” If these prompts do not appear, the user may get lost or make mistakes. This applies even more to voice user interfaces. During the professional programming of the skills, the needs and capabilities of the users when using voice user interfaces need to be taken into account and factored into development. Facit Digital and VUI.agency has the tools and capabilities to support this process and therefore to contribute to a better experience for the user.

Expert recommendation by Patrick Esslinger (VUI agency): In the study, the skill with a tutorial performed much better than the stripped-down, non-tutorial version. But we must not forget that more help means more time. In contrast to written prompts, it’s not possible to simply skip over or click through instructions and explanations. A middle ground between a guided and non-guided user experience is decreasing the amount of prompts. This means that when the skill is first enabled, one or several detailed explanations are read aloud, depending on how complex the skill is. These prompts will explain the functions of the skill. When the skill is enabled for the second time, there will be fewer prompts, and on the third time, there will be none at all (it may be possible to replay the explanations after the skill hasn’t been used for more than 14 days, for example).

This page is available in DE