Voice Interfaces – The three biggest challenges

Marcel Naujeck

Senior Software Engineer, hmmh

Voice Interfaces – Looking to the Future - 10. December 2018
Voice Interfaces – The three biggest challenges - 16. November 2018
Voice Interfaces – The Here and Now - 29. October 2018
Voice Interfaces – A Trend with History - 19. October 2018
Fundamentals for programming Amazon Echo Show - 28. July 2017

As with every trend, many see voice interfaces as a magic bullet. Yet their application is not relevant to every situation. So, for which services do they offer a genuine incremental value? What characterises a good dialogue and how do we guarantee that a customer’s data is handled securely? Let us show you what you should be paying attention to.

In theory, voice interfaces should be perfectly integrated into our everyday lives. We are accustomed to packing information into language and expressing our wishes verbally. However, this is not our only means of communicating information. Information is also passed on non-verbally, often through gestures, mimicry and tone. In online chats, we attempt to balance out the scant possibilities of non-verbal communication with the help of numerous emojis. When describing superlatives, most of us will turn to wild gesticulation. For example, we use sweeping gestures to explain the size or width of something. If we see something extraordinary and want to describe it, as with a phone call, email or letter, we can only do so verbally; this often feels limiting and explains why we gladly rely on sending pictures. With countless gadgets available online, when we come across a great one and tell a friend about it, we tend to enumerate only a few of its attributes. We do so not only because we are limited with our time, but also because we know that our counterparts might find different features exciting. Our experience tells us that it is much better to simply send friends a link to the product so that they can see for themselves what they like most about the gadget.

Verbal communication in everyday life reflects verbal communication with voice interfaces. Not every application has the potential to generate added value through the use of a voice interface. An example of this is Amazon Alexa’s Skill Store. There are a lot of so-called ‘useless skills’, poorly rated skills that nobody uses. Voice interface skills are the equivalent to apps in the mobile world. But what characterises these useless skills? They have no incremental value for the user. Either they are simply not designed for voice interfaces or they are not well designed for dialogues and thus cause user frustration. But why is that? What can be done better and how can useless skills be avoided

Find a meaningful application

We often use everyday phrases like “Can you just…?”, “I need a quick…” or “What was that again…?”. This is especially true when we are short on time or have our hands full. Especially in these situations, we do not have the opportunity to sit in front of a computer or to get our mobile phones out. And this is exactly where the ideal scenarios for practical voice interface usage are found.
It is possible to provide all kinds of information, from the control of connected systems such as smart homes, or the use of services such as rental car bookings. All ‘hands free’ scenarios are also predestined for voice interfaces. From the mechatronics engineer who is working on an engine with oily hands and needs some info on a spare part, to the amateur cook who wants to know the next step of a recipe while kneading dough.
In such situations, software serves to make our everyday lives easier and more pleasant. And that’s exactly what counts when using voice interfaces. It’s a question of short questions, logical support and fast results. Pragmatism is key. It is therefore important to consider exactly which service or application you want to offer with a voice interface and whether it will really help the user in their private or professional life.

Remember to always think in terms of dialogue and never in visual concepts

When the smartphone and mobile app revolution flooded the market, already existing concepts were simply scaled down and taken over. It was only over the course of time that these adapted concepts were refined and adapted for the mobile format. However, the way in which people process visual information is very selective. The subconscious mind acts like a filter, directing our attention to the things that are important to us. Additional information will only come to us later. By contrast, auditory perception works quite differently. In this case, the subconscious mind cannot decide which information to absorb first. Instead, we process everything we hear in a predetermined order.

And this is where the first big mistake arises: When designing a skill for a voice interface, it is often falsely assumed that all it takes is a simple adaptation of an already functioning visual concept. Yet visual concepts contain too much information for a voice interface. If you use all this content, the user is flooded with long texts and an endless amount of information. The result is both exhausting and unpleasant. For this reason, Amazon has already launched the ‘one-breath rule’. It states that the text Alexa should communicate in an interaction with the user must be no longer than a slow breath. To ensure the user does not feel overwhelmed and the voice interface adapts better, it is important to look at the information to be communicated in detail and take into account text lengths and information restrictions.

Avoid long dialogues: A second big mistake in terms of dialogue, which is also based on the adaptation of visual concepts, are overly long stretches of dialogue. Especially when it comes to e-commerce, we are used to being led through a process page by page so that by the end of the process, the system contains all the information needed to make a purchase. These processes are stable and, in most cases, lead to success. With a voice interface, the situation is different. A simple, multi-step, question-answer dialogue that can be executed quickly by the interface can still take several minutes. If the user takes too long to answer, the dialogue is usually simply ended. If something is incorrect or misunderstood, it can lead to errors. In addition, some well-known interfaces simply drop dialogue, even for no apparent reason. This is all the more annoying the more advanced this sluggish dialogue is.

In order to avoid this, when using a voice interface for the first time, certain basic user information can be queried and then assumed during further use. If necessary, you can also access this default data through another source. For example, if a user wants to book a trip to Munich, the voice interface needs the following data: Place of departure, final destination, date, time, preferred method of travel and payment type. The user has previously stated that he lives in Hamburg, mostly travels by train and often pays by credit card. The next possible time is selected as the default departure time. The interface would therefore be able to make a valid booking by asking just one question, namely the destination. And all this without a long and possibly error-prone and repetitive question-answer game with a poor dynamic. The user should always be able to make subsequent changes to the existing data.

Different phrases employed at the right time with a pleasant dynamic: Language gives us the opportunity to express a specific statement in many different ways. Linguistic variation is an expression of intelligence. So why shouldn’t voice interfaces also vary in their formulations? Through enhanced dynamics and numerous phrases, the process and overall interaction are rendered much more natural. The interface adapts to the user, instead of the other way around. These linguistic adjustments also correspond to repeated use of the interface. If the interface explains everything in detail the first time you use it, further repetition of usage instructions should be avoided unless the user asks for them.

In situations where the user needs help, there is also a lot to take into account. It is not always clear how to use voice interfaces. Therefore, there is the option of asking for help. The interface can take into account the situation in which the user finds themselves. Finally, it recognises whether the user is currently in a shopping cart or specifying a date for a trip. This ensures that it is easy to provide the user with a shopping cart-related help request specifically when the user is dealing with the shopping cart. This knowledge should definitely be harnessed to provide the best possible in-situ support.

Ensuring secure dialogues

As with any software development, data security is a key issue when developing voice interfaces. So, what must be considered during analysis and conception? In the article ‘Voice Interfaces – The Here and Now‘, the big players were put under the magnifying glass. The interfaces that it describes are all cloud-based. Thus, the language analysis and processing does not take place locally on the user’s computer, but in the respective data centres of the provider. Within the framework of the GDPR, these providers not only have to provide information about where their processing servers are located, but also comply with applicable basic regulations. However, the question arises, why would a financial service provider or health insurance company want to store highly sensitive customer data in the cloud of a foreign company? Amazon, for example, offers a high level of security when accessing their services through encrypted transmission or authentication via OAUTH2, yet everything else in their infrastructure is invisible to users and developers. It is almost impossible to anonymise a voice interface that works with sensitive data in such a way that prevents knowledge about the user being tapped from the cloud side. Everything that is said is processed in the cloud, as is everything that the interface communicates to the user. Therefore, it is only possible to use voice interfaces in situations where no sensitive data is handled.

Why the cloud? The blessing and curse of current voice interfaces is that sentence transcription and analysis is based on machine-learning technology. Once a dialogue model has been developed, the system must learn this model so that it can then understand similar sentence variants. This ‘learning’ is a computationally intensive process that is performed on the hardware of a server. From this perspective, these cloud solutions are both pragmatic and, seemingly, essential. However there are a few solutions in the field of voice interfaces that can run on local machines or servers. For example, with its speech recognition software Dragon, software manufacturer ‘Nuance’ offers a tool that enables transcription via local hardware.

What needs to be considered when dealing with pins and passwords? Another aspect of data security is the type of interface in question. While it is easy to quickly glance at a visual interface and check if anyone is paying attention when entering our password, with spoken language it is far more problematic. The tapping of security-sensitive data is therefore easy game. Pins and passwords should therefore never be part of a voice interface. Here, connection with a visual component is more advisable. The user is authenticated via the visual component, while additional operations are carried out using the auditory component.

Conclusion

The handling of sensitive data still represents one of the biggest challenges when using voice interfaces. Here, it is important to work with a particularly critical eye and design dialogues accordingly. Security questions should never be part of a voice interface dialogue. While it may be tempting, visual concepts should never be transferred directly to a voice interface. This results in the user being overwhelmed and dialogues being interrupted for being too long or due to errors. If all of these points are taken into consideration, the user will find working with a voice interface pleasant, natural and helpful. Of course, whether the interface makes sense overall largely depends on the concept and field of application.

This is the third contribution of a four-part series on the subject of voice interfaces:

Part 1: ‘Voice Interfaces – A Trend with History’
Part 2: ‘Voice Interfaces – The Here and Now’
Part 3: ‘Voice Interfaces – The 3 Biggest Challenges’
Part 4: ‘Voice Interfaces – Looking to the Future’

This page is available in DE

Voice Interfaces – The three biggest challenges

Marcel Naujeck

Find a meaningful application

Remember to always think in terms of dialogue and never in visual concepts

Ensuring secure dialogues

Conclusion

NEW CONTENT

TOP AUTHORS

AGENCIES

Marcel Naujeck

Find a meaningful application

Remember to always think in terms of dialogue and never in visual concepts

Ensuring secure dialogues

Conclusion

You might also like

NEW CONTENT

TOP AUTHORS

AGENCIES

Subscribe to our newsletter.