As with every trend, many see voice interfaces as a magic bullet. Yet their application is not relevant to every situation. So, for which services do they offer a genuine incremental value? What characterises a good dialogue and how do we guarantee that a customer’s data is handled securely? Let us show you what you should be paying attention to.

In theory, voice interfaces should be perfectly integrated into our everyday lives. We are accustomed to packing information into language and expressing our wishes verbally. However, this is not our only means of communicating information. Information is also passed on non-verbally, often through gestures, mimicry and tone. In online chats, we attempt to balance out the scant possibilities of non-verbal communication with the help of numerous emojis. When describing superlatives, most of us will turn to wild gesticulation. For example, we use sweeping gestures to explain the size or width of something. If we see something extraordinary and want to describe it, as with a phone call, email or letter, we can only do so verbally; this often feels limiting and explains why we gladly rely on sending pictures. With countless gadgets available online, when we come across a great one and tell a friend about it, we tend to enumerate only a few of its attributes. We do so not only because we are limited with our time, but also because we know that our counterparts might find different features exciting. Our experience tells us that it is much better to simply send friends a link to the product so that they can see for themselves what they like most about the gadget.

Verbal communication in everyday life reflects verbal communication with voice interfaces. Not every application has the potential to generate added value through the use of a voice interface. An example of this is Amazon Alexa’s Skill Store. There are a lot of so-called ‘useless skills’, poorly rated skills that nobody uses. Voice interface skills are the equivalent to apps in the mobile world. But what characterises these useless skills? They have no incremental value for the user. Either they are simply not designed for voice interfaces or they are not well designed for dialogues and thus cause user frustration. But why is that? What can be done better and how can useless skills be avoided

Find a meaningful application

We often use everyday phrases like “Can you just…?”, “I need a quick…” or “What was that again…?”. This is especially true when we are short on time or have our hands full. Especially in these situations, we do not have the opportunity to sit in front of a computer or to get our mobile phones out. And this is exactly where the ideal scenarios for practical voice interface usage are found.
It is possible to provide all kinds of information, from the control of connected systems such as smart homes, or the use of services such as rental car bookings. All ‘hands free’ scenarios are also predestined for voice interfaces. From the mechatronics engineer who is working on an engine with oily hands and needs some info on a spare part, to the amateur cook who wants to know the next step of a recipe while kneading dough.
In such situations, software serves to make our everyday lives easier and more pleasant. And that’s exactly what counts when using voice interfaces. It’s a question of short questions, logical support and fast results. Pragmatism is key. It is therefore important to consider exactly which service or application you want to offer with a voice interface and whether it will really help the user in their private or professional life.

Remember to always think in terms of dialogue and never in visual concepts

When the smartphone and mobile app revolution flooded the market, already existing concepts were simply scaled down and taken over. It was only over the course of time that these adapted concepts were refined and adapted for the mobile format. However, the way in which people process visual information is very selective. The subconscious mind acts like a filter, directing our attention to the things that are important to us. Additional information will only come to us later. By contrast, auditory perception works quite differently. In this case, the subconscious mind cannot decide which information to absorb first. Instead, we process everything we hear in a predetermined order.

And this is where the first big mistake arises: When designing a skill for a voice interface, it is often falsely assumed that all it takes is a simple adaptation of an already functioning visual concept. Yet visual concepts contain too much information for a voice interface. If you use all this content, the user is flooded with long texts and an endless amount of information. The result is both exhausting and unpleasant. For this reason, Amazon has already launched the ‘one-breath rule’. It states that the text Alexa should communicate in an interaction with the user must be no longer than a slow breath. To ensure the user does not feel overwhelmed and the voice interface adapts better, it is important to look at the information to be communicated in detail and take into account text lengths and information restrictions. 

Avoid long dialogues: A second big mistake in terms of dialogue, which is also based on the adaptation of visual concepts, are overly long stretches of dialogue. Especially when it comes to e-commerce, we are used to being led through a process page by page so that by the end of the process, the system contains all the information needed to make a purchase. These processes are stable and, in most cases, lead to success. With a voice interface, the situation is different. A simple, multi-step, question-answer dialogue that can be executed quickly by the interface can still take several minutes. If the user takes too long to answer, the dialogue is usually simply ended. If something is incorrect or misunderstood, it can lead to errors. In addition, some well-known interfaces simply drop dialogue, even for no apparent reason. This is all the more annoying the more advanced this sluggish dialogue is.

In order to avoid this, when using a voice interface for the first time, certain basic user information can be queried and then assumed during further use. If necessary, you can also access this default data through another source. For example, if a user wants to book a trip to Munich, the voice interface needs the following data: Place of departure, final destination, date, time, preferred method of travel and payment type. The user has previously stated that he lives in Hamburg, mostly travels by train and often pays by credit card. The next possible time is selected as the default departure time. The interface would therefore be able to make a valid booking by asking just one question, namely the destination. And all this without a long and possibly error-prone and repetitive question-answer game with a poor dynamic. The user should always be able to make subsequent changes to the existing data. 

Different phrases employed at the right time with a pleasant dynamic: Language gives us the opportunity to express a specific statement in many different ways. Linguistic variation is an expression of intelligence. So why shouldn’t voice interfaces also vary in their formulations? Through enhanced dynamics and numerous phrases, the process and overall interaction are rendered much more natural. The interface adapts to the user, instead of the other way around. These linguistic adjustments also correspond to repeated use of the interface. If the interface explains everything in detail the first time you use it, further repetition of usage instructions should be avoided unless the user asks for them.

In situations where the user needs help, there is also a lot to take into account. It is not always clear how to use voice interfaces. Therefore, there is the option of asking for help. The interface can take into account the situation in which the user finds themselves. Finally, it recognises whether the user is currently in a shopping cart or specifying a date for a trip. This ensures that it is easy to provide the user with a shopping cart-related help request specifically when the user is dealing with the shopping cart. This knowledge should definitely be harnessed to provide the best possible in-situ support.

Ensuring secure dialogues

As with any software development, data security is a key issue when developing voice interfaces. So, what must be considered during analysis and conception? In the article ‘Voice Interfaces – The Here and Now‘, the big players were put under the magnifying glass. The interfaces that it describes are all cloud-based. Thus, the language analysis and processing does not take place locally on the user’s computer, but in the respective data centres of the provider. Within the framework of the GDPR, these providers not only have to provide information about where their processing servers are located, but also comply with applicable basic regulations. However, the question arises, why would a financial service provider or health insurance company want to store highly sensitive customer data in the cloud of a foreign company? Amazon, for example, offers a high level of security when accessing their services through encrypted transmission or authentication via OAUTH2, yet everything else in their infrastructure is invisible to users and developers. It is almost impossible to anonymise a voice interface that works with sensitive data in such a way that prevents knowledge about the user being tapped from the cloud side. Everything that is said is processed in the cloud, as is everything that the interface communicates to the user. Therefore, it is only possible to use voice interfaces in situations where no sensitive data is handled.

Why the cloud? The blessing and curse of current voice interfaces is that sentence transcription and analysis is based on machine-learning technology. Once a dialogue model has been developed, the system must learn this model so that it can then understand similar sentence variants. This ‘learning’ is a computationally intensive process that is performed on the hardware of a server. From this perspective, these cloud solutions are both pragmatic and, seemingly, essential. However there are a few solutions in the field of voice interfaces that can run on local machines or servers. For example, with its speech recognition software Dragon, software manufacturer ‘Nuance’ offers a tool that enables transcription via local hardware.

What needs to be considered when dealing with pins and passwords? Another aspect of data security is the type of interface in question. While it is easy to quickly glance at a visual interface and check if anyone is paying attention when entering our password, with spoken language it is far more problematic. The tapping of security-sensitive data is therefore easy game. Pins and passwords should therefore never be part of a voice interface. Here, connection with a visual component is more advisable. The user is authenticated via the visual component, while additional operations are carried out using the auditory component.


The handling of sensitive data still represents one of the biggest challenges when using voice interfaces. Here, it is important to work with a particularly critical eye and design dialogues accordingly. Security questions should never be part of a voice interface dialogue. While it may be tempting, visual concepts should never be transferred directly to a voice interface. This results in the user being overwhelmed and dialogues being interrupted for being too long or due to errors. If all of these points are taken into consideration, the user will find working with a voice interface pleasant, natural and helpful. Of course, whether the interface makes sense overall largely depends on the concept and field of application.

This is the third contribution of a four-part series on the subject of voice interfaces:

What voice internet means for the future of digital marketing

The screenless internet: A bold prediction for the future

At the end of 2016, Gartner published a bold prediction: by 2020 30% of web browsing sessions would be done without a screen. The main driver behind this push into a screenless future would be young and tech savvy target groups fully embracing digital assistants like Siri and Google assistant on mobile, Microsoft’s Cortana and Amazon’s Echo.

While 30% still feels slightly optimistic mid 2018, the vision of an increasingly screenless internet becomes more and more realistic every day. The adoption rate of smart speakers 3 years after launch is outpacing the smartphone adoption rate in the United States. And what’s maybe most surprising, it isn’t only the young early adopter crowd that is behind this success story, but parents and families. Interacting with technology seamlessly and naturally through conversation is making digital services more attractive to a wider range of consumers.

The new ubiquity of voice assistants

And it isn’t only stationary smart speakers that are growing in usage and capability, every major smartphone features its own digital assistant and consumers can interact with their TVs and cars through voice as well. The major tech players are investing massively in the field and within the next few years every electronic device we put in our homes, carry with us or wear, will be voice-capable.

So, have we finally reached peak mobile and can finally walk the earth with our chins held high again, freed from the chains of our smartphone screens? Well, not so fast.
There’s one issue many digital assistants still face, and let’s be perfectly honest here: despite being labeled “smart” they are still pretty dumb.

Computer speech recognition has reached human level accuracy through advancements in artificial intelligence and machine learning. But just because the machine now understands us perfectly, it isn’t necessarily capable of answering in an equally meaningful way and a lot of voice apps and services are still severely lacking. Designing better voice services and communicating with consumers is a big challenge, especially in marketing.

Peak mobile and “voice first” as the new mantra for marketing

Ever since the launch of the original iPhone in 2007 and the smartphone boom that followed, “mobile first” has been marketing’s mantra. Transforming every service and touchpoint from a desktop computer to a smaller screen and adapting to an entirely new usage situation on the go was a challenge. And even 10 years later, a lot of companies still struggle with certain aspects of the mobile revolution.

The rising popularity of video advertising on the web certainly helped ironing out many issues in terms of classic advertising. After all a pre-roll ad on a smartphone screen catches at least as much attention as it does in a browser. We figured out how to design apps, websites and shops for mobile, reduced complexity and shifted user experiences towards a new ecosystem. But this mostly worked by taking the visual assets representing our brands and services and making them smaller and touch capable.

Brand building in a post-screen digital world

With voice, this becomes a whole new struggle. We have to reinvent how brands speak to their consumers. Literally. And this time without the training wheels of established visual assets. At this year’s SXSW, Chris Ferrel of the Richards Group gave a great talk on this topic and one of his slides has been on my mind ever since: The visual web was about how your brand looks. The voice web is about how your brand looks at the world.

In recent decades, radio advertising has mostly been reduced to a push-to-store vehicle. Loud, obnoxious, and annoying the consumers just long enough, that visiting a store on their way home from work became a more attractive perspective, than listening to any more radio ads.

On the screenless internet, we could see a renaissance of the long-lost art of audio branding. A lot of podcast advertising is already moving in this direction, although there it is mostly carried by the personalities of the hosts. Turning brands into these kinds of personalities should have priority.

The challenges of voice search and voice commerce

We will also have to look at changing search patterns in voice. Text search tends to be short and precise, mostly one to three words. With voice, search queries become longer and follow a more natural speech pattern, so keyword advertising and SEO will have to adapt.

Voice enabled commerce poses a few interesting challenges as well. How do you sell a product, when your customer can’t see it? This might be less of an issue than initially imagined, though. “Alexa, order me kitchen towels” is pretty straight forward and Amazon already knows the brand I buy regularly. Utilizing existing customer data and working with the big market places will be key, at least for FMCG brands.

But how to get into the consumer’s relevant set? And what about sectors like fashion, that heavily rely on visual impressions? Tightly combining all marketing touchpoints comes into play, voice as a channel can’t be isolated from all other brand communication. Obviously, voice will not replace all other marketing channels, but it might become the first point of reference for consumers due to its ubiquity and seamless integration into their daily lives. Finding its role in the overall brand strategy will be crucial.

Navigating the twilight zone of technological evolution

What may be the biggest challenge of this brave new world of voice marketing is the fact that our connected world isn’t as connected as we would like it to be. The landscape of voice assistants is heavily fragmented and more importantly, the devices act in very isolated environments. While I can tell my digital assistant to turn on my kitchen lights or fire up my PlayStation when using compatible smart home hubs and devices, an assumedly simple task like “Siri, show me cool summer jackets from H&M on the bedroom TV” isn’t as easily accomplished.

Right now, it often is still up to the users to act as the interface between voice assistants and the other gadgets in their living spaces. The screenless internet isn’t the natural endpoint in the evolution of technology, it’s more of an unavoidable consequence of iterative steps in development. For now, we have to navigate through this weird, not fully-realized vision of a connected world and hope for technology to catch up and become truly interconnected. So, let’s find the voices of our brands until they regain the capability of also showing us their connected personality.