The current functionality of voice interfaces is far from optimal. Vocabulary gets misunderstood, and entire sentences are interpreted wrongly. There are also many limitations on development in the most commonly available interfaces. What technological improvements need to be made to achieve greater acceptance among people? And what development trends are the big players on the market pursuing? We take a look at how voice interfaces are developing and where major potential can be unlocked.

The international market for voice interfaces is developing rapidly and in different directions. A few businesses are focusing on improving the understanding of speech, while others are working to add convenience functions to established technologies. For instance, Alexa will soon be able to distinguish between multiple users by applying voice analysis. Smart assistants will be equipped with deeper knowledge so they can understand increasingly complex speech inputs and grow smarter in the process.

Thus, for example, Samsung are currently working on models for Viv that external developers will be able to expand in future in order to create an increasingly broad knowledge base. What’s more, niche markets are forming for highly specific application fields for conversational interfaces. These are already available for working with product data or in-car solutions, for example.

The big players’ plans

With Alexa, Amazon’s goal was not to bring a smart assistant to the market; rather, the idea is to offer developers the opportunity to create new skills for it. Its functionality is designed to grow, thus expanding the range of possible applications and ensuring that a market develops specifically for this interface. Other systems tend to be difficult for external developers to expand. For instance, adding a domain to Siri’s knowledge – or in other words, adding knowledge of a particular field – would have a huge impact on its overall functionality.

A good example of this is the word “dry”, which can mean “arid” but also can be used to describe wine. If both knowledge domains were implemented without being coordinated with each other, a sentence such as “I like it dry” would be difficult to interpret. By contrast, the classification would be unambiguous if there were only one knowledge domain. That’s why the Apple environment offers no way of programming Siri independently. With Cortana and the Google Assistant, the expansion opportunities are restricted so that Voice Skills or Actions (Google’s equivalent to Alexa Skills) can be developed, but cannot access existing domain knowledge. For developers, this puts them on an equal footing with Alexa.

Amazon focuses on in-skill purchasing

Microsoft and Amazon are working to integrate Alexa into Cortana, and vice versa, in order to broaden the market. Initial reviews can already be found online. What’s more, Amazon is working to bring more and more hardware for Alexa (or with direct support for Alexa) onto the market. These include buzzers – simple buttons that allow users to trigger an action in order to increase the scope of gamification – as well as all kinds of Echos and even smart hub integration for Philips Hue, among others.

So far, however, the market for Alexa Skills has proven to be a zero-sum game. Revenues boiled down to the profits generated through use of Amazon Web Services, and were only earned once a specific usage volume was achieved. The introduction of “in-skill purchasing” has changed all that, in the USA, at least. In-skill purchasing is similar to in-app purchases, and is the first method of voice interface monetisation to be supported by a provider. Amazon takes a 30% cut of every purchase and every subscription, which is roughly equivalent to what Apple and its competitors earn on the app market. This model will be coming to Germany soon, although Amazon has not yet released any more specific information on this topic.

Google focuses on artificial intelligence

Google is tackling a much broader field in its development of voice interfaces. The Duplex system was unveiled at this year’s “Google I/O” conference, and provides additional functionality for the Google Assistant. It uses artificial intelligence (AI), is capable of understanding conversations, and speaks with a remarkably realistic human voice.

But what exactly does that mean? Suppose my favourite sushi delivery service doesn’t let me place orders online, and I need to order over the phone. All telephone conversations of this kind follow the same principle: I state where I live and the dish I want to order, and in reply, I am told how much I need to pay and what time the food will get to me. Google created Duplex for exactly this kind of situation. It can be instructed to make phone calls independently and arrange appointments on your behalf, for example. And when it does so, it’s hard to believe that there isn’t a genuine caller on the line. Intonation and pauses play a special role here, as well as the natural flow of its speech. Duplex thus benefits from Google’s prior deep engagement with natural language.

Google also developed Tacotron 2 in order to artificially generate a human speaking voice (a process known as speech synthesis). Like its predecessor, this new system is trained on the established Deepmind WaveNet neural network as a basis for generating natural language; however, the new feature is that the neural network now receives data on pitch. This YouTube video by CodeEmporium shows exactly how this works and how the system functions. The system can also be tested with different languages on Cloud Text-To-Speech – just make sure you specify the “WaveNet” voice model when you do so. However, prospective users of this system should take note that it is four times as expensive as the existing Cloud Text-To-Speech.

Samsung and Apple keep their cards close to their chests

Unfortunately, it’s completely unclear why Samsung acquired Viv Labs and how this system is being developed. It remains to be seen whether Viv will replace Samsung’s previous Bixby solution, or whether the Viv technology will be integrated into Bixby. However, it is clear from Viv’s overall history that it represents a significantly improved version of Siri, with major potential (see Voice Interfaces – The Here and Now).

By contrast, Siri’s development seems to be stagnating. The only major innovations over the past year were voice macros, which make it possible to activate small macros using a pre-saved voice command. Yet this could be the proverbial calm before the storm. After all, Apple’s HomePod would be ideal as a possible competitor to Alexa. To achieve that, however, Apple would first need to open the Siri interface up to developers and make it possible to write software for the HomePod.

Where does the journey lead?

Beyond voice and conversational interfaces, machine learning is also a major buzzword at the moment. The advances that have been made in voice interfaces over the last few years would have been impossible to achieve without machine learning. Whether for transcription, text analysis or speech synthesis, neural networks are used everywhere, and are yielding ever more astonishing results.

For example, a voice interface that was trained on a single voice would be able to clearly recognise and process a voice belonging to a specific person – even through a din of background noise – with the help of neural networks, as well as its knowledge of all the features of that voice. Anyone who has tried to use their Alexa Smart Home controls while watching a film will understand how important this development would be. After all, users do not want to shout at their voice interface in order to make themselves heard over the ambient noise; rather, they want to communicate at a normal volume. What’s more, if individual voices could be separated, that would significantly expand the fields of application for voice interfaces.

Looking beyond optimised speech processing, it is striking that all smart assistants to date have been completely impersonal. All that might be about to change, however, as a completely digital newsreader has just been showcased in China. This offers significant potential for product providers. Even if the film “Her” depicts a particularly personal relationship with a voice, it is undoubtedly true that people build closer emotional connections with realistic personalities. Just look at the success of influencer marketing. VR and AR technology might also allow this kind of assistant to keep us company in human form wherever we go.

Where does the greatest potential lie?

Computing power: Given the security issue that all data processing performed by voice interfaces takes place in the cloud, we can predict that in future, there will be more and more solutions in which the processing takes place locally. At present, almost all data is processed and stored in the provider’s cloud. This is mainly because many solutions exceed the capacity of the user’s own computer. Yet processing power is constantly growing and getting cheaper. As such, it’s only a matter of time before voice interfaces will be able to function perfectly on smartphones that are offline.

Language comprehension:

Many companies are also working on understanding speech at the level of content. All modern voice interfaces become useless when it comes to interpreting more than one individual sentence – such as the content of an entire story. As they currently stand, voice interfaces focus primarily on statements of intent rather than on knowledge content. The interface is designed to understand what the user wants in order to provide a response. By contrast, extracting knowledge from texts is about capturing knowledge and saving it in ordered structures.

Let’s take the example of a service employee on a hotline who has to handle a five-minute conversation with a customer regarding a complaint. In order to help the employee do their job, there are already a few solutions available that can identify keywords in the conversation and display relevant topics on a screen. Yet it would be even more useful if the interface could extract the essential content from the conversation and display the key points on a screen, so that the employee could then address these points in the discussion. For this to happen, the system would need to be able to understand the content of what the user is saying and evaluate or prioritise it as appropriate. Going further, a conversational interface could also extract information from emails or even chatbots and quickly make all the relevant facts available to service employees.

A great deal of additional research is currently underway in the field of knowledge representation and natural language understanding. Likewise, more and more self-learning technologies are being developed to undertake text analysis, such as word embedding. Here too, it is only a matter of time before systems become available that can understand highly complex content.

Recognising and verbalising image content:

Something that most people tend to encounter only peripherally is the idea of accessibility in the digital world. In the past, Siri made a major and very important contribution to helping people with visual impairments to use smartphones conveniently. The use of voice interfaces is particularly relevant to people in this position.

In addition, the field of machine learning harbours many research projects that focus on recognising image content. This is no longer merely a matter of telling dogs apart from cats; rather, it revolves around image constructions with many components. To take an example: imagine a system that can recognise and describe the location of a street – what’s in front of it, what’s behind it – or can recognise whether a traffic light is currently red, or read the symbols on road signs. Taken together, these technologies would deliver significant added value: a system for the visually impaired that describes what is currently happening in front of the user, warns them when obstacles come into view, and provides reliable navigation.


Voice interfaces have come a long way – yet from day to day, it still doesn’t feel completely natural to use interfaces of this kind since their capacity to understand speech is still too limited. However, there are people working on these problems, and it is possible to envisage a future in which we talk to our digital assistants almost routinely – perhaps even telling them about our ups and downs, and receiving understanding responses or even ideas and encouragement in return. Time will tell what impact this might have on our social lives. Every major technology to date has brought advantages and disadvantages in its wake – we just need to make sure we deploy it prudently.

This is the final contribution of a four-part series on the subject of voice interfaces