Until 2015, voice interfaces were perceived by most as a nice gimmick that was limited to smartphones and navigation systems. But with Amazon Echo, this technology entered the living rooms of many consumers around the world virtually overnight. Amazon is holding back its exact sales figures and has not released any other details yet, but according to news portal Business Insider in 2015 alone, 2.4 million Amazon Echos were sold worldwide, while in 2016, sales rose to 5.2 million. As a result, Apple also revamped the previously neglected Siri and, after six years of silence concerning its speech recognition programme, in June 2017 announced a very unique device: the HomePod. Other companies were subsequently forced to follow this trend, even if they were unsure how to handle it.

Back to the roots

At the same time, voice and conversational interfaces are not an entirely new concept. Voice interfaces are essentially conversational interfaces with a special input channel, namely for analogue language. The development stages of the past decades may even be known to many market observers. If you look at the technology behind a voice interface today, you will find two different components: One is responsible for transcribing analogue speech into text. The other analyses the text and reacts accordingly. This part is carried out by natural language processing and other artificial intelligence (AI) technologies. Both components have existed as separate technologies for a very long time:

1) Transcription

Transcribing simply means transforming spoken text or even sign language into a written form. Corresponding software has been available since 1982 when Dragon System launched its software. Somewhat rudimentary, it was developed for the former DOS (x86) and was called DragonDictate. Continuous transcribing was not yet possible, however 15 years later the same company launched Dragon NaturallySpeaking 1.0. The software already understood natural language so well that it was mainly used for computer dictation. However, the former systems had to be heavily voice trained, or the vocabulary used had to be limited in order to improve the recognition accuracy. Therefore, there were already corresponding prefabricated language packs for lawyers or medical practitioners, for example, whose language is highly specialised. Once optimised, these early systems delivered amazingly good results. In addition, Dragon already offered the option to control a Windows system with voice commands.

2) Natural Language Processing

After the language has been transcribed, the text can then be further processed. When considering a technology that can work with a natural-sounding input text, and that is also capable of reacting coherently to it, one quickly thinks of chatbots. These are a subclass of autonomous programmes called bots that can carry out certain tasks on their own. Chatbots simulate conversation partners and often act according to topics. Although these have enjoyed increasing popularity in recent years, it should also be described as a renaissance; The first chatbot was born 52 years ago. Computer scientist Joseph Weizenbaum developed ‘ELIZA’, which successfully demonstrated the processing of natural language and today is considered the prototype of modern chatbots.

3) Artificial Intelligence

The development of ELIZA showed that simple means are sufficient to achieve good results in the Turing artificial intelligence (AI) test, which concerns the subjective evaluation of a conversation. In spite of the bot’s simple mechanisms, test subjects have begun to build a personal bond and even write about private matters. Experiences with this first conversational Interface attracted a lot of attention and continuously improved chatbot technologies.

For example, in 1981, BITNET (Because It’s There NETwork) was launched, a network that links US research and teaching institutions. One component of this network was Bitnet Relay, a chat client that later became the Internet Relay Chat (IRC). Over the years, students and nerds have developed countless, more or less simple, chatbots for these chat systems, including ICQ. Like ELIZA, they were based on the simple recognition of sentences and not on the evaluation of knowledge.

In 2003, another important development was sparked, banking on a new class of chatbots, Smart Assistants such as Siri. CALO, the ‘Cognitive Assistant that Learns and Organizes’, was a development initiated by the Defense Advanced Research Projects Agency, involving many American universities. The system should help the user to interact with information more effectively and provide assistance by constantly improving their ability to interpret the wishes of the user correctly. The concept is based on digital knowledge representation. In this way, knowledge can be captured in a digital system and made usable. Semantic networks allow objects and their capabilities to be mapped in relation to other objects that enable the Smart Assistant to understand what a user wants to express with a given utterance. For example, if a customer wants to order a ‘dry wine’ through their Smart Assistant, then it needs to understand the connection between the terms ‘dry’ and ‘wine’, depending on the context. Only then does it understand that this term refers to a taste sensation and not the absence of fluid.

Learning

The simple recognition and comparison of texts, also called matching, and the intelligent analysis by means of knowledge representation are two different technologies that have evolved independently of each other. With the help of the matching approach, most applications can be implemented with straightforward resources. For more complex queries, however, a Smart Assistant is much better. However, in turn, this technology is more involved in terms of development and implementation because it requires a broad knowledge base.

Currently, the chatbots that one usually comes across are based on matching technology and can be trained with the help of machine learning (ML). With this method, the system is given as many text variants as possible to a certain statement, which it learns in order to then recognise other similar sentences in its subsequent application, without the need for any special knowledge.

Today we can choose between two technologies that can be used in a conversational interface. Depending on the requirements, one must ask the question whether a system that compares what has been said with learned sentence structures is sufficient or is a system needed that understands the meaning of what has been said and reacts accordingly?

This is the first contribution of a multi-part series on the subject of voice interfaces:

This page is available in DE