Dr. Pfleger, more and more people have voice assistants at home or in the car. What needs to happen behind the scenes so that I can ask a machine, “What’s showing in the movie theater tonight?”
Norbert Pfleger: The first stage is the voice recognition system that records a map of the voice’s frequency characteristic — which means the words the system believes it has heard. Good microphones with a noise-suppression function are extremely useful here. This gives us an initial idea of what the user might have said. For this reason, we need a dialog manager as a second step to derive the most likely interpretation based on the particular context: What did the user mean? Then we need access to background information — usually devices, applications and web-based services — to determine which movies are showing in which theaters, for instance.
In other words, you need rapid access to as many databases as possible?
Exactly. A really smart digital assistant must be able to take the context — or knowledge of the outside world — into account. For example, I am currently sitting in Saarbrücken at just after eight o’clock at night. So, I’m interested in movies that are available in the late showing at around ten. Plus, a good assistant must include acquired information. For instance, if it knows I prefer action movies to tales of romance, it can use that information to recommend a suitable movie for me. None of this is possible without intelligence, which is also the only way that the assistant can genuinely add value. At the end of the day, you need a component that can present this acquired information in a useful format. What do I convey verbally? What do I prefer to display on the screen? After all, it makes no sense to read out a long list of 40 movies.
Let’s be honest — today’s most common user interfaces are generally incapable of making such recommendations. Why is that?
In reality, the problem often comes down to simple things like the user’s accent or double meanings. Today’s smart speakers are simply not yet at the level of an assistant because they lack contextual data as well as acquired information about the user. On the other hand, good systems are capable of dealing with ellipses — that is, incomplete sentences — or references. Our system in an Audi A8, for instance, knows that I have just talked to someone on the phone and understands what I mean when I say “Take me there” after hanging up.
Based on the current state of the art, what are the biggest hurdles standing in the way of truly intelligent voice assistants?
There are two areas of particular relevance here. First of all, the services are not yet sufficiently integrated and networked to enable delegation of tasks. You could call it “insufficient intelligence.” Today’s assistants are essentially a collection of standalone applications. For example, I have to say, “Hello voice-activated service, tell MyTaxi that I need a cab.” As a user, I have to know the name of the app or skill and must understand how to operate it. Put simply, I have to perform an unnecessary number of steps. What I actually want is to say “Call me a cab” and the system does the rest. It’s similar to the problem with using smartphones, where you have to spend ages scrolling back and forth to find the right app.
And the second hurdle?
On top of everything else, they are all still very technical interfaces that interact with me in a pretty inflexible way. What’s missing is the sense of compassion, of empathy. I’m looking for an assistant that understands me and addresses my feelings — one that can engage with me on an emotional level. If it can attune to the user’s emotional state, the dialog will be better. Picture me in the car, stuck in traffic and stressed out — I’m unlikely to appreciate receiving a reminder to dash to the store to pick up something. However, if the system knows how I’m feeling, it can convey the information in a different way. Smart speakers these days are mainly used to perform simple tasks such as controlling the lighting, playing music and sending messages. But if systems are to take over more and more everyday tasks, they must engage with us in a different way to avoid causing frustration and acceptance problems.
Despite its early stage of development, how has voice recognition altered our everyday lives so far?
There’s no denying the increasing use of these systems in our daily lives — especially by children. I see this in my family. For my two daughters aged six and ten, it is the most normal thing in the world to talk to a device, for instance, to change the TV station. They instantly and intuitively understood that this is an efficient way of doing something.
“I anticipate that we will be dealing less with machines as opposed to relying on a combination of multiple interfaces such as voice, gestures or touch screens with artificial intelligence.”
And what about in the workplace?
There too, these types of systems are becoming increasingly popular because they save time and money — for example, when it comes to dictating text in law firms or for doctor’s letters. They are even used in medical technology. One of our customers manufactures a robotic camera guidance system used in keyhole surgical procedures. The surgeon can control the camera with his voice while keeping both hands on the surgical instruments. This provides huge benefits in terms of flexibility because the surgeon does not need to wait for the theater nurse or assistant. Virtually all warehouses now employ speech dialog systems that tell workers where the next product is located so they can keep both hands free. In short, voice control can be used to optimize numerous workflows and reduce waiting times.
What does the future hold? Outline for us how voice control will define the networked world in five or ten years.
I wouldn’t say it will define things, but rather that voice control will support us in our everyday lives. Although these systems will be available around the clock, they will operate very much in the background so that we are not constantly aware of them. I anticipate that we will be dealing less with machines as opposed to relying on a combination of multiple interfaces such as voice, gestures or touch screens with artificial intelligence. These systems will not need that many commands. Instead, they will use what they have learned from and about us to offer smart and therefore subtle support. Conventional operating interfaces will gradually be phased out and replaced by a superordinate system — with intelligence operating behind the scenes of my day-to-day life.
Ignoring keyboards, touch screens and gestures for a moment — is voice really the most natural way for us humans to communicate with machines?
It is one of the most efficient ways, but not the only way. It always depends on what I want to do at any given time. Take the process of dictating an IBAN code, for example. That doesn’t work well even in human-to-human communication because I can say the wrong number or write something down incorrectly. This kind of information is best typed directly into a device or photographed. So, what we’re talking about here is a combination, or multi-modal system, where the user has a choice. The fact that I can switch the lights on or off using a voice command is completely superfluous when I’m standing right beside the switch, but very useful if I happen to be sitting on the couch.
Dr. Norbert Pfleger
Dr. Norbert Pfleger holds a PhD in Computer Science and worked at the German Research Center for Artificial Intelligence (DFKI) from 2002 to 2008. He is co-founder and managing director of paragon semvox GmbH in Saarbrücken, Germany. The company emerged in 2008 from a project conducted at the DFKI and now develops semantic technologies and voice communication solutions — including a natural-language speech dialog system in the Audi A8. In 2018, the company became part of paragon GmbH & Co. KGaA.
“On the other hand, this type of system should not come across as being overly human as this would merely raise our expectations, which would be swiftly disappointed.”
What other, rather more futuristic operating possibilities do you see on the horizon?
If you examine the range of interpersonal communications, there are still lots of possibilities. We exchange information using gestures, facial expressions and looks. All of these are important input sources for the future. The ability to exercise control using thoughts is the next major, exciting step to deducing something without the use of words or actions.
Given these many areas that have yet to be developed, is a truly intelligent voice assistant still a long way off or just around the corner?
Despite the enormous strides we have made in the areas of miking and voice recognition, the technology as a whole is still in its infancy. We have yet to create what can be described as a truly intelligent companion. This is always apparent whenever I give lectures and ask my audience which of them uses a smart speaker. Between 80 and 90 percent raise their hands. But when I ask if they consider these systems to be genuine assistants, not a single hand goes up.
Machines don’t care whether we say “please” or “thank you,” they merely respond to commands. How will voice control change our interpersonal relationships?
That’s a question of perspective — the extent to which we manage to incorporate empathy so that a system does not merely execute commands but engages with me on an equal footing. This will influence our behavior toward machines. It is also a question of product design. The assistant must display a certain form of politeness or personality, and then we will reflect that. On the other hand, this type of system should not come across as being overly human as this would merely raise our expectations, which would be swiftly disappointed.
“There are undoubtedly areas such as the bedroom or the children’s room that should be kept free of electronic devices.”
Many users have concerns that a box that is part of their everyday lives is always listening and may even be recording or sharing that information…
There is a need for clear-cut boundaries to be set in this area. I don’t expect the type of centralized cloud services used for voice assistants today to continue in their current form into the future. The edge computing paradigm — for instance, where I have a server in my own home — will become increasingly important. As an individual, I should have a digital representative in the shape of an assistant where I control where data is stored and processed, and which I can bring with me to use on other platforms. Admittedly, this will mean moving away from free-of-charge models. Instead, my intelligent companion will play such a key role in my life that I will be willing to pay for it. This scenario will also do away with a fundamental acceptance problem, namely the legitimate fear among users that their data could be misused.
What areas of your everyday life should remain off limits to voice assistants?
I would cast the net a little further. There are undoubtedly areas such as the bedroom or the children’s room that should be kept free of electronic devices. This relates to the issue of electronics-free space that is so important to ensuring a degree of mental hygiene.