Skip to main content

Conversation design: The right approach to crafting voice interfaces

Image Credit: nchlsft / Shutterstock

Watch all the Transform 2020 sessions on-demand here.


Apple’s App Store launch feels like a distant memory now — the glossy buttons, overused gradient, and harsh drop shadows have faded in the rearview for most of us. Similarly, in a few years we’ll have forgotten the arguments we’re now having with our voice assistants — the casual request “Hey Siri” eventually turning into a frustrated “HEY SIRI — SET THE STUPID TIMER.”

It may seem to us now that voice user interfaces (VUIs) aren’t learning quickly enough, but they’re actually evolving at a pretty good pace. Primary platforms are making substantial strides to define the process and practice of crafting a VUI, for example, so that third-party UX designers can bring us new and hopefully better experiences. But UX designers also need time to adapt. Yes, they can follow many of the fundamental guidelines they’re used to applying to visual interfaces, but they also need to operate with new tools and new rules. What makes VUI most difficult is the absence of a visible interface — there are no confines of a screen to keep users boxed in. In this space, we have to design for every possible situation without that visible safety net.

Platforms that allow for third-party Skills and Actions have given some rough guidelines about how to design for voice, but at its last I/O conference, Google gave us a new term: conversation design. With it came a more robust set of guidelines and best practices for creating Actions for Google Assistant, but many of Google’s fundamental ideas can apply to every voice interface regardless of the platform.

Better error handling

Error handling is not a component of human-to-human conversation. When one party misinterprets something, we don’t respond with: “I’m sorry I don’t understand what you said,” and walk away. If VUIs are expected to be modeled after human conversation, this interaction should be more adaptive. Error handling is one of the most overlooked cases in voice. Poor error handling puts the mental workload on the user rather than the computer. If users run into errors once, they are less likely to initiate an interaction again.


June 5th: The AI Audit in NYC

Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.


Google does an excellent job defining an error type with three buckets: no input, no match, and system errors. No input refers to a situation in which the VUI did not hear the user. No match is when the user input cannot be interpreted. Lastly, system error refers to either a technical issue or an invalid request. There are different approaches to how all of these should be handled, but the core idea of error handling is: If at first you don’t succeed, try and try again. As designers, we should approach error handling the same way we would in a human conversation. An input error could look something like this:

User: “I need a cake recipe.”

VUI: “I have a recipe for lemon poppy seed and red velvet, do you like either of these?”

User: “Uhh, I’m not a fan of red velvet.”

VUI: “I’m sorry, did you want to look for other recipes, or will one of those work?”

Instead of giving a general prompt of “I’m sorry, I didn’t understand,” the VUI gave the user an alternative. In human conversation, if we don’t get the answer we’re hoping for, we approach the situation from a different angle to keep the conversation going. This is conversation design — a human approach to a technical problem.

Personas

Voice interfaces introduce a new element to design — a voice persona. The idea behind creating a voice persona for an Action is to make users feel like they are talking to a person rather than a computer. This persona is a character that talks to your users, has quirks, a tone, a brand. We all know the sound of Alexa’s voice and how she responds to general commands. An Action is more personalized than the platform voice assistants, delivering responses to a smaller set of users — your target audience. For this reason, the responses an Action persona delivers can take on the personality of the product. A persona for an adventure travel Action might be upbeat with a casual delivery, while an Action for world news may need an air of confidence and intellect.

To create a VUI persona, you need to invest in several different disciplines — from user experience and interaction design to sound and voice design. However, like the colors and typography of your app, the voice of your VUI is an extension of your brand. It gives you the ability to separate your product from the herd. Google Actions is currently the only VUI that allows third parties to create personas for Actions. It should be interesting to see how it will influence other primary platforms.

Intermodal experiences

If you are using a VUI in conjunction with a connected device that can deliver supplementary visual information — such as a smartphone, tablet, or car interface — then that’s called an “intermodal experience.”

Alexa, Siri, and Google have been providing these types of experiences for quite some time. But recently, there has been a shift in how voice information and visual information are delivered to the user. Instead of the visual interface simply replicating the voice interaction, we are starting to see visual and voice take on different roles. As designers, we should see visual and voice as a partnership. Sometimes users begin an interaction with voice and finish it on a device and vice versa. They are not mutually exclusive, nor should they be copies of one another.

A common issue with providing a visual experience alongside voice is that many times the experience is duplicated; VUI responds, and the UI shows a copy of that response on the screen. Instead, these experiences should fit the interface. Humans talk differently than they write. What is communicated in the two channels should reflect the same principle. Let’s evaluate how this can be applied to someone baking in the kitchen:

A user is baking from a recipe they found on their tablet. Searching for ingredients in their kitchen, they ask: “What are the dry ingredients for this recipe?” When we consider this case, we need to take it in context. The prompt asked for a specific set of ingredients and was the first voice command from the user. We can assume a couple of things: First, the user has just begun baking; second, they are gathering dry ingredients. Given those two pieces of information, the best step is to relay a quick response with only what the user requested. For the VUI, this means providing the list of dry ingredients but withholding the measurements. Not only would measurements lengthen the response considerably, but voice is simply the wrong channel of communication for those details. On the tablet, however, when the user asks for dry ingredients, they are listed with measurements, ready for the user whenever they finish getting their ingredients together.

Voice is useful for giving quick, high-level information and pulling in detail only when prompted. Visuals are suitable for delivering details to give users the ability to take what they want from it. Consider this real-world situation in a restaurant or bar: You sit down and start to graze the drink menu for a beer. You find the header “Beer” with two items listed — bottled and on tap — but no details on the specific types of beer. When you ask your server what they have on tap, they begin reciting a list of 15 beers in their inventory. Do you remember the third beer they listed? The seventh? Most likely no. You were given too much verbal information in a short amount of time, which usually leads to a rushed decision. If instead, the restaurant had listed the beers on the menu, you could scan the menu, compare the items that sound the most interesting in your head, and have confidence in your selection. It’s the same content, just delivered through a different channel.

What’s next

In the early years of the App Store and Google Play Store, it was difficult to convince a room full of stakeholders of the importance of user experience in a native application. Now, we build apps for products and services that are mobile-first, sometimes even mobile-only. It is no longer acceptable to take a design built for desktop and shrink it down to fit onto a mobile device. VUI will work out its kinks just as early native apps did, but that can’t be done if we keep crafting the same experience with outdated guidelines. Conversation design is the notion that experiences should be crafted relative to the interface rather than an afterthought in order to harness the same success of visual interfaces.

Meghan Dever is Lead Designer at POSSIBLE Mobile.