VUI Challenge #010: My designs

9 min readAug 9, 2021

Continuing to share my progress as I work through Jesús Martín’s VUI Challenge…

I’ve got company!

My friend Maaike Groenewegge suggested we synchronise our blogs. We’re going to release our independent designs for the same challenge together — this should mean that anyone who’s interested can compare our approaches and glean more insights (not least the two of us). I’m game!

Here’s Maaike’s.

There’s some other designers who showed interest so my plan is to collect them all together in a list.

This time we’re focusing on VUI Challenge 010.

Challenge 010 — Info details and SSML

(Exactly as received in Jesús’ email)

The challenge
Create an interaction where you inform customers about the address and telephone of a restaurant called Lion’s House (you can change the name if you want). You can decide to use a single or multi turn experience. Play special attention to SSML tags where needed.
Jesús’ Tip
Are you familiar with the term cognitive load? Our brains are incredibly limited and there’s always a limited amount of information we can handle. For this challenge, think about what your users want to achieve and share the information so it is easier for them to be successful.

My notes on the challenge

This challenge could make me think “I’ll just write any old prompt and then try to use SSML to tidy it up.” I’m wary of that. SSML can improve how writing is expressed but it can’t fix a poorly worded prompt. The wording and the markups need to work together.
I want to lay some groundwork first before I can get a feel for the SSML challenge. Context is everything so I need to (at least) have a feel for the experience the utterance would be written for. I’ll write a few turns between user and bot to lead me into the main focus: the bot sharing contact details with the user
Although I’ll aim to ‘lead in’ to the challenge with some context first, my focus is purely on the task at hand — making sure the SSML supports the user’s cognition of the info details. In other words, the bot needs to speak to be understood!
I have to juggle various considerations:

the overall design
the context of the conversation
whether address and phone number should be shared together in one utterance (my gut feeling is that could be too much —immediately after hearing a phone number I’ll struggle to remember all of it unless I write it down straight away, so there’s no way I’ll remember an address too)
or if the address can be shared in a shortened form (street+number+postcode) or using a service like What3Words which would allow the user to ask for more if they need it (city, region, etc) — but that only works if the bot is 100% confident they know where the user is before sharing the information, if they use other services like What3Words and so on…

Here’s my rough plan:

Demo the interaction up to the utterance where the voice assistant shares ‘contact’ info for the restaurant
Scrub up the contacts info utterance
Apply SSML to improve it further

My Design Process

Setting the context

As said above — this is just to lead me in. I’m not aiming to make this perfect. It’s purely to give me a feel for the challenge, and is un-scrubbed DEMO content:

[User is in Birmingham, England and the voice assistant knows their exact home address from previously shared data]
USER: [to themself] I want a steak pie… [out loud] Hey Alexa, where near me can I get English pub food?
ALEXA: I’ve found the three nearest restaurants serving English pub food; would you like to hear distance from your home, their user ratings or their accessibility options?
USER: Tell me their ratings
ALEXA: Lion’s House has the highest user rating of 4.8, Red Lion has 4.7 and Three Lions has 4.3. Which would you choose?
USER: Lion’s House
ALEXA: Lion’s House address is 38 Kingston Drive, Birmingham. Their telephone number is 0121 496 0999. Would you like to hear that again or have it sent to your phone?

So near yet so far... At least I have a feel for the interaction. Now I’ll focus on the last Alexa utterance and scrub it up.

The Challenge — scrubbing the utterance

Here’s the utterance I wrote:

ALEXA: Lion’s House address is 38 Kingston Drive, Birmingham. Their telephone number is 0121 496 0999. Would you like to hear that again or have it sent to your phone?

(Notes:

This would be a very short challenge if I just assumed that every user will request to have it sent to their phone, but that’s cheating! Some users will want to understand it the first time they hear it without relying on multi-modality
It seems odd to me to give the user BOTH the address and the phone number at the same time. Who needs that? Those are separate pieces of information and I can’t think when anyone would use both at the same time? We’re used to seeing them together on a ‘contacts’ page but that’s a different form of information architecture. This is about what’s best for speech…
A user may need both the address and phone number, but at different times — perhaps needing the phone number immediately to make a reservation, and then needing the address later to find the restaurant
Alexa can become a telephone because it uses speech as the input and audio as the output. Why not connect the user directly without sharing the phone number? Alexa supports making phone calls. That would speed up the interaction — Alexa could call the restaurant instantly and thereby connect the user — no need for an ugly old phone number…)

Instead, it could be (2nd attempt):

Lion’s house is 0.2 miles away. Would you like to call them now and receive the address on your phone?

(Notes:

That feels pretty streamlined to me. The user gets confirmation that the restaurant is close, they get the chance to call now and receive the address for later. It makes a lot of assumptions though — it leads the user towards an outcome that I think they want.

Although the aim is to be frictionless, some people really do want to hear the address and telephone number. I’m thinking of parents of friends who get irritated when information bounces between their devices and they don’t feel like they’re in control. They don’t mind friction if they feel they‘re in control on the other hand. I can’t assume that everyone is as tech-savvy as me and my friends.

So, I’ll attempt to write the utterance with all information presented to the user without any attempt to redirect them elsewhere. I’m assuming this is what Jesús intended for this challenge, but it doesn’t hurt for me to work through the problem and consider other possible pathways and solutions — this conversation could have multiple pathways depending on each user’s needs.

Alexa knows where the user is — so it can just give the street name and number. The phone number needs to be carefully presented, as it could sound like ‘machine-gun speech’. I think the best way to solve this is to bunch numbers together into small groupings)

3rd attempt:

ALEXA: Lion’s House is on 38 Kingston Drive. Their number is 0121 496 0999. Would you like the full address or the number again?

(Notes:

I’m not sure what else I would change here. I feel like this does what it’s expected to.

It provides the address and phone number, and gives the user options if they need more info, or didn’t catch the number.)

Then the user might ask for the full address and Alexa would say this:

ALEXA: The full address for Lion’s House is 38 Kingston Drive, Birmingham, B5 2RH, United Kingdom. Would you like that again?

Or they might ask for the phone number again:

ALEXA: The number for Lion’s House is 0121 496 0999. Would you like to hear that again?

I feel that this is working. Onto the SSML!

The Challenge — scrubbing the SSML

Here’s the final prompt again:

ALEXA: Lion’s House is on 38 Kingston Drive. Their number is 0121 496 0999. Would you like the full address or the number again?

Within Amazon Polly that looks like this:

(I put each sentence on a new line as I know I’ll add SSML tags later — it makes no difference to how Polly reads the text)

<speak>
Lion’s House is on 38 Kingston Drive. 
Their number is 0121 496 0999. 
Would you like the full address or the number again?
</speak>

And that sounds like this:

(Notes:

I think “nine nine nine” sounds clumsy. The user could think ‘was that two nines or three?’

It can easily be cleared up to become “treble nine” or “triple nine”. Also simply adding commas helps to ‘chunk’ the data in a way that’s better for cognition)

Chunked with commas, ‘999’ becomes ‘treble 9’:

<speak>
Their number is 0121, 496, 0 treble 9. 
</speak>

(Notes:

I feel that’s better but it’s odd how the first four numbers are read seperately (“oh one two one”), then the middle numbers are read like a single number (“four hundred ninety six”) and the last numbers are presented in a different way again (“oh treble nine.”)

I wonder if it’s clearer when the style is more consistent?)

I’m going to do something very fake to get a natural result!

<speak>
Their number is 0121, four nine six, 0 treble 9. 
</speak>

To my eyes it looks bizarre but to my ears it sounds fine (something I’m very used to as a sound designer 😄). Have a listen:

(Notes:

If I was going to get really nitpicky, and I know this is now a case of overcooking the stew, but it’s odd that the first ‘0’ is pronounced as “oh” whereas the second ‘0’ is pronounced as “zero”. I’m not sure how much difference this makes for the user but the writing could be altered to clear that up. It’s also something that brands and localization would want corrected, from my experience)

Altered to become as consistent as possible:

<speak>
Their number is zero one two one, four nine six, zero treble nine. 
</speak>

(Notes:

So, I feel that utterance is now complete and it works well when spoken aloud by Amazon Polly. You could say “hey, you didn’t use any SSML tags!” and you would be right. However to my ears it sounds good and I’m not sure what I would change. Instead of using SSML, my approach was to alter the wording of the utterance so that Polly read the same information in a better manner — I guess you could say that’s SSML 😂

I feel that at this point it makes sense to test it with users.)

Here’s the final text as read by Amazon Polly (Joanna voice):

<speak>
Lion’s House is on 38 Kingston Drive.
Their number is zero one two one, four nine six, zero treble nine. 
Would you like the full address or the number again?
</speak>

And this is how it sounds:

Final thoughts

That’s my thought process for this challenge. I know I meander a lot but this took me just a few hours and my aim is to get the best result — regardless of the process that leads me there. Language can be ambiguous. Conversational AI design is complicated. All the users will know about your design process are the final designs you settled on, not the various things you threw away to get there!

When I close my eyes and listen to the final version it sounds clear and I can follow along comfortably. My assumption is it would work well if that was the first time I heard it. I always listen to TTS without looking at the screen because it’s misleading — no user will experience it that way. With eyes shut I’m hearing it like someone using a voice skill. The problem is that by the time I’ve tweaked it I’ve heard it so many times it’s impossible to imagine exactly how it sounds for a new user.

I expected to need more SSML too! I’m surprised that I got results without any <tags> but I trust my ears.

Perhaps this confirms my feeling at the start that when the prompt is worded properly there is less of a need to try and ‘fix’ it with SSML.

Whether that’s true or not, anything created for a conversational design needs testing to take it from hypothesis to an effective product. Perhaps with thorough testing I would find myself using a lot of <tags>!

Benjamin McCulloch

Conversation Designer (with audio superpowers)

Conch.design

VUI Challenge #010: My designs

Challenge 010 — Info details and SSML

My notes on the challenge

My Design Process

Setting the context

The Challenge — scrubbing the utterance

The Challenge — scrubbing the SSML

Final thoughts

Written by Benjamin McCulloch (conch.design)