We know that writing is hugely important to conversation design.
But I think there’s something else that is equally important and doesn’t get enough attention.
Great writing will take you toward the final result, but the user will experience the voice that embodies your writing.
I find it peculiar how some conversation designers place less importance on how the voice talks than what it says. To me, this is like saying “my final results don’t matter — the work I produce before that is more important.”
It’s as if converting your written utterances to a spoken voice is as simple as converting a .doc to a .pdf, but there is a profound transformation that occurs and I think conversation designers need to be aware of it.
The voice ties together the writing and the persona design. It finalises what the design was intended to be, and either achieves it or misses the mark.
Users won’t be aware of how your wonderful writing looked when you wrote it. Writing is one medium and verbal communication is another. They’re not direct copies of each other. This is why I think something is missing in many discussions around conversation design.
Why do I feel so strongly about this? I’ve experienced it many times working in sound production.
Talking is easy, but delivering a message is hard
Perhaps you’re thinking:
‘It’s just talking. Talking is easy! Writing is the hard part.’
‘Just talking’ is what you do when you’re wasting time — having a blether/chitchat/natter/chinwag — whatever you want to call it. If the only price of failure is having to repeat yourself while you chat with a friend over coffee then that’s ‘just talking’. On the other hand, if the price of failure is that it reflects poorly on a brand, destroys trust, misleads or confuses the user— then that is something much more serious.
I’ve engineered and directed many recording sessions with voice actors. If you’ve never been in a sound studio you wouldn’t believe the focus that is placed on the human voice while recording. Everything is under a microscope. Everything is amplified.
For example, I once engineered a recording for a washing detergent TV commercial. The voice artist was only asked to say the words ‘goji berries’ repeatedly for one hour. The director wanted exactly the right intonation so that the commercial made sense. It was costing them thousands of dollars but they felt it was worth it.
It’s common to have that level of focus in sound production. Every single syllable, breath and vocalisation is labored over. Why? Because the recorded voice is the final result of the process that began with the writer imagining it. The voice has to embody the persona and the things that persona would say.
If there is so much care taken with sound production for a brand’s commercials then why don’t conversation designers take the same care with their voice skills? What if it’s the same brand? Are we implying that the delivery of a brand’s message only matters in commercials, but in a voice experience it’s only a minor concern? What if the user saw the commercial and wanted to talk with that brand but discovers that they’re a terribly boring conversational partner?
The voice needs to embody everything the design was supposed to be — because if we don’t hear the detail in the voice it’s gone forever. Users aren’t listening to writing, they’re listening to ‘what that person just said to me’. The conversational design is a spokesperson —the voice of a brand, or organization, or charity, or movement, or nation, or religion, or any other client.
Your writing will transform
People will hear your writing because the voice assistant communicates with audio. Your writing will transform.
It starts as an idea in your head, then you find the words and revise until the copy is the best you can write (and you should read aloud while you write), and then it’s spoken aloud by the voice you selected for the design.
That final stage — speaking aloud — won’t sound exactly like the persona you imagined or the copy you wrote. The voice will give it a different spin. The emphasis and intonation will be placed differently from the way you imagined. Your words will sound funny with that accent. Sometimes the meaning of those words will be enhanced, and sometimes it will be altered, and sometimes it will be ruined.
Your writing will be interpreted.
Please think about what I mean by interpretation.
Look at the sheet music on the left. Sheet music is just instructions for a musical performance.
The song is ‘All Along The Watchtower’ by Bob Dylan.
It was interpreted by Jimi Hendrix.
How can the same chords, words and melody produce such a wildly different result?
Because Hendrix and Dylan are two wildly different personalities who interpreted the same instructions differently.
Here’s a recipe. This isn’t a meal — it’s just instructions. The final result will vary depending on the skill of the cook, the ingredients they have available, their focus and their equipment when they turn that recipe (idea!) into a meal (actual worldly result).
A theatrical script is not a performance. How many times have Shakespeare’s most famous works been interpreted? Countless times— for radio, film and the stage. Baz Luhrmann’s film of Romeo and Juliet added pop culture references and retro-futuristic design to the same dialogue that Shakespeare published in 1597. The result was wildly different from what Sheakespeare could have imagined.
Interestingly, with TV, film and theatre the writer is hugely important but rarely has final say over the finished product — that is the director’s responsibility. Often many changes happen to a script without the writer’s consent because the interpetation of their idea has lead to different results and the director needs to turn it into a cohesive audiovisual experience. The writer creates the script and a vast crew of people interprets it with the director guiding the process.
Finally, the image of an interaction flow below isn’t a conversation:
The conversation will happen when a user converses with it.
Did you ever experience this transformation? You wrote something and discovered that it sounded different from how you intended when TTS or a voice actor read it aloud for you? Was the meaning altered by the performance?
You write instructions that become a real worldly experience when they’re presented by a voice that converses with a user.
What difference does it make?
Let me put it this way:
Do you want the user to stop engaging with the experience because they need to think about what your utterances mean? How bad a user experience is it if your design is misunderstood?
What if an ‘exciting and daring’ brand is presented by a voice that talks with all the enthusiasm of someone reading out their tax return?
Your writing needs to be presented well because the alternative is confusion, frustration and cognitive load. Don’t be lured into thinking a boring result is a functional one — boredom negatively affects engagement, which will likely make the experience less efficient.
We’re so used to talking and listening that it’s jarring when someone breaks from convention.
TTS speaks incredibly clearly but it doesn’t know how to present material. Is that what your brand wants? If the only goal of talking aloud was to say words clearly then TTS would fit the bill perfectly, but sadly it’s not that simple. Conversations are about negotiation, persuasion, connection, empathy, teaching, listening, and many more purposes. Sometimes they are all of those things in the same conversation!
With synthetic voices you’re faced with a contradiction — the designer alone can decide on the final result but they have very limited capabilities to achieve it. Finessing takes a great deal of time and you need a concept for what you want to achieve before you can make any meaningful adjustments. Tweaking and listening repeatedly is the only way to improve the result, and that is a slow and tedious process. TTS is only a quick solution if you forego the nescessary SSML that makes it sound good.
On the other hand there are voice actors who know how to present material in various contexts (news, marketing, drama, explainer videos etc), and usually only need a little guidance to get great results. I know from personal experience that a voice actor will get a better result than TTS in a fraction of the time because you get everything in the same moment —your script will be spoken clearly with the persona and the performance you asked for. Of course it comes at a price though.
Only you know what the design is supposed to be. You want the end user to experience it the way you intended so how do you achieve that?
Whichever voice you choose — voice talent or synthetic — your design will be interpreted and the results will likely be different from what you intended.
Do the results fit the message, the persona you designed, the brand (and their style guide for VO) and the target group’s way of talking?
If you trust synthetic voices to present your writing well without any tweaking then that strategy amounts to ‘hit and hope’. That’s not a design strategy.
When your design speaks well it will improve the experience.