Voice assistants – speak or die?

On: September 25, 2017
Print Friendly, PDF & Email
About Aleksandra Straczek


The Tech’s Big 5 feels honour-bound to win the race for the best voice assistant. Since Siri’s introduction for iPhone 4s in 2011 and Alexa’s official debut at smart speaker Echo in 2015 the appetite for more natural ways of human-computer interaction has only whetted. As for today Amazon’s Alexa has five different Echo-variations: the classical Echo, Tap, Dot, Look and Show; Apple’s Siri works on iPhone, iPad, Mac, Apple Watch, Apple TV, and soon also on smart speaker HomePod; Google with its Google Assistant for Android, iOS and Google Home smart speaker does not lag behind; Microsoft with its Cortana is chasing down the breakaway group. Facebook is more careful – according to Head of Product for Messenger Stan Chudnovsky voice technology is still too unreliable (Wagner). But rumour has it that Facebook is working intensively on voice-based services as well (Lunden).

Nevertheless, before any of these companies declares victory, few challenges have to be faced. There is a big gap between what Donald Norman calls real affordances – functions attached to an object – and perceived affordances – functions that are clear to the user (Davis and Chouinard 2). Let’s focus on Alexa for a moment. It has over 20 000 skills (Amazon Echo & Alexa Stats), but people use it for very basic tasks like setting a timer, playing a song or reading the news (Suplizio et al.) and even if they try out a new skill, only 3% will still use it a week later (Condliffe). What are the possible reasons? First of all, to apply a new skill one has to find it in Alexa mobile app or on the Amazon website and enable it by hand. Furthermore, once enabled, the skill doesn’t have a graphical user interface which would allow to explore the available options. To formulate a task one has to know the right manner of doing that which often includes memorizing the skill’s name. “Alexa, ask [skill’s name] to [action]” is the most popular combination. And I really can’t imagine anyone saying “Alexa, ask Sushi Facts to tell me an interesting fact about sushi” without an indulgent smile even if he/she finds facts about sushi fascinating. We shouldn’t pretend that this is how a natural conversation looks like and make it clear that because of natural language processing’s and machine learning’s shortcomings, at the moment communication with Alexa resembles giving commands to your dog – “sit” might work, but “take a load off buddy” not necessarily. Tech companies vacuum up millions of sentences every day to advance computer’s capabilities of parsing, understanding and responding to human speech and the users of voice-controlled assistants definitely make the whole process easier, since their queries and commands may be recorded, transformed into anonymized data and used as samples (Cao and Bass), but the technology is still far from perfection.

Taking this thing further, maybe voice user interfaces should be simply (at least for now) another option of triggering a desired chain of actions, not a new, all-embracing solution. The Amazon Show’s (touchscreen device with build-in Alexa) introduction would point in this direction. One of the most convincing theories is presented in “What voice UI is good for (and what it isn’t)” article by Des Taynor. He refers to Bill Buxton concept of placeonas, which itself is based on a popular concept of personas (fictional characters meant to represent certain users’ type) but taking into consideration their current location “The ‘in a library wearing headphones’ placeona is ‘hands free, eyes free, voice restricted, ears free’ (…) The ‘driving’ placeona is ‘hands busy, eyes busy, ears free, voice free’” (Taynor). I would argue it’s more about activity than location, but never mind. The point is that there are situations in which voice user interfaces are probably the best choice (e.g. while driving). But sometimes (e.g. in a library) graphical user interfaces are more suitable.

“Voice won’t kill touchscreens. Touchscreens didn’t kill the mouse. The mouse didn’t kill the command line” (Taylor)

But people don’t think about voice interfaces rationally. We are dreamers with Spike Jones’ “Her” movie at the back of our head. Luger’s and Sellen’s semi-structured interviews with 14 regular users of Siri/Google Now/Cortana confirm this hypothesis. People, especially those who define themselves as not or less technically knowledgeable, tend to have very high expectations of one voice assistants’ (or as they are called in Luger’s and Sellen’s report conversational agents’) capability and intelligence. This people have also a habit of describing voice assistants with gendered pronouns and of assigning specifically human trails to them: “There was one time I was very [sarcastic)] to it, I was like ‘oh thanks that’s really helpful’ and it just said, I swear, in an equally sarcastic tone ‘that’s fine it’s my pleasure’” (Luger and Sellen 5292).

So on one hand we have learned that machines have their own, unhuman way of processing information. We have got used to entering keywords into search engines (after all nobody but 86-year-old May Ashworth types “Please translate these roman numerals mcmxcviii thank you” into Google anymore). But on the other hand Nass and Reves claim that we “respond to communication media, media technologies, and mediated images as we do to actual people and places” (Weiss 636). This obviously seems to be even more true in case of technology which has been transformed from a simple tool into something we can talk to. SPACE10 have launched a world-wide survey called “Do You Speak Human” to find out how would people like their future the Artificial Intelligence to be. 73% said their AI should be humanlike rather than robotic. The majority wants their AI to reflect their values and worldviews (69%) and to be able to detect and react to emotions (85%). Brad Abrams, group product manager for the Google Assistant Platform, claims that according to their research, the strongest persona the conversational bot has, the better retention it achieves (Wilson). Siri’s mean answer to “What is zero divided by zero?” made headlines around the world and became inspiration for countless memes (if you haven’t tried it yet, catch up, it’s worth it).

Siri’s answers change over time. The answer to “I’m drunk” used to be “Just don’t breathe on me” and now it’s “Neither of us is driving home” accompanied by “Call me a taxi” button. And here we get to another important factor when thinking about the voice assistants – money. Alexa, Siri or Google Assistant are important not only because it’s more natural for humans to speak than to click, but also because of their underlying business model. They are platforms, which means they provide an infrastructure for value-creating interactions between external producers and consumers and set governance conditions for them (Parker, Van Alstyne and Choudary). Selling smart speaker to the customers, Amazon sells an access to convenient ways of shopping (by the way that’s exactly was Amazon Dash was supposed to do), ordering food, managing home, moving around and so on. In the same time Amazon opens up a second revenue stream coming from producers and service providers who want to plug into the created system. This raises a few questions since our voice-controlled assistant, being in the language of Thaler and Sunstein our choice architects (Thaler and Sunstein), will most likely propose us the most cost-effective product or service but from the perspective of platform’s profit and not ours. Because of its default settings Alexa will play music from Amazon Music library and audiobooks from Amazon Audiable library. Choosing another music service is possible, but restricted. Yes to Spotify, no to Tidal.

These platforms are also a titbit for brand managers since they offer completely new ways of engaging the customers. Johnny Walker teamed up with Amazon last year and produced a skill that takes the whiskey connoisseurs on a guided tasting. As they are sniffing and sipping, Alexa is describing the scents and flavour notes they are currently experiencing and presents different ways of serving a chosen blend. Sounds better than just visiting the website, right?



Amazon Echo & Alexa Stats. 2017. Voicebot. 24 September 2017. <https://www.voicebot.ai/amazon-echo-alexa-stats/>.

Cao, Jing and Dina Bass. “Why Google, Microsoft and Amazon Love the Sound of Your Voice.” Bloomberg Technology. 2017. 24 September 2017. <https://www.bloomberg.com/news/articles/2016-12-13/why-google-microsoft-and-amazon-love-the-sound-of-your-voice>.

Condliffe, Jamie. “AI Voice Assistant Apps are Proliferating, but People Don’t Use Them.” MIT Technology Review. 2017. 24 September 2017. <https://www.technologyreview.com/s/603420/ai-voice-assistant-apps-are-proliferating-but-people-dont-use-them/>.

Davis, Jenny L. and James B. Chouinard. ”Theorizing Affordances: From Request to Refuse.” Bulletin of Science, Technology & Society. Prepublished 16 June 2017. 24 September 2017. <http://journals.sagepub.com/toc/bsta/0/0>.

Do You Speak Human. 2017. SPACE10. 24 September 2017. <http://doyouspeakhuman.com>.

Lunden, Ingrid. “Facebook design head loud on voice, silent on Alexa and hardware.” TechCrunch. 2017. 24 September 2017. <https://techcrunch.com/2017/09/18/facebook-design-head-bullish-on-voice-dodges-questions-on-voice-apps-and-hardware/>.

Luger, Ewa and Abigail Sellen. “’Like Having a Really bad PA’: The Gulf between User Expectation and Experience of Conversational Agents.” Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (2016): 5286-5297. 24 September 2017. <https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/p5286-luger.pdf>

Parker Geoffrey G., Marshall W. Van Alstyne and Sangeet Paul Choudary. Platform Revolution: How Networked Markets Are Transforming the Economy And How to Make Them Work for You. New York: W. W. Norton & Company, 2016.

Suplizio, Aaron et al. “Unpacking the Breakout Success of the Amazon Echo.” Experian. 2016. 24 September 2017. <https://www.experian.com/innovation/thought-leadership/amazon-echo-consumer-survey.jsp>.

Taynor, Des. “What voice UI is good for (and what it isn’t).” Intercom. 2017. 24 September 2017. <https://blog.intercom.com/benefits-of-voice-ui/>

Thaler, Richard H. and Cass R. Sunstein. Nudge: Improving Decisions About Health, Wealth, and Happiness. New Haven. London: Yale University Press, 2008.

Wagner, Kurt. “Here’s why Facebook Messenger isn’t building a voice-controlled assistant like Alexa or Siri.” Recode. 2017. 24 September 2017. <https://www.recode.net/2017/5/2/15525048/facebook-messenger-m-voice-control-assistant>.

Weiss, David. “Media Equation Theory.” Encyclopedia of Communication Theory, Eds. Stephen W. Littlejohn and Karen A. Foss, n.p.: SAGE, Publications, 2009. 636-637.

Wilson, Mark. “Google’s 3 Secrets To Designing Perfect Conversations.” Co.Design. 2017. 24 September 2017. <https://www.fastcodesign.com/90126166/googles-3-secrets-to-designing-perfect-conversations>.


Leave a Reply