Voice Machine

Spoken word and voice play

c17-fig-0113.jpg

113. Lynn Hershman Leeson's DiNA, Artificial Intelligent Agent Installation (2002–2004) is an animated, artificially intelligent female character with speech recognition and expressive facial gestures. DiNA converses with gallery guests, generating answers to questions and becoming “increasingly intelligent through interaction.”

Brief

Create an interactive chatbot, eccentric virtual character, or spoken word game centered around computing with speech input and/or speech output. For example, you might make a rhyming game or memory challenge; a voice-controlled book; a voice-controlled painting tool; a text adventure; or an oracular interlocutor (think: Monty Python's Keeper of the Bridge of Death). Consider the creative affordances of using a tightly restricted vocabulary as well as the dramatic potential of rhythm, intonation, and volume of speech. Keep in mind that speech recognition is error-prone, so find ways to embrace the lag and the glitches—at least, for another couple of years. Graphics are optional.

Learning Objectives

Variations

Making It Meaningful

The capacity to speak has long been perceived as a sign of intelligence. For this reason, machines that speak can seem uncanny or even supernatural, as they decouple ancient bindings between voice, living matter, and intelligence. In the field of interaction design, voice interfaces are thought to make technologies more intuitive and accessible than their visual or typographic counterparts—but such anthropomorphized machines do this at the expense of our ability to accurately estimate how much they actually “understand.”

In a conversation, information is transmitted and received on multiple registers—in not only what is said, but also how it is delivered, and by whom. We have exquisitely tuned capacities for inferring contextual information like emotion, gender, age, health, and socioeconomic status from a speaking voice. Intonation, rhythm, pace, and rhyme are also used to create drama, suspense, sarcasm, and humor. When creating new experiences through speech, simple operations may be the most generative. For example, a vast range of meanings can arise just from altering the emphasis of words in a sentence.

Speech has a key paralinguistic social role. Through chit-chat and banter we establish trust, build relations, and create intimacy. Wordplay, punnery, and other playful verbal exchanges create a protected space for this social activity by exploiting ambiguities in the rules of language itself. Culture is embedded and propagated in the protocols of knock-knock jokes, call-and-response songs, and once-upon-a-time fairy tales. These rule-based media lend themselves well to creative manipulation with code. Some potentially helpful tools to algorithmically generate speech include context-free grammars, Markov chains, recurrent neural nets (RNN), and long short-term memory (LSTM) systems. Note that some commercial speech analysis tools transmit the users’ voice data to the cloud, raising issues of data ownership and privacy.

c17-fig-0114.jpg

114. In Conversations with Bina48 (2014), Stephanie Dinkins performs improvised conversations about algorithmic bias with BINA48, a chatbot-enabled face robot. BINA48 was commissioned by entrepreneur Martine Rothblatt, and constructed by roboticist David Hanson, to resemble Rothblatt's wife Bina.

c17-fig-0115.jpg

115. Hey Robot (2019) by Everybody House Games is a game in which teams compete to make a smart home assistant (such as Amazon Alexa or Google Home) say specific words.

c17-fig-0116.jpg

116. In David Rokeby's The Giver of Names (1991–1997), a camera detects objects placed on a pedestal by members of the audience. A computerized voice then describes what it sees—with strange, uncanny, and often poetic results.

c17-fig-0117.jpg

117. When Things Talk Back (2018), by Roi Lev and Anastasis Germanidis, is a mobile AR app that gives voice to everyday objects. The software automatically identifies objects observed by the system's camera, and anthropomorphizes them with AR overlays of simple faces. It then uses ConceptNet, a freely available semantic network, to retrieve information about the objects and their possible interrelationships. The app uses this information to generate humorous and sometimes poignant conversations between the objects in the scene.

c17-fig-0118.jpg

118. David Lublin's Game of Phones (2012) is “the children's game of telephone, played by telephone.” Players receive a phone call and hear a prerecorded message left by the previous player; they are then prompted to re-record what they recall hearing for the next person in the queue. After a week, the entire chain of messages is published online.

c17-fig-0119.jpg

119. Neil Thapen's playful Pink Trombone (2017) is an interactive articulatory speech synthesizer for “bare-handed speech synthesis.” Using a richly instrumented simulation of the human vocal tract, the project enables a wide range of vocal noisemaking in the browser.

c17-fig-0120.jpg

120. Kelly Dobson's Blendie (2003–2004) is a 1950s Osterizer blender, adapted to respond empathetically to a user's voice. A person induces the blender to spin by vocalizing. Blendie then mechanically mimics the person's pitch and power level, from a low growl to a screaming howl.

c17-fig-0121.jpg

121. In Nicole He's speech-driven forensics game, ENHANCE.COMPUTER (2018), players yell out commands like “Enhance!”—living out a science-fiction fantasy of infinitely zoomable images.

Additional Projects

Readings

  1. Zed Adams and Shannon Mattern, “April 2: Contemporary Vocal Interfaces,” readings from Thinking through Interfaces (The New School, Spring 2019).
  2. Takayuki Arai et al., “Hands-On Speech Science Exhibition for Children at a Science Museum” (paper presented at WOCCI 2012, Portland, OR, September 2012).
  3. Melissa Brinks, “The Weird and Wonderful World of Nicole He's Technological Art,” Forbes, October 29, 2018.
  4. Geoff Cox and Christopher Alex McLean, “Vocable Code,” in Speaking Code: Coding as Aesthetic and Political Expression (Cambridge, MA: MIT Press, 2013), 18–38.
  5. Stephanie Dinkins, “Five Artificial Intelligence Insiders in Their Own Words,” New York Times, October 19, 2018.
  6. Andrea L. Guzman, “Voices in and of the Machine: Source Orientation toward Mobile Virtual Assistants,” Computers in Human Behavior 90 (2019): 343–350.
  7. Nicole He, “Fifteen Unconventional Uses of Voice Technology,” Medium.com, November 26, 2018.
  8. Nicole He, “Talking to Computers” (lecture, awwwards conference, New York, NY, November 28, 2018).
  9. Halcyon M. Lawrence, “Inauthentically Speaking: Speech Technology, Accent Bias and Digital Imperialism” (lecture, SIGCIS Command Lines: Software, Power & Performance, Mountain View, CA, March 2017), video, 1:26–17:16.
  10. Halcyon M. Lawrence and Lauren Neefe, “When I Talk to Siri,” TechStyle: Flash Readings 4 (podcast), September 6, 2017, 10:14.
  11. Shannon Mattern, “Urban Auscultation; or, Perceiving the Action of the Heart,” Places Journal, April 2020.
  12. Mara Mills, “Media and Prosthesis: The Vocoder, the Artificial Larynx, and the History of Signal Processing,” Qui Parle: Critical Humanities and Social Sciences 21, no. 1 (2012): 107–149.
  13. Danielle Van Jaarsveld and Winifred Poster, “Call Centers: Emotional Labor over the Phone,” in Emotional Labor in the 21st Century: Diverse Perspectives on Emotion Regulation at Work, ed. Alicia A. Grandey, James M. Diefendorff, and Deborah E. Rupp (New York: Routledge, 2012): 153–73.
  14. “Vocal Vowels,” Exploratorium Online Exhibits, accessed April 14, 2020.
  15. Adelheid Voshkul, “Humans, Machines, and Conversations: An Ethnographic Study of the Making of Automatic Speech Recognition Technologies,” Social Studies of Science 34, no. 3 (2004).