We’ve been dealing with standard stereo presentation up to now, but there are other ways we can create the spatial perception of sound. As with all other aspects we’ve been looking at, spatial audio is an enormous area of practice and research, and it is the one area of sound that is significantly in flux at the moment. Although Audacity has a rudimentary surround sound option, it has no native spatial sound tools. The skills and knowledge are quite specialized, but an overview to help you understand the area is useful. To do any real work in spatial audio, though, you’ll need to switch to one of the other DAW tools available. We also need to distinguish between surround sound, which is the presentation of sound on multiple speakers (the hardware), and spatial sound, which is the location of the sound in a 3D space, accomplished in the software.
One of the goals of surround sound is immersion and envelopment. Envelopment is the sensation of being surrounded by sound or the feeling of being inside a physical space (enveloped by that sound). Most commonly, this feeling is accomplished through the use of the subwoofer and bass frequencies, which create a physical, tangible presence for sound in a space. A subwoofer is specially designed to handle the lowest frequencies in the human range: from 20 Hz to about 200 Hz (the top end of the frequency range varies according to specification, with 100 Hz common for professional live sound and 80 Hz for THX certification, but 200 Hz for consumer-level products).
In film, the sense of envelopment is perhaps the main reason for using surround sound, but when it comes to some other media like games, another really important purpose to spatializing sound emerges: to localize sound (that is, to locate sounds in space).
Humans have two ears, and this simple fact means that we are quite good at localizing sounds. There are three components to the sound’s location in space:
Sounds located in front of us are the easiest for us to localize; sounds behind us (and to a lesser extent to the sides) are harder to localize accurately. The reason for this, we learned in chapter 1, is that the pinna is, in part, responsible for our localization ability. Sounds coming from behind do not reflect in the pinnae and so are harder for us to get a precise localization. But the sound itself can influence how well we can determine where it comes from. We are quite poor at localizing bass frequencies and sounds without many spectral properties (sine waves). Sounds with close frequencies can sound like they are coming from the same source, which is why ambulance and police sirens are made up of multiple tones—we need to be able to localize the sound quickly, so by altering the frequencies we can more effectively locate the source.
Figure 7.1
Human sound localization.
Figure 7.2
Time difference and level difference between our two ears.
Because we have two ears, there are differences in how the sound signal reaches those two ears, which we use for localization (figure 7.2). The interaural level difference (ILD) refers to the fact that sounds experience a slight reduction in amplitude (sound pressure level) from the sound to the farther ear. If the sound is directly in front or directly in behind the listener, the cue will be the same in each ear, making it harder to localize the sound. The interaural time difference (ITD) refers to the fact that sounds take a little longer to reach the farther ear.
At times, sounds can have the same interaural time difference and the same interaural level distance; this can result in what is called the cone of confusion. For instance, in figure 7.3, two sounds, B and D, occur at the same elevation and distance, but at a different azimuth: they will reach the ear at the same time. Likewise, sounds A and C are on the same azimuth and distance but at different elevations: they will reach the ear at the same time.
Referred to as the head-related transfer function, or HRTF, our head gets in the way of some sounds reaching the farther ear, and the pinnae reflect the intensity of the frequencies we hear. Perhaps head-related transform function would have made a more accurate name—since the term refers to how the sound is transformed by our unique bodies. We can localize high-frequency sounds more easily than low-frequency sound, because the high frequency sounds tend to get filtered out by the head. Low-frequency sounds are harder to localize because no acoustic shadow is created by the head. These spectral cues are also a key component in how we determine sound location.
Figure 7.3
The cone of confusion.
Figure 7.4
Spectral cues altered by a head shadow.
Other aspects of localizing a sound source also come into play, such as the initial time delay between a sound and early reflections, the amount of direct versus reverberant sound, and auditory motion parallax (when listeners move, sounds from sources close by appear to move more quickly than sounds from farther away—see Yost 2018), but these are less commonly used in artificially simulating spatialization.
Exercise 7.1 Where Does Sound Localization Fall Apart? (Partner Exercise)
An easy way to demonstrate where localization tends to weaken is to sit, blindfolded or with eyes closed, and have someone else snap their fingers in a variety of places around your head: is there a place where you can’t figure out where the sound is coming from? Did your partner indicate where you went wrong?
Exercise 7.2 360 Degrees of Sound
Stand in one spot and point a directional microphone at the space. Slowly, over the course of several minutes, turn around on the spot, so you are recording 360 degrees of sound. Note your observations when you play back the file.
Binaural audio takes into account the main differences (i.e., in level, time, and spectral properties) between the perceptions of our two auditory inputs to our brain. These differences can’t really be used in stereo or surround sound reproduction, because of what’s called cross-talk: what is in each speaker will reach both of our ears. If we pan a sound hard left into the left speaker, our right ear will still hear the sound because it’s traveling through the air and can’t be isolated to one ear.
When we use headphones, we can isolate the sound and eliminate cross-talk. Binaural audio, then, relies on the listener using headphones, which is why it hasn’t been used a lot in media, although you’ll find at least one film (e.g., Bad Boy Bubby), a handful of music albums (e.g., Pearl Jam’s Binaural), and an increasing number of video games (e.g., Papa Sangre) have made use of binaural audio.
Sound can be recorded binaurally using two microphones, but this requires the head to block some frequencies of the sound, provide distance between the two cues for the level and time delay, and ideally pinnae that reflect sound. For that reason, when recording binaural audio, we can use dummy heads, which have been built to mimic a human head, or by using tiny microphones in our own ears (some people secure microphones to the side of a hat, but then they will miss the spectral changes created by the pinnae). More commonly, it’s easier to create binaural audio in post-production by positioning the sound using binaural plugins for audio software. These plugins use a generic HRTF—an averaged-out sense of how people hear (perhaps most commonly, those datasets found on MIT Media Lab’s KEMAR or IRCAM’s Listen HRTF Database). That generic aspect means it’s not quite as accurate as our own unique pair of ears, but it is still more accurate than stereo. I could find no native plugins for binaural audio for Audacity, but you can find some VST (“virtual sound technology”) plugins that you can install into Audacity that will enable you to explore binaural audio. The microphone and headphone manufacturer Sennheiser have released a free binaural VST plugin called Ambeo Orbit, and although I couldn’t get it to work in Audacity it worked fine in other software.
Figure 7.5
We hear each speaker with both ears, leading to cross-talk.
Exercise 7.3 Listen to a Binaural Recording
Check out some binaural mixes from music, film, or games. What do you notice about the difference in localization versus stereo?
Figure 7.6
Ambeo Orbit (running in Adobe Audition).
Exercise 7.4 Binaural Recording
If you don’t have binaural mics, tape a mic to each side of your hat by your ears, and head out on a sound walk, recording as you go. Listen to the file when you get back. What do you notice is the difference between this recording and others you’ve taken on your sound walk? Did the binaural aspect work even without the pinnae being involved?
Exercise 7.5 Binaural Panning
If you’ve got a binaural plugin for your software, try to recreate a scene from your sound walks using binaural panning.
While it may be easy to think we can drag-and-drop our sounds into position for a stereo or surround mix, the potential use of headphones by our listener presents a challenge for most mixers because the lack of cross-talk leads to what’s called in-head localization—the feeling that the sound is coming from inside our head. As we saw above, the front speakers in surround sound, or the two speakers in stereo sound, are optimally positioned to create an equilateral triangle, where the head is located at the sweet spot of an even mix from right and left channels. In factory mixer settings, for instance, the stereo imaging is usually set by default to this 60° angle.
Almost all headphones are designed to transmit one of the stereo channels to one ear exclusively (that is, the left channel transmits to the left earphone), so it is possible for one ear to hear an entirely different sound than the other ear. Sounds panned hard into one channel for speakers can therefore sound unnatural when translated to headphones, since in the natural world we do not normally hear a sound with just one ear. Some cross-feed software plugins simulate the loudspeaker experience with audio mixed for headphones, which bleed some of the left and right channels together. However, these plugins can still result in a not entirely natural feel.
When we eliminate cross-talk, the optimal positioning of sounds becomes much wider, and is typically tripled for headphone mixes (up to 180°). The reason for that wide spread is that in the equilateral triangle mix, the sounds appear to emanate from inside our heads, rather than from around us, and generally this is considered bad by mixers—we want to feel like we’re standing on stage with the band, not like the band is playing inside our heads.
Exercise 7.6 In-Head Localization
If you have access to stereo panning tools, try to place sounds intentionally in and outside the in-head localization position. How does it change the way that you perceive sound? Reflect on where you might want to use in-head localization on purpose.
Because most people don’t use headphones to listen to film, most film is mixed for some form of surround sound. Surround sound can mean different speaker setups, and there are different standards based on what setup you are using. The basics of surround sound is to have more than the two standard stereo speakers, but how many additional speakers is fluctuating. In the 1970s the standard was quadrophonic, or four-speaker sound. Since the 1990s the standard for home theater has been 5.1, where “5” means 5 main speakers and the .1 means a subwoofer. Some new formats introduce height speakers as well, since in standard surround the speakers are all at the same azimuth. These can be called 9.1.2, for instance, where the “.2” at the end indicates the height speakers. You may come across 7.1, 9.1.4, 22.2, and more.
The basic setup is to keep the two front speakers (front left and front right, or FL and FR) where they are located in stereo position, each at 30° from the listener, making a total 60° equilateral triangle (figure 7.7). The center (C) is in the direct center, 0°, and the two rears, or surrounds (SL and SR) are each at 110° from the listener, to form a triangle at between 100° and 120°, or twice the width of the front speaker. This is sometimes referred to as the ITU standard (it’s ITU-R BS.775, to get technical). It’s not the only standard, however: the National Academy of Recording Arts and Sciences, NARAS, recommend the rear speakers be between 110° to 150°. For home theater, Dolby and THX recommend the rear speakers be between 90° and 110°. Because the subwoofer’s frequencies aren’t localized well, you can place your sub pretty much anywhere.
There are three major competing formats for surround sound: Dolby, DTS, and THX. Each of these has its own unique recommended placement of speakers, and if you’ve got a home stereo system with separate components (rather than an all-in-one or Bluetooth setup), you’ll probably find options on your receiver to use Dolby, DTS, and/or THX. Each compresses audio differently, and each uses different means to decide which channel to send information to and which to apply equalization to. You may come across Dolby Digital, Dolby TrueHD, Dolby Atmos, DTS:X, DTS HD, THX Ultra2, and more. You will need to look up the specifications for setting up your equipment.
Figure 7.7
Standard surround sound setup.
The differences are a matter not just of speaker location, but also how they play back sound. We know that bass response drops off at lower amplitudes, for instance. In theaters, the volume is high so we have plenty of bass response. At home, you’re probably not going to crank up your volume (at least, if you live in an apartment), so you are going to miss out on some of that bass. By EQing the mix in decoding, we can reintroduce the illusion of amplitude, as we have seen. In other words, how the setup uses EQ can alter your experience of the sound.
Exercise 7.7 Test Surround Modes
If you have the equipment, grab a great DVD or Blu-ray (yes, a real physical DVD, not a stream—ideally you can get your hands on a demo disc on eBay, designed to demonstrate the capabilities of the system) and try out different surround modes. Can you hear a difference? What are the most obvious differences you hear?
Exercise 7.8 Surround Mix in Audacity
It is possible to mix in surround in Audacity. You may need to enable the plugin. With three or more separate tracks, select File > Export > Export as Wav
then hit save. The Advanced Mixing Options window should pop up. This lets you map the tracks to output channels (figure 7.8). You can click on the track name followed by the channel number to alter the track’s mapping. Mix a bunch of files in surround sound: how does it change your listening experience?
Figure 7.8
Advanced Mixing Options in Audacity.
Named by film sound designer Ben Burtt, the exit-sign effect is a phenomenon that relates to discrete sounds that are placed in rear surround loudspeakers. With film or other audiovisual media, pans of fast-moving sounds to the left or right of the screen boundaries lead the viewer’s eye to follow the sound through the acousmatic (off-screen) space and toward the theater’s exit signs at the side of the cinemas. For this reason, when we have a viewer sitting facing a screen, we don’t tend to put spot sounds in the rear speakers (unless we want to shock an audience and have it appear that something is suddenly behind them). However, with just audio media, since it doesn’t matter where the viewer is looking, we can use spot sounds to our heart’s content.
Unlike surround-sound formats, which typically put speakers on one plane (one azimuth), ambisonics is a format that has been around for a long time that is designed for a full sphere. Rather than encoding information for each speaker channel, ambisonics creates a virtual sound field called a B-format that is decoded at the consumer end based on the individual speaker setup. The creator, then, doesn’t worry about assigning sounds to specific speakers, but rather to a spherical position. The actual speaker location is chosen by the decoder and depends on the listener’s system.
Dolby Atmos takes a similar approach, by allowing creators to position sounds in a 3D spherical position without attention to assigning speaker channels, and the sound is then decoded on the consumer side depending on the individual setup. Other similar formats have been emerging as spatial audio gains interest, including DTS:X and Auro-3D. These formats are often referred to as object-based audio, since they are independent of speaker channels, but position sound objects in a virtual space. It’s a bit like visually placing an object in a room relative to us standing in the middle, and then placing that template into any room, so that the sound is always in the same relative position to the listener, no matter how large the speakers or where they are in the room. In this way, the creator need only position the sounds once and they are appropriately decoded on the user’s device, whether they have a surround sound setup or are using headphones. The advantage for the creator, of course, is not having to create multiple deliverables.
New tools like object-based audio often mean new challenges. If you think about film ambiences, we used to create stereo ambiences and then just kind of leave them in the surround speakers. Now, we have to think about spatializing those sounds in three dimensions. A helicopter rising overhead could sound like it’s actually flying right over our heads. But we still have to be aware of the exit-sign effect and not make spatialized spot sounds so obvious that they pull a person’s attention from the screen. Of course, if we’re not designing sound for film, then the challenges of object-based sounds don’t become challenges so much as interesting creative opportunities.
To work with Atmos, DTS:X, or Auro 3D you’ll need specialized tools, but Ambisonics is not proprietary, and there are many tools that you can find to explore Ambisonics. Most of these are open source or designed by researchers, so may not be as user-friendly as some of the proprietary tools. Two free options that are designed by major companies who do sound for VR/games are Facebook’s Spatial Workstation and Audiokinetic’s WWise. It is possible to encode an ambisonic format in Audacity using the Advanced Mixing Options.
There are many spatial sound (“3D”) algorithms and approaches, and more advanced software is required to use spatial audio, so we won’t go into too much depth here. Rather than referring to the position of the speakers or the channel assignment, spatial audio usually refers to the processing done on the sound source itself before it gets to the speakers/headphones, so it is usually used in virtual environments like VR and games. The goal of spatial audio is to create a three-dimensional position for the sounds. Note that with surround sound, the speakers are usually on one plane (that is, the same azimuth; although newer approaches to surround have a second height set added, as described above), and most sounds happen in the front speakers. With spatial audio, we have a 360° spherical positioning of sounds in the software. How that positioning gets decoded into surround setups or headphones depends on the decoder used, as we saw above.
When watching movies, we are in what is called a head-locked position: our eyes are pointed forward and we aren’t moving around much. However, in virtual reality (VR) applications, we can turn our head around, and our field of view changes with the position of our head. For this reason, the audio also needs to change. If there is a bird chirping on our left in VR, and we turn to look at it, the bird’s chirp should now be in front of us. So, head-tracked audio that locates where our head is positioned can be a big element of realism in virtual reality spatial audio. Real-time, head-tracked spatial audio can be quite computationally expensive (that is, we have to have powerful processors), and so we often need to make compromises in the accuracy of the audio presentation in VR today, but this is quickly changing with GPU-based processing.
An important element of spatial audio rendering is understanding how sound waves propagate—how they travel through space, how they reflect off objects or are absorbed by objects, and how they decay over time. We looked at some basic propagation earlier (chapter 4), but with spatial audio we have to consider even more potential aspects of propagation. The wave’s path through space, including all of its reflections and absorptions, must be simulated by the software for a realistic presentation.
Diffraction is the path of sound around obstacles. For instance, a sound that occurs behind a small opening can re-radiate as if it were a new sound source, depending on frequency.
With occlusion, the direct path of the sound is obstructed, and the reverberations/reflections are also muffled. Common occlusions include, for instance, walls that are between the sound source and the listener.
With exclusion, the direct path is clear, but reflections are obstructed in some way: a door in a wall, for instance, will let the direct sound through, but many reflections may be behind the wall.
With obstructions, the direct path is obstructed in some way, but the reflections can get around the obstructing object. Examples might be a large tree, a column, a car, and so on.
Figure 7.9
Sound diffraction through a small opening.
Figure 7.10
Occlusion effect.
Sound engines designed for spatialization must take into account these diffraction behaviors of sound as it propagates in the virtual space, if the goal is to create a space that is true to life.
Exercise 7.9 Real-Life Propagation
Use your home or school to set up some real-life propagation experiments. Play a broadband (many frequencies) sound from a speaker behind a door, then open the door. Duck behind some furniture. Note how the sound changes in the space. Record the sound and run a spectral analyzer. What frequencies drop out as the location of the source changes?
Figure 7.11
Exclusion effect.
Figure 7.12
Obstruction effect.
Surround and spatial sound is not just a technical tool, but also a creative tool, as Stephan Schütze explains:
I actually think it’s a lot more important than unfortunately many of us actually have the opportunity to make use of. And I’ll give you a couple of examples. I’ve always played games, and I love games, and I’m probably addicted to them and that’s why it’s good that I work in the industry. And so in a lot of the studios that I worked in, we would play the games at lunch time or after work, etc. And because I had my own little space and because I was mixing these in surround I had very, very simple surround setup initially, but we would play the games that everybody else was playing, and at one point we were playing Counter-Strike. And there was a particular level where there was a particular little nook where you could hide, and I would hide there occasionally, and one day somebody very, very carefully crept up behind me, turned the corner, and I shot them. They were like, “How did you know I was there?” I said, “I heard you coming.” He said, “But I was really, really quiet.” “Yeah, but I’ve got rear speakers.” And as a sound designer, I’m very, very aware that the sound of you walking through the corrugated iron tunnel to get to me has a very distinct sound. So as soon as I hear a corrugated iron tunnel anywhere in the level, I know there’s somebody in the corridor behind me. But the sound, the surround sound pinpointed that exactly.
I do find it really interesting that there are people playing really competitive, high-end competitive first person shooter games with stereo headphones, and I’m like, “What? Your critical sense for threat analysis and warnings is your hearing.” They say your ears don’t blink. They don’t go to sleep, either, really. And so surround sound, from a tactical point of view, is brilliant, but from an environmental point of view, if we want to simulate an environment, you’ve gone out to a beautiful jungle or a beautiful forest somewhere, or even a desert, with insects, and it’s like you’ve got your head underwater. In the same way that water completely envelops you, sound completely envelops you and you listen to a really amazing movie in surround sound. You take those speakers away and all of a sudden it’s like looking at a flat painting. If you take so much out of the experience and for games, it’s experiential. It’s narrative support. It’s tactical opportunities, there’s so many things in there, and from the point of view of creating an audio environment. It’s like if I was an artist with, somebody sort of said, “OK, here’s your pencil. Draw this beautiful picture.” And I’m like, “OK, cool, I’m ready to colour it in, can I have the colour pencils?” “Like, no.” I was feeling like I’m missing all the colour when I don’t have the ability to do things in surround. (quoted in Collins 2016, 198)
Wright’s article is an academic study of some history of Dolby Atmos and surround sound, and then some approaches used by Hollywood film mixers to surround sound and Atmos in particular.
Although quite technical, some of the best information about sound spatialization techniques can be found in the documentation available for software tools such as the game audio middleware tools Wwise and Fmod.
You’ll find binaural audio in an increasing number of video games; for instance, Papa Sangre (2010), Hellblade (2017), and Sniper Elite 4 (2017). Pearl Jam released an album recorded in binaural audio (Binaural, 2000), as did Can (Flow Motion, 1976). You can also find some classical music recorded in binaural sound. If you have the playback technology, you can also switch between listening to media in standard stereo and in surround sound, so you can get a sense of what the surround speakers bring to an experience. Some music is also mixed in surround formats. For a while in the 1970s, quadrophonic (four-speaker, or 4.0 surround) music was popular with many artists (you’ll have to find a copy of the quadrophonic version, not the stereo version, to listen to it). In the late 1990s and early 2000s, 5.1 surround music albums were also released for a short time, as “super audio CDs,” or SACDs. Now that most people have shifted to streamed music, you can pick up some SACDs cheaply, along with the used hardware required to play it.