Chapter 9. Audio and Music Processing

Deep in the back of my mind is an unrealized sound Every feeling I get from the street says it soon could be found When I hear the cold lies of the pusher, I know it exists It’s confirmed in the eyes of the kids, emphasized with their fists . . .

The music must change For we’re chewing a bone We soared like the sparrow hawk flied Then we dropped like a stone Like the tide and the waves Growing slowly in range Crushing mountains as old as the Earth So the music must change

Audio and music can be approached in three different ways with Mathematica: (1) as traditional musical notes with associated pitch names and other specifications, such as duration, timbre, loudness, etc.; (2) as abstract mathematical waveforms that represent vibrating systems; and (3) as digitally represented sound—just think of .wav and .aiff files. If nothing else, this chapter should hint at the ease with which Mathematica can be put in the service of the arts. Let’s make some music!

Mathematica allows you to approach music and sound in at least three different ways. You can talk to Mathematica about musical notes such as "C" or "Fsharp". You can directly specify other traditional concepts, such as timbre and loudness, with Mathematica’s Sound, SoundNote, and PlayList functions. You can ask Mathematica to play analog waveforms. And you can ask Mathematica to interpret digital sound samples.

Mathematica has implemented 60 percussion instruments as specified in the General MIDI (musical instrument digital interface) specification.

Here the percussion instruments are listed in alphabetical order. Some of the names are not obvious. For example, there is no triangle or conga, instead there’s "MuteTriangle", "OpenTriangle", "HighCongaMute", "HighCongaOpen", and "LowConga".

In [724]:= allPerc = {"BassDrum", "BassDrum2", "BellTree", "cabasa", "Castanets",
               "ChineseCymbal", "Clap", "Claves", "Cowbell", "CrashCymbal",
               "CrashCymba12", "ElectricSnare", "GuiroLong", "Guiroshort", "HighAgogo",
               "HighBongo", "HighCongaMute", "HighCongaOpen", "HighFloorTom",
               "HighTimbale", "HighTom", "HighWoodblock", "HiHatClosed", "HiHatOpen",
               "HiHatPedal", "JingleBell", "LowAgogo", "LowBongo", "LowConga",
               "LowFloorTom", "LowTimbale", "LowTom", "LowWoodblock", "Maracas",
               "MetronomeBell", "MetronomeClick", "MidTom", "MidTom2", "MuteCuica",
               "MuteSurdo", "MuteTriangle", "Opencuica", "Opensurdo", "OpenTriangle",
               "RideBell", "RideCymbal", "RideCymba12", "ScratchPull", "ScratchPush",
               "Shaker", "SideStick", "Slap", "Snare", "SplashCymbal", "SquareClick",
               "Sticks", "Tambourine", "Vibraslap", "WhistleLong", "WhistleShort"};

Here’s what each instrument sounds like. The instrument name is fed into SoundNote where, more typically, the note specification should be. In fact, in the Standard MIDI specification, each percussion instrument is represented as a single pitch in a "drum" patch. So for example, "BassDrum" is CO, "BassDrum2" is C#O, "Snare" is DO, and so on. Therefore, it makes sense for Mathematica to treat these instruments as notes, not as "instruments" as was done above for "Piano", "GuitarMuted", and "GuitarOverDriven".

Solution

Here’s a measure’s worth of closed hi-hat:

Solution

And here’s something with a little more pizzazz. Both the choice of instrument and volume are randomized.

Solution

Getting the curly braces just right in Mathematica’s syntax can be a little frustrating. Without Flatten in the example above, the SoundNote function is confused by the List-within-List results of the Table function. Consequently, you get no output.

In[734]:=  Sound[SoundNote[#, 0.25] &/@ Table[groove, {4}]]

Out[734]=  Sound[
            {SoundNote[{{"HiHatClosed", "BassDrum", None}, {"HiHatClosed", None, None},
               {"HiHatClosed", None, "Snare"}, {"HiHatClosed", "BassDrum", None},
               {"HiHatClosed", "BassDrum", None}, {"HiHatClosed", None, None},
               {"HiHatClosed", None, "Snare"}, {"HiHatClosed", None, None}}, 0.25`],
             SoundNote[{{"HiHatClosed", "BassDrum", None}, {"HiHatClosed", None, None},
               {"HiHatClosed", None, "Snare"}, {"HiHatClosed", "BassDrum", None},
               {"HiHatClosed", "BassDrum", None}, {"HiHatClosed", None, None},
               {"HiHatClosed", None, "Snare"}, {"HiHatClosed", None, None}}, 0.25'],

           SoundNote[{{"HiHatClosed", "BassDrum", None}, {"HiHatClosed", None, None},
               {"HiHatClosed", None, "Snare"}, {"HiHatClosed", "BassDrum", None},
               {"HiHatClosed", "BassDrum", None}, {"HiHatClosed", None, None},
               {"HiHatClosed", None, "Snare"}, {"HiHatClosed", None, None}}, 0.25`],
             SoundNote[{{"HiHatClosed", "BassDrum", None}, {"HiHatClosed", None, None},
               {"HiHatClosed", None, "Snare"}, {"HiHatClosed", "BassDrum", None},
               {"HiHatClosed", "BassDrum", None}, {"HiHatClosed", None, None},
               {"HiHatClosed", None, "Snare"}, {"HiHatClosed", None, None}}, 0.25`]}]

Furthermore, with a simple Flatten wrapped around the Table function, each hit is treated individually; we lose the chordal quality of the drums hitting simultaneously. Go back and notice that the correct idea is to remove just one layer of braces by using Flatten [ ... , 1 ].

Discussion

Modern Western music uses tempered tuning, which is a slight compromise to the vibrations of the natural world, or at least the perfection of the natural world as the Greeks described it 3,000 years ago. The ancient Greeks (and even earlier, the Babylonians) noticed that when objects vibrate in simple, integer ratios to each other, the resulting sound is pleasant. The simple ratio of 2:1 is so pleasant that we perceive it as an equivalence. When two notes vibrate in a ratio of 2:1, we say they have the same pitch but are in different octaves. The history of music has been the history of partitioning the octave.

The first obvious division of the octave is created by the next simplest ratio, a 3:1 ratio. Consider the following schematic of a vibrating string. The only requirement on the string is that its endpoints remain fixed. The string can vibrate in many different modes, as shown in the first column. Each mode has a characteristic number of still points, called "nodes," that appear symmetrically along the length of the string. Each mode also has a characteristic rate of vibration, which is a simple integer multiple to the lowest fundamental frequency. Notice that three out of the first four harmonics are octave equivalences. The third harmonic, situated between the second and fourth harmonics, has a ratio of 3:2 to the second harmonic and 3:4 to the fourth. These were the kinds of simple ratios that appealed to the Greeks.

Solution

The following keyboard shows how a successive application of the 3:2 ratio can be used to build the entire chromatic scale. After 12 applications of this 3:2 ratio, every note of the modern chromatic scale has been visited once and we are returned to starting pitch—sort of!

Solution

There’s a problem: (3/2)12 represents the C seven octaves above the starting C and should equal a C with a frequency ratio of 27 = 128, but (3/2)12 equals 129.75. The equal temperament solution to this problem is to distribute this discrepancy equally over all the intervals. In other words, in equal temperament, every interval is made slightly, and equally, "out of tune." Johann Sebastian Bach composed a series of keyboard pieces in 1722 called "The Well-Tempered Clavier" to demonstrate that this compromise was basically imperceptible and had no negative impact on the beauty of the music.

Mathematically, equal temperament means that the frequency of each pitch should have the same ratio to its immediate lower neighbor’s frequency. Call this ratio α. Then it must be the case that if a chromatic scale, which contains 12 pitches, takes you from some frequency to twice that frequency, then α12= 2. So the ratio of a semitone in equal temperament is 1.0596.

Solution

However, now that we have the octave in perfect shape, every other interval is slightly "wrong"—or at least wrong according to the manner in which the Greeks were trying to make their intervals. So for example, a Pythagorean fifth, which is 3/2 = 1.5, is slightly flat in equal temperament (the musical interval of a fifth is composed of seven half-steps).

In[756]:= α7
Out[756]= 1.498317

In[757]:= 1.498307
Out[757]= 1.49831

Now that we’ve gone through the basics of tuning, how do you use Mathematica to explore alternate tunings?

Mathematica imports many standard file formats. Both AIFF and WAV are in the list.

In[762]:= $ImportFormats
Out[762]= {3DS, ACO, AIFF, ApacheLog, AU, AVI, Base64, Binary, Bit, BMP, Byte, BYU,
           BZIP2, CDED, CDF, Character16, Character8, Complex128, Complex256,
           Complex64, CSV, CUR, DBF, DICOM, DIF, Directory, DXF, EDF, ExpressionML,
           FASTA, FITS, FLAC, GenBank, GeoTIFF, GIF, Graph6, GTOPO30, GZIP,
           HarwellBoeing, HDF, HDF5, HTML, ICO, Integer128, Integer16, Integer24,
           Integer32, Integer64, Integer8, JPEG, JPEG2000, JVX, LaTeX, List, LWO,
           MAT, MathML, MBOX, MDB, MGF, MMCIF, MOL, MOL2, MPS, MTP, MTX, MX, NB,
           NetCDF, NOFF, OBJ, ODS, OFF, Package, PBM, PCX, PDB, PDF, PGM, PLY, PNG,
           PNM, PPM, PXR, QuickTime, RawBitmap, Real128, Real32, Real64, RIB,
           RSS, RTF, SCT, SDF, SDTS, SDTSDEM, SHP, SMILES, SND, SP3, Sparse6, STL,
           String, SXC, Table, TAR, TerminatedString, Text, TGA, TIFF, TIGER,
           TSV, UnsignedInteger128, UnsignedInteger16, UnsignedInteger24,
           UnsignedInteger32, UnsignedInteger64, UnsignedInteger8, USGSDEM, UUE,
           VCF, WAV, Wave64, WDX, XBM, XHTML, XHTMLMathML, XLS, XML, XPORT, XYZ, ZIP}

Using the "Data" specification will save you the aggravation of decoding the syntax of the imported data. Don’t forget the semicolon, which prevents Mathematica from listing all the sample points. The easiest way to access a file is to type Import[ ], place your cursor between the empty brackets, choose File... from the Insert Menu, navigate in the dialog box to the file you want to open.

In[763]:= file = FileNameJoin [{NotebookDirectory [], "..", "data", "JCK_01.aif"}];
          data = Flatten@Import[file, "Data"];

You’ll need to know the sample rate and whether this file is a mono or stereo, so do a second Import on the same file but specify "Options".

Solution

If you simply wanted to play the file, specify "Sound" as the second parameter.

In[766]:=  snd = Import[file, "Sound"];

This returns a Sound object.

In[767]:=  snd // Head
Out[767]=  Sound

And can be played like so:

Solution

Typically you’ll start with a digitized signal. The sampling rate will determine the highest frequency that can be investigated. This highest frequency is called the Nyquist frequency and is always exactly one half the sampling rate. For this "Yes we

can!" sample, which was digitized at 48 KHz, the highest frequency is 24 KHz. (It’s not coincidental that this frequency is slightly greater than the limits of human hearing.) Notice the plot is symmetric about the Nyquist frequency.

The number of sample points used in any analysis is also critical. Here exactly one second of audio, that is, 48,000 sample points, is being analyzed. The 48,000 points from the time domain yield 48,000 points in the frequency domain, but as you can see, the right side of the plot, between points 24,000 and 48,000, is just a mirror duplication of the points between 0 and 24,000. This is an artifact of the underlying mathematics, and there is no additional information in this half of the plot.

Solution

Since this is speech, you can focus on the first 2,000 points, which correspond to frequencies 0 to 2,000 Hz. Later you’ll see that 2,000 points of a Fourier analysis doesn’t always mean frequencies 0 through 2,000 Hz. It does in this case because you started with 48,000 sample points in the time domain that equals the sampling rate and created a one-to-one relationship between data points and frequencies in the frequency domain. You can see that this speaker has four significant frequency resonances to his voice at approximately 150 Hz, 300 Hz, 490 Hz, and 700 Hz. These resonances are known as formants. Notice, the Ticks option customized the labeling of the x-axis.

Solution

Typically, when analyzing voice, one second is too long of a sample. Just think how many syllables you utter in one second of normal speech. A much more appropriate length would be 1/10 or 1/20 or even 1/30 of a second. You can easily identify various phonemes of "yes we can" in the plot below: the "yeh" and "sss" of the "yes," the singular vowel sound of "we," and the hard "c" and "an" of "can."

Solution

Here’s the "we," which is very homogeneous.

Solution

You’re now looking at 9,600 sample points (9,600/48,000 = 1/5 sec) in the time domain, so each point in the frequency domain represents 48,000/9,600 = 5 Hz. There’s a direct trade-off between using as few sample points as possible to narrow the analysis to a single phoneme, versus sampling enough points to ascertain a desired precision in the frequency domain.

Solution

Here, half as many points (4,800) sampled from the same region focuses our analysis in the time domain, but each sample point now represents 10 Hz. Perhaps we’re losing some detail in the 150-200 Hz range, as well as the 300-350 Hz range?

Solution