This chapter explores the practical side of implementing audio-related AI features in your Swift apps. Taking a top-down approach, we explore two audio tasks and how to implement them using Swift and various AI tools.
Here are the two audio-related practical AI tasks that we explore in this chapter:
Making a computer understand human words is incredibly useful. You can take dictation or order a computer around.
Classification is going to crop up repeatedly in this book. We build a sound classifier app that can tell us what animal sound we’re listening to.
Images might be the trendy hot topic that triggered an explosion of deep learning, machine learning, and artificial intelligence (AI) features in products, and activity classification might be a novel way of using the myriad sensors in a modern iOS device, but sound is one of the real stars of practical applications of machine learning. Almost everyone has used sound at least once on their mobile device (like the music identification service, Shazam), even before AI was (yet again) a buzzword.
Speech recognition is one of those touchpoints of AI that most people have used at some point or another: whether it’s on a phone call with an irritating phone robot that’s trying to understand your voice, or actively using your computer with assistive and accessibility technologies, speech recognition has been pervasive a lot longer than many other forms of practical AI in consumer applications.
For the first of our two practical AI audio tasks, we’re going to explore how you can add speech-recognition capabilities to your Swift applications quickly, easily, and without any model training involved.
As with the face-detection task we covered in “Task: Face Detection”, speech recognition is a little easier than many of the others in this book in that the toolkit for performing speech recognition is largely provided by Apple (“Apple’s Other Frameworks”).
You could train a model that understands human speech for each of the languages you want to support in your project, but Apple has done the work for you, for lots and lots of languages. So why would you?
This task, therefore, takes a similar approach to “Task: Image Similarity”, in which we covered checking images for similarity, and “Task: Face Detection”, in which we looked at face detection.
For this task, we’re going to explore the practical side of speech recognition by doing the following:
Making an app that can recognize human speech and display it on screen
Building an app that allows us to listen to some speech and display it as text
Using Apple’s tools for doing this without training a model
Exploring the potential next steps for speech recognition
Speech recognition is absolutely everywhere. There’s not much more to say. It’s pervasive, widely understood, and doesn’t require much explanation to users. You can use it for everything from allowing the user to dictate text (although there are other, more appropriate ways to do that), to controlling an app with voice (again, there are other more appropriate ways to do that), to voice-driven features that revolve around understanding what the user is saying.
We’re going to build the Speech Recognizer app shown in Figure 5-1.
As we did in for many of the tasks in Chapter 4, we’re going to be using Apple’s newest user interface (UI) framework, SwiftUI, to build the app for exploring speech recognition.
The final form of the app we’re going to build for this task was shown earlier, in Figure 5-1, and consists the following SwiftUI components:
A NavigationView
in which to display the title of the app
Some Button
components to start and stop the listening process for speech recognition
A Text
component for the result of the speech recognition to be displayed (and for instructions prior to the app being used)
This book is here to teach you the practical side of using AI and machine learning features with Swift and on Apple’s platforms. Because of this, we don’t explain the fine details of how to build apps; we assume you mostly know that (although if you don’t, we think you’ll be able to follow along just fine if you pay attention). If you want to learn Swift, we recommend picking up Learning Swift (also by us!) from the lovely folks at O’Reilly Media.
If you don’t want to manually build the iOS app, you can download the code from our website and find the project named SRDemo
. After you have that, we strongly recommend that you still proceed through this section, comparing the notes here with the code you downloaded.
To make the app yourself, you’ll need to do the following:
Create an iOS app project in Xcode, choosing the “Single View App” template, and selecting the SwiftUI checkbox.
After your project is created, add a new Swift file called Speech.swift to the project. In that file, add a new class
called SpeechRecognizer
:
class
SpeechRecognizer
{
}
Add some attributes, covering all the necessary components you need to recognize speech:
private
let
audioEngine
:
AVAudioEngine
private
let
session
:
AVAudioSession
private
let
recognizer
:
SFSpeechRecognizer
private
let
inputBus
:
AVAudioNodeBus
private
let
inputNode
:
AVAudioInputNode
private
var
request
:
SFSpeechAudioBufferRecognitionRequest
?
private
var
task
:
SFSpeechRecognitionTask
?
private
var
permissions
:
Bool
=
false
We’re creating an AVAudioEngine
, which is used to perform audio input or output; an AVAudioSession
, which is used to help you specify to the operating system (OS) what kind of audio you’ll be working with; and an AVAudioNodeBus
and AVAudioInputNode
, which are used to establish connections with the input hardware on an iOS device (i.e., the microphones).
We also create an SFSpeechRecognizer
, which allows us to initiate speech recognition and is part of Apple’s provided Speech Framework. We also use SFSpeechAudioBufferRecognitionRequest
to capture audio from a live buffer (i.e., a device’s microphone) in order to recognize speech.
An alternative to SFSpeechAudioBufferRecognitionRequest
is SFSpeechURLRecognitionRequest
, which allows you to perform speech recognition on a preexisting recorded audio file, instead.
We also create a SFSpeechRecognitionTask
, which represents an ongoing speech recognition task. We can use this to see when the task is done or cancel it.
Add an initializer:
init
?(
inputBus
:
AVAudioNodeBus
=
0
)
{
self
.
audioEngine
=
AVAudioEngine
()
self
.
session
=
AVAudioSession
.
sharedInstance
()
guard
let
recognizer
=
SFSpeechRecognizer
()
else
{
return
nil
}
self
.
recognizer
=
recognizer
self
.
inputBus
=
inputBus
self
.
inputNode
=
audioEngine
.
inputNode
}
Our initializer creates the necessary audio capture components and assigns the bits and pieces we created a moment ago appropriately.
Add a function to check that we have the appropriate permissions to listen on the microphone (in order to do speech recognition):
func
checkSessionPermissions
(
_
session
:
AVAudioSession
,
completion
:
@
escaping
(
Bool
)
->
())
{
if
session
.
responds
(
to
:
#selector
(
AVAudioSession
.
requestRecordPermission
(
_
:)))
{
session
.
requestRecordPermission
(
completion
)
}
}
Add a function to the start the recording, and some setup at the top:
func
startRecording
(
completion
:
@
escaping
(
String
?)
->
())
{
audioEngine
.
prepare
()
request
=
SFSpeechAudioBufferRecognitionRequest
()
request
?.
shouldReportPartialResults
=
true
}
Within this function, below the setup, check for audio and microphone access permissions:
// audio/microphone access permissions
checkSessionPermissions
(
session
)
{
success
in
self
.
permissions
=
success
}
guard
let
_
=
try
?
session
.
setCategory
(
.
record
,
mode
:
.
measurement
,
options
:
.
duckOthers
),
let
_
=
try
?
session
.
setActive
(
true
,
options
:
.
notifyOthersOnDeactivation
),
let
_
=
try
?
audioEngine
.
start
(),
let
request
=
self
.
request
else
{
return
completion
(
nil
)
}
Set the recording format and create the necessary buffer:
let
recordingFormat
=
inputNode
.
outputFormat
(
forBus
:
inputBus
)
inputNode
.
installTap
(
onBus
:
inputBus
,
bufferSize
:
1024
,
format
:
recordingFormat
)
{
(
buffer
:
AVAudioPCMBuffer
,
when
:
AVAudioTime
)
in
self
.
request
?.
append
(
buffer
)
}
Print out a message (to the console, not visually in the app) that recording (listening) has started:
(
"Started recording..."
)
You can display the console in Xcode by going to the View menu → Debug Area → Activate Console.
Begin the recognition:
task
=
recognizer
.
recognitionTask
(
with
:
request
)
{
result
,
error
in
if
let
result
=
result
{
let
transcript
=
result
.
bestTranscription
.
formattedString
(
"Heard:
\"
\(
transcript
)
\"
"
)
completion
(
transcript
)
}
if
error
!=
nil
||
result
?.
isFinal
==
true
{
self
.
stopRecording
()
completion
(
nil
)
}
}
In the Speech.swift file, add a function to stop recording:
func
stopRecording
()
{
(
"...stopped recording."
)
request
?.
endAudio
()
audioEngine
.
stop
()
inputNode
.
removeTap
(
onBus
:
0
)
request
=
nil
task
=
nil
}
Because we’re going to access the microphone, you’ll need to add the NSMicrophoneUsageDescription
key to the Info.plist file, along with an explanation for why we’re using the microphone.
You’ll also need to add NSSpeechRecognitionUsageDescription
for speech recognition. The messages will be be displayed to the user. Figure 5-2 shows our messages.
NSMicrophoneUsageDescription
in the Info.plistNext, we need to start working with the view file, ContentView.swift:
At the top of the file, update the imports
:
import
Speech
import
SwiftUI
import
AVFoundation
In this, we bring in Speech
for speech recognition, SwiftUI
for SwiftUI, and AVFoundation
for audio capabilities.
Create a SwiftUI View
to use within a Button
, to make it a bit fancier looking. Let’s name it ButtonLabel
:
struct
ButtonLabel
:
View
{
private
let
title
:
String
private
let
background
:
Color
var
body
:
some
View
{
HStack
{
Spacer
()
Text
(
title
)
.
font
(.
title
)
.
bold
()
.
foregroundColor
(.
white
)
Spacer
()
}.
padding
().
background
(
background
).
cornerRadius
(
10
)
}
init
(
_
title
:
String
,
background
:
Color
)
{
self
.
title
=
title
self
.
background
=
background
}
}
This view basically allows us to style some text in a reusable fashion. It’s a Text
view, wrapped in an HStack
with an initializer that allows us to provide a title String
and a Color
, for convenience.
We move now to the bulk of the code in the View
, the ContentView
. Much of this came with the project template, but we’ll be starting with something that looks like this (it’s probably already there):
struct
ContentView
:
View
{
}
Into this View
, we need to add some @State
variables:
@
State
var
recording
:
Bool
=
false
@
State
var
speech
:
String
=
""
recording
is a Bool
that reflects the current state of recording, and speech
is a String
that will store the recognized text.
Move down below body
View
(still within the ContentView
struct
) and add a variable named recognizer
to store a SpeechRecognizer
:
private
let
recognizer
:
SpeechRecognizer
init
()
{
guard
let
recognizer
=
SpeechRecognizer
()
else
{
fatalError
(
"Something went wrong..."
)
}
self
.
recognizer
=
recognizer
}
In this, we initialize a new SpeechRecognizer
(the class defined in Speech.swift) and store it in recognizer
, which we defined a moment ago.
Add a function named startRecording()
, which will start listening:
private
func
startRecording
()
{
self
.
recording
=
true
self
.
speech
=
""
recognizer
.
startRecording
{
result
in
if
let
text
=
result
{
self
.
speech
=
text
}
else
{
self
.
stopRecording
()
}
}
}
This function sets the recording
state variable to true and the speech
state variable to an empty String
and then uses our SpeechRecognizer
(recognizer
) to start recording, storing the result in speech
, if there is one.
Add a function to stop recording, creatively called stopRecording()
:
private
func
stopRecording
()
{
self
.
recording
=
false
recognizer
.
stopRecording
()
}
This function sets the recording
state variable to false and instructs the SpeechRecognizer
in recognizer
to stop recording.
We don’t need to touch the ContentView_Previews
struct
in this case.
You can now run the app. Tap the button and speak, and you should see the words you say appear in the Text
component.
As we did back in Chapter 4, we used one of Apple’s provided frameworks to do literally all the AI work for this practical example. SFSpeechRecognizer
is Apple’s provided speech recognition framework, and as of macOS Catalina (10.15), it’s available to both iOS apps and macOS apps.
You can also do speech recognition on watchOS and tvOS, but it’s a little bit different and beyond the scope of this book. To learn more about speech recognition on Apple platforms in general, head to https://apple.co/33Hry2t.
SFSpeechRecognizer
supports offline speech recognition for many languages, but also might (i.e., does) rely on Apple’s server support (which is not something you need to configure), as needed. Apple’s documentation is vague about which languages support offline recognition, and under what conditions the server is contacted, but it strongly emphasizes that speech recognition via SFSpeechRecognizer
should always be assumed to require connectivity.
It’s very important to always follow Apple’s guidelines for asking permission. When using SFSpeechRecognizer
, Apple requests that you always ask permission from the user to perform speech recognition because it might be cloud-based. Awareness of privacy implications is very important. Do pay attention 007…
There are possibly some limits (e.g., per device, per day, and so on) to how much speech recognition you can perform. Apple isn’t clear on this just yet, and implies that it will evolve and crystallize with time.
Answering “What’s next?” is a complex question for this topic. If you want to add speech recognition to your iOS or macOS apps, this is everything you need. You’re ready to go.
Because this book is about the practical side of AI, and we want to approach things from the top down, we think that this is everything you really need right now.
However, if you’re curious, you can go further by exploring how you might train a model to recognize speech. We’re not going to step through this, because it’s definitely beyond the scope of this book, but the toolkit and data that we’d be exploring for doing this from scratch would resemble the following:
The Speech Commands Dataset, available from Google Brain (this is a very large file!)
The Common Voice Dataset, available from Mozilla
The Python version of TensorFlow
Alternatively, if you get a copy of the TensorFlow source tree, build anything necessary for you to run TensorFlow, and want to try building your own very small speech recognition model, you could do the following:
Execute the Python script train.py, located in the examples/speech_commands directory of the TensorFlow tree. This downloads the aforementioned Speech Commands Dataset (this might take a while) and begins training.
You will see the training occur, step by step, and occasionally a confusion matrix will be displayed that shows you what mistakes the model is making at the time.
You will also see some validation measures output, showing the validation accuracy of the model on the validation dataset (which is a 10% split that is done automatically by the train.py script).
Eventually, after what could be many hours, you will have a model. You will need to freeze the model, which compacts it for use on mobile devices, using the freeze.py script located in the same directory.
You can use the label_wav.py script, also in the same directory, to pass audio files into the model to test it.
There’s a full tutorial for a process similar to the one we just outlined available in the TensorFlow documentation.
The simple model that can be trained using TensorFlow that we outline here is based on the paper “Convolutional Neural Networks for Small-footprint Keyword Spotting.” If you’re interested in going beyond practical AI, it’s definitely one of the more readable “proper” AI papers.
You can also use the TensorFlow to CoreML Converter, which is a project from both Apple and Google, to convert the model from TensorFlow’s format to an .mlmodel file. This would allow you to use it with CoreML and use it in an iOS app.
Check back to Chapter 2 for details on how to use Apple’s CoreMLTools and the TensorFlow to CoreML Converter. Later in this book, in both Chapter 8 and Chapter 9, we use CoreML Tools to convert models for use with CoreML.
Exploring this in its entirety is beyond the scope of this book, but it is the next step if you’re curious. Visit out our website for articles and links that explore this sort of thing.
For our next audio task, we want you to imagine that you’re building an app for a zoo. One of the features that you’ve been asked to create is a system in which users can open up the app when they hear animals in the distance making a noise, and the app can identify and inform the users what kind of animal they’re hearing. This is a sound classification problem.
Sound classifiers, given a sound, will assign it to one of a predetermined collection of labels. They’re classifiers, so they work like any other classifiers. (We discuss how they work under the hood in “Sound Classification”.)
None of the sound classification features provided by Apple’s machine-learning tools are designed to be used with human speech. You can use them on weird noises that you might want to make, but they’re not designed around speech recognition.
In this chapter, we build an app, the final version of which is shown in Figure 5-3, that can assign a sound it hears to one of nine different buckets.
For this task, we explore the practical side of sound classification by doing the following:
Building an app that can record some audio, perform a sound classification on the recording, and inform us as to what animal made the noise
Selecting a toolkit for creating the sound classification model and assembling a dataset for the problem
Building and training our sound classification model
Incorporating the sound classification model into our app
Improving our app
After that, we’ll quickly touch on the theory of how sound classification works and point to some further resources for improvements and changes you can make on your own. Let’s get started.
Our sound classification app is going to use UIKit, Apple’s older UI framework for iOS. This app makes slightly more advanced use of native iOS views, including a UICollectionView
and a UIProgressView
, so if you’re unfamiliar with those, this app might look a little scary.
Never fear. We explain them as we go in a little more detail than other iOS views that we’ve been using.
This book is here to teach you the practical side of using AI and machine learning features with Swift and on Apple’s platforms. Because of this, we don’t explain the fine details of how to build apps; we assume you mostly know that (although if you don’t, we think you’ll be able to follow along just fine if you pay attention). If you want to learn Swift, we recommend picking up Learning Swift (also by us!) from the lovely folks at O’Reilly Media:
The general look of the app we’re going to build here, even in its starting point form, is shown in Figure 5-3. The starting point will have the following components:
A UIButton
to trigger the recording (and later the automatic classification of) a sound
A UICollectionView
, showing a collection of different animals (each in a UICollectionViewCell
) in emoji form, which will light up depending on what type of animal sound is heard
A UIProgressView
to indicate how far through its recording (listening) process the app is
An AVAudioRecorder
(and its associated AVAudioRecorderDelegate
) to record audio
If you don’t want to manually build the starting point iOS app, you can download the code from our our website and find the project named SCDemo-Starter
. After you have that, skim through the rest of this section (don’t skip it!) and then meet us at “AI Toolkit and Dataset”.
To make the sound classification starting point yourself, follow these steps:
Create an iOS app project in Xcode, choosing the “Single View App” template. Don’t select any of the checkboxes below the Language drop-down (which are, as usual, set to “Swift”).
We’re going to start with code instead of the storyboard. That’s because we’re creating some custom classes that inherit from standard UI objects.
Add a new Swift file to the project and name it Animals.swift. In that file, add the following enum
:
enum
Animal
:
String
,
CaseIterable
{
}
We’re going to use this enum
type to represent the animal sounds that the app can detect. Note that the enum
we created, which is called Animal
, conforms to both String
and CaseIterable
. What String
means should be obvious: this is an enumeration of Strings, but conforming to CaseIterable
allows us to access a collection of all of the cases of Animal
by using the .allCases
property.
You can read more about the CaseIteratble
protocol in Apple’s documentation.
With our Animal
type in place, we need to add some cases. Add the following to the top, within the Animal
type:
case
dog
,
pig
,
cow
,
frog
,
cat
,
insects
,
sheep
,
crow
,
chicken
These are the nine different animal cases for which we’ll be able to classify the sounds.
Add an initializer so that the right case can be assigned when an Animal
is needed:
init
?(
rawValue
:
String
)
{
if
let
match
=
Self
.
allCases
.
first
(
where
:
{
$0
.
rawValue
==
rawValue
})
{
self
=
match
}
else
if
rawValue
==
"rooster"
||
rawValue
==
"hen"
{
self
=
.
chicken
}
else
{
return
nil
}
}
This matches the incoming raw value to one of the cases, except in the case of the incoming raw value being either the string “rooster” or the string “hen,” which are both also matched to the chicken
case because they’re varieties of chicken (for the purposes of this app, just in case there are any chickenologists out there who disagree…)
We want to return a nice icon (which will just be an emoji) for each case:
var
icon
:
String
{
switch
self
{
case
.
dog
:
"<img src="
images
/
twemoji
/
dog
.
svg
" />"
case
.
pig
:
return
"<img src="
images
/
twemoji
/
pig
.
svg
" />"
case
.
cow
:
return
"<img src="
images
/
twemoji
/
cow
.
svg
" />"
case
.
frog
:
return
"<img src="
images
/
twemoji
/
frog
.
svg
" />"
case
.
cat
:
return
"<img src="
images
/
twemoji
/
cat
.
svg
" />"
case
.
insects
:
return
"<img src="
images
/
twemoji
/
insects
.
svg
" />"
case
.
sheep
:
return
"<img src="
images
/
twemoji
/
sheep
.
svg
" />"
case
.
crow
:
return
"<img src="
images
/
twemoji
/
crow
.
svg
" />"
case
.
chicken
:
return
"<img src="
images
/
twemoji
/
chicken
.
svg
" />"
}
}
Assign a color to each animal so that the views that we ultimately display them in look nice:
var
color
:
UIColor
{
switch
self
{
case
.
dog
:
return
.
systemRed
case
.
pig
:
return
.
systemBlue
case
.
cow
:
return
.
systemOrange
case
.
frog
:
return
.
systemYellow
case
.
cat
:
return
.
systemTeal
case
.
insects
:
return
.
systemPink
case
.
sheep
:
return
.
systemPurple
case
.
crow
:
return
.
systemGreen
case
.
chicken
:
return
.
systemIndigo
}
}
We’ve just arbitrarily picked some colors here, so go nuts if you have any better ideas than we did. We definitely feel that insects are pink, though.
That’s everything we need to do in Animals.swift, so make sure that you save the file, and then let’s move on to the ViewController.swift file. There’s a fair bit of work to do there.
The first thing we need to do in ViewController.swift is create a button that can move between three different states. We’re going to use this button to allow users to record a sound, which will ultimately be classified.
The button needs to be able to switch from being a nice, friendly button inviting users to trigger a recording, to showing that recording is in progress. We also want to set up a state in which it’s disabled in case something prevents a recording from being made or the app is busy classifying the recording (which could take some time).
We could perform all these state changes manually on a standard UIButton
, but we want to make sure the code that connects to the AI features later on is as clean and simple as possible, so we’re abstracting a few bits and pieces out in ways that make that code more obvious. Also, it’s good practice to do it like this!
Add a new class
to the ViewController.swift file:
class
ThreeStateButton
:
UIButton
{
}
This is just a new class called ThreeStateButton
that inherits from UIButton
. At this point, we could implement some ThreeStateButton
s, and they’d just be UIButton
s.
Add an enum
to represent the different states of the button:
enum
ButtonState
{
case
enabled
(
title
:
String
,
color
:
UIColor
)
case
inProgress
(
title
:
String
,
color
:
UIColor
)
case
disabled
(
title
:
String
,
color
:
UIColor
)
}
Add a function to change the state of the button:
func
changeState
(
to
state
:
ThreeStateButton
.
ButtonState
)
{
switch
state
{
case
.
enabled
(
let
title
,
let
color
):
self
.
setTitle
(
title
,
for
:
.
normal
)
self
.
backgroundColor
=
color
self
.
isEnabled
=
true
case
.
inProgress
(
let
title
,
let
color
):
self
.
setTitle
(
title
,
for
:
.
disabled
)
self
.
backgroundColor
=
color
self
.
isEnabled
=
false
case
.
disabled
(
let
title
,
let
color
):
self
.
setTitle
(
title
,
for
:
.
disabled
)
self
.
backgroundColor
=
color
self
.
isEnabled
=
false
}
}
This function takes a state (which is the ButtonState
enum we created a moment ago) and changes the ThreeStateButton
to that state. Each state involves a change of title (which is provided when this is called; it is not predefined), a new background color (also provided when this is called), and an actual enabling or disabling of the button.
The time has come to build our UI storyboard, but we need one more thing. Because we’re going to use a UICollectionView
, which is made up of a collection (bet you’d never have guessed!) of cells, we’re going to subclass UICollectionViewCell
and use it to display each of the animal types for which the app can detect the sound.
Add the following code to the ViewController.swift file, outside of any existing classes or definitions (we recommend adding it to the very bottom):
class
AnimalCell
:
UICollectionViewCell
{
static
let
identifier
=
"AnimalCollectionViewCell"
}
This creates a subclass of UICollectionViewCell
named AnimalCell
and provides an identifier by which we can refer to it within our storyboard (which we make next. We promise!)
Now, you can open the Main.storyboard file, and create a UI:
Add the following components to your storyboard:
A UIButton
, to trigger the sound recording (and show that a recording is in progress)
A UIProgressView
, which shows the length of the recording
A UICollectionView
, which holds cells to show each animal type for which the app can detect the sound
Within the UICollectionView
, a prototype UICollectionViewCell
, which displays each animal.
You can see an image of our storyboard in Figure 5-4. Make sure you add the necessary constraints!
We need to change the class of some of these components to the custom types we created in code earlier. Select the UICollectionViewCell
inside the UICollectionView
and then, with the Identity Inspector open, change its class to AnimalCell
(it should autocomplete for you), as shown in Figure 5-5.
Within the cell’s view, in the storyboard, add a large UILabel
and center it appropriately using constraints.
We need to add some outlets for the AnimalCell
. Add the following outlets within the AnimalCell
class definition in ViewController.swift, connecting them to the “Cell View” (a UIView
, which came premade in the cell) and the UILabel
you created, respectively:
@IBOutlet
weak
var
cellView
:
UIView
!
@IBOutlet
weak
var
textLabel
:
UILabel
!
UICollectionViewCell
Select the UIButton
you created when we first started working with this Storyboard and change its class to ThreeStateButton
in its identity inspector (just as we did for the UICollectionViewCell
/AnimalCell
).
Connect some outlets to the ViewController
class itself:
@IBOutlet
weak
var
collectionView
:
UICollectionView
!
@IBOutlet
weak
var
progressBar
:
UIProgressView
!
@IBOutlet
weak
var
recordButton
:
ThreeStateButton
!
These outlets are for the UICollectionView
itself, the UIProgressView
, and the UIButton
for which we just changed the class to ThreeStateButton
.
Add an action, connected to the ThreeStateButton
:
@IBAction
func
recordButtonPressed
(
_
sender
:
Any
)
{
// start audio recording
recordAudio
()
}
Add the following attributes to the ViewController
class:
private
var
recordingLength
:
Double
=
5.0
private
var
classification
:
Animal
?
private
lazy
var
audioRecorder
:
AVAudioRecorder
?
=
{
return
initialiseAudioRecorder
()
}()
private
lazy
var
recordedAudioFilename
:
URL
=
{
let
directory
=
FileManager
.
default
.
urls
(
for
:
.
documentDirectory
,
in
:
.
userDomainMask
)[
0
]
return
directory
.
appendingPathComponent
(
"recording.m4a"
)
}()
These attributes define a recording length, a variable in which to store the ultimate classification of a sound, an AVAudioRecorder
, and a filename for the file in which we’ll store the recording.
Update the viewDidLoad()
function to look like the following:
override
func
viewDidLoad
()
{
super
.
viewDidLoad
()
collectionView
.
dataSource
=
self
}
Add a function to start the audio recording, using the attribute we created earlier to access AVAudioRecorder
:
private
func
recordAudio
()
{
guard
let
audioRecorder
=
audioRecorder
else
{
return
}
classification
=
nil
collectionView
.
reloadData
()
recordButton
.
changeState
(
to
:
.
inProgress
(
title
:
"Recording..."
,
color
:
.
systemRed
)
)
progressBar
.
isHidden
=
false
audioRecorder
.
record
(
forDuration
:
TimeInterval
(
recordingLength
))
UIView
.
animate
(
withDuration
:
recordingLength
)
{
self
.
progressBar
.
setProgress
(
Float
(
self
.
recordingLength
),
animated
:
true
)
}
}
You’ll also need a function to finish recording, taking a Bool
as a parameter to indicate whether the recording was successful, just in case it wasn’t (it defaults to true
):
private
func
finishRecording
(
success
:
Bool
=
true
)
{
progressBar
.
isHidden
=
true
progressBar
.
progress
=
0
if
success
,
let
audioFile
=
try
?
AVAudioFile
(
forReading
:
recordedAudioFilename
)
{
recordButton
.
changeState
(
to
:
.
disabled
(
title
:
"Record Sound"
,
color
:
.
systemGray
)
)
classifySound
(
file
:
audioFile
)
}
else
{
summonAlertView
()
classify
(
nil
)
}
}
Add a method to update the UICollectionView
to show which animal we think the sound is. This function takes an Animal
as input:
private
func
classify
(
_
animal
:
Animal
?)
{
classification
=
animal
recordButton
.
changeState
(
to
:
.
enabled
(
title
:
"Record Sound"
,
color
:
.
systemBlue
)
)
collectionView
.
reloadData
()
}
In the ViewController
class (but not in the ViewController.swift file), add a function to an AVAudioFile
(the result of our recording) and do something with it (it will do a lot more after we add the AI features):
private
func
classifySound
(
file
:
AVAudioFile
)
{
classify
(
Animal
.
allCases
.
randomElement
()
!
)
}
We’re done with the ViewController
class for the moment. Now, we need to add an extension or three to the class:
Below the end of the ViewController
class, but still within the ViewController.swift file, add the following extension, which will allow us to present an alert view (a pop up) in case of a problem:
extension
ViewController
{
private
func
summonAlertView
(
message
:
String
?
=
nil
)
{
let
alertController
=
UIAlertController
(
title
:
"Error"
,
message
:
message
??
"Action could not be completed."
,
preferredStyle
:
.
alert
)
alertController
.
addAction
(
UIAlertAction
(
title
:
"OK"
,
style
:
.
default
)
)
present
(
alertController
,
animated
:
true
)
}
}
Below that, add another extension, allowing us to conform to the AVAudioRecorderDelegate
in order to work with the AVAudioRecorder
:
extension
ViewController
:
AVAudioRecorderDelegate
{
func
audioRecorderDidFinishRecording
(
_
recorder
:
AVAudioRecorder
,
successfully
flag
:
Bool
)
{
finishRecording
(
success
:
flag
)
}
private
func
initialiseAudioRecorder
()
->
AVAudioRecorder
?
{
let
settings
=
[
AVFormatIDKey
:
Int
(
kAudioFormatMPEG4AAC
),
AVSampleRateKey
:
12000
,
AVNumberOfChannelsKey
:
1
,
AVEncoderAudioQualityKey
:
AVAudioQuality
.
high
.
rawValue
]
let
recorder
=
try
?
AVAudioRecorder
(
url
:
recordedAudioFilename
,
settings
:
settings
)
recorder
?.
delegate
=
self
return
recorder
}
}
Again, below that, add one final extension to allow us to conform to the UICollectionViewDataSource
, which provides the ability to populate a UICollectionView
(we’re going to fill it with AnimalCell
s):
extension
ViewController
:
UICollectionViewDataSource
{
func
collectionView
(
_
collectionView
:
UICollectionView
,
numberOfItemsInSection
section
:
Int
)
->
Int
{
return
Animal
.
allCases
.
count
}
func
collectionView
(
_
collectionView
:
UICollectionView
,
cellForItemAt
indexPath
:
IndexPath
)
->
UICollectionViewCell
{
guard
let
cell
=
collectionView
.
dequeueReusableCell
(
withReuseIdentifier
:
AnimalCell
.
identifier
,
for
:
indexPath
)
as
?
AnimalCell
else
{
return
UICollectionViewCell
()
}
let
animal
=
Animal
.
allCases
[
indexPath
.
item
]
cell
.
textLabel
.
text
=
animal
.
icon
cell
.
backgroundColor
=
(
animal
==
self
.
classification
)
?
animal
.
color
:
.
systemGray
return
cell
}
}
Add a launch screen and an icon, if you’d like to (as with the previous practical tasks, we’ve provided some in the downloadable resources), and launch the app in the simulator. You should see something that looks like the figure from earlier, Figure 5-3.
This app actually does record audio, but it doesn’t do anything with it. There’s no way to play it back, and it’s obviously not yet connected to any form of machine-learning model. Back in the classifySound()
function that we wrote, we just randomly pick one of the animals each time.
As usual with our practical AI tasks, we need to assemble a toolkit with which to tackle the problem. The primary tools that we use in this case are Python to prepare the data for training, the CreateML application for training, and CoreML to read the model in our app.
To make a model that will power our app’s ability to classify animal sounds, we’ll need a dataset full of animal sounds. As is often the case (you might have noticed a pattern) with machine learning and AI datasets, the boffins have us covered.
The Dataset for Environmental Sound Classification (ESC) is a collection of short environmental recordings of a variety of sounds spanning five major categories, as shown in Figure 5-6.
Head over to the ESC-50 GitHub repository and download a copy of the dataset. Save it somewhere safe.
You could do everything that our Python script does manually by yourself, but it would probably take longer than doing it with a script. It’s always good practice to make things repeatable when you’re working on machine-learning problems.
Fire up a new Python environment, following the instructions in “Python”, and then do the following:
Create the following Python script (ours is called preparation.py) using your favorite text editor (we like Visual Studio Code and BBEdit, but anything works):
import
os
import
shutil
import
pandas
as
pd
# Make output directory
try
:
os
.
makedirs
(
output_directory
)
except
OSError
:
if
not
os
.
path
.
isdir
(
output_directory
):
raise
# Make class directories within it
for
class_name
in
classes_to_include
:
class_directory
=
output_directory
+
class_name
+
'/'
try
:
os
.
makedirs
(
class_directory
)
except
OSError
:
if
not
os
.
path
.
isdir
(
class_directory
):
raise
# Go through CSV to sort audio into class directories
classes_file
=
pd
.
read_csv
(
input_classes_filename
,
encoding
=
'utf-8'
,
header
=
'infer'
)
# format: filename, fold, target, category, esc10, src_file, take
for
line
in
classes_file
.
itertuples
(
index
=
False
):
if
include_unlicensed
or
line
[
4
]
==
True
:
file_class
=
line
[
3
]
if
file_class
in
classes_to_include
:
file_name
=
line
[
0
]
file_src
=
sounds_directory
+
file_name
file_dst
=
output_directory
+
file_class
+
'/'
+
file_name
try
:
shutil
.
copy2
(
file_src
,
file_dst
)
except
IOError
:
raise
This script import
s Pandas (as shown in Figure 2-18), makes an output folder that contains subfolders for each class, and then parses the comma-separated values (CSV) file and writes out files in a new format.
We use the Pandas framework here to access its CSV-reading capabilities. Very useful stuff.
To point the script to the appropriate classes and input files, at the top of your Python script, add the following configuration variables, after the import
statements but before the actual script starts:
# Configure as required
input_classes_filename
=
'/Users/mars/Desktop/ESC-50-master/meta/esc50.csv'
sounds_directory
=
'/Users/mars/Desktop/ESC-50-master/audio/'
output_directory
=
'/Users/mars/Desktop/ESC-50-master/classes/'
classes_to_include
=
[
'dog'
,
'rooster'
,
'pig'
,
'cow'
,
'frog'
,
'cat'
,
'hen'
,
'insects'
,
'sheep'
,
'crow'
]
# whether to use whole ESC-50 dataset or lesser-restricted ESC-10 subset
include_unlicensed
=
False
Update each of the paths to point to the proper place on your system: input_classes_filename
should point to the esc50.csv file that came with your copy of the dataset; sounds_directory
to the /audio/ folder; and output_directory
to wherever you want the script to place the subset of files it will need for training (we made a folder called /classes/ in the dataset download).
The classes_to_include
list includes all of the animals for which we’re going to be allowing our app to classify the sound (which is not all of the animal sounds present in the dataset).
Run the preparation Python script by executing it on the command line (python preparation.py
). Your data should now be prepared, and you should have it in a folder structure that looks like Figure 5-7.
With our animal sound dataset ready to go, let’s turn to Apple’s CreateML application, just as we did in Chapter 4, to build a sound classification model.
To learn more about the various incarnations of CreateML, check back to “CreateML”.
Let’s build our animal sound classifier model:
Fire up CreateML and create a new Sound Classifier project by selecting the appropriate template, as shown in Figure 5-8.
After giving your project some details, you’ll see a new empty Sound Classifier project, as shown in Figure 5-9.
In the Training Data section, click the drop-down box and browse to the folder where you put the prepared data earlier (it should have 10 different animal-themed folders in it). Select this folder and then, in the top bar of the CreateML app, click the Play button (the right-facing triangle.)
Training the sound classifier won’t take as long as the image classifier back in “Task: Image Classification” did. But it might take a few minutes. (Watch five minutes of an episode of Person of Interest.)
When the training is done, you’ll see something resembling Figure 5-10. You’ll be able to drag the model file out from the Output box in the upper-right corner of the window. Drag this file somewhere safe.
With our CoreML model trained by CreateML, we’re ready to put it to work in our app.
You could have also trained the sound classification model using the CreateML framework and the MLSoundClassifier
structure. You can learn more about it in Apple’s documentation.
At this point, if you’ve been following along, you have a starter iOS app written in Swift, and a trained sound classifier model built using the CreateML application all ready to go. Let’s combine them and make the iOS app capable of sound classification.
If you didn’t build the starting point yourself (following the instructions in “Building the App”), you can download the code from our website and find the project named SCDemo-Starter
. We’ll be progressing from that point in this section.
You’ll also find a trained sound classifier model in the same folder as the demo project folder, as shown in Figure 5-11.
If you don’t want to follow along and manually work with the iOS app’s code to add the AI features via the sound classification model that we trained, you can also download the project named SCDemo-Complete
. If you choose to download the SCDemo-Complete
project instead of stepping through this section, we still urge you to read this section and look at the relevant bits of the SCDemo-Complete
project.
As usual, we’re going to need to change quite a few things to get the app working with our sound classification model:
Drag the .mlmodel file you created earlier into the project’s root, allowing Xcode to copy as needed.
Add the SoundAnalysis
framework to the imports in ViewController.swift:
import
SoundAnalysis
SoundAnalysis
is a framework provided by Apple that lets you analyze audio and classify it. SoundAnalysis
works with a model trained using CreateML’s MLSoundClassifier
(which is what you end up with whether you used the CreateML app, like we did, or the CreateML framework).
Add an attribute for the classifier
, pointing to our model file:
private
let
classifier
=
AudioClassifier
(
model
:
AnimalSounds
().
model
)
Make sure to change the name from AnimalSounds().model
to that of your model if your model is named something else (for example, if your model is named MyAnimalClassifier.mlmodel, set this to MyAnimalClassifier().model
).
Add a new function, refresh()
, after the viewDidLoad()
function:
private
func
refresh
(
clear
:
Bool
=
false
)
{
if
clear
{
classification
=
nil
}
collectionView
.
reloadData
()
}
This function is so that we can ask the UICollectionView
to refresh as needed.
Change the recordAudio()
function to be as follows:
private
func
recordAudio
()
{
guard
let
audioRecorder
=
audioRecorder
else
{
return
}
refresh
(
clear
:
true
)
recordButton
.
changeState
(
to
:
.
inProgress
(
title
:
"Recording..."
,
color
:
.
systemRed
)
)
progressBar
.
isHidden
=
false
audioRecorder
.
record
(
forDuration
:
TimeInterval
(
recordingLength
))
UIView
.
animate
(
withDuration
:
recordingLength
)
{
self
.
progressBar
.
setProgress
(
Float
(
self
.
recordingLength
),
animated
:
true
)
}
}
This sets it up so that instead of performing the refresh itself, the function calls our new refresh()
function (which we created a moment ago).
Update the finishRecording()
function to be as follows:
private
func
finishRecording
(
success
:
Bool
=
true
)
{
progressBar
.
isHidden
=
true
progressBar
.
progress
=
0
if
success
{
recordButton
.
changeState
(
to
:
.
disabled
(
title
:
"Record Sound"
,
color
:
.
systemGray
)
)
classifySound
(
file
:
recordedAudioFilename
)
}
else
{
classify
(
nil
)
}
}
This function is called when recording is finished. If the success
Bool
is true, it disables the record button and calls classifySound()
, passing the recordedAudioFilename
.
Replace the call to collectionView.reloadData()
in the classify()
function with the following:
refresh
()
if
classification
==
nil
{
summonAlertView
()
}
Update the classifySound()
function, as follows:
private
func
classifySound
(
file
:
URL
)
{
classifier
?.
classify
(
audioFile
:
file
)
{
result
in
self
.
classify
(
Animal
(
rawValue
:
result
??
""
))
}
}
This removes the random animal that we used as a placeholder in the starter app, and actually uses our model to perform a classification. You can now launch the app. You should see something that looks exactly like it always did, as shown in Figure 5-12.
You can now tap the Record Sound, record some noises, and the app should light up the animal relating to the sound that it thinks it heard. Amazing!
You’ll need to add the NSMicrophoneUsageDescription key to your Info.plist file, as we did for the “Task: Speech Recognition”, as shown in Figure 5-13.
At this point, we have an iOS application, written in Swift and using UIKit and integrated with a CoreML model generated using CreateML, that can, with a reasonable degree of reliability, record some audio and then perform a classification on it, using the Sound Analysis framework, and tell us which of nine possible animals the sound in the audio recording might belong to. What’s the next step?
In this section, we improve the sound classification app, making it capable of performing real-time sound classification instead of having to record an audio file and then classify it.
You’ll need to have completed all the steps presented prior to this section to follow from here.
If you don’t want to do that or you need a clean starting point, you can download the resources for this book from our website and find the project SCDemo-Complete
. We’ll be building on the app from there. If you don’t want to follow the instructions in this section, you can also find the project SCDemo-Improved
, which is the end result of this section. If you go down that route, we strongly recommend reading the code as we discuss it in this section and comparing it with the code in SCDemo-Improved
.
There are a lot of code changes required here, so take your time. To begin, create a new Swift file, named Audio.swift, in the project:
Add the following imports
:
import
CoreML
import
AVFoundation
import
SoundAnalysis
Add the following class to it:
class
ResultsObserver
:
NSObject
,
SNResultsObserving
{
private
var
completion
:
(
String
?)
->
()
init
(
completion
:
@
escaping
(
String
?)
->
())
{
self
.
completion
=
completion
}
func
request
(
_
request
:
SNRequest
,
didProduce
result
:
SNResult
)
{
guard
let
results
=
result
as
?
SNClassificationResult
,
let
result
=
results
.
classifications
.
first
else
{
return
}
let
label
=
result
.
confidence
>
0.7
?
result
.
identifier
:
nil
DispatchQueue
.
main
.
async
{
self
.
completion
(
label
)
}
}
func
request
(
_
request
:
SNRequest
,
didFailWithError
error
:
Error
)
{
completion
(
nil
)
}
}
This class implements the SNResultsObserving
protocol, which is part of the SoundAnalysis
framework that we imported. It allows us to create an interface with which to receive the results of a sound analysis request.
Create a class to represent the process of classifying audio. Let’s call it AudioClassifier
(we’re incredibly creative). Add the following class to the Audio.swift file:
class
AudioClassifier
{
}
Add the following attributes:
private
let
model
:
MLModel
private
let
request
:
SNClassifySoundRequest
private
let
audioEngine
=
AVAudioEngine
()
private
let
analysisQueue
=
DispatchQueue
(
label
:
"com.apple.AnalysisQueue"
)
private
let
inputFormat
:
AVAudioFormat
private
let
analyzer
:
SNAudioStreamAnalyzer
private
let
inputBus
:
AVAudioNodeBus
private
var
observer
:
ResultsObserver
?
Each of these attributes should be fairly self-explanatory. If you’d like to know more, you can check Apple’s documentation.
Add an initializer:
init
?(
model
:
MLModel
,
inputBus
:
AVAudioNodeBus
=
0
)
{
guard
let
request
=
try
?
SNClassifySoundRequest
(
mlModel
:
model
)
else
{
return
nil
}
self
.
model
=
model
self
.
request
=
request
self
.
inputBus
=
inputBus
self
.
inputFormat
=
audioEngine
.
inputNode
.
inputFormat
(
forBus
:
inputBus
)
self
.
analyzer
=
SNAudioStreamAnalyzer
(
format
:
inputFormat
)
}
The initializer should also be fairly self-explanatory: it sets the various attributes appropriately.
Add a function to begin the analysis to perform a classification:
func
beginAnalysis
(
completion
:
@
escaping
(
String
?)
->
())
{
guard
let
_
=
try
?
audioEngine
.
start
()
else
{
return
}
(
"Begin recording..."
)
let
observer
=
ResultsObserver
(
completion
:
completion
)
guard
let
_
=
try
?
analyzer
.
add
(
request
,
withObserver
:
observer
)
else
{
return
}
self
.
observer
=
observer
audioEngine
.
inputNode
.
installTap
(
onBus
:
inputBus
,
bufferSize
:
8192
,
format
:
inputFormat
)
{
buffer
,
time
in
self
.
analysisQueue
.
async
{
self
.
analyzer
.
analyze
(
buffer
,
atAudioFramePosition
:
time
.
sampleTime
)
}
}
}
This code starts the analysis. It first attempts to fire up the audio system and then, effectively, just waits for results.
Add a function to stop the analysis:
func
stopAnalysis
()
{
(
"End recording..."
)
analyzer
.
completeAnalysis
()
analyzer
.
remove
(
request
)
audioEngine
.
inputNode
.
removeTap
(
onBus
:
inputBus
)
audioEngine
.
stop
()
}
That’s it for the Audio.swift file. Make sure that you save it and then open ViewController.swift:
Replace the entire set of attributes in the ViewController
class, as follows:
@IBOutlet
weak
var
collectionView
:
UICollectionView
!
@IBOutlet
weak
var
progressBar
:
UIProgressView
!
@IBOutlet
weak
var
recordButton
:
ThreeStateButton
!
@IBAction
func
recordButtonPressed
(
_
sender
:
Any
)
{
toggleRecording
()
}
private
var
recording
:
Bool
=
false
private
var
classification
:
Animal
?
private
let
classifier
=
AudioClassifier
(
model
:
AnimalSounds
().
model
)
Because we’re doing some of the work in our new AudioClassifier
class (which we just created in Audio.swift), we no longer need quite so much code here. Make sure the IBOutlet
and IBAction
attributes are still connected or reconnected to the correct place in the storyboard (which should remain unmodified).
Comment out the recordAudio()
function and add a new function, toggleRecording()
, as follows:
private
func
toggleRecording
()
{
recording
=
!
recording
if
recording
{
refresh
(
clear
:
true
)
recordButton
.
changeState
(
to
:
.
inProgress
(
title
:
"Stop"
,
color
:
.
systemRed
)
)
classifier
?.
beginAnalysis
{
result
in
self
.
classify
(
Animal
(
rawValue
:
result
??
""
))
}
}
else
{
refresh
()
recordButton
.
changeState
(
to
:
.
enabled
(
title
:
"Record Sound"
,
color
:
.
systemBlue
)
)
classifier
?.
stopAnalysis
()
}
}
Comment out the entire classifySound()
function, and then update the classify()
function to look as follows:
private
func
classify
(
_
animal
:
Animal
?)
{
classification
=
animal
refresh
()
}
You can also comment out the entire extension of ViewController
that conforms to AVAudioRecorderDelegate
(we’ve moved out the audio functionality and changed how it works).
For a cleaner UI, update the .inProgress
case of the switch
statement in the changeState()
function of the ThreeStateButton
class, as follows:
case
.
inProgress
(
let
title
,
let
color
):
self
.
setTitle
(
title
,
for
:
.
normal
)
self
.
backgroundColor
=
color
self
.
isEnabled
=
true
Launch the improved app in the simulator. You should be able to tap the button and the app will perform live classification on the sounds it hears, lighting up the associated animal emoji.
Figure 5-14 shows our finished sound classifier.
That’s all for the audio chapter. We’ve covered some common audio-related practical AI tasks that you might want to accomplish with Swift, and we used a variety of tools to do so. We built two apps, exploring two practical AI tasks related to audio:
Using Apple’s new SwiftUI for the interface and Apple’s provided speech recognition framework, we built an app that can turn human speech into text. We didn’t need to train our own model.
We used Apple’s UIKit framework for the interface, some Python scripting to prepare the data (a whole bunch of animal sounds), and Apple’s CreateML application to train the model. We also used CoreML in the app, to work with our trained model.
Later, in Chapter 11, we’ll look at what happened under the hood, algorithm-wise, for each of the tasks we explored in this chapter (“Audio”). In the next chapter we look at text and language.
For more audio-related practical AI tasks, check out our website.