Building a new app using the model

Perform the following steps to build a complete new Android app that uses the speech_commands_graph.pb model we built in the last section:

  1. Create a new Android app named AudioRecognition by accepting all the defaults as in the previous chapters, then add the compile 'org.tensorflow:tensorflow-android:+' line to the end of the app's build.gradle file's dependencies.
  2. Add <uses-permission android:name="android.permission.RECORD_AUDIO" /> to the app's AndroidManifest.xml file so the app can be allowed to record audio.
  3. Create a new assets folder, then drag and drop the speech_commands_graph.pb and conv_actions_labels.txt files, generated in steps 2 and 3 in the previous section, to the assets folder.
  4. Change the activity_main.xml file to hold three UI elements. The first one is a TextView for recognition result display:
<TextView
android:id="@+id/textview"
android:layout_width="wrap_content"
android:layout_height="wrap_content"
android:text=""
android:textSize="24sp"
android:textStyle="bold"
app:layout_constraintBottom_toBottomOf="parent"
app:layout_constraintLeft_toLeftOf="parent"
app:layout_constraintRight_toRightOf="parent"
app:layout_constraintTop_toTopOf="parent" />

The second TextView is to display the 10 default commands we have trained using the train.py Python program in step 2 of the last section:

<TextView
android:layout_width="wrap_content"
android:layout_height="wrap_content"
android:text="yes no up down left right on off stop go"
app:layout_constraintBottom_toBottomOf="parent"
app:layout_constraintHorizontal_bias="0.50"
app:layout_constraintLeft_toLeftOf="parent"
app:layout_constraintRight_toRightOf="parent"
app:layout_constraintTop_toTopOf="parent"
app:layout_constraintVertical_bias="0.25" />

The last UI element is a Button that, when tapped, starts recording audio for one second and then send the recording to our model for recognition:

<Button
android:id="@+id/button"
android:layout_width="wrap_content"
android:layout_height="wrap_content"
android:text="Start"
app:layout_constraintBottom_toBottomOf="parent"
app:layout_constraintHorizontal_bias="0.50"
app:layout_constraintLeft_toLeftOf="parent"
app:layout_constraintRight_toRightOf="parent"
app:layout_constraintTop_toTopOf="parent"
app:layout_constraintVertical_bias="0.8" />
  1. Open MainActivity.java, first make the MainActivity implements Runnable class. Then add the following constants defining the model name, label name, input names, and output name:
private static final String MODEL_FILENAME = "file:///android_asset/speech_commands_graph.pb";
private static final String LABEL_FILENAME = "file:///android_asset/conv_actions_labels.txt";
private static final String INPUT_DATA_NAME = "decoded_sample_data:0";
private static final String INPUT_SAMPLE_RATE_NAME = "decoded_sample_data:1";
private static final String OUTPUT_NODE_NAME = "labels_softmax";
  1. Declare four instance variables:
private TensorFlowInferenceInterface mInferenceInterface;
private List<String> mLabels = new ArrayList<String>();
private Button mButton;
private TextView mTextView;
  1. In the onCreate method, we first instantiate mButton and mTextView then set up the button click event handler, which first changes the button title, then launches a thread to do recording and recognition:
mButton = findViewById(R.id.button);
mTextView = findViewById(R.id.textview);
mButton.setOnClickListener(new View.OnClickListener() {
@Override
public void onClick(View v) {
mButton.setText("Listening...");
Thread thread = new Thread(MainActivity.this);
thread.start();
}
});

At the end of the onCreate method, we read the content of the label file line by line and save each line in the mLabels array list.

  1. In the beginning of the public void run() method, started when the Start button is clicked, add the code that first gets the minimum buffer size for creating an Android AudioRecord object, then uses buffersize to create a new AudioRecord instance with a 16,000 SAMPLE_RATE and 16-bit mono format, the type of raw audio expected by our model, and finally starts recording from the AudioRecord instance:
int bufferSize = AudioRecord.getMinBufferSize(SAMPLE_RATE, AudioFormat.CHANNEL_IN_MONO, AudioFormat.ENCODING_PCM_16BIT);
AudioRecord record = new AudioRecord(MediaRecorder.AudioSource.DEFAULT, SAMPLE_RATE, AudioFormat.CHANNEL_IN_MONO, AudioFormat.ENCODING_PCM_16BIT, bufferSize);

if (record.getState() != AudioRecord.STATE_INITIALIZED) return;
record.startRecording();
There are two classes in Android for recording audio: MediaRecorder and AudioRecord. MediaRecorder is easier to use than AudioRecord, but it saves compressed audio files until Android API Level 24 (Android 7.0), which supports recording raw, unprocessed audio. According to https://developer.android.com/about/dashboards/index.html, as of January 2018, there are more than 70% of Android devices in the market that still run Android versions older than 7.0. You probably would prefer not to target your app to Android 7.0 or above. In addition, to decode the compressed audio recorded by MediaRecorder, you have to use MediaCodec, which is pretty complicated to use. AudioRecord, albeit a low-level API, is actually perfect for recording raw unprocessed data which is then sent to the speech commands recognition model for processing.
  1. Create two arrays of 16-bit short integers, audioBuffer and recordingBuffer, and for 1-second recording, every time after the AudioRecord object reads and fills the audioBuffer array, the actual data read gets appended to the recordingBuffer:
long shortsRead = 0;
int recordingOffset = 0;
short[] audioBuffer = new short[bufferSize / 2];
short[] recordingBuffer = new short[RECORDING_LENGTH];
while (shortsRead < RECORDING_LENGTH) { // 1 second of recording
int numberOfShort = record.read(audioBuffer, 0, audioBuffer.length);
shortsRead += numberOfShort;
System.arraycopy(audioBuffer, 0, recordingBuffer, recordingOffset, numberOfShort);
recordingOffset += numberOfShort;
}
record.stop();
record.release();
  1. After the recording is done, we first change the button title to Recognizing:
runOnUiThread(new Runnable() {
@Override
public void run() {
mButton.setText("Recognizing...");
}
});

Then convert the recordingBuffer short array to a float array, also making each element of the float array in the range of -1.0 and 1.0, as our model expects floats between -1.0 and 1.0:

float[] floatInputBuffer = new float[RECORDING_LENGTH];
for (int i = 0; i < RECORDING_LENGTH; ++i) {
floatInputBuffer[i] = recordingBuffer[i] / 32767.0f;
}
  1. Create a new TensorFlowInferenceInterface as we did in the previous chapters, then call its feed method with two input nodes' names and values, one of which is the sample rate and the other is the raw audio data stored in the floatInputBuffer array:
AssetManager assetManager = getAssets();
mInferenceInterface = new TensorFlowInferenceInterface(assetManager, MODEL_FILENAME);

int[] sampleRate = new int[] {SAMPLE_RATE};
mInferenceInterface.feed(INPUT_SAMPLE_RATE_NAME, sampleRate);

mInferenceInterface.feed(INPUT_DATA_NAME, floatInputBuffer, RECORDING_LENGTH, 1);

After that, we call the run method to run the recognition inference on our model and then fetch the output scores for each of the 10 speech commands and the "unknown" and "silence" output:

String[] outputScoresNames = new String[] {OUTPUT_NODE_NAME};
mInferenceInterface.run(outputScoresNames);

float[] outputScores = new float[mLabels.size()];
mInferenceInterface.fetch(OUTPUT_NODE_NAME, outputScores);
  1. The outputScores array matches the mLabels list so we can easily find the top score and get its command name:
float max = outputScores[0];
int idx = 0;
for (int i=1; i<outputScores.length; i++) {
if (outputScores[i] > max) {
max = outputScores[i];
idx = i;
}
}
final String result = mLabels.get(idx);

Finally, we show the result in a TextView and change the button title back to "Start" so users can start to record and recognize speech commands again:

runOnUiThread(new Runnable() {
@Override
public void run() {
mButton.setText("Start");
mTextView.setText(result);
}
});