Speech recognition

In this section, we will discuss developing a speech recognition example in Python involving speech recognition. We will make use of the requests module (discussed in the previous chapter) to transcribe audio using wit.ai (https://wit.ai/).

There are several speech recognition tools, including Google's Speech API, IBM Watson, Microsoft Bing's speech recognition API. We are demonstrating wit.ai as an example.

Speech recognition can be useful in applications where we would like to enable the Raspberry Pi Zero responses to voice commands. For example, in Chapter 10, Home Automation Using the Raspberry Pi Zero, we will be working on a home automation project. We could make use of speech recognition to respond to voice commands.

Let's review building the speech recognition application in Python using wit.ai (its documentation is available here at https://github.com/wit-ai/pywit). In order to perform speech recognition and recognize voice commands, we will need a microphone. However, we will demonstrate using a readily available audio sample. We will make use of audio samples made available by a research publication (available at http://ecs.utdallas.edu/loizou/speech/noizeus/clean.zip).

The wit.ai API license states that the tool is free to use, but the audio uploaded to their servers are used to tune their speech transcription tool.

We will now attempt transcribing the sp02.wav audio sample performing the following steps:

The first step is signing up for an account with wit.ai. Make a note of the API as shown in the following screenshot:

The first step is installing the requests library. It could be installed as follows:

       pip3 install requests

According to the wit.ai documentation, we need to add custom headers to our request that includes the API key (replace $TOKEN with the token from your account). We also need to specify the file format in the header. In this case, it is a .wav file, and the sampling frequency is 8000 Hz:

       import requests 

       if __name__ == "__main__": 
         url = 'https://api.wit.ai/speech?v=20161002' 
         headers = {"Authorization": "Bearer $TOKEN", 
                    "Content-Type": "audio/wav"}

In order to transcribe the audio sample, we need to attach the audio sample in the request body:

       files = open('sp02.wav', 'rb') 
       response = requests.post(url, headers=headers, data=files) 
       print(response.status_code) 
       print(response.text)

Putting it all together, gives us this:

       #!/usr/bin/python3 

       import requests 

       if __name__ == "__main__": 
         url = 'https://api.wit.ai/speech?v=20161002' 
         headers = {"Authorization": "Bearer $TOKEN", 
                    "Content-Type": "audio/wav"} 
         files = open('sp02.wav', 'rb') 
         response = requests.post(url, headers=headers, data=files) 
         print(response.status_code) 
         print(response.text)

The preceding code sample is available for download along with this chapter as wit_ai.py. Try executing the preceding code sample, and it should transcribe the audio sample: sp02.wav. We have the following code:

200
{
  "msg_id" : "fae9cc3a-f7ed-4831-87ba-6a08e95f515b",
  "_text" : "he knew the the great young actress",
  "outcomes" : [ {
    "_text" : "he knew the the great young actress",
    "confidence" : 0.678,
    "intent" : "DataQuery",
    "entities" : {
      "value" : [ {
        "confidence" : 0.7145905790744499,
        "type" : "value",
        "value" : "he",
        "suggested" : true
      }, {
        "confidence" : 0.5699616515542044,
        "type" : "value",
        "value" : "the",
        "suggested" : true
      }, {
        "confidence" : 0.5981701138805214,
        "type" : "value",
        "value" : "great",
        "suggested" : true
      }, {
        "confidence" : 0.8999612482250062,
        "type" : "value",
        "value" : "actress",
        "suggested" : true
      } ]
    }
  } ],
  "WARNING" : "DEPRECATED"
}

The audio sample contains the following recording: He knew the skill of the great young actress. According to the wit.ai API, the transcription is He knew the the great young actress. The word error rate is 22% (https://en.wikipedia.org/wiki/Word_error_rate).

We will be making use of the speech transcription API to issue voice commands in our home automation project.