This project is created by Silicon Labs’ summer intern Cole Morgan.
Background and motivation:
This project is a program that implements voice recognition for the GG11 using the starter kit’s onboard LEDs. My motivation to work on this project was mainly that I have never done anything remotely close to voice recognition before, and I thought it would be a good challenge. But another motivation was also that I am very interested in the Amazon Echo and the other emerging home assistant technologies.
The program works by first learning your voice through a small training protocol where the user says each keyword a couple times when prompted. After the program has learned the user’s voice, the user can turn the LED on, red, blue, green, or off simply by saying “on”, “blue”, “red”, “green”, or “off”.
My first step was getting audio input from the microphone into the microcontroller and storing it. This proved a little more difficult than I expected because I hadn’t worked with SPI or I2S before. In addition to this, I also had to design a sound detection system that captures as much significant sound as possible. I did this by squaring and summing the elements of the state buffer of the bandpass FIR filter that I apply on each sample and then setting a threshold for the result of that operation. This system turned out to be extremely useful because, in addition to saving processor time, it also time-aligned the data to be processed.
After this step, I began to implement the actual voice recognition. At first, I thought I could just find a library online and implement easily, but this turned out to be far from true. Most voice recognition libraries are much too big for a microcontroller, even one with a very large Flash memory of 2MB like the GG11. There was one library I found that was written for Arduino, but it didn’t work very well. So, I began the process of writing my own voice recognition algorithm.
After a lot of research, I decided I would use Mel’s Frequency Cepstral Coefficients (MFCCs) as the basis for my algorithm. There are a number of other audio feature coefficients, but MFCCs seemed to be the most effective. The calculation of MFCCs is basically several signal processing techniques applied in a specific order, so I used the CMSIS ARM DSP library for those functions.
After beginning work on this, I created a voice training algorithm to allow the program to learn any voice and adapt to any user. The training program has the user say each word a configurable number of times, and then calculates the MFCCs of that person’s pronunciation of the keyword and stores them in flash memory.
Next, because the input data was time-aligned, I could simply put all the MFCCs for the 4 buffers in one array and use that as the basis for comparison. In addition to this, I also calculated and stored the first derivative (delta coefficients) of the MFCC data to increase accuracy.
Lessons Learned and Next Steps:
My next steps are to apply the voice recognition to a temperature / humidity controller application, which should be easier than this LED application as the keywords are very different from each other unlike “on” and “off”.
Whoa. This is seriously awesome, and not something I expected our little MCUs were capable of. And you created your own lightweight voice recognition algorithm? Amazing!
Looks interesting. From readme.txt
"For a more detailed description of the voice recognition algorithm,
please see the "VoiceRecognitionGuide" text file."
But I can not find this file.