This article will describe the steps required for building a wake word detector.
- Background
- Related Work
- How to train a model from audio files
- Datasets
- How to learn from audio files
- Libraries
- Preparing labeled dataset
- Word Alignment
- Retreive timestamps of words
- Fixing data imbalance
- Adding noise data
- Dataloader
- Transformations
- Model
- Training
- Evaluation
- Inference
- Deploying model in the app
- Demo
- Conclusion
- References
Background
Personal Assistant devices like Google Home, Alexa, and Apple Homepod, will constantly be listening for a specific set of wake words like “Ok, Google” or “Alexa” or “Hey Siri”, and once these sequence of words are detected, it would prompt to the user for subsequent commands and respond to them appropriately. To create a custom wake word detector, which will take audio as input and, once the sequence of words is detected, then prompt the user. The goal is to provide a configurable custom detector so anyone can use it on their application to perform operations once configured wake words are detected.
Related Work
Below two papers heavily influence most of the work discussed here.
- Howl: A Deployed, Open-Source Wake Word Detection System
- Honkling: In-Browser Personalization for Ubiquitous Keyword Spotting
I would highly recommend going through the above papers.
How to train a model from audio files
Below are the steps we will be doing to train a model using audio files
Step 1: Get an audio dataset with transcripts
Step 2: Decide on what wake words to use (mostly 2 to 3 words)
Step 3: Search for these words in dataset transcripts and prepare a positive dataset (that has a wake-word) and negative
dataset (does not have wake-word)
Step 4: Extract audio features by converting them to Mel spectrograms (pictorial representation of audio)
Step 5: Using CNN, train on the above data.
Step 6: Save and test the model
Step 7: Make live inference on the above model.
We will go through each step in detail.
Datasets
First, you need a dataset with transcripts; check Mozilla Common Voice dataset. As of this writing, the dataset size is 73 GB. Download it, and if you extract it, you will see something like the below.
clips have all the mp3 files, tsv files have transcripts of these mp3 files. If you open train.tsv using pandas

Now, if you play audio file common_voice_en_19731809.mp3
in the clips folder it would say Hannu wrote of his expedition in stone.
For this use case, we only need path
and sentence
columns.
There are other datasets you can explore please refer
For Noise dataset - Microsoft SNSD (Reddy et al., 2019)
How to learn from audio files
Before going further, we need to know a few concepts. I would strongly recommend going through below articles to learn about how to understand and extract features from audio files.
- Learning from Audio: Wave Forms - Towards Data Science
- Learning from Audio: Fourier Transformations - by mlearnere - Towards ….
- Understanding the Mel Spectrogram - by Leland Roberts - Medium.
Sample rate
The sounds of our voices — are just a sum of many sine and cosine signals. Waves are repeated signals that oscillate and vary in amplitude, depending on their complexity. In the real world, waves are continuous and mechanical — quite different from computers being discrete and digital. So, how do we translate something continuous and mechanical into something discrete and digital? Here is where the sample rate comes in1. The sample rate is the number of points per second used to trace the signal.
Say, for example, the sample rate of the recorded audio is 100; this means that for every recorded second of audio, the computer will place 100 points along the signal to best “trace” the continuous curve. Once all the points are in place, a smooth curve joins them all together for humans to be able to visualize the sound. Since the recorded audio is in terms of amplitude and time, we can intuitively say that the waveform operates in the time domain1.
Simply put, what resolution is to photos, the sample rate is to audio. The high the sample rate, high the quality of the audio.
Usually, 16kHz is enough to get audio features. High-definition audio files or real-time audio streaming will give 44.1 kHz, 48 kHz, then we need to downsample 16kHz.
For example - if you used librosa to load a mp3 audio file
If the sample rate is 16000, you will get 16000 data points per second. So here you will get a numpy of size 16000 (filled with floating numbers)
Fourier Transform
Fourier transformation translates the audio from the time domain to the frequency domain.

Time domain
The time domain looks at the signal’s amplitude variation over time, which helps understand its physical shape. To plot this, we need time on the x-axis and amplitude on the y-axis. The shape gives us a good idea of how loud or quiet the sound will be2.

Frequency domain
The frequency domain observes the constituent signals our recording is comprised of. By doing this, we can find a sort of “fingerprint” of the sound.To plot this, we need frequency on the x-axis and magnitude on the y-axis.The larger the magnitude, the more important that frequency is. The magnitude is simply the absolute value of our results from the FFT2.

below is how you can calculate FFT for one windowed segment

The FFT is computed on overlapping windowed segments of the signal, and we get what is called the spectrogram. You can think of a spectrogram as a bunch of FFTs stacked on top of each other. It is a way to visually represent a signal’s loudness, or amplitude, as it varies over time at different frequencies3.
So to get a spectogram,

Mel-Spectrograms
First Spectrogram needs to be computed; for that we need to compute FFT (Fourier transform) The Fourier transform is a mathematical formula that allows us to decompose a signalinto its individual frequencies and the frequency’s amplitude. In other words, it converts the signal from the time domain into the frequency domain. The result is called a spectrum3.
The Mel Scale is a logarithmic transformation of a signal’s frequency The linear audio spectrogram is ideally suited for applications where all frequencies have equal importance, while Mel spectrograms are better suited for applications that need to model human hearing perception. Mel spectrogram data is also suited for use in audio classification applications. The y-axis is converted to a log scale, and the color dimension is converted to decibels3.
So to calculate Mel spectrograms.

So to summarize, we need to calculate the Mel-spectrograms of all audio files; we will be getting a pictorial representation of the audio file, which we can feed to a CNN to learn the patterns in audio files.
Libraries
Below are the libraries and frameworks we will be using
- Deep Learning Framework - PyTorch
- Audio preprocessing - Librosa, Torchaudio
- Server side Inference - Pyaudio, Websockets, Flask
- Client side Inference - ONNXjs, TensorflowJS, tflite
Preparing labeled dataset
Used Mozilla Common Voice dataset,
- Go through each wake word and check transcripts for match
- If found, then it will be in the positive dataset
- If not found, then it will be in the negative dataset
- Load appropriate mp3 files and trim the silence parts
- save as .wav file and transcript as .lab file

For example, here I am using “Hey Fourth Brain”
Load train, dev and test data
Use below regex pattern, find transcripts that contain any wake words.
You will get a large negative dataset, so sample to 1%
Save as .wav and .lab file, this format is required for word alignment step.
Word Alignment
We need to know where the wake word is used in the audio.
For example - if in the audio file positive/audio/common_voice_en_20812859.wav
, the utterance is of The fourth candidate is awarded a two-year term.
then what timestamps in the audio file, the word fourth
uttered?
How to find these timestamps, for this we use word alignment software like Montreal Forced Alignment

For positive dataset, used Montreal Forced Alignment to get timestamps of each word in audio.
Download the stable version
Download the Librispeech Lexicon dictionary
Known issues in MFA
Creating aligned data
Note that we only need run alignment on the positive dataset. Above should create a bunch of TextGrid files; if you try to open it should be like below

Retreive timestamps of words
Now you can read those Textgrid files to get word timestamps and duration.
Apply above methods on positive data
Positive train data would look like below

Do the same for negative dataset, however since we have not run MFA on negative dataset, you will get empty.
Negative train data would look like below

Fixing data imbalance
You can check how much each wake word spread on the dataset you created above.
The above should give a result like the below. The spread is not equal among the dataset.
To fix this, you can generate additional data using Google TTS.
Using generate_voices
method, generate for each wake word by varying speaking rate and speaking pitch, we can
generate 7K samples for each wake word.
Now we can create csv file with path and sentence.
Split generated data into train, dev and test
Since we already know what generated audio files have, we don’t need to run MFA. So we can add some dummy values.
Now combine generated data with actual data and check the distribution.
That should be looking something like the below.
Adding noise data
For wake word detection to work correctly, we need to add noise; in the real world, there would be a lot of background noise which will be fed to the model, so the model should be aware of that noise. Without noise, model will be overfitting.
You can noise dataset from Microsoft Scalable Noisy Speech Dataset.
We are interested in two folders, noise_train
and noise_test
.
Load noise dataset.
You can listen to one noise sample
Now lets take a clean sample
For example - if the clean_sample size is 12560.
Then you can take 12560
data points from the noise sample.
Now, if you mix clean and noise samples together like below
hear noise sample is added 100%, but you can bring the noise down like below
In the above, we mixed 80% of the clean sample with 20% of the noise sample.
Dataloader
Now we have all the data required to train the model. Now we need to create a data loader that can be used during training.
We will be using window size as 750ms and sample rate as 16K so that the max length would be int(window_size_ms/1000 * sr).
and that would come to 12000.
- Load the audio file
- Compute labels
- if it is generated file, then we know the label
- if it is from MCV, then we have timestamps to know where precisely to trim to get the word
- If the audio length > max length (12000), then trim randomly either at the start or end of the audio
- If audio length < max length (12000), then pad zeros at the start or end of the audio
- Add noise ranging from 10 to 50% randomly.
The full code will be like the below.
Now we can use the above methods to create dataloder
Transformations
MelSpectrogram
As mentioned above the input to our model would be log mel spectrograms. We already have the raw audio data in dataloaders, we need to transform them to log mel spectrograms. Pytorch audio already has torchaudio.transforms.MelSpectrogram to transform to mel spectrograms,
With sample rate = 16000, number of fft’s = 512, hop length = 200, number of mel bands 40 for the sample of 750 ms, we know we get 12000 data points; we said we want 40 mel bands, each with length (12000/200 + 1) = 61, so we would end up with a 40 x 61 matrix. When you pass the audio file to the above method, you will get a tensor of shape (1, 40, 61)
Now we need a CNN model which can take this tensor and train on it.
Zero Mean Unit Variance
We can further normalize the above values using ZMUV.
We take 1 batch of train dataset, calculate Mean and Standard deviation
Below is the Mean and Standard Deviation
So after calculating the Mel spectrogram, each value will be subtracted with the above mean and divided by the standard deviation to normalize.
Model
We will be using 2 conv layers and 2 linear layers.
With below parameters
So our model summary for 1x40x61 would be like below - summary(model, input_size=(1,40,61))

Training
Below are the hyper parameters that’s used.
During training,
- First, we pass the audio data to get the Mel spectrogram and
- then pass through zmuv_transform to get normalized values and
- we change the dimension from (batch, channels, n_mels, time) to (batch, channels, time, n_mels)
- We pass through 2 Conv layers
- Next, flatten and pass through 2 linear layers.


Finally save the model - torch.save(model.state_dict(), 'model.pt')
Evaluation
Now do the evaluation using the test dataset


Inference
Now you can use pyaudio to stream audio and make inferences on live stream; for every 750ms, collect samples and feed them to model to make the inference.
Below is how the wake word sequence will be detected.
Deploying model in the app
Usually, if you are deploying in any IoT device or raspberry pi and if that device has a microphone and speaker, pyaudio should work, but let’s say if you want to deploy on an application where it will be accessed through the browser, then there are a couple of ways to do it
Using web sockets

- Using Flask Socketio at server level to capture audio buffer from client.
- At Client, use socket.io at client level to send audio buffer through socket connection.
- Capture audio buffer using getUserMedia, convert to array buffer and stream to server.
- Inference will happen at server, after n batches of 750ms window
- If sequence detected, send detected prompt to client.
- Server Code - application.py
- Client Code - main.js
- To run this locally

- Use Dockerfile & Dockerrun.aws.json to containerize the app and deploy to AWS Elastic BeanStalk
- Elastic Beanstalk initialize app
- Create Elastic Beanstalk instance
- Disadvantage of the above method might be privacy, since we are sending the audio buffer to the server for inference
Using ONNX
- Use Pytorch onnx to convert pytorch model to onnx model
- Pytorch to onnx convert code - convert_to_onnx.py
- Once converted, onnx model can be used at client side to do inference
- Client side, used onnx.js to do inference at client level
- Capture audio buffer at client using getUserMedia, convert to array buffer
- Used fft.js to compute Fourier Transform
- Used methods from Meganta.js audio utils to compute audio transformations like Mel spectrograms

- Below is the comparision of client side vs server side audio transformations

- Client side code - main.js
- To run locally
- To deploy to AWS Elastic Beanstalk, first initialize app
- Create Elastic Beanstalk instance
- Refer standalone_onnx for the client version without flask; you can deploy on any static server. You can also deploy to IPFS
- Recent version will show plots and audio buffer for each wake word which model was inferred for; click on the wake word button to know what buffer was inferred for that word.
Using tensorflowjs
- Use onnx-tensorflow to convert onnx model to tensorflow model
- onnx to tensorflow convert code - convert_onnx_to_tf.py
- Verify the model using the below command
Output
- Refer onnx_to_tf for generated files
- Test converted model using test_tf.py
- Used tensorflowjs[wizard] to convert
savedModel
to web model - Once the above step is done, copy the files to the web application
Example -
- Client-side used tfjs to load model and do inference
- Loading the TensorFlow model
- Do inference using above model
Using tflite
- Once the tensorflow model is created, it can be converted to tflite, using the below code
- Note:
tf.lite.TFLiteConverter.from_saved_model("hey_fourth_brain")
did not work, as it was throwingconv.cc:349 input->dims->data[3] != filter->dims->data[3] (0 != 1)
on inference, so used above method. - copy the tflite model to the web application
- Used tflite js to load model and do inference
- Loading tflite model
Demo
- For a live demo
- ONNX version -https://wake-onnx.netlify.app
- Tensorflow js version - https://wake-tf.netlify.app/
- Tensorflow lite js version - https://wake-tflite.netlify.app/
- Allow microphone to capture audio
- Model is trained on
hey fourth brain
- once those words are detected is the sequence; for each detected wake word, a play button to listen to what sound was used to detect that word, and what Mel spectrograms are used will be listed.
Conclusion
In this article, we have gone through how to extract audio features from audio and train model and detect wake words using end-to-end examples with source code. Go through wake_word_detection.ipynb jupyter notebook for a complete walk-through of what is discussed in this article. I hope this helps.
– RC
References
-
Learning from Audio: Wave Forms - Towards Data Science. https://towardsdatascience.com/learning-from-audio-wave-forms-46fc6f87e016 ↩ ↩2
-
Learning from Audio: Fourier Transformations - by mlearnere - Towards …. https://towardsdatascience.com/learning-from-audio-fourier-transformations-f000124675ee ↩ ↩2
-
Understanding the Mel Spectrogram - by Leland Roberts - Medium. https://medium.com/analytics-vidhya/understanding-the-mel-spectrogram-fca2afa2ce53 ↩ ↩2 ↩3
Comments