Speech Emotion Recognition


A Speech Emotion Recognition model allows a program to correctly identify the emotional state of a speaker.

The purpose of this study is to examine two primary methods of modeling sound data for speech emotion recognition-- a flattened feature transform consistent with telecommunications standards and a convolutional neural network utilizing stacked spectrogram arrays.

Using data from the The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), which includes 24 professional voice actors, 12 male and 12 female speaking the same two lines, we can process the audio files

Exploratory Data Analysis:

The Ravdess Dataset is very well defined and is balanced by design. Each of 24 actor records the same two lines 4 times over, two repetitions each of each of two emotional intensities. For 5 out of 8 emotions, we also have the same in song form instead of speech. Meaning, that for five out of the eight labelled emotions in the dataset, we have 192 samples and the remaining three emotions we only have 96. Additionally, this assumes that the model will perform equally well on speech and song sample types when it comes to classifying emotions.

So let's dig in to the differences here, look at the data with several common transformations, and determine if we can comfortably include both speech and song emotion samples in our final dataset. For the purposes of the EDA segment, we're going to be looking at only a few specific actors to get to know what our preprocessed data looks like and begin investigating trends.

The librosa sound code library allows for quite a few visualizations, the most interesting ones are the spectrograms and waveforms, pictured above.

Given the nature of sound information as a time series object, the bottom-most figure labeled 'waveplot' will be the most familiar to readers. The librosa library preprocesses a waveform by taking slices from the waveform at a preset rate to generate a reproducible series of samples from which the waveform can be recreated. There are limitations to this in that if the sample rate is too low or too high, it can created multiple possible waveforms from the same sample slices-- according to Nyquist Theorem, the sampling rate must be at least double the highest analog frequency component (aka hz or Hertz) or the sample will not be able to used to recreate the sound. Given that the range of human perception is between 20 and 20,000 hz and the range of intelligible human speech is between 1,000 and 10,000 hz, the librosa sound library's standard sampling rate of 22,050 should be capable of accurately capturing all waveforms in human speech patterns and we can accept it as the standard for our preprocessing.

Looking at the first emotion on the list, 'happy,' across the four iterations of both speech and song for actor number 10 (chosen at random) and statement number 2, 'Dogs are sitting by the door', we can start to see a pattern emerge in the waveforms.

1. Each word shows up as a distinct spike in the time intervals, whether its in speech or song.

2. There is an immediately visible difference in the amplitude of the pattern for two of the speeches. This is most likely the 'low' and 'high' intensities.

After spending more time examining the other emotions, we find these observations hold true across the board.

If we refer back to the three main methods of looking at the sound file that we explored in the first section, we see that two of them are variations on spectrograms, so let's dig into those a bit more.

Spectrograms accomplish several specific things- they allows us to visualize the sound frequencies in the file and they allow us to add an additional dimension of volume in decibles through the inclusion of color. Similar to heatmaps and density plots, this allows us additional insights. In the interest of visibility, we'll look at the log-frequency power spectrograms which show trends more readily.

One of the two primary methods used in modern Deep Learning for SER involves treating sound files as images and applying 2D convolutional pooling to the neural network, this is a promising transformation to apply.

Looking at the pattern similarity column by column, the most easily visible pattern is the 'angry' column shows more peaks and ridges in the samples.


At current, the community of data scientists working on speech emotion recognition is split across two main camps. Some scientists began using feature transformed sound similar to the digitization techniques used by major telecommunications and paired with shallow neural networks. The most common transformation used directly on sound data is is the MFCC, or Mel-Frequency Cepstrum Coefficients.

Per Wikipedia, a "mel-frequency cepstrum (MFC) is a representation of the short term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency."

If that confuses you, it's okay. There are a lot of different pieces to unpack, and if you're interested in learning more about the process, I highly recommend checking out Jack Schaedler's visual explanation of the steps involved in digital signal processing.

It's a good read and well worth your time, although the entire article is a deep dive into the first of 5 steps to acquire the MFCC's:

1. take a Fourier transform of the sample

2. map the results onto the mel scale

3. log the powers at each mel frequency

4. take a discrete cosine transform of each item of the mel log powers

5. sample the amplitudes of the resulting spectrum.

Fortunately, this was standardized by the telecommunications industry and is easily achieved using the librosa sound processing library. (I really can't plug this library enough) Take a look at the scripts/sound_feature_extraction.py file for more details on how it was accomplished in this example.

Spectrogram Image Transformation

More recently, data scientists have been using image recognition techniques with deep learning as an alternative method, and convolutional neural networks trained on sound image data (mostly spectrograms) have seen an increase in effectiveness as a result.

As recently as March of 2020, the debate between the two primary methods has been ongoing and I highly recommend taking a look at the paper by Lech Margaret, Stolar Melissa, Best Christopher, Bolia Robert on Frontiers in Computer Science.

Baseline Score:
This particular dataset is well balanced-- The overall dataset has a majority class that makes up 16.6% of the observed samples, the subset of the datasets that are either songs only or speech and song together is 25% per emotion, and the speech only dataset is balanced at 16.7% per emotion. This means that in order or our models to predict the emotion of the speaker better than random chance, we need to exceed 25 % accuracy.

We'll use this as our baseline moving forward.

Add a Title
Add a Title
Add a Title
Add a Title
Add a Title
Add a Title


Our full modeling notebook looks at several subsets of the data- speech only, song only and models that only look at one of the three transformations of the full feature array.

There is only so much we can glean from the comparison spreadsheet. At a glance, we're seeing a few important points:

1. When we narrow down the dataset to only classify the emotions for which we have both speech and song samples, we see an increase in the training accuracy of 10-20%. This is true across both the flattened features and the feature array methods.

2. The full feature array takes longer to fit. With the exception of the songs only test run, the training time of the flattened feature models was below 30 seconds, while the training time of the feature array models was 63 seconds at best.

When we line up the two approaches Training and Validation data by epoch side by side, a pattern starts to emerge- the feature array with a CNN produces comparable results in fewer training epochs. This is true both when examining loss by epoch and accuracy by epoch.

From what we can see, the training time dramatically increases for the full feature array model. While this model has consistent results with a training accuracy of 62% and a validation accuracy of 57%, it is also the slowest by far, clocking in at 537.41 seconds to train and an average epoch time of 10.54 seconds per epoch.

If we compare it to the flattened feature method, we achieve a training accuracy of 69 % and a testing accuracy of 63% in a total of 27.43 seconds, meaning it achieves comparable results in a less than 1/20th of the time.

The full feature array method which looks at spectrograms as image data is comparably effective to the flattened feature method and takes 20 times as long, despite using fewer iterations. The flattened feature method is both faster and more accurate, therefore it is more suited to adaptation to real-time SER classification. Additionally, when comparing the confusion matrixes of the two models, the feature array completely discards two entire categories of emotions.

Project Links

Technical Report

Github Repository Link