• karlduane

Speech Emotion Recognition- Are flat features actually better?

Updated: Jan 12, 2021

Phase 1 - Problem Definition

1.1 Broad Goals

A Speech Emotion Recognition (SER) model allows a program to correctly identify the emotional state of a speaker. The community of data scientists working on the SER problem is currently split between two main methodologies-- using sound data transformations and using spectrogram arrays as image data with a convolutional neural network. This project's purpose is to examine the two methodologies and determine which is more suited for adaptation to working with real-time speech emotion classification instead of pre-existing samples.

1.2 Data Source

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) includes 24 professional voice actors, 12 male and 12 female speaking the same two lines. The full dataset includes both audio and visual files.

Statement 01: "Kids are talking by the door"

Statement 02: "Dogs are sitting by the door"

Full Dataset available at:


For the purposes of this project we will be examining the audio-only files.

First, we'll need to set up a few functions to extract the sound features we need using the Librosa python library- https://librosa.org/

File Loading:

def load_targets(target_emotions, target_actors, target_channels = ['song', 'speech']):
    #load files that contain target emotion
    sounds = []
    for file in glob.glob('./data/samples/*.wav'):
        file_name = os.path.basename(file)
        #filter out non-target channels- set to all by default
        channel = vocal_channels[file_name.split('-')[1]]
        if channel not in target_channels:
        #filter out non-target emotions
        emotion = emotions[file_name.split('-')[2]]
        if emotion not in target_emotions:
        #filter out non-target actors
        actor = file_name.split('-')[6][:-4]
        if actor not in target_actors:
        wave, sample_rate = lb.load(file)
        duration = lb.get_duration(wave, sample_rate)
        sound_dict = {
            'file_name'  : file_name[:-4],
            'emotion'    : emotions[file_name.split("-")[2]],
            'statement'  : statements[file_name.split('-')[4]],
            'channel'    : vocal_channels[file_name.split('-')[1]],
            'mfccs'      : extract_mfccs(wave, sample_rate),
            'melspec'    : extract_melspec(wave, sample_rate),
            'chroma'     : extract_chroma(wave, sample_rate),
            'wave'            : wave,
            'duration'        : duration,
            'sr'              : sample_rate,
            'flat_feature'    : extract_features(wave,
                                 mfcc = True,
                                 chroma = True, 
                                 mel = True),
            'feature_array'   : extract_feature_array(wave,
                                 mfcc = True, 
                                 chroma = True, 
                                 mel = True)
    return pd.DataFrame.from_dict(sounds)

This function will extract a number of feature arrays and a flattened feature list as well as a sampled sound wave. For the full extraction functions, check out the github repository.

Flattened Feature Extraction:

def extract_features(audio, sample_rate, mfcc = False, chroma = False, mel = False):
    result = []
    if mfcc:
        mfccs=np.mean(lb.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40).T, axis=0)
    if chroma:
        chroma_r=np.mean(lb.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
    if mel:
        mel=np.mean(lb.feature.melspectrogram(audio, sr=sample_rate).T,axis=0)
    return result

Note: This feature extraction function will rotate the resulting MFCC, Chroma and Mel arrays produced by the librosa library and flatten them into a one 1D array by taking the mean at each mel band, compromising the time series nature of the data.

Phase 3: Exploratory Data Analysis

The Ravdess Dataset is very well defined and is balanced by design. Each of 24 actor records the same two lines 4 times over, two repetitions each of each of two emotional intensities. For 5 out of 8 emotions, we also have the same in song form instead of speech. Meaning, that for five out of the eight labelled emotions in the dataset, we have 192 samples and the remaining three emotions we only have 96. Additionally, this assumes that the model will perform equally well on speech and song sample types when it comes to classifying emotions.

So let's dig in to the differences here, look at the data with several common transformations, and determine if we can comfortably include both speech and song emotion samples in our final dataset. For the purposes of the EDA segment, we're going to be looking at only a few specific actors to get to know what our preprocessed data looks like and begin investigating trends.

The librosa sound code library allows for quite a few visualizations, the most interesting ones are the spectrograms and waveforms:

The plot above shows three different methods of examining the same sound window side by side.

3.1 Wave plots

Given the nature of sound information as a time series object, the bottom-most figure labeled 'waveplot' will be the most familiar to readers. The librosa library preprocesses a waveform by taking slices from the waveform at a preset rate to generate a reproducible series of samples from which the waveform can be recreated. There are limitations to this in that if the sample rate is too low or too high, it can created multiple possible waveforms from the same sample slices-- according to Nyquist Theorem, the sampling rate must be at least double the highest analog frequency component (aka hz or Hertz) or the sample will not be able to used to recreate the sound. Given that the range of human perception is between 20 and 20,000 hz and the range of intelligible human speech is between 1,000 and 10,000 hz, the librosa sound library's standard sampling rate of 22,050 should be capable of accurately capturing all waveforms in human speech patterns and we can accept it as the standard for our preprocessing.

Looking at the first emotion on the list, 'happy,' across the four iterations of both speech and song for actor number 10 (chosen at random) and statement number 2, 'Dogs are sitting by the door', we can start to see a pattern emerge in the waveforms.

1. Each word shows up as a distinct spike in the time intervals, whether its in speech or song.

2. There is an immediately visible difference in the amplitude of the pattern for two of the speeches. This is most likely the 'low' and 'high' intensities.

After spending more time examining the other emotions, we find these observations hold true across the board.

3.2 Spectrograms

If we refer back to the three main methods of looking at the sound file that we explored in the first section, we see that two of them are variations on spectrograms, so let's dig into those a bit more. Spectrograms accomplish several specific things- they allows us to visualize the sound frequencies in the file and they allow us to add an additional dimension of volume in decibles through the inclusion of color. Similar to heatmaps and density plots, this allows us additional insights. In the interest of visibility, we'll look at the log-frequency power spectrograms which show trends more readily.

We're showing distinct patterns that are quickly visible, this is promising. One of the two primary methods used in modern Deep Learning for SER involves treating sound files as images and applying 2D convolutional pooling to the neural network, this is a promising transformation to apply. Let's next take a look at the spectrograms across the emotions.

Looking at the pattern similarity column by column, the most easily visible pattern is the 'angry' column shows more peaks and ridges in the samples. Looking at it from this perspective is limited however. It's time to get the models up and running.

Phase 4: Modeling

Sound Feature Transformation

At current, the community of data scientists working on speech emotion recognition is split across two main camps. Some scientists began using feature transformed sound similar to the digitization techniques used by major telecommunications and paired with shallow neural networks. The most common transformation used directly on sound data is is the MFCC, or Mel-Frequency Cepstrum Coefficients.

Per Wikipedia, a "mel-frequency cepstrum (MFC) is a representation of the short term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency."

If that confuses you, it's okay. There are a lot of different pieces to unpack, and if you're interested in learning more about the process, I highly recommend checking out Jack Schaedler's visual explanation of the steps involved in digital signal processing.

It's a good read and well worth your time, although the entire article is a deep dive into the first of 5 steps to acquire the MFCC's:

1. take a Fourier transform of the sample

2. map the results onto the mel scale

3. log the powers at each mel frequency

4. take a discrete cosine transform of each item of the mel log powers

5. sample the amplitudes of the resulting spectrum.

Fortunately, this was standardized by the telecommunications industry and is easily achieved using the librosa sound processing library. (I really can't plug this library enough) Take a look at the scripts/sound_feature_extraction.py file for more details on how it was accomplished in this example.

Spectrogram Image Transformation

More recently, data scientists have been using image recognition techniques with deep learning as an alternative method, and convolutional neural networks trained on sound image data (mostly spectrograms) have seen an increase in effectiveness as a result.

As recently as March of 2020, the debate between the two primary methods has been ongoing and I highly recommend taking a look at the paper by Lech Margaret, Stolar Melissa, Best Christopher, Bolia Robert on Frontiers in Computer Science.

4.1 Flat Feature Model

We'll start with the flattened feature model. This version uses the feature extraction function from before which compromises the time-series nature of the data by taking the mean of each mel-band's frequencies.

flat_model = Sequential()
               activation = 'relu',
               input_shape = n_input))
flat_model.add(Dense(256,activation = 'relu'))
flat_model.add(Dense(128,activation = 'relu'))
flat_model.add(Dense(64,activation = 'relu'))
flat_model.add(Dense(32,activation = 'relu'))
flat_model.add(Dense(16, activation = 'relu'))
flat_model.add(Dense(7, activation = 'softmax'))
flat_model.compile(loss = 'categorical_crossentropy',
             optimizer = 'adam',
             metrics = ['accuracy'])
flat_tl = TimeHistory()
flat_res = flat_model.fit(X_train, y_train,
               validation_data = (X_test, y_test),
               epochs = 250,
               callbacks = [flat_tl],
               verbose = 0)

4.2 Feature Array

Our other model for comparison maintains the time series nature of the data by stacking each of the spectrogram array transforms on the y axis and maintains the time index on the x axis for a 2 dimensional array.

ff_cnn = Sequential()
ff_cnn.add(Conv2D(filters = 16,
                    kernel_size = (3, 2),
                    activation = 'relu',
                    input_shape = (180, 275, 1)))
ff_cnn.add(MaxPooling2D(pool_size = (2,2)))
ff_cnn.add(Conv2D(filters = 8,
                    kernel_size = (2,2),
                    activation = 'relu'))
ff_cnn.add(MaxPooling2D(pool_size = (2,2)))
ff_cnn.add(Dense(128, activation = 'relu'))
ff_cnn.add(Dense(64, activation = 'relu'))
ff_cnn.add(Dense(32, activation = 'relu'))
ff_cnn.add(Dense(16, activation = 'relu'))
ff_cnn.add(Dense(7, activation = 'softmax'))
ff_cnn.compile(loss = 'categorical_crossentropy',
             optimizer = 'adam',
             metrics = ['accuracy'])
early_stop = EarlyStopping(patience = 5, restore_best_weights = True)
ff_ctl = TimeHistory()
ff_cres = ff_cnn.fit(Z_train, y_train,
               batch_size = 128,
               validation_data = (Z_test, y_test),
               epochs = 100,
               callbacks = [early_stop, ff_ctl],
               verbose = 0)

Note that in this case we're using two different regularization techniques (dropout and early stopping) as CNN models of this nature are prone to overfitting.

Phase 5: Model Analysis

Baseline Score

This particular dataset is well balanced-- The overall dataset has a majority class that makes up 16.6% of the observed samples, the subset of the datasets that are either songs only or speech and song together is 25% per emotion, and the speech only dataset is balanced at 16.7% per emotion. This means that in order or our models to predict the emotion of the speaker better than random chance, we need to exceed 25 % accuracy.

We'll use this as our baseline moving forward.

Our full modeling notebook looks at several subsets of the data- speech only, song only and models that only look at one of the three transformations of the full feature array.

There is only so much we can glean from the comparison spreadsheet. At a glance, we're seeing a few important points:

1. When we narrow down the dataset to only classify the emotions for which we have both speech and song samples, we see an increase in the training accuracy of 10-20%. This is true across both the flattened features and the feature array methods.

2. The full feature array takes longer to fit. With the exception of the songs only test run, the training time of the flattened feature models was below 30 seconds, while the training time of the feature array models was 63 seconds at best.

When we line up the two approaches Training and Validation data by epoch side by side, a pattern starts to emerge- the feature array with a CNN produces comparable results in fewer training epochs. This is true both when examining loss by epoch and accuracy by epoch.

From what we can see, the training time dramatically increases for the full feature array model. While this model has consistent results with a training accuracy of 62% and a validation accuracy of 57%, it is also the slowest by far, clocking in at 537.41 seconds to train and an average epoch time of 10.54 seconds per epoch.

If we compare it to the flattened feature method, we acheive a training accuracy of 69 % and a testing accuracy of 63% in a total of 27.43 seconds, meaning it achieves comparable results in a less than 1/20th of the time.

Phase 6: Conclusions

6.1 Revisit Problem Statement

The purpose of this study was to compare the most common SER methodologies and determine which of the two is better suited for adaptation to real-time sound processing.

6.2 Conclusion

The full feature array method which looks at spectrograms as image data is comparably effective to the flattened feature method and takes 20 times as long, despite using fewer iterations. The flattened feature method is both faster and more accurate, therefore it is more suited to adaptation to real-time SER classification. Additionally, when comparing the confusion matrixes of the two models, the feature array completely discards two entire categories of emotions.

6.3 Recommendations for Future Research

The RAVDESS dataset has some limitations:

it is designed for clinical testing in that it only uses a ‘Neutral North American’ accent

It only uses two specific near identical statements ‘Kids are talking by the door’ and ‘Dogs are sitting by the door’

Despite including roughly 2.2 thousand samples, each emotion shows up AT MOST 376 times

In short: we need more data, both in sheer volume and in variety. If the goal of speech emotion recognition is to reach beyond the academic, the model needs to be able to accurately predict the emotions of a broader audience.

Phase 7: Credits, References:

Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391.

EDA Segment and comprehension of the Short Term Fourier Transform inspired by:


Feature Extraction Techniques Inspired and Informed by:



Speech Intelligibility information courtesy of:


Lech Margaret, Stolar Melissa, Best Christopher, Bolia Robert

Real-Time Speech Emotion Recognition Using a Pre-trained Image Classification Network: Effects of Bandwidth Reduction and Companding


McFee, Brian, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. “librosa: Audio and music signal analysis in python.” In Proceedings of the 14th python in science conference, pp. 18-25. 2015.

18 views0 comments