Introduction

The project is intended to serve as an introduction to working with audio data in a machine learning context. We will focus on the feature extraction process and a few simple models we can train with those features. The goal is to begin to develop an intuitive understanding of the feature extraction process and how it may relate to/affect successful classification of speech examples according to various characteristics such as emotion, intensity, actor sex, etc.

Source code can be found here

Sections

Data Inspection

The data for our project comes from the speech only portion of the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) which consisting of 1440 samples recorded by professional actors, each conveying one of eight labeled emotions.

In this section we will examine the distributions of the data and discuss some preliminary processing.

Feature Extraction

We discuss why we transform audio before model training and what those transformations look like, with focus on:

  • spectrograms
  • mel frequency spectrograms
  • mel frequency cepstral coefficients (mfccs)

This is followed by an interactive module that allows the user to select examples from RAVDESS, listen to the audio and view the feature extractions.

Binary Classification

We will use the estimators below to train and evaluate. In this section we give an overview of how they work.

  • Linear SVC algorithm
  • SVC with a radial basis function kernel
  • K-Nearest Neighbor

We also discuss some basic metrics and how we will evaluate the estimators' performance.

This is followed by another interactive module where the user can choose a characteristic (aka label) to train the model to classify. These include:

  • emotion (any of 8)
  • sex (female/male)
  • phrase
  • intensity
  • actor

After selecting the label that the model will attempt to classify, the user can then train the models and view performance metrics for each estimator. Furthermore, the user can experiment with the feature extraction parameters:

  • Training the model with mel spectrograms vs mfccs
  • Changing the number of mel frequencies
  • Changing the number of mfcc coefficients
Observations and Additional Resources

Finally, we will look at some of the aggregated results of our training and testing, noting how the estimators perform with different feature types, feature resolutions, and classification tasks. At the end of this section there are number of links for those wishing to dive deeper into audio machine learning.

What we will not be doing

Training models for real-world tasks:

The RAVDESS dataset is great for our needs as the audio is cleaned, labeled, well recorded, etc. However, the cleanliness of that audio is not reflective of most real-world situations. Typically, we would add some noise, reverb and filtering to the signal, a process known as data augmentation, to better simulate the real world.

Here we want to work with clean data as we are trying to develop some intuition about how the feature extraction process reduces our data to the characteristics we are most concerned with here.

Tweaking hyperparameters/fine tuning our models:

Again, we're not trying to make a real world prediction system; we are intentionally limiting ourselves to basics in order to better understand the impacts of feature extraction.

Working with newer/larger/more advanced models

We won't dive into neural networks, transformers, embeddings, LLMs. For a nice overview of the audio models out there check out the Hugging Face audio course.


Continue to the data inspection section