Training and testing on mfcc features resulted in better performance across almost all classification tasks than training/testing on mel spectrograms, especially with the SVC-RBF model. Given that we are working with speech data, where the information tends to be encoded in the phenomes and formants, this makes sense. MFCCs specifically capture and highlight that information.
Furthermore, for this set of data and tasks there was no meaningful performance difference to be found in adjusting the number of mel filters or the number of mfccs.
All of the models using mfccs were able to classify examples according to the actor's sex quite well with all recall and precision scores > 0.95. Interestingly, the SVC using the RBF kernel (RBF model) drops from 0.99 recall and precision to 0.82 and 0.86 respectively when using mel spectrograms, while the SVC model with the linear kernel (LINEAR model) and the KNN model achieve results similar to the iterations trained/tested using mfccs.
Using mfccs, the RBF model achieves strong results in identifying individual actors, with 0.87 recall and 0.86 precision. The LIN and KNN models exhibit decent performance. The LINEAR model is stronger on recall than precision (0.88/0.62) while the KNN model is the opposite (0.68/0.87). That relationship holds for models trained on mel spectrograms, though precision takes a massive hit on both RBF and LINEAR at 0.12 and 0.18, respectively.
When classifying the phrase with mfccs, both the RBF and LINEAR models achieve strong recall scores ~0.9, but are prone to false positives, with precision scores around 0.5. Using mel spectrograms, the recall scores drop to 0.71 (RBF) and 0.78 (LINEAR), though the precision scores stay similar. The KNN model scores poorly, around 0.5 on both metrics and its performance here is similar whether using mfccs or mel-spectrograms.
Likewise, in classifying intensity, both RBF and LINEAR produced decent recall scores, while being prone to false positives. Again, using the mel spectrograms results in degraded performance, though more noticeably in the RBF model. The KNN model using mfccs here gives decent and balanced performance here with recall at 0.7 and precision at 0.73.
Overall performance on classifying emotions is lackluster. With the RBF and Linear models in particular false positives present a problem with precision scores < 0.5 KNN also performs poorly but struggled with recall more than precision.
Interestingly, when looking at performance across all emotions, the RBF model posted similar recall scores for runs using mfccs and runs using mel spectrograms.
The following tables give the average performance of each estimator in terms of recall and precision when trained and tested on a single label. We omit accuracy because with particularly unbalanced sets accuracy can be misleading.
For each class type, we trained and tested the models on each possible label while keeping the rest of the class types at their default "all." Furthermore, each label was trained and tested across a range of values for:
The table below represents the average performance of each estimator across all labels of the class type and across all mel spectrogram and mfcc values. The results for training/testing on mfcc features and mel spectrogram features are listed separately.
| Class-type | RBF-Recall | RBF-Precision | Linear-Recall | Linear-Precision | KNN-Recall | KNN-Precision |
|---|---|---|---|---|---|---|
| actor | 0.87 | 0.86 | 0.88 | 0.62 | 0.68 | 0.87 |
| sex | 0.99 | 0.99 | 0.98 | 0.97 | 0.97 | 0.97 |
| phrase | 0.91 | 0.53 | 0.88 | 0.54 | 0.51 | 0.54 |
| intensity | 0.89 | 0.64 | 0.86 | 0.58 | 0.7 | 0.73 |
| emotion | 0.72 | 0.45 | 0.71 | 0.3 | 0.42 | 0.61 |
| Class-type | RBF-Recall | RBF-Precision | Linear-Recall | Linear-Precision | KNN-Recall | KNN-Precision |
|---|---|---|---|---|---|---|
| actor | 0.76 | 0.12 | 0.77 | 0.18 | 0.31 | 0.51 |
| sex | 0.82 | 0.86 | 0.94 | 0.94 | 0.92 | 0.93 |
| phrase | 0.71 | 0.45 | 0.78 | 0.51 | 0.52 | 0.57 |
| intensity | 0.71 | 0.58 | 0.8 | 0.56 | 0.62 | 0.66 |
| emotion | 0.71 | 0.25 | 0.64 | 0.25 | 0.26 | 0.42 |
As with the above, all other class types were set to "all" as we trained and tested the model on each emotional label individually.
| Emotion | RBF-Recall | RBF-Precision | Linear-Recall | Linear-Precision | KNN-Recall | KNN-Precision |
|---|---|---|---|---|---|---|
| angry | 0.73 | 0.57 | 0.79 | 0.42 | 0.37 | 0.66 |
| calm | 0.89 | 0.54 | 0.83 | 0.4 | 0.62 | 0.74 |
| disgust | 0.7 | 0.45 | 0.64 | 0.25 | 0.4 | 0.7 |
| fearful | 0.67 | 0.44 | 0.71 | 0.3 | 0.37 | 0.55 |
| happy | 0.59 | 0.38 | 0.65 | 0.24 | 0.28 | 0.54 |
| sad | 0.67 | 0.37 | 0.68 | 0.27 | 0.29 | 0.53 |
| surprised | 0.79 | 0.41 | 0.65 | 0.21 | 0.62 | 0.57 |
| neutral | 0.73 | 0.28 | 0.73 | 0.17 | 0.33 | 0.46 |
| Emotion | RBF-Recall | RBF-Precision | Linear-Recall | Linear-Precision | KNN-Recall | KNN-Precision |
|---|---|---|---|---|---|---|
| angry | 0.58 | 0.46 | 0.53 | 0.5 | 0.34 | 0.66 |
| calm | 0.96 | 0.21 | 0.95 | 0.29 | 0.46 | 0.5 |
| disgust | 0.92 | 0.16 | 0.81 | 0.17 | 0.19 | 0.34 |
| fearful | 0.31 | 0.34 | 0.31 | 0.29 | 0.24 | 0.41 |
| happy | 0.34 | 0.23 | 0.24 | 0.19 | 0.18 | 0.33 |
| sad | 0.93 | 0.18 | 0.85 | 0.19 | 0.19 | 0.32 |
| surprised | 0.9 | 0.16 | 0.77 | 0.16 | 0.21 | 0.36 |
| neutral | 0.96 | 0.1 | 0.9 | 0.13 | 0.22 | 0.31 |
Google Machine Learning Courses
3Blue1Brown introduction to Fourier transforms.
Mathematics of the Discrete Fourier Transform (from CCRMA)