AML-BC

Feature Types and Resolution

Training and testing on mfcc features resulted in better performance across almost all classification tasks than training/testing on mel spectrograms, especially with the SVC-RBF model. Given that we are working with speech data, where the information tends to be encoded in the phenomes and formants, this makes sense. MFCCs specifically capture and highlight that information.

Furthermore, for this set of data and tasks there was no meaningful performance difference to be found in adjusting the number of mel filters or the number of mfccs.

Classifications

All of the models using mfccs were able to classify examples according to the actor's sex quite well with all recall and precision scores > 0.95. Interestingly, the SVC using the RBF kernel (RBF model) drops from 0.99 recall and precision to 0.82 and 0.86 respectively when using mel spectrograms, while the SVC model with the linear kernel (LINEAR model) and the KNN model achieve results similar to the iterations trained/tested using mfccs.

Using mfccs, the RBF model achieves strong results in identifying individual actors, with 0.87 recall and 0.86 precision. The LIN and KNN models exhibit decent performance. The LINEAR model is stronger on recall than precision (0.88/0.62) while the KNN model is the opposite (0.68/0.87). That relationship holds for models trained on mel spectrograms, though precision takes a massive hit on both RBF and LINEAR at 0.12 and 0.18, respectively.

When classifying the phrase with mfccs, both the RBF and LINEAR models achieve strong recall scores ~0.9, but are prone to false positives, with precision scores around 0.5. Using mel spectrograms, the recall scores drop to 0.71 (RBF) and 0.78 (LINEAR), though the precision scores stay similar. The KNN model scores poorly, around 0.5 on both metrics and its performance here is similar whether using mfccs or mel-spectrograms.

Likewise, in classifying intensity, both RBF and LINEAR produced decent recall scores, while being prone to false positives. Again, using the mel spectrograms results in degraded performance, though more noticeably in the RBF model. The KNN model using mfccs here gives decent and balanced performance here with recall at 0.7 and precision at 0.73.

Overall performance on classifying emotions is lackluster. With the RBF and Linear models in particular false positives present a problem with precision scores < 0.5 KNN also performs poorly but struggled with recall more than precision.

Interestingly, when looking at performance across all emotions, the RBF model posted similar recall scores for runs using mfccs and runs using mel spectrograms.

Aggregated Results

The following tables give the average performance of each estimator in terms of recall and precision when trained and tested on a single label. We omit accuracy because with particularly unbalanced sets accuracy can be misleading.

For example, there are only 60 examples per actor; a model could fail to identify any true positives and still have an accuracy rate > 0.9.

For each class type, we trained and tested the models on each possible label while keeping the rest of the class types at their default "all." Furthermore, each label was trained and tested across a range of values for:

the number of mel spectrogram filters
the number of mfccs

The table below represents the average performance of each estimator across all labels of the class type and across all mel spectrogram and mfcc values. The results for training/testing on mfcc features and mel spectrogram features are listed separately.

For example: for actors we train and test the model to identify each of the 24 actors individually and then average the results.

Averaged Results by Class

MFCC

Class-type	RBF-Recall	RBF-Precision	Linear-Recall	Linear-Precision	KNN-Recall	KNN-Precision
actor	0.87	0.86	0.88	0.62	0.68	0.87
sex	0.99	0.99	0.98	0.97	0.97	0.97
phrase	0.91	0.53	0.88	0.54	0.51	0.54
intensity	0.89	0.64	0.86	0.58	0.7	0.73
emotion	0.72	0.45	0.71	0.3	0.42	0.61

Mel-Spectrogram

Class-type	RBF-Recall	RBF-Precision	Linear-Recall	Linear-Precision	KNN-Recall	KNN-Precision
actor	0.76	0.12	0.77	0.18	0.31	0.51
sex	0.82	0.86	0.94	0.94	0.92	0.93
phrase	0.71	0.45	0.78	0.51	0.52	0.57
intensity	0.71	0.58	0.8	0.56	0.62	0.66
emotion	0.71	0.25	0.64	0.25	0.26	0.42

Averaged Results for Each Emotion

As with the above, all other class types were set to "all" as we trained and tested the model on each emotional label individually.

MFCC

Emotion	RBF-Recall	RBF-Precision	Linear-Recall	Linear-Precision	KNN-Recall	KNN-Precision
angry	0.73	0.57	0.79	0.42	0.37	0.66
calm	0.89	0.54	0.83	0.4	0.62	0.74
disgust	0.7	0.45	0.64	0.25	0.4	0.7
fearful	0.67	0.44	0.71	0.3	0.37	0.55
happy	0.59	0.38	0.65	0.24	0.28	0.54
sad	0.67	0.37	0.68	0.27	0.29	0.53
surprised	0.79	0.41	0.65	0.21	0.62	0.57
neutral	0.73	0.28	0.73	0.17	0.33	0.46

Mel Spectrogram

Emotion	RBF-Recall	RBF-Precision	Linear-Recall	Linear-Precision	KNN-Recall	KNN-Precision
angry	0.58	0.46	0.53	0.5	0.34	0.66
calm	0.96	0.21	0.95	0.29	0.46	0.5
disgust	0.92	0.16	0.81	0.17	0.19	0.34
fearful	0.31	0.34	0.31	0.29	0.24	0.41
happy	0.34	0.23	0.24	0.19	0.18	0.33
sad	0.93	0.18	0.85	0.19	0.19	0.32
surprised	0.9	0.16	0.77	0.16	0.21	0.36
neutral	0.96	0.1	0.9	0.13	0.22	0.31

Additional Resources

Feature Types and Resolution

Classifications

Aggregated Results

Averaged Results by Class

MFCC

Mel-Spectrogram

Averaged Results for Each Emotion

MFCC

Mel Spectrogram

Additional Resources

Machine Learning

Digital Audio

Math

Useful Libraries