We use the speech-only portion of the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) which consists of 1440 samples:
Before moving to the feature extraction process, we reduce the size of our data by trimming each sample to 2.5 seconds and resampling the file to 16kHz.
Trimming removes the silence at the beginning and allows us to work with smaller files, resulting in faster feature extraction and training. For our purposes, the silence at the beginning doesn't contain any useful information.
We downsample the examples from 48kHz to 16kHz for similar reasons. There's not much speech information above 8kHz, so we can get rid of that information and in the process, we are feeding the model more of the information that's important to our application and less of the noise.
Each sample was rated by individuals from a pool of 349 participants ten times. Information on perceptual validation can be found here.