Machine Learning and Deep Learning Approaches for Speech Emotion Recognition: A Survey

Page 1


International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 10 | Oct 2025 www.irjet.net p-ISSN: 2395-0072

Machine Learning and Deep Learning Approaches for Speech Emotion Recognition: A Survey

1 1B. Tech Student, Dept of Computer Engineering, and IT, VJTI College, Mumbai, Maharashtra, India

2 1B. Tech Student, Dept of Computer Engineering, and IT, VJTI College, Mumbai, Maharashtra, India

31B. Tech Student, Dept of Computer Engineering, and IT, VJTI College, Mumbai, Maharashtra, India

41B. Tech Student, Dept of Computer Engineering, and IT, VJTI College, Mumbai, Maharashtra, India

5Associate Professor,Dept of Computer Engineering and IT, VJTI College, Mumbai, Maharashtra, India ***

Abstract - This paper presents a comprehensive survey of systemsforrecognizinghumanemotionsfromspeechsignals. The ability of machines to understand emotional context is a critical step toward more natural and empathetic humancomputer interaction(HCI). Thissurveyexploresthecomplete pipelineofmodernSpeechEmotionRecognition(SER)systems, beginning with a discussion of benchmark datasets and feature extraction techniques, particularly the use of Melspectrograms. We delve into a detailed analysis of various machine learning and deep learning models, highlighting the evolution from traditional classifiers to advanced architectures like Convolutional Neural Networks (CNNs), LongShort-TermMemory(LSTM)networks, andstate-of-theart hybrid models. The paper also investigates advanced techniques such as data augmentation with Generative Adversarial Networks (GANs) and the fusion of multiple modalities.WeproposearobustSERsystembasedonahybrid CNN-BiLSTM architecture designed to achieve high accuracy by effectively modeling both the spectral and temporal characteristics of emotional speech.

Key Words: Deep Learning, Speech Emotion Recognition (SER),CNN-BiLSTM,RAVDESSDataset,FeatureExtraction, DataAugmentation.

1.INTRODUCTION

1.1

Background and Importance

Human speech is a complex signal rich with emotional informationthatcomplementsitslinguisticcontent.Speech EmotionRecognition(SER)isarapidlyadvancingfieldatthe intersection of digital signal processing and artificial intelligence,aimingtoautomaticallyidentifytheemotional state of a speaker from their voice. The development of accurateSERsystemsisparamountforthenextgeneration of HCI. Applications are wide-ranging and impactful, including:

ï‚· Smart Assistants: Enabling virtual assistants to respondmoreempatheticallytoauser'stone.

ï‚· Healthcare: Monitoringtheemotionalwell-beingof patientsinremotetherapyordetectingstressand depression.

ï‚· Automotive Safety: Detecting driver states like anger,drowsiness,orstresstopreventaccidents.

1.2 Core Challenges in SER

Despite its potential, SER faces significant challenges. Emotions are subjective and can be expressed differently acrosscultures,genders,andcontexts.Theacousticfeatures that correlate with emotions are often subtle and intertwined with the linguistic content of speech. Furthermore,thescarcityoflarge-scale,realisticallylabeled emotionalspeechdatasetsremainsamajorbottleneck,often leadingtomodelsthatperformwellinalabsettingbutfailin real-world,noisyenvironments.

Several key challenges are consistently identified in the literature:

ï‚· Ambiguity of Emotional Labels: There is no universallyacceptedstandardforlabelingemotions. Emotionsareoftenblended(e.g.,happilysurprised) and do not have distinct boundaries, making it difficulttoassignasingle,discretelabeltoaspeech utterance.Thisleadstoinconsistenciesduringthe dataannotationprocess.

ï‚· DataScarcityandQuality: Mostavailabledatasets are"acted"ratherthan"spontaneous."Whileacted datasets are clean and balanced, they may not accuratelyreflectthesubtletyofnatural,real-world emotions.Spontaneousdatasetsaremorerealistic butaremuchhardertocollect,labelaccurately,and areoftenimbalanced.

ï‚· Environmental and Speaker Variability: Models trained on one dataset often fail to generalize to anotherduetodifferencesinrecordingconditions, background noise, languages, and speaker characteristics. This problem, known as crosscorpus generalization, is a major hurdle for creatinguniversallyapplicableSERsystems.

ï‚· Feature Engineering and Selection: While deep learningcanlearnfeaturesautomatically,thechoice of input representation is still critical. Using very

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 10 | Oct 2025 www.irjet.net p-ISSN: 2395-0072

large feature sets can lead to the "curse of dimensionality,"wherethemodelbecomesoverly complexandoverfitsthetrainingdata,achallenge addressed by techniques like sequential feature selection

2. METHODOLOGIES IN SPEECH EMOTION RECOGNITION

The process of building an SER system involves several stages, from processing raw audio to classifying emotions usingsophisticatedmodels.

2.1

Feature Extraction

The first step is to convert raw audio waveforms into a compactandinformative numerical representation. While traditionalfeatureslikepitchandenergyareuseful,modern SERsystemsheavilyrelyon2Drepresentationsthatcanbe processedbydeeplearningmodels.

Mel-Spectrograms areapopularchoice.Theyareavisual representationofthespectrumoffrequenciesinasoundas they vary with time, with the frequency axis scaled to the Melscaletomimichumanauditoryperception.Thismakes themanincrediblyrichinputforvision-basedmodelslike CNNs.

2.2

Classification Models

ï‚· Traditional Machine Learning: Earlyapproaches utilized models like Support Vector Machines (SVM) and Random Forests (RF). These models perform well with handcrafted features but are often limited in their ability to learn complex, hierarchicalpatternsfromthedata.

ï‚· Convolutional Neural Networks (CNNs): CNNs arethecornerstoneofmoderncomputervisionand have been successfully adapted for SER. When applied to Mel-spectrograms, their convolutional filters act as learnable feature extractors, automatically detecting key spectral shapes, textures, and patterns (like formants) that correspondtodifferentemotionalstates.

ï‚· Recurrent Neural Networks(RNNs) and LSTMs: Sincespeechisasequence,itstemporaldynamics are crucial. Long Short-Term Memory (LSTM) networks, a type of RNN, are designed to model long-rangedependenciesinsequences.Theyusea systemofgatestorememberorforgetinformation over time, allowing them to learn the contextual flow of emotional expression throughout an utterance. Bidirectional LSTMs (Bi-LSTMs) further enhance this by processing the sequence both forwards and backward, providing a more completecontext.

ï‚· Hybrid Models (CNN-BiLSTM): Thisarchitecture has emerged as the state-of-the-art for SER. It combines the strengths of both models: a CNN front-end extracts powerful spatial features from eachtime-frameofthespectrogram,andaBi-LSTM back-end models the temporal sequence of these features.Thisallowsthesystemtosimultaneously learn what is being said spectrally and how it evolvescontextually.

3. COMPARATIVE ANALYSIS OF SER TECHNIQUES

The following table provides a summary of key research papers, focusing on their unique contributions and limitationsratherthanjusttheauthors.

Table -1: ComparisonofKeySERMethodologies

Methodology

Dataset(s) Used Key Contribution Limitations / Observations

RandomForest RAVDESS, EmoDB

DCGAN Augmentation EmoDB, RAVDESS

Strong ML baseline (~85% acc.) using ensemble learning.

Reliesheavilyon feature engineering; poor generalization.

Used GANs to generate spectrograms, boostingaccuracy by solving data scarcity.

Generated data quality is a concern; high computecost.

Feature Selection (SFS/SBS)

IEMOCAP, RAVDESS

Hybrid CNNBiLSTM Multiple Datasets

Improved efficiency & reduced overfitting by selecting dominant features.

State-of-the-art results by capturingspatiotemporalfeatures effectively.

The optimal feature set is often datasetspecific.

High data & computeneeds; requirescareful regularization.

4. PROPOSED SYSTEM ARCHITECTURE

ToaddressthechallengesofSER,weproposeasystembased onarobust,state-of-the-artdeeplearningarchitecture.

4.1 Problem Statement: To develop an accurate and efficient system for real-time classification of human emotions (e.g., happy, sad, angry, neutral, calm) from raw speechsignals.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 10 | Oct 2025 www.irjet.net p-ISSN: 2395-0072

4.2 System Pipeline:

1. Data Pre-processing: Input audio will be standardizedtoauniformsamplingrateandlength. Silent portions at the beginning and end of utterances will be trimmed to remove irrelevant data.

2. Feature Extraction: Thepre-processedaudiowill beconvertedinto Mel-spectrograms.Thisprovides a rich, 2D representation suitable for our deep learningmodel.

3. ModelArchitecture:AhybridCNN-BiLSTMmodel willbeused.

o The CNN block will consist of multiple convolutionallayerswithReLUactivation and Max Pooling. This part will act as a powerfulfeatureextractor.

o TheoutputfeaturemapsfromtheCNNwill bereshapedandfedintoa Bi-LSTM block. Thiswillmodelthetemporaldependencies withintheutterance.

o Finally, a set of Dense layers with a Softmax activation function will perform the final classification into emotion categories. Dropout layers will be used throughout the network to prevent overfitting.

4. TrainingandEvaluation:Themodelwillbetrained onthe RAVDESS dataset,ahigh-quality,balanced corpusofNorthAmericanEnglishspeech.Wewill use the Adam optimizer and categorical crossentropy as the loss function. Performance will be evaluated using standard metrics: accuracy, precision, recall, and F1-score

5. CONCLUSIONS

This survey confirms that the field of Speech Emotion Recognitionhassignificantlybenefitedfromtheadoptionof deep learning. While traditional machine learning models providestrongbaselines,hybridarchitecturesliketheCNNBiLSTM have demonstrated superior performance by adeptly capturing the complex spatio-temporal nature of emotional speech. Techniques like data augmentation are vitalforovercomingthepersistentchallengeoflimiteddata. The system we propose, based on a CNN-BiLSTM architecture, is designed to be both robust and accurate. Future work should continue to explore cross-corpus generalization, real-world noise robustness, and the recognitionofmoresubtleandblendedemotionalstatesto bringSERtechnologyclosertohuman-levelperformance.

ACKNOWLEDGEMENT

Wewouldliketoexpressoursinceregratitudetoourproject guide, Prof. Pramila M. Chawan, for her invaluable guidance,constantencouragement,andsupportthroughout this research. Her insights were instrumental in the successfulcompletionofthissurveypaper.Wewouldalso liketothanktheDepartmentofComputerEngineering&IT at Veermata Jijabai Technological Institute (VJTI), Mumbai, for providing the necessary resources and a conduciveenvironmentforourwork.

REFERENCES

[1] P.KoromilasandT.Giannakopoulos,"DeepMultimodal Emotion Recognition on Human Speech: A Review," AppliedSciences,vol.11,no.17,p.7962,2021.M.Young, The Technical Writer’s Handbook. Mill Valley, CA: UniversityScience,1989.

[2] F. Harby, M. Alohali, A. Thaljaoui, and A. S. Talaat, "Exploring Sequential Feature Selection in Deep BiLSTM Models for Speech Emotion Recognition," Computers,Materials&Continua,vol.78,no.2,pp.26892719,2024.

[3] C.BarhoumiandY.BenAyed,"Real-timespeechemotion recognitionusingdeeplearninganddataaugmentation," Artificial Intelligence Review,vol.58,p.49,2025.

[4] T.M.Wani,T.S.Gunawan,S.A.A.Qadri,M.Kartiwi,and E. Ambikairajah, "A Comprehensive Review of Speech EmotionRecognitionSystems," IEEE Access,vol.9,pp. 47795-47814,2021

[5] S.G.Shaila,A.Sindhu,L.Monish,D.Shivamma,andB. Vaishali, "Speech Emotion Recognition Using Machine Learning Approach," in ICAMIDA 2022, ACSR 105, pp. 592-599,2023.

BIOGRAPHIES

Prof. Pramila M.Chawan holds the position of Associate Professor in the Computer Engineering Department of VJTI, Mumbai. She pursued B.E (Computer Engineering) and M.E (Computer Engineering) from VJTI college of Engineering. She has guided 100+M.Techand150+B.TechProjects in her 32 years of Profession. In PeerReviewed International journals, InternationalConference&Symposiumsshehaspublished 181+papers.Shehasbeenontheplanningcommitteeforsix faculty development programs and 29 International conferences.Sheisconsultingeditoron9scientificresearch journals. The Society of Innovative Educationalist & ScientificResearchProfessional,Chennai(SIESRP)awarded

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 10 | Oct 2025 www.irjet.net p-ISSN: 2395-0072

her with ‘Innovative & Dedicated Educationalist Award Specialization:ComputerEngineering&I.T.’

B.Tech Student, Dept. of Computer Engineering and IT, VJTI,Mumbai,Maharashtra,India.

Jash Shah, Labdhi Shah, B.Tech Student, Dept. of Computer Engineering and IT, VJTI,Mumbai,Maharashtra,India.

Anish Deshpande, B.Tech Student, Dept. of

Computer Engineering and IT, VJTI,Mumbai,Maharashtra,India.

Puransh Kawdia, B.Tech Student, Dept. of Computer Engineering and IT, VJTI,Mumbai,Maharashtra,India.

© 2025, IRJET | Impact Factor value: 8.315 | ISO 9001:2008

Turn static files into dynamic content formats.

Create a flipbook