Sentiment Analysis: A comparative study of Deep Learning and Machine Learning

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056

Volume: 09 Issue: 11 | Nov 2022 www.irjet.net p-ISSN:2395-0072

Sentiment Analysis: A comparative study of Deep Learning and Machine Learning

Rachit Manshani1 , Toshi Thatte2 , Kanhaiya Suryawanshi3

Abstract - As the population of customers grows, it becomes increasingly tedious to analyze customer feedback. By 2025 it is estimated that more than 75% of the data relating to customer reviews will be unstructured, meaning that the challenges to analyze and gain insights from these reviews will increase.

With help of sentiment analysis by machine learning or deep learning, data gathered via reviews can be categorized under the labels of positive, negative and neutral. Sentiment analysis makes it convenient to understand the customer’s mindset and preferences, assisting companies in changing their strategies and products to better cater the needs of their customer base.

Key Words: SentimentAnalysis,MachineLearning,Deep Learning,LSTM,CNN,RNN,LogisticRegressionalgorithm, TF-IDF.

1. INTRODUCTION

ViaSentimentanalysis,weclassifyablockoftextunderan umbrellaoflabelslikepositive,negativeandneutral.Using Sentimentanalysis,onecandeterminethegeneralmindset ofthecustomerpoolwhichinturnhelpsthecompaniesto introspect onthe qualityof services or productsthatthey areproviding,thusassistingthemagreatdealinupgrading themselves.

Various machine learning algorithms and deep learning techniques canbeusedfor sentimentanalysis,though the objectiveofthispaperistodeterminewhichworkapproach will yield the best results [1]. At the end of this research, we’llhaveageneralizedideaonhowsentimentanalysiscan beperformedviamachinelearninganddeeplearningand whichpracticestoemployfortheoptimumpre-processing andmodellingofthedataset.

2. WORK DONE

E.LakshmiandEdwarddeducedpre-processingofthedatato improvearawphrase’sconsistencystructure.Theyutilized theLSAtechniqueandcosinesimilarity.

Basant Agarwal et al. used the term pattern system for sentimentanalysis.Itworksonextractingcertaincontextual and syntactic information from the dataset. The author wantedtoputforwardaspect-focusedopinionpollingbased onunlabeledfreeformcustomerreviews.

M. Karamibekr and A.A. Ghorbani on the other hand suggested an alternative approach for sentiment classificationofa textweighingonverbsasavital opinion expression and hence can be used as a tool for sentiment analysis.

SentiFulisasentimentlexicon/algorithmwhichutilizesand expands it by synonyms, antonyms, hyponyms etc. They identified four types of affixes based on their position in sentiment features namely propagation, weakening, reversing,andintensifying.Forsentimentanalysis,a lotof researchershaveexploredandusedsoft-computingmethods, mostlyfuzzylogicandneuralworks.

Significant contribution has been made by using a fuzzy domainsentimentontologytreeextractionalgorithm.This algorithm makes a fuzzy domain sentiment ontology tree based on feedback, which involves extracting sentiment terms, product attributes, and feature relationships, and accurately predicting the polarity of the reviews [2]. They deduced and standardized the method of increasing the strengthofreviewer'sopinionsinthepresenceofadverbial modifiersbydesigningmembershipfunctions.Theyusedthe technique to analyze adverbial modifier trigram patterns [10].

3. APPROACH

3.1 Determining the Dataset

There are several datasets available that can be used to efficientlyforSentimentAnalysisandwe’lldeterminewhich datasetwillbetheappropriatefortheresearchwehavein mind.Someprominentdatasetsare:

1)IMDBMoviesReviewDataset:

This dataset contains about 50,000 movie reviews from IMDB.However,inthissetonlyhighlypolarizedreviewsare beingconsidered.Fromatotalof10,thenegativehasascore oflessthan4andthepositivehasascoreofgreaterthan7.

2)CornellFilmReviewData:

This dataset contains around 220,000 conversations betweenpairsofdifferentmoviecharacters.

3)MovieLens25Mdataset:

This dataset is collated from the Movie Lens website, this datasetcomprisesaround25millionratingswith1million tagapplicationsgiventoabout62,000moviesintotal.

|

Certified Journal | Page147

IRJET

Impact Factor value: 7.529

ISO 9001:2008

***

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056

Volume: 09 Issue: 11 | Nov 2022 www.irjet.net p-ISSN:2395-0072

4)RottenTomatoesDataset:

Inthisdataset,eachentryrepresentsamovieavailableon Rotten Tomatoes website with the URL that is used for analyzingwithdatavalueslikemovietitle,descriptionofthe movie, genre to which the belongs, duration, directors, actors,ratingsgivenbythe usersandratingsgivenbythe critics.

Afteranalyzingseveraldatasets,wedecidedtoproceedwith theRottenTomatoesMovieReviewDatasetforthesakeof this research as we felt that the above-mentioned dataset cateredmosttoourneeds.

Here, the submissions are evaluated using classification accuracywhichisthepercentageoflabelsthatarepredicted correctlyforeveryparsedphrase.Thesentimentlabelsare asfollows:

0-negative 1-somewhatnegative 2-neutral 3-somewhatpositive 4–positive

3.2 Initial Inspection

First most, we can see from the given dataset that each sentenceisbrokenintomultiplephrases,eachhavingtheir own sentimentclassification.Eachsentence isdenoted by ‘SentenceId’column.Hence,westartwithcheckingoutthe totalnumberofuniquesentencesandunderstandingmore about the sentence structure within both the training and testdatasets.Wethenfindtheaveragecountofphrasesper sentence,perdataset.

Therearemuchlargerproportionofphrasescomparedto sentences in our dataset, and the average word length of phrases at 7 is quite low. This will need to be considered whencleaningthetextinordertostrike therightbalance betweenmakingthedataneaterandlosingtoomuchdata thatrendersMachineLearningmoredifficult.

3.3 Text Preprocessing

In this step, we combine both the datasets (training and testing)andclean.Noiseremovalisoneofthemostessential text preprocessing steps as it helps us to emit out certain characters,digitsetc.whichmighthinderourtextanalysis

Forexample,languagestopwords(forinstance-the,of,is, am,inetc.)URLsorpunctuationsandlinks-allofthesewill be considered as noise. Noise removal step deals with eliminationofalltypesofnoisyentitiespresentinthetextof ourdataset.

Typically,wewoulddefineafunctiontoclearallthisnoise, but this dataset contains a huge number of single word entries. Removing numbers, punctuation marks, special charactersandwholewordswouldwipeoutalargechunkof the available observations, potentially resulting in more trouble during our modeling. Therefore, we are going to keep them in and proceed with a more ‘light touch’ noise removalfunction.

Ourfirststepwouldbetointroducethebasictextcleaning function. We make all the text characters lowercase and removewhitespaces.Thenweperformconversiontostring. We apply this function and create a new column for the cleanedtext.Withthetextcleanednow,wecansplitthedata backintotrainingandtestsets.

Wenowmarkthesentimentcolumnas-999withinthetest set which indicates that there is zero chance of crosscontamination.

3.4 Preparation of Text Matrix

After the data pre-processing a text matrix is created for analyzing our dataset efficiently and making it algorithm compatible.Fordoingthesamewe’llbeusingtheapproach of“TF-IDF”(TermFrequency-InverseDocumentFrequency) model.

Inthismodel‘TF’canbedefinedasthenumberoftimesa term,say‘dog’,occursinadocumentandIDFcanbestated asthelogarithmoftheratiooftotaldocumentsavailableto the number of documents containing the term ‘dog’. Combined,theTF.IDFformulastatestherelativerelevance oftheterm‘dog’inalistofconsidereddocuments[6].

We’llfirstgetthetotaln-gramcountorallthesingle-ortwowordcombinationsfromourdataset.Onceextracted,we’ll proceed to check the sparsity or the number of non-zero valuespresentinourdataset.Oursparsitycomesouttobe 0.01%whichworksinourfavorasthesparserourdatais, themorechallengingit’llbeforustomodelit.Sparsityand missingdatacanbedistinguishedaswhenthereismissing data,itmeansthatmanydata pointsareunknown,onthe other hand, if the data is sparse, all the data points are known,butmostofthemhavezerovalue

Since we have calculated the term counts for both the trainingandtestingdatasets,wecanproceedwithapplying theTF.IDFtransformerforcalculatingtheweightsforeach terminboththesets.

©

Certified Journal | Page148

2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056

Volume: 09 Issue: 11 | Nov 2022 www.irjet.net p-ISSN:2395-0072

3.5 Model Selection

MachineLearningcanbedefinedasthestudyofcomputer programs that utilize algorithms and statistical models to learnthroughdeductionsandpatternswithoutbeingpreprogrammedtodoso.

ThereisahumongousnumberofMLalgorithmswhichcan becategorizedintothreedistinctlabels:

1.SupervisedLearning:Inthistype,thealgorithmisinitially trained on labeled data. The ML algorithm is first given a smalltrainingdatasettolearnfrom.Thistrainingdatasetisa part of the main dataset and has the role to teach the algorithm the basic nature of our problem, its probable solution,anddatapointstoworkwith.Thealgorithmthen looks for connections between the parameters given, establishing a cause-and-effect relationship between the variables in the dataset. At the end of this process, the algorithm has an overview of how the data fits in and the connectionbetweentheinputandtheoutput.Thissolution isthenputinplacetoemploythefinaldataset.

Dependingonthenatureofthetargetvariable,itcanbea discrete target variable task (Classification) model or a continuoustargetvariabletask(Regression)model.

● Classification Supervised Learning: It is a Supervised Learning task where output contains defined labels or discrete value. The main goal in this type of supervised learning is to predict discrete values belonging to a particularclassandthenassessthemonaccuracy.

● Regression Supervised Learning: It is a Supervised Learningtaskwhereoutputhasacontinuousvaluerather than the discrete values present in Classification learning. The main idea here is to predict a value as closer to the actual output value and then final evaluation is done by determiningtheerrorvalue.Theinsignificanttheerror,the bettertheaccuracyofourmodel.

ExamplesofSupervisedlearningalgorithms:

● LogisticRegression

● SupportVectorMachine

● DecisionForest

2. Unsupervised Learning: Unsupervised learning can be describedasthetrainingofanalgorithmusingdatathatis neither labeled nor classified and letting the algorithm deduceonthatdatawithoutanyguidance.Themajorroleof the machine is to gather unsorted information based on variousfactorslikepatterns,similaritiesanddifferences,all withoutanypriortraining.

UnsupervisedLearningcanbeclassifiedintothefollowing twocategories:

Clustering:Aclusteringproblemiswhereweaimtoclassify theinherentgroupingsinthedataset.

Association:Anassociationrulelearningproblemiswhere we aim to find orders that describe large portions of our dataset.

ExamplesofUnsupervisedmachinelearningare:

● K-meansclustering

● Fuzzymeans

3. Reinforcement Learning: In this case, the algorithm discoversdataviatrialanderrorandthenconcludeswhat action will result in higher efficiency. Three major components that comprises reinforcement learning: the agent, the environment,and the actions.Theagent can be considered as the learner and decision-maker, the environmentcompriseseverythingthattheagentinteracts with,andtheactionsarewhattheagentperforms.Onesuch exampleistheMarkovDecisionprocess.

Fig -1:ClassificationofMachineLearningAlgorithms

Now for our research’s sake, we’ll be considering Logistic Regression as our chief algorithm as it is convenient in handling multi class classification in a relatively shorter duration.

LogisticRegression:Itisasupervisedlearningclassification algorithm that can be used to predict the probability of a YES/NO event (binary event) happening. Therefore, the outcomemustbeadiscretevalue.ItcanbeeitherNoorYes, 1or0,etc.butinsteadofgivingtheprecisevalue,itgivesthe probabilisticvalueswhichliebetweenthetwoextremes[9].

TheequationusedbythebasiclogisticmodelisLn(p/1-p) =a0+a1*x+a2*xThisiscalledthelogisticfunction.

|

Certified Journal | Page149

IRJET

Impact Factor value: 7.529 | ISO 9001:2008

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056

Volume: 09 Issue: 11 | Nov 2022 www.irjet.net p-ISSN:2395-0072

Fig - 2:PrincipalGraphforLogisticRegression

Thebasicideaofthisistotest,train,findaccuracyandthen create a confusion matrix. The classification report with metricssuchasprecision,Recall,F1scoreetc.willgiveus insightsabouthowourmodelisperforming.

Fig - 4:DeepLearningModel

Deep learning can properly analyze data in its sequential form, something where our traditional machine learning models fail, as depicted from the inadequate accuracy obtainedfromtheconventionalLogisticRegressionModel [4].

Fig - 3:ClassificationReport

The accuracy obtained from the above model is 56.2% which is not suitable for our requirements and therefore we’llbeturningtodeeplearning.

3.6 Deep Learning Approach

Deeplearningmimicsthebasicstructureandfunctionalityof the human brain for recognition tasks and patterns. Deep learning can obtain certain useful features which in turn increases the efficiency of our predictive model making it possibletoanalyzeourdatamoreprecisely[3].

Theoretically, the model used in deep learning is a mathematical function - f: X → Y. Deep learning is the evolution of a more in-depth Artificial Neural Network (ANN)usingmorethanonehiddenlayerformodelingofthe provideddata[11].

Itcontainsthreeprominentlayers:

1.InputLayer:thislayerreceivesdatafromvariableX.

2. Hidden Layer: This layer receives data from the input layer. Each hidden layer trains a unique set of features. Increasednumbersofhiddenlayersincreasethecomplexity andabstractionofourmodel.

3.OutputLayer:thelayer’sneuronsreceivedatafromthe last hidden layer producing an output value and is the culminationofthecalculationofthevariableXtovariableY.

Inthisapproachwe’llbeusingtheLongshort-termmemory (LSTM) algorithm. LSTM is a type of a Recurrent Neural Networks(RNN)whichisdistinguishedbyitsuniquenessof processingofsequentialinformationwhichisderivedfrom the internal memory stored by directed cycles and hence RNN can not only remember but use previous computed informationbyapplyingittotheforthcomingelement.

LSTM,moreover,iscapabletouselongmemoryasaninput for activating the functions present in the hidden layer. LSTMwasintroducedto overcomethevanishinggradient problem usually associated with typical RNNs models. In shortifasequenceislong,thenitbecomesdifficultforRNN tocarryacertainpieceofinformationfromoneinstancetoa previousone,afoibleremovedbyLSTMbyusingamemory cell[7].

Fig - 5:LSTMArchitecture

|

Certified Journal | Page150

IRJET

Impact Factor value: 7.529 | ISO 9001:2008

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056

Volume: 09 Issue: 11 | Nov 2022 www.irjet.net p-ISSN:2395-0072

Furthermore, we’ll be using a CNN model alongside our LSTMmodeltoimprovetheoverallefficiency.

CNN (Convolutional Neural Networks) is a type of neural networks which is distinguishable from its counterparts becauseofitsabilitytoefficaciouslyprocessdatahavinga gridlikestructure,forexampleanimage[5].

ACNNconsistsofthreelayers:

1.Convolutionlayer:Thislayeristhefoundationalbuilding block and manages the significant portion of the computational load. Convolution layer is tasked with performingadotproductbetweentwomatrices,thekernel (setoflearnableparameters)andtherestrictedportionof thereceptivefield.

AfterloadingthenecessarylibrariesfromKeraspackage,we proceededwithcertainprocessingstepstoprepareourdata further.Thesestepsare:

1.Tokenization:Tokenizationcanbedefinedasbreakingof raw text into smaller parts also known as tokens. This is done to ease the understanding of the context and progressingthemodelbyanalyzingthesequenceofwords asnowthetokenscanbeconsideredasdiscreteelements.

2.Indexing:We’llperformindexingbygivingourwordsan indexandconvertingthemintoadictionary-likestructure. We’ll perform this by using texts_to_sequences () which converts our text into integer sequences, thus creating a mappingofwordsandtheirspecificintegerindices.

3. Index Representation and padding length: We’ll now transformourindexedlistintoa2Darray.Thisarraywillbe of the shape of (num_samples, num_timesteps) where num_timestepsiseitherthelongestsequencelengthorthe ‘maxlen’argument.Inourcasewe’llsetout‘maxlen’to20. Values are padded to sequences which are shorter than ‘maxlen’ and similarly sequences which are longer are curtailedtoobtainthedesiredlength.

Fig - 6:ConvolutionLayerFormula

2. Pooling layer: This layer is tasked with replacing the outputatcertainpositionswithaderivedsummarystatistics of nearby outputs which in turn helps in reducing the requiredamountofweightandcomputation.

Finally proceeding to our hybrid model of LSTM and CNN combined.We’llproceedwithmakinganembeddingmatrix usingthewordvectorstrainedonCommonCrawlandone hot encoding our y variable or training set. Our model containsacommencinglayerofLSTMthatwillcollectallthe wordembeddingscorrespondingtoeachtokenasaninput. Ourmainaimisthatthefinaloutputtokenswillcontainthe informationofallthetokensprecedingit;Orwecansay,the LSTM layer is fabricating a new encoding for the original input. The output is then transferred into a convolution layerwhosefunctionwillbetoextractallthelocalfeatures [8].

Fig - 7:PaddingLayerFormula

3. Fully Connected Layer: This layer helps in mapping the representation between the output and input. Neurons in thislayerhavecompleteconnectivitywithalltheneuronsin precedingandsucceedingstages.

Moving over, the CNNs output will be amalgamated to a smaller dimension and ultimately derived as a positive/negativelabel.Thefinalaccuracyreceivedusingthe hybridmodelis 86.12% whichisfarmoresatisfactorythan theoneobtainedviamachinelearning.

Fig - 8:CNNArchitecture

©

Certified Journal | Page151

2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056

Volume: 09 Issue: 11 | Nov 2022 www.irjet.net p-ISSN:2395-0072

AfterusingahybridmodelofLSTM-CNN,wedid receivean optimumperformancefromourmodel.Thecorresponding accuracyobtainedwas 86.12% whichcateredtoourneeds more.

Epoch 00009: val_loss improved from 0.31239 to 0.31127, saving model to best_model.hdf5

Epoch 10/15

39168/140454 [=======>......................] - ETA: 33s

- loss: 0.3120 - acc: 0.8612

Hence,wecanclearlyobservethatourdeeplearningmodel outperformsourmachinelearningmodelwhichisprimarily becausedeeplearningpossessestheabilitytocaptureany kindofsequentialdependencybetweendifferentwordsina sentence, this is made possible because of the Recurrent NeuralNetworks,whichLSTMisaspecialtypeof.

5. CONCLUSION

Fig - 9:HybridLSTM-CNNModel

4. RESULTS AND OBSERVATIONS

As we used a modified version of the infamous Rotten Tomatodatasetwhichisfurtherdividedintotrainingdataset andtestingdataset.Thecorrespondingsentimentlabelsthat wehaveusedare: 0-negative 1-somewhatnegative 2-neutral 3-somewhatpositive 4–positive

First, we have utilized the machine learning algorithmLogisticRegression.However,theaccuracyofthemodeli.e., 56.2% obtained from logistic regression was not satisfactorysoweturnedtowardsdeeplearning.

Todeterminewhichapproach(MachineLearningorDeep Learning)ismostsuitableforthepurposeofthisresearch i.e., for precise sentiment analysis of our dataset, we calculatedtheaccuraciesofourchosenarchitectures.

Ourresearchindicatesthatourhybriddeeplearningmodel ofLSTM-CNN(Accuracy-86.12%)complywithourneeds betterthanthemachinelearningmodel,LogisticRegression (Accuracy-56.9%)andhence,deeplearningalgorithmsare moresuitableforsentimentanalysisontherottentomato dataset.

REFERENCES

[1]LiuB.Sentimentanalysis:miningopinions,sentiments, andemotions.TheCambridgeUniversityPress,2015.

[3]PangBandLeeL.Opinionminingandsentimentanalysis. FoundationsandTrendsinInformation Retrieval, 2008.2(1–2):pp.1–135.

[4]Goodfellow I,BengioY,CourvilleA.Deeplearning.The MITPress.2016.

[5]GlorotX,BordesA,BengioY. Deepsparserectifier neural networks. In Proceedings of the International ConferenceonArtificialIntelligenceandStatistics(AISTATS 2011),2011.

[6] Rumelhart D.E, Hinton G.E, Williams R.J. Learning representations by back-propagating errors. Cognitive modeling,1988.

Fig

- 10:

ConfusionMatrixforLogisticRegressionModel

©

| Page152

2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056 Volume: 09 Issue: 11 | Nov 2022 www.irjet.net p-ISSN:2395-0072

[7]CollobertR, Weston J,BottouL,KarlenM,Kavukcuoglu K,andKuksaP.Naturallanguageprocessing(almost)from scratch.Journalof MachineLearningResearch,2011.

[8] Goldberg Y. A primer on neural network models for naturallanguageprocessing.JournalofArtificialIntelligence Research,2016.

[9] L. Liu, X. Nie, and H. Wang,” Toward a Fuzzy Domain SentimentOntologyTreeforSentimentAnalysis,”Processed ofthe5thImageInternationalCongressonSignalProcessing (CISP),pp.1620–1624,2012.

[10] R. Srivastava, M. P. S. Bhatia,” Quantifying Modified OpinionStrength:AFuzzyInferenceSystemforSentiment Analysis,” International Conference on Advances in Computing,CommunicationsandInformatics(ICACCI),pp. 1512-1519,2013.

[11]ApoorvAgarwal,BoyiXie,Ilia Vovsha, Owen Rambow, Rebecca Passonneau. Sentiment Analysis on twitter data. DepartmentofComputerScience,ColumbiaUniversity,New York,NY10027USA.

2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified

©

| Page153

Journal

Turn static files into dynamic content formats.

Create a flipbook