Comparison analysis of Bangla news articles classification using support vector machine and logistic

TELKOMNIKATelecommunication,Computing,ElectronicsandControl

Vol.21,No.3,June2023,pp.584∼591

ISSN:1693-6930,DOI:10.12928/TELKOMNIKA.v21i3.23416

❒ 584

ComparisonanalysisofBanglanewsarticlesclassification usingsupportvectormachineandlogisticregression

MdGulzarHussain1,2,BabeSultana1,MahmudaRahman1,MdRashidulHasan1

1DepartmentofComputerScienceandEngineering,FacultyofScienceandEngineering,GreenUniversityofBangladesh, Dhaka,Bangladesh

2DepartmentofComputerApplicationTechnology,SchoolofComputerScienceandArtificialIntelligence,Changzhou University,Changzhou,Jiangsu,China

ArticleInfo

Articlehistory:

ReceivedFeb21,2022

RevisedOct29,2022

AcceptedNov12,2022

Keywords:

Banglanews

Banglatextclassification

Logisticregression

Naturallanguageprocessing

Newsclassification

Supportvectormachine

ABSTRACT

Intheinformationage,Banglanewsarticlesontheinternetarefast-growing.For organizing,everynewssitehasaparticularstructureandcategorization.Newsarticleclassificationisamethodtodetermineadocument’sclassificationbasedonvariouspredefinedcategories.ThisresearchdiscussestheclassificationofBanglanews articlesontheonlineplatformandtriestomakeconstructivecomparisonusingseveralclassificationalgorithms.ForBanglanewsarticlesclassification,termfrequencyinversedocumentfrequency(TF-IDF)weightingandcountvectorizerhavebeenused asafeatureextractionprocess,andtwocommonclassifiersnamedsupportvectormachine(SVM)andlogisticregression(LR)employedforclassifyingthedocuments.It isclearthattheaccuracyoftheexperimentalresultsbyapplyingSVMis84.0%and LRis81.0%fortwelvecategoriesofnewsarticles.Inthisresearchwork,whenwe havemadecomparisontworenownedclassificationalgorithmsappliedontheBangla newsarticles,LRwasoutperformedbySVM.

Thisisanopenaccessarticleunderthe CCBY-SA license.

CorrespondingAuthor:

MdGulzarHussain

DepartmentofComputerScienceandEngineering,FacultyofScienceandEngineering GreenUniversityofBangladesh,Dhaka,Bangladesh

Email:gulzar.ace@gmail.com

1. INTRODUCTION

Textdataisthemostcomprehensivesourceofinformationbutduetoitsunstructurednature,itis difficultandtimeconsumingtodrawinsightsfromit.Advancementinmachinelearning(ML)andnatural languageprocessing(NLP)aremakingiteasiertoanalyzetextdatathroughtextclassification.Textclassificationtechniquesareusedtoorganize,structureandcategorizetextdataforanalysisofsentiment,topic labelling,spamdetection,intentdetectionandsoon[1].Thisstudyappliestextclassificationtechniquesto Banglanewsarticlesasthenewsarticlesarethemostcommonformoftextonline.Manystudieshavebeen conductedontextclassification,anddifferentclassificationtechniquessuchasrules-basedmethod,decision trees,k-nearestneighbors(KNN),naiveBayes,logisticregression,supportvectormachines(SVM),neural networks(NN)andsoonhavebeendeveloped[2].Studiesintheliteratureofvariousresearchpapershave shownthatresearchershavefocusedonclassifyingtextsindifferentlanguages,suchasEnglishandArabic andsoon[3],[4]however,theamountofanalysisontheBengalilanguageismuchless.In[5]haveusedmachinelearningalgorithmstocategorizeBanglanewspaperarticlesintofivedistinctgroups.Theyusedlogistic regression,SVM,andmulti-layerneuralnetworktodoso.Multi-layerdenseneuralnetworkapproachhasa

Journalhomepage: http://telkomnika.uad.ac.id

accuracyof95.50%whichishighercomparedtoothermodels.Thereisarepresentation[6]ofthebehavior ofleast-squaresSVMs,twinSVMs,andleast-squarestwinSVM(LS-TWSVM)classifiersonthenewsdata isshowntohandlemulti-categorydata.AndtheirperformanceevaluationshowedthatLS-TWSVMisthe bestofallthreewith92.96%accuracy.Maisha etal. [7]performsentimentanalysisonBanglanewsusing ”pipeline”classalongwithsixstate-of-the-artsupervisedMLalgorithmswhichincludesdecisiontree(DT), multinomialnaiveBayes(MNB),k-nearestneighbor(KNN),logisticregression(LR),randomforest(RF)and lagrangiansupportvectormachine(LSVM).Randomforestalgorithmoutstandsallotheralgorithmssecuring 98%accuracyinpercentagesplitmethod.Thismethodologycategorizesthecontenteitheraspositiveornegative.RabbinnovandKobilov[8]usedSVM,decisiontreeclassifier,randomforest,LR,andMNBamongsix othermachine-learningalgorithmstoconductmulti-classtextcategorizationofinternetUzbeknewsarticles. ForSVM,theyusedradialbasisfunction(RBFSVM)whichgavethebestaccuracy(86.88%)performance ofotherclassifiers.Severalsupervisedmachinelearningaswellasdeeplearningalgorithmsforcategorizing Bengalinewsdocumentsarediscussedin[9].AnevermethodforclassifyingBanglatextualcontenthasbeen developedby[10].Thedeeplearningrecurrentneuralnetworks(RNN)-basedattentionlayerandtheRNN withBiLSTMachievedaccuracyratesof97.72%and86.56%,respectively.In[11]proposedamodelstructure namedtheDCLSTM-MLPmodelforthecategorizationofnewstextdocuments,anideaofacustomizedalgorithmthatcombinesdeeplearningalgorithmssuchaslongtermshortmemory(LSTM),convolutionalneural network(CNN),andmulti-layerperceptron(MLP).Byapplyingthismodelstructuretheyhavetriedtosolve theproblemsoftextuallength,thecomplexityofextractingfeaturesfromnewscontentandcategorizingnews texteffectivelyandachieved94.82%accuracy.However,themainissueoftheirpaperisthatthenumberof samplesissmall,andthedistributionofdifferenttypesofnewsisunequal,resultinginthemodel’snarrow effectiveness.

Kowsher etal.of[12]usedseveralwordembeddingmethodstoincorporatethetextfromBangla newspaperdata,aswellasmachinelearningalgorithmstocategorizetheincorporatedtext.Adeeprecurrent neuralNetworkwasusedtoprovideanewmethodofevaluatingBanglanewsarticlesin[13].Thedeep recurrentneuralnetworkfeaturingBi-LSTMobtained98.33%inBengalitextcategorization,whichisgreater thanpreviouswell-knownclassificationtechniques.Forclassifyingtextordocuments,manyofsupervised techniquesareusedsuchasKNN,naiveBayes(NB),DT,n-grams,neuralnetworks(NNet).Butaccordingto previousresearchliteraturereviewsSVM[14]ismostfrequentlyusedclassifieralgorithm.In[15]proposed anewsarticleclassificationmodelframeworkbasedondeephybridlearningandcomparedittotraditional textclassificationtodemonstratethesuperiorityofnetworknewstextclassification,anditoutperformsthe standardizedmethodforclassifyingnewstextsintermsofoverallperformance.In[16]showedacomparative analysisamongDT,KNN,NB,andRocchio’salgorithmwheretheirstudiessaidthatSVMoutperformsbetter thanallotherclassifiers.AsidefromtheEnglishdocument,therehasalsobeenextensiveresearchonother languages.In[17]focusedontheclassificationofself-createdIndonesiannewscorpus,whichincludesfour separatecategoriesand472Indonesiannewspaperarticlesfromavarietyofsources.Theyproducedfivemodels withtenepocheachusing377datafortrainingand95testingdata,withCNNhavingthehighestaccuracyrate atabout90.74percentcategorizingIndonesiannewsdata.In[18]conductedonanIndonesiannewscorpus foundthatthecombinationofTF-IDFandMNBbeatsotherclassificationmodelssuchasmultivariateBernoulli naiveBayes(BNB)andSVMwithanaccuracyof85%.IntermsofArabictextclassification,in[19]Arabic medicaltextdocumentsareclassifiedandauthorsusedanrule-basedclassifierforclassificationofArabictext andhavinganaccuracyof90.6%.Theyusedthreeclassificationalgorithm:majorityvoting,ordereddecision list,andK-NNwhichareusedtovalidatethemodel.

AmongallthecoupleworkscoveredintheBangla,Chy etal. [20]workedonBanglalanguagenews classificationandtheyusednaiveBayesclassifier.ButtheproblemisthatinthenaiveBayesmodelifthe testingdatasetcontainsacategoricalvariableofacategorythatwasnotpartofthetrainingdataset,itwill giveitzeroprobabilityandbeunabletomakeanypredictions.Nahar etal. [21]demonstratedacomparative analysisutilizingnaiveBayesclassifier,SVM,andneuralnetworkstofilterBanglasportsandpoliticalnews onlyononlinenetworksfromtextdata.MandalandSen[22]showedthecomparisonperformanceevaluationof the4supervisedlearningtechniques:decisiontree,naiveBayes,k-nearestneighbor,andSVMfortheBengali textcategorization.TheyusedTF-IDFtocompriseafeaturevectorusing1000webdocumentswhichcontains 22,218words.Closetoourresearchwork,byusingTF-IDFasafeatureselectionapproachandSVMasa classifier,theyattainedaclassificationaccuracyof92.57percentfor12categoriesofBanglatextfiles.They used3191textsamplespercategorywherewehavetriedtouse11770articlesoftop12categories(showed

ComparisonanalysisofBanglanewsarticlesclassificationusing...(MdGulzarHussain)

TELKOMNIKATelecommunComputElControl ❒ 585

inTable1)from20differentcategories.AndbestofourknowledgetheyusedonlySVMwherewehaveused SVMandlogisticregression(LR)andLRisprimarilyusedtoclassifyobservationsintoadiscretenumber ofcategorieswithrapidclassificationofunknownrecords.Inthisresearchpaper,wehavetriedtoshowa comparativeperformanceanalysisusingSVMandLRwhichcanbringrighteouspathforfutureresearchers whoarewillingtoworkwithBanglalanguageinthisera.

Inthisstudy,LRandSVMareusedtoclassifytheBanglanewsarticlesasthestudy[23]showsthat thelogisticregressionoutperformsothertechniquessuchasrandomforestandknearestneighboursalgorithm. ItclassifiestheBanglanewsarticlesandlabelthemtoacertaintopicbasedontheircontents.Itextractsfeatures fromthecorpususingtheTF-IDFfeaturevectorsinceitisenabledtotheextractionofrelevantfeaturesaswell asremovingcommonwords.

2. METHOD

Theaimofnewsclassificationseemstoallocatecategoriesasperthecontentofanewsarticle.PreprocessingofBanglanewsarticlesandfeaturesetextractionarealsoneededpriortotrainingandmodelconstructionfordocumentclassification,likeEnglishtextclassification.TheoverallBanglanewsarticleclassificationprocesshavebeenusedinthisexperimentasillustratedinFigure1.

Despitethelimitedamountofthedata,Hossain etal. [24]discoveredthattext-graphconvolutional neural(GCN)accomplishedbetterthanGRU-LSTM,BiLSTM,Char-CNN,LSTM,andbidirectionalencoder representationsfromtransformers(BERT)inclassifyingonlineBanglanews.Assofar,thereisascarcityof standarddatasetintheBanglalanguage,soadatasethavebeenpreparedbyscrapingthenewsarticlesfrom variouselectronicnewssitessuchas’https://www.prothomalo.com/’.Atthetimeofscrapping,thearticleswere labeledwiththeircategories.Toassesstherecommendationresults,wehavecollectedaround12.5Klabeled newsarticlesconsistingof20categories.Amongthemtop12categoriesareconsideredforthisresearchwork whichhasmentionedinTable1.

586 ❒

ISSN:1693-6930

Testing Training DataCleaning & Preprocessing Feature Selectionand Extraction Apply Classifier Algorithms BanglaNews Article TrainingData BanglaNews Article TestData Performance Evaluation CountVectorizer Tf-idf Vectorizer Training Dataset DataCleaning & Preprocessing Feature Selectionand Extraction Classifier Model

Figure1.Systemarchitectureoftheproposedwork 2.1. Datacollection

SerialCategoryArticlesSerialCategoryArticles 1Bangladeshnews50297Feedbacknews578 2Sportsnews13878Othernews447 3NorthAmericanews8099Citizennews414 4Entertainmentnews75010Newsofdistancemigrants393 5Internationalnews74611Lifestylenews264 6Economicnews71112Scienceandtechnologynews242 TELKOMNIKATelecommunComputElControl,Vol.21,No.3,June2023:584–591

Table1.Category-wisecountoftheBanglanewsarticles

2.2. Datapreprocessing

Preprocessingisperformedtominimizethenoiseinthetext,whichhelpstoincreasetheclassifier’s efficacy.Thepreprocessingstepsclearthetextdataandpreparingitfortheartificiallearningmodel.Thenthe tokenizationisperformedontextdocumentsandbreakdownthesequenceofcharactersorwordstogetthe features.Sincetextdocumentscontainsentences,likeasequenceofcharacterorwords,togetthefeatures,it shouldbebrokenintotokens.Tokenizationistheprocesstobreakdownthetextintosentencesandthenwords. Tokenizationistheprocessbywhichthetextisdividedintophrasesandthenwords.Aftertokenization, differentsymbolssuchas!,x,¿,¡,$,%,andnumbers,whicharenotveryimportantforclassification,are excluded.Stop-wordsinBanglaarealsoeliminatedatthesameperiod.In[25]isalistofstop-wordsusedin thisresearch.

2.3. Featureextraction

Thegeneralmethodofconversionofacollectionoftextdocumentstonumericalfeaturevectors isvectorization.Countvectorizertransformsatextdocumentarrayintoatokencountingmatrix:contains tokencountsoccurrencesineachdocument.Asparserepresentationofthenumbersisgeneratedbythis implementation.Ontheotherhand,thepurposeofusingTF-IDFinsteadoftherawfrequenciesofatoken’s occurrenceinagivendocumentistoscaledowntheinfluenceoftokensthatoccurinagivencorpusvery regularly.Andthusempiricallylessinformativecharacteristicsthatoccurinasmallfractionofthetraining corpusareremoved.Itisamethodofdataretrievalthatweightsthefrequencyofwords(TF)andtheinversed documentfrequency(IDF).Everywordhasit’sownTFandIDFratings,respectively.TheTF-IDFweightis theproductofaterm’sTFandIDFscores.Therarerthewordandviceversa,thehighertheTF-IDFscore.

2.4. Classifiers

Inthisstudy,twocommonclassifiers:theSVMclassifierwithalinearkernelandthelogisticregressionhavebeenused.Intextclassification,SVMhasbeenusedeffectively.Andlogisticregressionisa statisticalmethodofdataanalysisinwhichoneormorevariablesareusedtodeterminetheresult.

2.4.1. SVM

Inessence,SVMisasupervisedmachinelearningapproachknownasabinaryclassifier.In[25] useshyper-planetoclassifydataintotwotypes.Supportvectorsareclosertothe-hyper-planedatapointsthat influencethepositionandorientationofthehyper-plane.Thesupportvectorisusedtooptimizetheperimeter oftheclassifier.Byeliminatingthesupportvectors,thehyper-plane’spositionwillchange.SVMrecognizes thefollowingsignfunctionofequation[26]mathematically,

��(x)=sin(wx + b) (1)

where w isan n weightedvectorin Rn .Bydividingthespace Rn intotwohalf-spaceswiththemaximum margin,SVMfindsthehyper-planein(2).

y = wx + b (2)

Generally,SVMisabinaryclassifier,butthestrategieslikeone-to-oneandone-vs-restcanbeusedto expandintoamulti-classclassifier.InSVM,linearandradialbasisfunctionkernelsareusedtomakedecisions. Sincemosttextsarelinearlyseparable,thelinearkernelispreferredfortextclassification.

2.4.2. Logisticregression

Logisticregressionusesalogisticfunctiontoestimateprobabilitiesfortherelationshipbetweenone ormoreindependentvariablesandthedependentvariableofthecategories.Thelogisticfunctionequationis alsoknownasthelogisticcurvewhichisacommon“S”shapedcurvedescribedbythe(3).Thesigmoidcurve isanothernameforthelogisticcurve.

(3)

Where, L isthecurve’shighestvalue, e thebaseofanaturallogarithm(oreuler’snumber), k isthelogistic growthrateorcurvesteepness, v0 isthesigmoidmidpoint’s x-value.

ComparisonanalysisofBanglanewsarticlesclassificationusing...(MdGulzarHussain)

TELKOMNIKATelecommunComputElControl ❒ 587

L 1+ e k

��(x)=

(v v0 )

3. RESULTSANDDISCUSSION

TheperformanceofSVMclassifiersandlogisticregrassiononourdata-setisdiscussedinthissection. Alittleanalysisonthedata-setisprovidedhere.Later,theperformanceanalysisofSVMandLRontheBangla textdataandcomparisonwithsomesimilarworksaredemonstratedinTable4.

3.1. Datasetanalysis

Among12.5thousandofarticleswith20categoriesweused11770articlesoftop12categoriesin thisresearchwork.Bar-chartfornumberofnewsarticlesofeachcategoriesaregiveninFigure2.InFigure 3generatedwordcloudbasedonTF-IDFandcountvectorizerfromtotalarticlesaregiven.Thisisgenerated beforerunningthepreprocessingonthedata.

Wecanseemanyunnecessarywordsareavailableinthecloudsandthecloud-basedonTF-IDF isdifferentthanthecloud-basedoncountvectorizervalue.Butsomewordsarehavingaroundthesame importanceforbothTF-IDFandcountvectorizervalue.Wehavesplitthedataintotwoparts,fortrainingthe model,80%ofthedataisusedandtheremaining20%isusedtotesttheperformance.

TELKOMNIKATelecommunComputElControl,Vol.21,No.3,June2023:584–591

588 ❒

ISSN:1693-6930

Figure2.NumbersoftheBanglanewsarticlesforeachcategory Figure3.WordcloudforthearticlesusedbasedonTF-IDF(left)andcountvectorizer(right)

3.2. Performancemeasures

In-textcategorization,avarietyofevaluationmetricsareemployed.Theprecision,recall,andFmeasureareamongthemostwidelyusedperformancemeasuresconsideredinourexperiments.Eachcategory’sprecision,recall,andF-measureexperimentaremeasuredforSVMandLR.TheF-measureisaveraged toassessperformancethroughoutcategories.Microaverageandmacroaveragearethetwotypesofaverage valueandthemacroaverageisused.Table2showhowtwoclassifiersperformedonourdata-setcorpusin termsofprecision,recall,andF-measure.OntheotherhandTable3showscomparisonofaccuracy,average precision,averagerecallandaverageF1-scoreresultforSVMandLRclassifier.

Table2.SVMandLRclassifierresultsof12categories

ItisobservedthatSVMwithlinearkernelachievesaccuracyof0.84andLRachieves0.81from Table3andSVMoutperformsLRincaseofaverageprecision,recallandF1-score.ThoughfromTable2it isfoundthatforsomeclassesorcategoriesLRoutperformsSVMincaseofindividualprecision,recalland F1-score.Figure4showstheconfusionmatrixforSVMclassifierwiththelinearkernelandlogisticregression respectively.Table4showsthecomparisonbetweentherecentworksusingSVMclassifierfindingswiththe resultsofthisresearchfinding.Thecomparisonshowsthatourresearchworkutilizingthecombinationof TF-IDFandcountvectorizerfeaturesintheSVMclassifierachievesbetteraccuracythanotherrecentworks whichalsousedSVMintheirresearchworks.

Table3.Accuracy,averageprecision,averagerecallandaverageF1-scoreresultofSVMandLRclassifier AccuracyPrecision(Avg)Recall(Avg)F1-score(Avg)

TELKOMNIKATelecommunComputElControl ❒ 589

Categ.PrecisionRecallF1-ScoreCateg.PrecisionRecallF1-Score SVMLRSVMLRSVMLRSVMLRSVMLRSVMLR 00.890.840.960.960.920.9060.690.640.740.690.710.66 10.940.910.980.980.960.9570.770.780.780.760.770.77 20.740.780.720.680.730.7380.510.500.210.130.290.21 30.880.840.870.830.880.8490.590.600.490.410.530.49 40.700.650.680.610.690.63100.880.930.610.490.720.64 50.810.800.780.730.790.76110.690.680.570.430.620.53

SVMLRSVMLRSVMLRSVMLR 0.840.810.760.750.700.640.720.68

Figure4.Confusion-matrixforSVM(left)andLR(right) ComparisonanalysisofBanglanewsarticlesclassificationusing...(MdGulzarHussain)

ISSN:1693-6930

Table4.ComparisonbetweenthisexperimentandrecentsimilarresearchesusingSVM ExperimentsFeatureusedAccuracy

ThisexperimentTF-IDF+Countvectorizer0.84 Islam etal. [27]TrigramTF-IDF0.80 Rahman etal. [28]TF-IDF0.82 Yeasmin etal. [5]TF-IDF0.83

4. CONCLUSION

Inthefieldofinformationsystems,textcategorizationisacontentiousissue.Inthispaper,wehave usedSVMandLRclassifiersonowndevelopedcorpusandtheirperformanceismeasured.Theproposed methodologyinthisstudyadvocatestheassumptionthattheBanglalanguagecanindeedberightlyclassified usingSVMandLRwithlimitedresources.Theoutcomeofthesuggestedapproachispromising.Still,more precisioncouldhaveattained.Goodoutcomescouldalsohaveobtainedifwehadbeenabletodealwithall ofthenewscategoriesandutilizemultiplecategorizationmethodstocomewithaconstructiveopinion.Inthe next,we’dliketoaddmorecategoriesandcomparetheresultsusingdifferentclassificationalgorithms.

ACKNOWLEDGEMENT

ThisworkwassupportedinpartbytheCenterforResearch,Innovation,andTransformationofGreen UniversityofBangladesh.

REFERENCES

[1] X.Zhou etal., “Asurveyontextclassificationanditsapplications,” WebIntelligence,vol.18,no.3,pp.205-216,2020,doi: 10.3233/WEB-200442.

[2] K.Kowsari,K.J.Meimandi,M.Heidarysafa,S.Mendu,L.Barnes,andD.Brown,“Textclassificationalgorithms:Asurvey,” Information,vol.10,no.4,2019,doi:10.3390/info10040150.

[3] A.Fesseha,S.Xiong,E.D.Emiru,M.Diallo,andA.Dahou,“Textclassificationbasedonconvolutionalneuralnetworksandword embeddingforlow-resourcelanguages:Tigrinya,” Information,vol.12,no.2,2021,doi:10.3390/info12020052.

[4] K.T.Nwet,“Machinelearningalgorithmsformyanmarnewsclassification,” InternationalJournalonNaturalLanguageComputing, vol.8,no.4,pp.17–24,2019,doi:10.5121/ijnlc.2019.8402.

[5] S.Yeasmin,R.Kuri,A.R.M.M.H.Rana,A.Uddin,A.Q.M.S.U.Pathan,andH.Riaz,“Multi-categorybanglanewsclassification usingmachinelearningclassifiersandmulti-layerdenseneuralnetwork,” InternationalJournalofAdvancedComputerScienceand Applications,vol.12,no.5,2021,doi:10.14569/IJACSA.2021.0120588.

[6] P.SaigalandV.Khanna,“Multi-categorynewsclassificationusingsupportvectormachinebasedclassifiers,” SNAppliedSciences, vol.2,2020,doi:10.1007/s42452-020-2266-6.

[7] S.J.Maisha,N.Nafisa,andA.K.M.Masum,“SupervisedmachinelearningalgorithmsforsentimentanalysisofBanglanewspaper,” InternationalJournalofInnovativeComputing,vol.11,no.2,pp.15–23,2021,doi:10.11113/ijic.v11n2.321.

[8] I.M.RabbimovandS.S.Kobilov,“Multi-classtextclassificationofuzbeknewsarticlesusingmachinelearning,”in Journalof Physics:ConferenceSeries,2020,doi:10.1088/1742-6596/1546/1/012097.

[9] M.R.Hossain,S.Sarkar,andM.Rahman,“Differentmachinelearningbasedapproachesofbaselineanddeeplearning modelsforbengalinewscategorization,” InternationalJournalofComputerApplications,vol.176,no.18,pp.10-16,doi: 10.5120/ijca2020920107.

[10] M.Ahmed,P.Chakraborty,andT.Choudhury,“BanglaDocumentCategorizationUsingDeepRNNModelwithAttentionMechanism,” CyberIntelligenceandInformationRetrieval,vol.291,pp.137–147,2021.

[11] M.Zhang,“Applicationsofdeeplearninginnewstextclassification,” ScientificProgramming,vol.2021,2021,doi: 10.1155/2021/6095354.

[12] M.Kowsher,A.Tahabilder,N.J.Prottasha,M.A-Rakib,M.M.Uddin,andP.Saha,“Banglatopicclassificationusingsupervised learning,” ComputationalIntelligenceinPatternRecognition,vol.1349,pp.505–518,2022.

[13] S.RahmanandP.Chakraborty,“Bangladocumentclassificationusingdeeprecurrentneuralnetworkwithbilstm,”in PComputer Science,2021.

[14] C.CortesandV.Vapnik,“Support-vectornetworks,” MachineLearning,vol.20,pp.273–297,1995,doi:10.1007/BF00994018.

[15] N.SunandC.Du,“Newstextclassificationmethodandsimulationbasedonthehybriddeeplearningmodel,” Complexity,vol. 2021,2021,doi:10.1155/2021/8064579.

[16] P.Y.PawarandS.H.Gawande,“Acomparativestudyondifferenttypesofapproachestotextcategorization,” InternationalJournal ofMachineLearningandComputing,vol.2,no.4,pp.423-426,2012.[Online].Available:http://www.ijmlc.org/papers/158C01020-R001.pdf

[17] M.A.Ramdhani,D.S.Maylawati,andT.Mantoro,“Indonesiannewsclassificationusingconvolutionalneuralnetwork,” Indonesian JournalofElectricalEngineeringandComputerScience,vol.19,no.2,pp.1000–1009,2020,doi:10.11591/ijeecs.v19.i2.pp10001009.

[18] R.Wongso,F.A.Luwinda,B.C.Trisnajaya,O.Rusli,Rudy,“NewsarticletextclassificationinIndonesianlanguage,” Procedia ComputerScience,vol.116,pp.137–143,2017,doi:10.1016/j.procs.2017.10.039.

TELKOMNIKATelecommunComputElControl,Vol.21,No.3,June2023:584–591

590 ❒

[19] Q.A.Al-RadaidehandS.S.Al-Khateeb,“Anassociativerule-basedclassifierforarabicmedicaltext,” InternationalJournalof KnowledgeEngineeringandDataMining,vol.3,no.3-4,pp.255–273,2015,doi:10.1504/IJKEDM.2015.074071.

[20] A.N.Chy,M.H.Seddiqui,andS.Das,“Banglanewsclassificationusingnaivebayesclassifier,” 16thInt’lConf.Computerand InformationTechnology,2014,pp.366-371,doi:10.1109/ICCITechn.2014.6997369.

[21] L.Nahar,Z.Sultana,N.Jahan,andU.Jannat,“FilteringBengaliPoliticalandSportsNewsofSocialMediafromTextualInformation,” 20191stInternationalConferenceonAdvancesinScience,EngineeringandRoboticsTechnology(ICASERT),2019,pp.1-6, doi:10.1109/ICASERT.2019.8934605.

[22] A.K.MandalandR.Sen,“Supervisedlearningmethodsforbanglawebdocumentcategorization,” InternationalJournalofArtificial IntelligenceApplications,vol.5,no.5,2014,doi:10.5121/ijaia.2014.5508.

[23] K.Shah,H.Patel,D.Sanghvi,andM.Shah,“Acomparativeanalysisoflogisticregression,randomforestandKNNmodelsforthe textclassification,” AugmentedHumanResearch,vol.5,pp.1–16,2020,doi:10.1007/s41133-020-00032-0.

[24] M.R.Hossain,M.M.Hoque,N.Siddique,andI.H.Sarker,“Bengalitextdocumentcategorizationbasedonverydeepconvolution neuralnetwork,” ExpertSystemswithApplications,vol.184,2021,doi:10.1016/j.eswa.2021.115394.

[25] Genediazjr,“Stopwords-ISO/stopwords-BN:Bengalistopwordscollection,”in Github,Oct2016.Online.[Available]: https://github.com/stopwords-iso/stopwords-bn(Accessed:December20,2022).

[26]

[27]

[28]

J.Kim,J.Yoo,H.Lim,H.Qiu,Z.Kozareva,andA.Galstyan,“Sentimentpredictionusingcollaborativefiltering,”in Proceedings oftheInternationalAAAIConferenceonWebandSocialMedia,vol.7,no.1,pp.685–688,2013,doi:10.1609/icwsm.v7i1.14461.

T.Islam,A.I.Prince,M.M.Z.Khan,M.I.Jabiullah,andM.T.Habib,“Anin-depthexplorationofBanglablogpostclassification,” BulletinofElectricalEngineeringandInformatics,vol.10,no.2,pp.742–749,2021,doi:10.11591/eei.v10i2.2873.

S.Rahman,S.K.Mithila,A.AktherandK.M.Alam,“AnEmpiricalStudyofMachineLearning-basedBanglaNewsClassification Methods,” 202112thInternationalConferenceonComputingCommunicationandNetworkingTechnologies(ICCCNT),pp.1-6, 2021,doi:10.1109/ICCCNT51525.2021.9579655.

BIOGRAPHIESOFAUTHORS

MdGulzarHussain wasborninNimnagar,DinajpurCity.HereceivedtheBachelorofSciencedegree(2018)inComputerScienceandEngineeringfromtheGreenUniversityof Bangladesh,Dhaka,Bangladesh.Since2019,hehasbeenaLecturerwiththeComputerScience andEngineeringDepartment,GreenUniversityofBangladesh,Dhaka,Bangladesh.Heiscurrently studyingMScinComputerScienceandTechnologyinChangzhouUniversity,Changzhou,Jiangsu, China.HeisalsoaffiliatedwithIEEEasstudentmemberandIAERasprofessionalmember.HisresearchinterestsincludeArtificialIntelligence,MachineLearning,DeepLearning,NaturalLanguage Processing,TextMining,andTopicModeling.Hecanbecontactedatemail:gulzar.ace@gmail.com.

BabeSultana wasbornin,Cox’sBazar,Bangladesh,in1994.ShereceivedherB.Sc.DegreeinComputerScienceandEngineeringfromGreenUniversityofBangladesh(GUB)in2018.At presentsheisworkingasaLecturer,Dept.ofCSE,GreenUniversityofBangladesh.Also,herpublicationwasabout”Multi-modeProjectSchedulingwithLimitedResourceandBudgetConstraints” publishedinInternationalConferenceonInnovationinEngineeringandTechnology(ICIET)27-28 December,2018andshegotBestPaperAwardandIEEEBestPaperAwardonthisconference. HerresearchinterestsincludeTheoryofOptimization,NaturallanguageProcessing,Renewableand SustainableEnergy.Shecanbecontactedatemail:babecse@gmail.com.

MahmudaRahman recivedherMaster’sinCSfromJahangirnagarUniversityandB.Sc. inCSEfromGreenUniversityofBangladesh.ShehasrecievdViceChancellor’sGoldMedalaward forheroutstandingacademicperformancein4thconvocationofGreenUniversityofBangladesh. ShehasbeenworkingasaLecturer,Dept.ofCSE,GreenUniversityofBangladeshsince2019. HerresearchinterestincludesNaturalLanuageProcessing,MedicalimageprocessingandMachine learning.Shecanbecontactedatemail:aurthi018@gmail.com.

MdRashidulHasan receivedtheBachelorofSciencedegreeinComputerScienceand EngineeringfromGreenUniversityofBangladesh.HewasborninPabna,Bangladeshin1994.His researchinterestincludeMachineLearning,NaturalLanguageProcessing,andDataMining.Hecan becontactedatemail:rashidulhasan777@gmail.com.

ComparisonanalysisofBanglanewsarticlesclassificationusing...(MdGulzarHussain)

TELKOMNIKATelecommunComputElControl ❒ 591

Turn static files into dynamic content formats.

Create a flipbook

Comparison analysis of Bangla news articles classification using support vector machine and logistic

Published on Apr 26, 2023

ICWT1 TELKOMNIKA

In the information age, Bangla news articles on the internet are fast-growing. For organizing, every news site has a particular structure and categorization. News article classification is a method to determine a document’s classification based on various predefined categories. This research discusses the classification of Bangla news articles on the online platform and tries to make constructive comparison using several classification algorithms. For Bangla news articles classification, term frequencyinverse document frequency (TF-IDF) weighting and count vectorizer have been used as a feature extraction process, and two common classifiers named support vector machine (SVM) and logistic regression (LR) employed for classifying the documents. It is clear that the accuracy of the experimental results by applying SVM is 84.0% and LR is 81.0% for twelve categories of news articles. In this research work, when we have made comparison two renowned classification algorithms applied on the Bangla news articles, LR w