Performance Enhancement using Appropriate File Formats in Big Data Hadoop Ecosystem

Big Data projects are frequently managed with Hadoop Technology.Hadoopiscurrentlythebasicstandardandis used foranalyzing large amounts of unstructured data, as wellascertainstructureddata[7].TheHadoopDistributed FileSystem(HDFS)isintendedtostoreverylargedatasets consistently and to transmit such data sets to user applicationsathighbandwidth[8].Asaresult,offeringSQL analyticcapabilitiestohugedatastoredinHDFSisbecoming increasingly critical. Although alternative SQL on Hadoop systems exist, such as HortonWorks Stinger and Cloudera Impala, Hive is a pioneer system that provides SQL like analysis of HDFS data. We have used hive for our experimentationpresented.

|

Key Words: Big Data, Hive, Hadoop, ORC, Parquet, Avro, Performance 1.INTRODUCTIONEvaluation

Volume: 09 Issue: 01 | Jan 2022 www.irjet.net p ISSN: 2395 0072 2022, IRJET Impact Factor value: 7.529 ISO 9001:2008

Data drives the 21st century and many more generations to come. There are vast amounts of data being generated and collected every minute from various devices, software, emails, transactional data, organizational data, and user data collected by various big tech giants on the scale of zettabytes. As the data increases, the physical storage space also increases to cater to such humungous data. Storing them and preserving them becomes a significant challenge when data grows exponentially. The main challenge is storing this data and accessing it in real time. This is where storage formats come into the picture. Different types of data formats are being used to store big data. This paper will compare five Data formats like Avro, orc, parquet, and textfile and their pros and cons and their usage.

HDFS storage space utilization in a more efficient manner according to the task defined, and several binary data storage formats exist inside HDFS [4]. Some of them are RCFile,ORC,Avro,Parquet[4].Theseformatsaredesigned forsystemsthatuseMapReducekindsofframework,andit isa structure thatis a systematiccombination of multiple components, including data storage format, data compression,andoptimizationtechniquesfordatareading [4].

International Research Journal of Engineering and Technology (IRJET) e ISSN: 2395 0056

Certified Journal | Page1247

|

©

2. BACKGROUND

Vishal Naidu

ThedatastorageformatsmentionedinIntroductionsection (Text/CSV [9], JSON [10], Avro [11], SequenceFile [12], RCFile [13], ORC file [14], Parquet [15]) have some advantages and disadvantages. Only Avro, SequenceFile, RCFile, ORC file, and Parquet offer compression storage capacity.Furthermore,theAvroandParquetdataformats offerschemaevolution.ThisistheprimaryreasonAvroand Parquetwereselectedforthestudies.

Performance Enhancement using Appropriate File Formats in Big Data Hadoop Ecosystem

The amount of data captured by enterprise organizations, social media, advertising companies, and various applicationsincreasesexponentially.Hadoopisoneofthe popularopen sourceBigData frameworksintheindustry today, capable of carrying out common Big Data related Tasks[3].ThefileformatthatHadoopsupportsiscalledthe Hadoopdistributedfilesystem(HDFS).HDFSisatraditional waytostorebigdatainTerabyteandpetabyte scalebecause of its reliability, scalability, and its fault tolerant nature. HDFScontainsthenumberofservermachines/nodesonthe scaleofhundredstothousandswherethedataisdistributed andstoredinparts.Traditionalsolutionsareonlyefficient forspecificfilesizesorfileformats[2].HDFSprovideshigh throughput access to application data and is suitable for applications that have large data sets [1]. The main advantageofHDFSisthatitishighlyfault tolerant,anditcan bedeployedwithlow costcommodityhardware.Handling andUnderstandingaHugeamountofdataisabigchallenge. HDFS supports many file formats but choosing a specific formatdependsontheusecase,andmanyfactorsmustbe Rawconsidered.dataisgenerallystoredintextformatsuchasCSV,Text file, JSON, XML. These are also known as human readable formats, i.e., they can be read and edited by humans. However, the issue with raw data formats is, it takes a tremendousamountofstoragespace.StoringRawdata in HDFSisagainamoresignificantissueandinHDFS,eachdata is replicated three times. So, the amount of data taken by HDFSwouldbereplicatedthreetimes,thusincreasingthe size by three times the original file size. Therefore, it becomes crucial to compress and store the files in HDFS.

Avro isarow basedcross languagefileformatinHadoop,a schema basedserializationtechnique.Avrowascreatedwith a primary goal of Schema Evolution, and the schema is segregatedfromthedata,unlikemanyothertraditionalfile formats. Data can be written with no prior knowledge of schema,andtheresultingserializeddataislesserinsizeas comparedtothesource.AvrostoresschemainJSONformat, makingit easyto readandinterpretbyanyprogramming language. Avro creates a binary structured format that is both compressible and splittable. As a result, it can be efficientlyusedtoinputHadoopMapReducejobs.Avrocan capture transactions such as Updates, Inserts, and Delete logs, so it is widely used for OLTP (on line transaction

Department of Electronics and Telecommunication, Ramrao Adik Institute of Technology, Mumbai, India *** Abstract

Parquet isanopen sourcestorageformatdesignedtobring anefficientcolumnarstoragelayoutinHadoop.Parquethas anestedstructure,anddataispopulatedsparsely.Theidea behind columnar format is straightforward: instead of storingtherecordsinrowbyrow,storetherecordusingthe column, which is beneficial for analytical processing. Multiple types of data can be converted and stored into Parquet format. Parquet supports well organized compression and encoding schemes. Parquet supports efficient columnar data representation available to any projectintheHadoopecosystem,regardlessofthechoiceof data processing framework, data model, or programming language[5].ThemainadvantageofParquetisthatAquery can read and perform on all values for a column while readingonlyasmallfractionofthedatafromadatafileor table.

Question 2: Whichdataformat(Avro,ParquetandORC) ismorecompact?

International Research Journal of Engineering and Technology (IRJET) e ISSN: 2395 0056 Volume: 09 Issue: 01 | Jan 2022 www.irjet.net p ISSN: 2395 0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page1248 processing)andhasnoteworthyperformanceinUpdating, Deleting,Inserting,andqueryingallcolumnsofrowsorrows ofthetable[4].Avrocaneasilyhandlechangesinschema such as addition or changes in existing fields; thus, new programscanscanolddata,andoldprogramscanscannew data.Avroisextremelyusefulinsystemswhereweseedata iswrittencontinuously.

Fig -1:AvroFileFormatStructure

3. GOALS AND OBJECTIVES

The fundamental question is what are the performance differences between Parquet and Avro in terms of query executiontime.

ORC, Optimized Row Columnar isthemostwidelyusedfile formatinHadoop.Itcanstoredatainanoptimizedwaythan theotherfileformats.Originaldatacanbereducedbyabout 75% due to this; there is an increase in data processing speedandshowsbetterperformancethanotherfileformats discussed.AnORCfilecontainsrowsofdataorganizedinto Stripesandafilefooter.WhenHiveisprocessingdata,the ORCformatenhancesspeed.Weareunabletoloaddatainto ORCFILEdirectly.Wemustfirstloaddataintoanothertable thenreplaceitinourfreshlyformedORCFILE.Ithasvarious advantages over other file formats. It involves storing columns separately, storing statistics (Min, Max, Sum, Count),havingalightweightindex,skipsblocksoftherow thatarenotpartofthequery.

FromtheIntroductionandBackgroundsectionofthearticle, we see how important it becomes to select the right file formattoprocessourjobs,andalsoseethatitwillinturn help in processing in terms of speed in the Hadoop ecosystemlikeHive.Wecomeupwiththetasktoperform experimentstoanswerthesequestions:

Question 1: What are the differences in performance (queryexecutiontime)betweenAvro,ParquetandORC?

Fig 3: ORCFileFormatStructure

The experiment has been chosen as a research method to addresstheresearchquestions.Therearefivestepstothe experimentingprocess.Thescopeisthefirststep,followed byplanning,execution,analysis,interpretation,andfinally reporting.Independentvariableshavebeendefinedinorder toformulatethescopeoftheexperiments.Thedataformat type (Avro / Parquet) has been designated as an independentvariable,whileperformanceandcompactness havebeendesignatedasdependentvariables.Asaresult,the experiment'sscopehasbeendefinedasfollows:Analyzethe data formats Avro and Parquet with the purpose of comparing performance and compactness from a

Fig 2 :ParquetFileFormatStructure

Nowwelookintotheaspectofhowdataisbeingstoredinto tablesofdifferentdataformats,wehaveinsertedadifferent numberofrecordsintoallthetablesandhaveseenthatORC fileformatperformsbetterintermsofdatainsertion.The operationaltimetakenisquitelessascomparedtotherest fileformatsthatwehavediscussed.

4.3 Queries Performed

5. EXPERIMENTAL RESULTS

Afterscopingandplanning,theoperationstagehasbeen performed. Organizing the experiments includes preparation, execution, and data validation tasks that are describedinthenextsection.

Page

Variousdataformatsanddatafromdifferentdatabasesare beingusedforthisexperiment.Thedatabeingusedwasthis wasscaledfor30times,thisisestimatedaroundtobe300 GB.AllthedataforthishasbeensenttotheHDFSdirectory oftheHadoopcluster.Asa plan,wehave first sentall the data in plain text format and then those data have been converted to different formats. Different tables have different rows of data in them, this experiment will then compareonalltheformatspresented.Dataisbeingloaded intothehivetableusingdifferentmethodsfordifferentfile formats.Plaintextformatcreationhasbeendoneusingthe CREATE TABLE method with “stored as TEXTFILE”, the parquetfilehasbeenstoredusingthe“storedas PARQUET”, thefilesinAVROformathavebeenstoredusing“storedas AVRO” and finally the files in ORC have been stored using “storedas ORC”.

4.1 Cluster Specifications

Volume: 09 Issue: 01 | Jan 2022 www.irjet.net p ISSN: 2395 0072

This section presents the results of the experiments and answerstotheresearchquestions.

LoadingdataintoHadoopandconvertingitfromplaintext toAvroandParquetformatssavesalotofstoragespace.The samedataneeds2timeslessstoragespaceinAvroformat, and3timeslessinParquetformat,asseeninFig.8.Thisisa responsetoRQ.2,thesecondresearchquestion:Whichdata format is more compact (Avro or Parquet)? As a result, ParquethasasmallerfootprintthanAvro.Despitethefact that Avro and Parquet both use the Snappy compression algorithm,thedifferencebetweenthetwodemonstratesthat Parquetisroughly1.5timesmorecompactthanAvro.

4. RESEARCH METHODOLOGY

Forexample,MapReduce'sprocessingmethoddiffersfrom Spark'sDAGapproach.TheHDPHadoopdistributionfrom Hortonworksisbeingchosenforthispaper.Thekeyreason for this is the platform's widespread popularity due to its openness. More open source Hadoop ecosystem projects havebeenmergedintoClouderathananyotherplatform.As a result, it enjoys greater acceptance among businesses becauseitdoesnotresultinvendorlock in.

researcher'sperspectiveinthecontextofaBigDatastorage format.Theexperiments chose Avro and Parquet based on the assumptionthatAvrosupportedrow orienteddataaccess should provide better performance on scan queries, e.g. whenallcolumnsareofinterest,butParquetformatshould provideabetterperformanceoncolumn orientedqueries, e.g.whenonlyasubsetofthoseareselected.

These experiments were performed in a Hadoop cluster basedonaHortonWorkssetup.TheHortonworksversion weareusingforperformingthistestisthe2.6version.This isoneofthemoststableversionsreleasedbyHorton.The clusterisbeingsetupinsuchawaythatitcanseamlessly supportlarge/textprocessing.Twonodesnamenodesare running in a high available manner. This is an advisable amount of master nodes recommended by Horton. The remaining 10 data nodes run the worker roles for the Hadoop services. This is an empirically chosen amount of datanodes.

The queries were performed on all the file formats, the variousquerieswrittenwerepickupfromgitrepo[29]with has the necessary combinations for it. The change is connected to the clause "l shipdate = date '1998 12 01' interval '[DELTA]' day (3)" in the 'where' clause. Because data load into Hive without a workaround approach of at leastfoursteps(createtemptable,loaddata,createatable with correct data types, and insert data therefrom temp table)onlysupportsstringtypedatevalues,thedateinterval hasbeenreplacedwiththeexactdateandfunctiontodate() hasbeenaddedtoreturnthedatefroma stringtype date valuestoredinHivetable.

Chart -1:FileCompressionacrossFileFormats

Numerous extensive data management systems are availabletoday,includingOracle'sBigDataAppliance,IBM's ApacheHadoop,Cloudera'sCDH,Hortonwork'sHDP,Apache Spark,etc.Thesesystemsareprimarilyfocusedonmassive datastorageandprocessing,buttheirtechniquesmaydiffer.

4.2 Data used for experiments

International Research Journal of Engineering and Technology (IRJET) e ISSN: 2395 0056

InordertoanswertheresearchconcernsaboutParquetand Avroformat,agapandtheneedforadditionalexperiments andinvestigationshavebeenidentifiedasaconsequenceofa comprehensive literature assessment. Because of both designspecifics,noneofthe17studiesreviewedattheend of the systematic literature review have a direct focus on comparing three binary data storage formats Parquet, Avro,andORC. TheexperimentsrevealthatusingAvroisonlybeneficialin terms of saving storage space. Even queries from Textfile formattablesareslowerthanqueriesfromAvrotables.GIT queriesfromORCformattables,ontheotherhand,providea

Chart -2:SizecomparisonofFileFormats

Chart 3:Performancecomparisononquerying

Theoutcomesofthegroupingprocessanddeterminingthe maximumvalueacrossseveralformatsareshownbelow.It's importanttonotethatthisismerelyamathematicalprocess. It'sverysimilartoaDataAnalyticsusecase.I'mpleasedto report that all binary formats (AVRO, Parquet, and ORC) performed admirably. Parquet and ORC were nearly identical in terms of performance, although ORC had a smaller file footprint. It comes as no surprise that ORC adoptionhasplummetedinrecentyears.Ibarelyevercome acrossprojectsinORCformatthesedays.

TABLE 2:Queryexecutiontime(seconds)acrossfile formats Query Data TextfileFormatAvro Parquet ORC Query0 132 200 34 28 Query1 306 320 142 122 Query2 Failed Failed Failed Failed Query3 430 500 277 230 Query4 350 390 200 170 Query5 500 550 320 280 Query6 145 230 64 44 Query7 633 640 436 380 Query8 Failed Failed Failed Failed Query9 Failed Failed Failed Failed Query10 403 460 230 180 Query11 325 310 270 213 Query12 325 360 180 167 Query13 216 244 200 162 Query14 275 315 150 118 Query15 608 675 325 299 Query16 280 300 238 198 Query17 600 700 344 312 Query18 680 800 428 409 Query19 Failed Failed Failed Failed Query20 540 645 390 365 Query21 1000 1266 670 634 Query22 210 300 150 134 Query23 28 55 25 18 6. CONCLUSIONS

TABLE 1: FileLoadinginDifferentAcrossFileFormats No. recordsof Parquet(sec) Avro(sec) ORC(sec) 50000 5.51 10.09 1.7 100000 7.02 15.98 2.34 200000 12.84 26.75 5.73

International Research Journal of Engineering and Technology (IRJET) e ISSN: 2395 0056 Volume: 09 Issue: 01 | Jan 2022 www.irjet.net p ISSN: 2395 0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page1250

Toavoidrepeatandtoowetospacelimits,someofthefinal conclusionsareofferedinthispart,whichisfarfromafull review.Thestudiesinthisarticlewerebasedonathorough examinationofSQL on Hadoopusingsmalldataformats[6].

Performance Comparison Results of the file sizes across different formats. Notice the vast variation in file sizes for 500 million rows. JSON has the largest footprint whereas Parquethasthelowest.

[19] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack,M.FerreiraandP.O'Neil,“C store:acolumn oriented DBMS,” in Proc. of the 31st international conferenceonVerylargedatabases,VLDBEndowment, 2005,pp.553 564.

[9] CSV files. [Online]. Available: https://tools.ietf.org/html/rfc4180.

[13] Y.He,R.Lee,Y.Huai,Z.Shao,N.Jain,X.ZhangandZ.Xu, "RCFile: A fast and space efficient data placement structureinMapReduce basedwarehousesystems,"in Proc. of IEEE 27th International Conference on Data Engineering(ICDE),2011,pp.1199 1208. [14] ORC Files. [Online]. guageManual+https://cwiki.apache.org/confluence/display/Hive/LanAvailable:ORC

REFERENCES [1] mlhttps://hadoop.apache.org/docs/r1.2.1/hdfs_design.ht

[12] Sequence File documentation. [Online]. [Accessed:https://wiki.apache.org/hadoop/SequenceFileAvailable:1Nov2016].

[15] Parquet official documentation. https://parquet.apache.org/documentation/latest[Online]

significantperformanceimprovementoverTextfile,Parquet andAvroqueries.Whencomparedtoothers,ORCcandeliver a2xfasterexecutiontimeonaverage.Thereisn'tmuchofa withadditioandsuchfortheInevaluatefactorsnowadays.relaGITExperimentsdifferencebetweenthescanandaggregationqueriesoffered.withGITdatasetshavegottenalotofattention.decisionsupportbenchmarksarefrequentlyutilisedintionaldatabasesystemperformanceevaluationBecauseDBGENallowsfordatasetswithscalegreaterthan1TB,GITdatasetscanbeusedtotheperformanceofBigDatamanagementsystems.thefuture,queryperformancecouldbemeasuredusingTPCDSstandardbenchmark,whichismoreappropriateBigDatasystems.OtherqueryenginesandframeworksasImpala,HAWQ,IBMBigSQL,Drill,Tajo,Pig,Presto,Spark,Cascading,Crunchcouldalsobeinvestigatedfornalexperimentstoobtainmoredetailedexperiencetinydataformats.

[10] JSON specification. [Online]. https://tools.ietf.org/html/rfc7159.[Accessed:1 Nov 2016].

[17] N. Palmer, E. Miron, R. Kemp, T. Kielmann and H. Bal, “Towards collaborative editing of structured data on mobile devices,” in Proc. of 12th IEEE International ConferenceonMobileDataManagement(MDM),vol.1, 2011(June),pp.194 199.

Journal | Page1251

[5] GartnerSays Smartphone SalesSurpassedOneBillion Units in 2014. [Online]. http://www.gartner.com/newsroom/id/2996817.Available:

International Research Journal of Engineering and Technology (IRJET) e ISSN: 2395 0056 Volume: 09 Issue: 01 | Jan 2022 www.irjet.net p ISSN: 2395 0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified

[2] ShuoZhang,LiMiao,DafangZhang,“AStrategytoDeal withMassSmallFilesinHDFS”2014SixthInternational ConferenceonIntelligentHuman MachineSystemsand Cybernetics. [3] Daiga Plase, Laila Niedrite, Romans Taranovs, “Accelerating Data Queries on Hadoop Framework by UsingCompactDataFormats”.2016IEEE4thWorkshop on Advances in Information, Electronic and Electrical Engineering(AIEEE) [4] M. Sharma, N. Hasteer, A. Tuli and A. Bansal, "Investigatingtheinclinationsofresearchandpractices in hadoop: A systematic review," Confluence the Next Generation Information Technology Summit (Confluence),inProc.of5thInternationalConference IEEE,2014,pp.227 231.

[8] K. Shvachko, H. Kuang, S. Radia and R. Chansler, "The hadoop distributed file system," in Proc. of IEEE 26th SymposiumonMassStorageSystemsandTechnologies (MSST),2010,pp.1 10.

[11] Avro specification. [Online]. [Accessed:http://avro.apache.org/docs/current/spec.html1Nov2016].

[7] Y. Chen, X. Qin, H. Bian, J. Chen, Z. Dong, X. Du and H. Zhang,"Astudyofsql on hadoopsystems,"inProc.of Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, Springer InternationalPublishing,2014,pp.154 166.

[16] ApacheThrift.[Online].Available:http://thrift.apache.org

[18] S.Zhang,L.Miao,D.ZhangandY.Wang,"Astrategyto deal with mass small files in HDFS," in Proc. of IEEE Sixth International Conference In Intelligent Human MachineSystemsandCybernetics(IHMSC),vol.1,2014 (August),pp.331 334.

[6] D. Plase, "A systematic review of SQL on Hadoop by using compact data formats," Preprint. [Online]. [Accessed:https://dspace.lu.lv/dspace/handle/7/34452.Available:2Nov2016].

Turn static files into dynamic content formats.

Create a flipbook