A Literature Review on Plagiarism Detection in Computer Programming Assignments by IRJET Journal

5Assistant Professor, Department of Computer Science and Engineering, Dayananda Sagar College of Engineering, Bengaluru, Karnataka, India ***

review, we came across several approachesthatpreviousresearchershavedeveloped.The approaches were similarity based, logic based, machine learningbasedetc.Severalcomparisonalgorithmssuchas theGreedyStringTilingalgorithm,winnowingalgorithmetc. have been used in similarity based detection. While logic basedalgorithmstrytofindoutonedissimilarityintermsof output and execution paths. For a system, which involved multiple submissions of a single exercise, similarity of consecutivesubmissionsmadebyastudentwereconsidered

2. MOTIVATION

Volume: 09 Issue: 03 | Mar 2022 www.irjet.net

| ISO

p ISSN: 2395 0072 2022, IRJET Impact Factor value: 7.529 9001:2008

1,2,3,4Students, Department of Computer Science and Engineering, Dayananda Sagar College of Engineering, Bengaluru, Karnataka, India

1.INTRODUCTION

Programmingassignmentsplayanimportantroletohone and evaluate programming skills of a student. But the process of finding a solution to the assignments seems frustrating to the majority of students that they borrow solutions from classmates or from external sources. ‘Plagiarism’ispresentingsomeone else’swork orideasas yourown,withorwithouttheirconsent,byincorporatingit into your work without full acknowledgement. The increasing plagiarism cases in academics hinders fair evaluationofstudents,henceraisestheneedforplagiarism detection. Manual plagiarism detection takes a lot of time and effort, in the case of a large number of assignments. Hence, reliable automatic plagiarism detection techniques areDuringnecessary.thisliterature

whicheliminatestheneedforfindingsimilaritybetweenall the source codes submitted by all the students. Machine learningtechniquessuchasXGBoost,SVM,randomforest, decisiontreeshavealsobeenusedinsomeapproaches.The takeawayandlimitationsofeachofthemhavebeenbriefed inthisreview.Thisreviewgivesusanideaofwhatareasto workonwhilebuildingaplagiarismdetectionsystem.

ThispaperbyK.J.Ottenstein[1]talksaboutoneoftheearliest approachestosolvingtheproblemofdetectingsimilarities in student’s computer programming assignments. This feature based approach was designed for and tested on programs written in FORTRAN. It considers Halstead’s metrics[2] numberofuniqueoperators(n1),number of unique operators (n2), Total number of occurrences of operators(N1), Total number of occurrences of operands(N2).Ifn1andn2isfoundtooccurexactlyN1and N2 times respectively in two assignments then those assignmentsareflaggedasplagiarized.

Abstract Our research aims to detect plagiarism in computer programming assignments. Plagiarism has been a problem for a long time and the problem has evolved with time. With the rise of the internet the theft of intellectual property has risen significantly and recognizing these thefts has become difficult. With this survey we have identified techniques used in detecting plagiarism. Withchangingtimes, the tools needed to detect plagiarism have to be evolved. However, to develop a tool with an ability to achieve high accuracy and greater accessibility of data has always been a demand. A comparative study on plagiarism checking tools with the technology used is presented in this paper. This study would help us determine the algorithm and methodology to proceed with the development of code to detect plagiarism.

Key Words: plagiarism detection; machine learning; artifcial intelligence; deep learning; neural network

Certified Journal | Page1073

A Literature Review on Plagiarism Detection in Computer Programming Assignments

Keerthana T V 1 , Pushti Dixit 2 , Rhuthu Hegde 3 , Sonali S K 4 , Prameetha Pai5

Thepracticeofplagiarismisnota strangethinganymore. Students,programmersandevenlecturersplagiarizeorcopy thesourcecodefromdifferentsourcesbeforesubmittingit to the evaluator. This isn't just limited to schools and colleges but also in the industries. Detecting plagiarism practices is a solution that should be done so that the fraudulentactionscanbeminimized.Detectionofplagiarism clustersisveryimportanttofindouthowmanystudentsor groupsaccomplishingprogramhomeworkindependently.It gives instructors more opportunity to enhance or modify education.Usingthesoftwarecanbeadeterrentforstudents toplagiarism.However,usingthissoftwaredoesnotprovide the final answer, which is why the authors have come up withanideaofusingtwoapproachesandcomparingthemto findoutwhichperformsbetter,thefirsttechniqueistouse machinelearningtechniqueslikeXGBoostalgorithm,SVM classifier and the second technique is to make use of Artificial Neural Network and backpropagation on the datasettoidentifyifthereisplagiarismincode.

International Research Journal of Engineering and Technology (IRJET) e ISSN: 2395 0056

3. RELATED WORK

| ISO

p ISSN: 2395 0072 2022, IRJET Impact Factor value: 7.529 9001:2008

Phase1.2Theinputprogramistransformedtoastatement ordersequenceand sequencesofthepairofassignments arecomparedtodetectsimilarity.

Rabinalgorithmforstringmatching[7]isusedto compareallpairsofk gramsinthetwodocuments.

SvenMeyerzuEissenandBennoStein[9]havefocusedtheir researchontheintrinsicplagiarismmethod.Theirresearch on previously used methods have led to the usage and analysis of intrinsic plagiarism. They have divided the

This paper by John L Donaldson et al. [3] designed a plagiarism detection system that analyses the input programsintwophases.Inthefirst,i.e.,theDatacollection phase the system keeps track of eight features. This informationisstoredinatwo dimensionalarray.Thesecond phaseistheDataAnalysisphasewhichisfurthersubdivided into2phases:

in assignments by students has posed a lot of difficulties for the evaluators and to avoid that the author Michael J.Wise et.al [5]hasproposeda systemknown as YAP3whichisthethirdversionofYAPwhichworksintwo phases primarily. It removes the comments and string constants,convertsfromuppercasetolowercase,mapsthe synonymstoacommonform,reordersthefunctionintheir calling order and also removes the token which is not a reservedwordfromtheprogram.Also,thepaperfocuseson Running Karp RabinGreedy String Tiling(RKR GST)which wasmadeaftertheobservationofYAPandothersystemsfor detection. The method can be used to detect transposed subsequencetoo.Also,thepapertalksaboutusageonYAP onEnglishtextswhichwasasuccess.

ThispaperbyAlexAikenetal.[6]describestheideabehind MOSS(MeasureofSoftwareSimilarity)atoolthatautomates detecting plagiarised programming assignments. MOSS accepts the programs as input and returns HTML pages illustratingpartsoftheacceptedprogramsthatitdetectsto besimilar.Thepaperdescribesthewinnowingalgorithm. Theinputisconvertedintok grams(acontinuoussubstring oflengthk)wherethevalueofkischosenbytheuser.Each kgramishashed.Asubsetofthehashesischosentobethe document's fingerprint and this paper describes the winnowing to select the hashes.A window of size w is createdand,ineachwindow,minimumhashvalueischosen. Ifthereismorethanoneminimumthentherightmosthash isselected.Thisalgorithmwasfoundtobeefficient bythe

TheyhaveevaluatedJPlagagainstbothoriginalandartificial programs. It was found that JPlag was able to perfectly identifymorethan90%ofthe77plagiarismsandtherest wereatleasttermedsuspicious.Runtimeisalsojustafew secondsforaround100programsofseveral100lineseach. JPlagislimitedtolanguageslikeJavaandsupportslanguages suchasC,C++andScheme;supportforotherlanguagesis stillanareaofconcern.

Thedistancebetweenthepointsdeterminesthesimilarityof thetwoprograms.ItisobservedbyLutzPrecheltetal.[8] that feature based systems do not consider valuable structural information of the programs and also observed thataddingfurthermetricsforcomparisondoesnotimprove the Alanaccuracy.Parkeret

Certified Journal | Page1074

[1][3]arefeaturebasedmethodsthatusesoftwaremetrics toconverttheinputprogramintoafeaturevectorthatcan bemappedtoapointinann dimensionalcartesianspace.

al[4]hasgivenusaglimpsetothealgorithms thatcanbeusedtodetectplagiarism.Thepaperfocuseson analgorithmthatisbasedonstringcomparisons.Itremoves thecomments,blankspaces,comparesstringandmaintains countofthepercentagewherethecharactersarethesame. The authors described six levels of plagiarism and their examples are shown in the paper. These algorithms have beendevelopedonthetheoriesofHalstead’smetricswhich bringsoutstrongrelationwithsoftwaremetrics.

Phase1.1determinessimilarityusingtheinformationstored inthecounters thatkeep track ofthe eight features.The threetechniquesusedhereareSumofDifferences,countof similarityandweightedcountofsimilarity.

Prechelt et al. [8] have described JPlag’s architecture, it’s evaluation results, among others. JPlag is a web service whichdetectsplagiarism,givenasetofprogramsasinput. Firstly, it takes a set of programs as input, and then comparestheprogramsinpairs,calculatingtotalsimilarity valueandasetofsimilarityregionsforeachpair.Theyhave modifiedWise’sGreedyStringTilingalgorithmbyapplying thebasicideaofKarp Rabinpatternmatchingalgorithm,to compareprograms.TheoutputisasetofHTMLpageswhich allowsustounderstandthesimilarityregionsindetail.

Thisapproachwasfoundtobelanguagespecific[18]

Sincethiswasanoldpaper,theresearchinthisbroughtout automatingthetextualplagiarismtherebyreducinghuman Plagiarismefforts.

Volume: 09 Issue: 03 | Mar 2022 www.irjet.net

International Research Journal of Engineering and Technology (IRJET) e ISSN: 2395 0056

Then,authors.Karp

ThispaperbyRichardMKarpandMichaelORabin[7]gives us insightful information on the development of the Karp Rabin Algorithm, transitioning from the older techniques and identifying the underlying problem which led to the development of the algorithm. This is a string matching algorithm in which fingerprint functions are used in the algorithm to identify the patterns. This algorithm is also suitable for multi dimensional rectangular arrays. This algorithm short expected computed time with negligible probability of error and has a wide range of applications suitableforcheckingtextualplagiarism.

Thesystemisprettyrobustaccordingtotheauthorandis effective in clustering. They plan to implement fuzzy clustering in the detection system and support more programminglanguagessuchasJava,BasicandDelphi.

InthefollowingpaperbyCynthiaKustantoet.al[11],they have developed ‘Deimos’ a web application interface that receives input and then triggers background process to display the result on the application. Deimos detects plagiarism for source code written in Pascal, Lisp and C programming languages. Deimos performs following functions a) Detects plagiarism b) Displays the result in readableformc)Deletestheresult.Theapplicationparses the source code and transforms it into tokens and then compares each pair of tokens using Running Karp Rabin GreedyStringTilingalgorithm.Theadvantagesofthisisthat it detects plagiarism efficiently, can be accessed from any computersystem,andcanbeusedonotherprogramming languagestoo.Wecanalsosetthedetectionsensitivityandit can process more than 100 source code. The method proposed might work on long programs and not so accurately on small codes. Also processing multiple programmingassignmentsatatimemighttakelongerthan expected.Alotofmechanismscanbeaddedtothistomakeit betterandmoreefficient.

DejanSrakaetal[12]focusesontheplagiarismdoneatthe education levels and identifies the reasons behind plagiarism.Theauthorshavealsoconductedvarioussurveys andbroughtouttheresultswhichhelpus understandthe reasonsandanalysethetypeofplagiarismthataregenerally conducted. They have drawn conclusions from the survey suchthattheremustbeformalrulesandregulationsforthe proceduresandstudentsandteachersmustbeeducatedto understandtheimportanceofauthorship,intellectualrights

Duric and Gasevic [14] addresses the problem of making structural modifications to source code which can make detecting plagiarism very difficult and have presented a sourcecodesimilaritydetectionsystem(SCSDS)whichuses a combination of two similarity measurement algorithms suchasRKR GSTalgorithmandWinnowingalgorithm.The approach consists of five phases, which are: 1) Pre processing: In this phase, all sorts of comments from the sourcecodefileareremoved. 2)Tokenization:Converting thesourcecodeintotokens,andthesetokensarechosenina waydifficultfortheplagiariststomodify,butstillmaintains the essence of the program. 3) Exclusion: In this phase, template code is excluded which can avoid many false positives. 4) Similarity measurement: RKR GST and Winnowingalgorithmareusedtomeasuresimilarity.This phaseisrepeatedtwicedue totheimplementation oftwo algorithms.5) Finalsimilaritycalculation:Thiscalculation isperformedontheresultsobtainedinthepreviousphase. The performance of SCSDS similarity measurement had shown promising results in comparison with JPlag. The tokenization phase and the usage of several similarity measurement algorithms contributes to the promising resultsobtained,butit’sslowerduetotheusageofseveral similarity measurement algorithms which needs to be

andrulesofproperreferences.Also,theteachersmustframe questions such that it has multiple solutions. This paper howeverdoesnotfocusontheplagiarismtools,itratheris justidentifyingthereasonsandsources.

Volume: 09 Issue: 03 | Mar 2022 www.irjet.net

ThispaperproposedbyLiangZhanget.al[10]usesametric Informationdistancetomeasurethesimilaritybetweentwo programs and also detects the plagiarism clusters which helpsinfindingouthowmanyofthemhavewrittenthecode independentlyandhelpsthecoursesettersandinstructors toenhanceandmodifythewayeducationiscommunicated tothestudents.Thedetectionsysteminthepaperworksin3 phases particularly, parsing programming codes and translation into token sequences, calculating pairwise distance or similarity and clustering analysis work on similaritymatricesinphase2.

by Bandara et al. [15] describes source code plagiarismdetectionusinganattributecountingtechnique andusesameta learningalgorithmtoimprovetheaccuracy of the machine learning model. Naive Bayes, K nearest neighbouralgorithmswereusedforresearchandAdaboost algorithm was used for meta learning. Nine metrics were chosentoidentifyeachsourcecodeandtrainedonadataset of904javasourcecodefilesandtestedonavalidationsetof 741files.Theaccuracyachievedwas86.64%.Theauthors plan to use other machine learning and meta learning algorithmsinthefuturetoimprovetheaccuracy.

Martin Potthast et al [13] have presented an evaluation frameworkforplagiarismdetection.Theperformanceofthe plagiarism detection is measured with the help of the plagiarismdetectionalgorithmandmeasuretoquantifythe precisionandgranularity.Theauthorshavecomeupwith the corpus where there are three layers for plagiarism authenticitythatisrealplagiarism,artificialplagiarismand simulated plagiarism. The integration of these PAN plagiarismcorpusisdonebytheauthorsinthePAN PC10 corpus. The corpus features various kinds of plagiarism caseswhichhelpinvalidationwhichisdoneby10different retrievalmodels.Theyhaveaimedforarealistictestbedso thatbetterperformancecanbeachieved.

| ISO

Thisimproved.paper

Journal | Page1075

International Research Journal of Engineering and Technology (IRJET) e ISSN: 2395 0056

p ISSN: 2395 0072 2022, IRJET Impact Factor value: 7.529 9001:2008 Certified

document into sentences, paragraphs or sections, and analyzing various features like stylometric features and averagedwordfrequencyclass.Theirexperimentalanalysis wasonthecomputersciencearticlesinACMdigitallibrary, representedintheXMLformandplagiarismcheckedwith XMLdocuments.Theyhaverepresentedtheiranalysiswith the help of the graph and tabular form. This paper has clarifiedus withtheintrinsicmethodanditsusage,which canbeappliedonlyforthetextualdocuments.

iftwoprogrammingassignmentsreceivedasimilarityscore of 0.7 or above then there were enough code blocks that lookedsimilarandismostlikelyplagiarised.Theresultsare presentedinanexcelsheetthatisdifficulttounderstand.

Zhangetal.[18]proposedaprogramlogic basedapproach to software plagiarism detection (LoPD), which is both effective and efficient. In place of detecting similarity betweentwoprograms,LoPDdetectsdissimilarity.Aslong as it can find one dissimilarity i.e., in the form of either different output states or semantically different execution paths,itisnotacaseofplagiarism;itis,otherwise.Aninput isgiventotwosuspectedprograms.If,a)OutputStatesare different,it'snotaplagiarismcase.Iftheyarethesame,then checkforpathdeviation.b)Executionpathsarethesame,try foranotherinput.If,aftermanyiterations,wecannotfindan input for which either the output or the execution path is different,theprogramsarelikelytobeaplagiarismcase.If they are different, check path equivalence to ensure true semantic deviation and not merely caused by code thematchingareIn2modifiedsentencelexical,vectorInFuzzyoutperformedmodel.ANFIS(FIS)identifiersclusteringwhichV.NeuroSingularthefrequencytermscreatedunnsourcedescribesThishenceequivalence;solvingsymbolicWhileobfuscation.pathcharacterizationisdoneusingtechniquessuchasexecution,weakestpreconditionandconstrainttodetectpathdeviationandmeasuresemanticsconstraintsolvercanleadtofalsepositivesandisboundbylimitations.paperbyGiovanniAcamporaandGeroginaCosma[19]afuzzybasedapproachtodetectsimilaritiesincode.Firstthesourcecodeispreprocessedtoremoveecessaryinformation,afterwhichavectorspacemodeliswhichisamatrixAthatholdsthefrequencyofthepresentinthesourcecodeafterpreprocessing.Thisisnormalisedbyaglobalweighingfunction.AsnextstepthedatasetsizeisfurtherreducedusingValuedecompositionandtheresultisamatrixFuzzylearningalgorithmisappliedonthismatrixVisatwostepprocess.Step1:FuzzyCmeanswhichgroupsthesourcecodehavingsimilartogetherandgeneratesaFuzzyInferencesystemthatdescribestheruleforaparticularcluster.Step2:algorithm[19]isusedtotunetheFIStooptimisetheTheSystemwastestedonJavafilesandSelfOrganisingmaps(SOM)approachandCMeansapproach(FCM)andRKRGSTusedinJPlag.thefollowingworkbyA.Chitraet.al[21]asupportbasedparaphraserecognizerisusedwhichextractssyntacticandsemanticfeaturesontextpassages.Thelevelparaphraserecognitionsystemhasbeenbytheauthortohandlethetextpassages.Therearedifferentapproachesthathavebeenused:thefirstapproach,theinputandthesuspiciouspassagessplitintosentencesanditdeterminestheclosestsourcesentence.AnSVMclassifierisusedtolabelpairs.

Volume: 09 Issue: 03 | Mar 2022 www.irjet.net ISSN: 2395 0072 2022, IRJET Impact Factor value: 7.529 9001:2008 Certified

International Research Journal of Engineering and Technology (IRJET) e ISSN: 2395 0056

ThissurveybyPrasanthSetal[16]tellsusaboutthevarious techniques and tools used for plagiarism detection and varioustypesofplagiarismdetection.Thissurveygivesusa differentFeatureusedstatementasProcessor:wasintobasedplagiarismThistheseevolutionpaperhelpsbringscompleteunderstandingofdifferentplagiarismmethodsandoutthecomparisonbetweenthesemethodswhichuschoosethetechniqueasperourneed.ThissurveyhasjustfocusedonthebasictechniquesandwiththeoftheInternetthechallengetoplagiarismfromsourcesisstillabigconcern.paperbyWeijunChenetal.[17]proposesasourcecodedetectionsystemthataimstocombinefeatureandstructurebasedplagiarismdetectionmethodsasinglesystem.Asystemconsistingoffourcomponentsdesignedbytheauthors.Thecomponentsare:PreThiscomponentremovesthenoiseelementssuchheaderfiles,comments,whitespaces,anyinputoutputandanystringliteralsastheseelementscouldbetofooltheplagiarismdetector.BasedComponent:Thiscomponentconsiderstwocategoriesoffeaturesi.e.,Physicalfeaturesand

Journal | Page1076

| ISO

Halstead’sSoftwaremetrics andbuildsa featurevector of the program given as input. Number of words and the numberofsourcecodelinesarethetwophysicalfeatures considered in this component. Six Halstead’s software metricsareconsiderednamelyarithmeticoperatormetrics, relational operator metrics, logical operator metrics, execution flow metrics, operand metrics and number of differentoperands.

The feature vectors of the two programs are compared to calculate the similarity. A sensitivity coefficient is used to define the strictness of the comparison. This component considers both Physical similarity(S1) and Halstead similarity(S2)alongwithweightsw1andw2.Theauthors haveconsideredtheweightsw1=0.2andw2=0.8.

IntegrationComponent:Combinestheresultofbothfeature and structure based components using the integration algorithm. The authors tested the system by changing the variablenames,addinguselessstatementsandheaderfiles and the similarity score reported was 1.0. The similarity scorereportedafterchangingthestructureoftheprogram was 0.9. The system was also tested using the programs submittedbythestudentsofanIntroductoryprogramming classofTsinghuaUniversity,Chinaanditwasobservedthat

StructureBasedComponent:Thiscomponentusesasetof rulespre definedbytheauthorstosubstitutetheidentifiers inthesourceprogramtoasetofstandardtokenstoensure thatchangestonamesofvariablesorfunctionsdoesn’tfool the plagiarism detector. To increase the efficiency further the token is further mapped to a single character using a mappingtable defined by theauthors resultingina token string.Thetokenstring(separatedintodifferentblocksby the curly braces) is compared to derive a similarity score usingtheLongestCommonSubsequence(LCS)Algorithm.

better execution. The authors later plan on including an integrationofSRL SVMwithatranslatormethodtoremove thelimitationofthepreviousproposedmethod.

ThispaperbyJitendraYasaswiet.al[27]describesamethod which takes student’s submissions to programming assignmentsasinputandextractsthestaticfeaturesfrom the intermediate representation of the program using the MILEPOSTGCCfeatureextractionplugin[28].Thispluginis capableofextracting65features.Thisfeatureismappedto ann dimensionalspacewheren=55.Thesimilaritybetween twodifferentsubmissionsiscalculatedusingtheEuclidean Distance formula and based on the similarity these submissions are clustered together. The results are consistentwiththeresultsobtainedfromMOSSonthesame dataset. Mostly, [27] performs better and five cases illustratingthisaredescribedinthepaper.Thepaperdoes

It was found that two out of the five subgroups could be ignoredbyanydetectiontechniques(sincetheyrepresent incomplete and template files), while the rest were rich enough for coding style metrics to be applied to detect

Inthesecondapproach,paraphraserecognitionfeaturesare extracteddirectlyfromthe sourceandsuspicious passage and the features extracted are used to determine if the suspiciouspassageisplagiarisedfromthesourcepassage. The evaluation measures considered here are Accuracy, Precision, Recall, and F measure. The system gave a good performance in both passage level and sentence level approach. It falls short in recognition of inputs having greaterlexicalvariationalsoinhandlingverysimilarinput phrasesorpassages.

Jitendraplagiarism.Yasaswi

p ISSN: 2395 0072 2022, IRJET Impact Factor value: 7.529 9001:2008 Certified Journal

| ISO

The following paper by Matija Novak et. al [22] gives a reviewofthesourcecodedetectioninacademia.According to the paper, no similarity detection engine is powerful enoughto beabletodetect all theplagiarised code.Using textsimilaritydetectioncanbeusefulattimestodetectthe plagiarism in the source code though not entirely. The author says, there should be more study on different similaritydetectionenginesonthesamedataset.Thetools currently which are used for detection and stand out are jPlag, MOSS, SIM, Sherlock, and Piggie. To improve the accuracyofthedetectiononecan,dopre processingofthe data or clean the code from unnecessary parts. There are manyothertoolslikeGATEandGplagwhichhaven’tbeen compared more by the authors and should be the future scopefor Plagiarismresearch.detection

Theydesigneda small Java programwhichatfirstfetches random samples from the dataset, as the dataset contains duplicatefiles,onlyonefilewiththesameIDisfetchedoutof themanywithsameIDs,topreventduplication.Then,the number of lines and size of each source code file is measured(counted). Followed by measurement of complexity by counting the number of loops based on commonloopwordssuchasfor,whileetc.Ultimately,they aregroupedintofivesubgroupsbasedontheabovefeatures.

is a difficult task as different programming languages could have different syntax and they are not only found in academic works but also in industrysoftwarecodes.InthefollowingpaperbyMayank Agrawal et. al [23], the author describes two types of plagiarismtechniques,textualandsourcecodeplagiarism. Therearedifferenttoolswhicharedescribedinthepaper for both.Theauthorhasalsoexplainedabout many other paperswhichhaveusedtechniqueslikeNaturalLanguage Processing (NLP), Machine Learning Technique, Running Karp Rabin Greedy String Tiling algorithm, Character N Grams, Data Mining, Latent Semantic Analysis (LSA) and GreedyString Tiling.Acomparativeanalysisisperformedon the techniques, methods and the objectives and outcomes have been listed. The literature tells that copying is a dynamic process and is a danger to educational Inorganisations.thefollowing

| Page1077

paper proposed byauthorAhmedHamza Osmanet.al[24],itusesSemanticRoleLabelling(SRL)and SupportVectorMachine(SVM)forpredictinganddetecting plagiarised text passages. The algorithm analyses the text basedonsemanticpositionforthetext.Textpre processing is done on the input original text and the suspected text which includes text segmentation, stemming process and stopwordremoval.SRLprocedureisthenfollowedtofind andnametermsintextdocumentsandpassages.Precision and recall are the execution measures for the algorithms. ThedatasetusedbytheauthorsisPAN PC 10dataset.The proposedmethodisbetterthantheothermethodsandhas

Volume: 09 Issue: 03 | Mar 2022 www.irjet.net

Mirzaetal.[25]analysedaBlackBoxdataset,whichcontains genuinestudentprogrammingassignments(BlackBoxisa project that collects data from the users of BlueJ online educational software), to check if the dataset was rich enoughtoapplycodingstylemetricsto detect plagiarism. They considered random samples of 250 Java files each downloadedfromthedataset.Thefileswerepre processed toremovewhitespaceandfileheaders.

International Research Journal of Engineering and Technology (IRJET) e ISSN: 2395 0056

etal [26]usesdeeplearningandnatural languageprocessing(NLP)todetectplagiarism.Theauthors haveusedchar RNNforfeatureextractionwithanattempt to attain high accuracy. They have focused on statistical language modelling by training the char RNN model on LinuxKernelcodes.TheLSTM(Longshort termmemory) unitshavebeenaddedinchar RNNsothatthelearnfeatures fromCprogramscanbecapturedandusedforcomparison. Overall,thechar RNNmodelistrainedfirst,thenitisfine tuned with some C programs which leads to obtaining embeddings for programming assignments that are submittedandusetheseembeddings(learnunits)todetect plagiarism. However, they have used sequence prediction whichinfactcanbedirectlyappliedonthedifferentdataset withouttheneedtofinetuneeachproblem set.

Siddharth Tata et al [32] have worked on the extrinsic plagiarismdetectionwheretheassignmentsandprojectsare copied from external sources like the Internet or fellow students.Theauthorshavefocusedongeneratingn grams and using the Karp Rabin algorithm which generates the hashvalues.Thewinnowingalgorithmisusedtoselectthe specific hash values and uses Jaccard similarity on the fingerprints generated from the passages to detect the plagiarism. The authors are yet to work on the content available on the Internet. They have currently worked on textfilesandwouldextendworkingontheprograms.

Page

International Research Journal of Engineering and Technology (IRJET) e ISSN: 2395 0056

Volume: 09 Issue: 03 | Mar 2022 www.irjet.net

BudimanandKarnalim[31]haveproposedanapproachto drawhintsfromseatingpositionandsourcecodecreation processtodetectplagiarismaswellastheplagiaristswithan accuracy of at least 80.87%and 76.88% respectively. The approachconsistsoftwomodules,theyare:Studentmodule: This module is installed as a plug in in the student’s

ThefollowingpaperbyK.K.Chaturvediet.al[34]analyses and scrutinises the dataset, techniques and tools used to mine the software engineering data. The authors have basicallycategoriseddifferenttoolsusedinMiningSoftware Repositories.Thecategorizationofthetoolsisdonebased on the ones which are newly developed, traditional, prototypeimplementedandscripts.Inmostofthepapers, studiesbytheauthordataretrieval,datapre processingand postprocessingaremostwidelyusedandareimportant.The

TahaeiandNoelle[29]havepresentedanapproachwhich doesn’trequirethepotentialplagiarismsourcesinorderto comparesimilarity,insteadassumesanonlinesystemwhich allows for submission of multiple solutions for a single exercise, providing formative feedback with each submission. It compares pairs of submissions by an individual student,forsimilarity.Theyhaveusedthe‘diff’ algorithm,whichgivestheminimumnumberofadditionsor deletionsneededtotransformonefileintoanother.Thiswas takenasasubmissiondifferencebetweentwoconsecutive submissions,avectorofn 1dimensionalityiscreatedfrom the submission differences of each pair of consecutive submissions, starting from the submission difference betweenfirstandsecondsubmissions.Severalfeaturessuch as number of submissions; average, maximum, minimum and last differences were considered. For each examined subset of features, logistic regression was performed, identifying logistic sigmoid parameters which maximised plagiarismclassificationaccuracyoverthetrainingset.The parameters were then used to calculate a plagiarism probabilityscore.The resultswereevaluatedagainstdata collectedfromactualstudentsenrolledinanundergraduate computer programming class at a research university. Thoughitcouldn’tperfectlyclassifyplagiarismcases,there was strong correlation between these scores and actual cases of plagiarism. However, this method would fail if a studentmakesasinglesubmissionandalsoifthestudents (whointendtoplagiarise)areawareofthefeaturesbasedon whichplagiarismwillbedetected,astheywilltrytobypass

etal[30]havepublishedtheprototype Hyplag which is able to detect strong textual plagiarism. Theirtargetreviewerswerereviewersofsuchworksuchas journaleditorsorPhDadvisors.TheHyplaghasthemulti stagedetectionprocesswherethereiscandidateretrieval, detailedcomparisonandhumaninspection.Theyhavetaken intoconsiderationmathematicalsimilarity,imagesimilarity, citationsimilarityandtextsimilarity.TheHyplagprototype consistsofthebackendserverwhichisrealisedinJavausing the Spring Boot framework and a web based frontend coupledviaRESTwebserviceinterface.Theydemonstrated theirworkbyshowingresultsoverviewaswellasadetailed comparisonview.Theauthorswereabletodemonstratethe hybridanalysisoftheretractedsourcehavingthecontent featuresandbringoutinteractivevisualisationsthatwould helpthereviewersinassessingthelegitimacyofdocuments.

ThispaperbyHuangQuiboetal[33].describesamethodto extract features from submitted programs and uses a combination of Random Forest algorithm and Gradient Boosting Decision Tree. The authors propose two algorithms. Algorithm 1: A Similarity Degree Threshold (SRT)issetalongwithaToplimit.Ifthesimilarity(sim)of the programs is less than SRT then the program is not plagiarised.IfsimisgreaterthanToplimitthentheprogram is suspected to be plagiarised and the student is asked to confirm if they want to submit the assignment. If they confirmthentheprogramissentforfurtherreviewtothe courseinstructor.Iftheydecline,theysystemassumesthat theprogramwasplagiarised.IfthesimvalueisbetweenSRT andToplimitthensomemorefeaturesneedtobeextracted andtheprogramisevaluatedusingtheRandomForestand GradientBoostDecisiontreemodelstocalculateplagiarism suspect level. Algorithm 2: The authors constructed a Random Forest containing multiple decision trees using entropyascriteriaforclassificationandaGradientboosting decision tree where each tree is a regression tree. Experiments show that algorithm 2 achieved a greater accuracycomparedtoalgorithm1andtheaccuracyratecan reachupto95.9%.

p ISSN: 2395 0072 2022, IRJET Impact Factor value: 7.529 ISO 9001:2008 Certified Journal | 1078

Normandetection.Meuschke

notidentifyanydynamicfeaturesandalsodoesnotconsider thecaseofpartiallyplagiarisedsubmissions.

programmingworkspace.Itfrequentlycapturesthesource codesnapshotswhicharecompressedusingHuffmancoding algorithm for space efficiency. The capturing period and seatingpositionIDaresetbeforemoduleinstallation.After assessment completion, all snapshots are merged as an archiveandsenttotheexaminermodule.Examinermodule: Itfilterscodepairsbasedonseatingpositionwhichreduces pairstobeinvestigatedby80.87%andsuggestssuspected codepairsandplagiaristsbasedonsnapshots.Inthefuture, theyplantouseversioningsystemssuchasGitHubtorecord snapshotsand implementadvancedsimilarityalgorithms.

MichaelDuraciketal[37]haspublishedthepapertodetect thefraudulentactivitydone bythestudentsinsubmitting computerprograms.TheauthorresearchedtoolslikeJPlag, MOSS,Plaggieanddesignedananti plagiarismsystembased on that. The input source code is processed with abstract syntaxtree(AST)andvectorizationisdoneunderlyingthe concepts of Deckard Algorithm. They have designed an algorithmforvectorizationtoaddmultiplevectors.Postthis the vectors are sent for clustering, where apart from incrementalalgorithmtheyhavealsooptimisedtheK means algorithm.Theplagiarismcanbedetectedoncethemerging between the matching vectors takes place. For the best efficiency this method is compared with MOSS and JPlag systemsandtheresultsarenoted.Theyfacedfewproblems inbringingoutthenormalisationofinputsandbringingout anefficientalgorithmforannotationofnon significantcode.

| Impact

HussainKChowdhuryand Dhrubha K Bhattacharyya [39] have presented a taxonomy of various plagiarism forms, tools and various machine learning methods employed to detectthesame.Theauthorshaveidentifiedvarioustypesof plagiarism,bringingoutclaritytoextremeextentsinwhich the former can be achieved. They also give us a detailed explanation on various types of plagiarism detection methods and a wide range of tools used. With clear diagrams,theirsurveybringsouttheclarityonitsmethods andtoolstoidentifyplagiarism.Theyhavealsohighlighted theissues.Theyhaveidentifiedallthemajorrequirementsof plagiarismandcanhelpadevelopertodevelopacodeaptly byreferringandidentifyingallthepreviouslyusedmethods andaimtoimprovisetheaccuracy.

4. FLOW DIAGRAM

Theauthorsfurtherplantostudyfunctions of all the tools used in Mining software Repositories and theirapplicationswhichcanhelpresearchersingettingtools fortheirapplicationsandformoreresearch.

p ISSN: 2395 0072 IRJET Factor value: 7.529 9001:2008

| ISO

tools which are used most of the time depend on the availability and most of them use a combination of many toolsforthetask.

Volume: 09 Issue: 03 | Mar 2022 www.irjet.net

Mining repository is one of the techniques which is employedintheprojectandthefollowingpaperbyThai Bao Do et. al [35] highlights more on the same. The paper proposes a methodto detect forks and duplicates in the repository and also checks any correlation between the forking patterns, software health, risks and success indicators.Therearemoreandmoredatawhicharebeing pushed into version control systems like GitHub, GitLab, Bitbucket, PyPI for Python and Debian for open source softwareandthepapertalksaboutthemethodwhichcanbe usedtoextractthemetadatafromtherepositorieshostedon platforms like GitHub, GitLab and Bitbucket. The study is doneonSoftwareHeritagedatasetwhichconsistsofmore thanthreemillionsoftwarerepositoriesfrommanyversion control systems. The data extracted from the dataset is stored which can be used for future investigations the approachusedbytheauthorsworkwellandshowspossible correlation between the metrics. But there is no concrete conclusionontherelationshipwhichisalsoapartoftheir future Nisheshworks.Awale

etal[36]saysthattheaccuracyofdetecting plagiarismishighbyusingthexgboostmodelwithSupport VectorMachine(SVM).Theauthorsaysthattheyfocusedon themachinelearningtechniquebyworkingonfeature based extractiondonebyrecurrentneuralnetworks.Thexgboost algorithm is trained with the features extracted and the resultscanbeusedwithSupportVectorMachine(SVM)to detectplagiarism.Withtheadvantageofhighaccuracy,they arestillworkingonincorporatingcompiler basedfeatures.

Certified Journal | Page1079

PAshwinetal[38]saysthatforplagiarismdetectiontobe successful,onemustusemachinelearningalgorithms. The

authorswerespecificwiththesoftwaretoolsandalgorithms theyusedfordevelopingtheplagiarismdetectiontool.They haveusedAnacondaNavigatorcloud,Djangoandnode.jsfor creatingalocalenvironmentforhostingonlineassessments andcontests.Thedatasetisobtainedfromtheuserand is used for testing and for training they use from the inbuilt libraries. They worked on normalisation using Natural Language Processing (NLP) and tokenization for the pre processing. They focused on the cosine similarity and N gram algorithms for detecting plagiarism. However, they werenotabletoscrapthedatafromthewebcontentandare workingonit.

International Research Journal of Engineering and Technology (IRJET) e ISSN: 2395 0056

Can be applieddirectly on datasets,different no need to fine tune for each dataset.

Bayes,Naive msalgorithurneighbonearestK

Certified Journal | Page1080

YasaswiJitendra et al )memorytermshort(LongRNN,CharLSTMunits

Tahaei and Noelle ‘diff’ ersparametsigmoidlogisticn,regressiomalgorithlogistic allowssystem,Online exercisesinglesolutionsmultiplesubmissionforoffora

Author Model description Algorith m/featu res Advantages Disadvanta ges

Has not been tested Javaotherlanguagesonthan

Slow,dueto theusageof ntmeasuremesimilariseveraltyalgorithm

CosmaGeorginaandAcamporaGiovanni

Bandara et al.

Fuzzy malgorithANFISg,clusterinmeansC problemsOvercame of reduemisdetectiondependency,languagetocodeshuffling

The results understanddifficultsheetanpresentedareinexcelthatisto

Volume: 09 Issue: 03 | Mar 2022 www.irjet.net p ISSN: 2395 0072 2022, IRJET Impact Factor value: 7.529 ISO 9001:2008

A.Chitraet. al classifierSVM approachlevelandpassageinperformanceGoodbothlevelsentence

International Research Journal of Engineering and Technology (IRJET) e ISSN: 2395 0056

OsmanHamzaAhmed et. al (SVM)MachineVectorSupportand(SRL)LabellingRoleSemantic hugemeasurablyweretheexaminedstrategyproposedadvantagesfoundofConsequencesTTeststheofinpaper

Author Model description Algorith m/featu res Advantages Disadvanta ges phrasesinput or passages.

et.alYasaswiJitendra DistancenEuclideanextractiofeatureTMILEPOSusingobtainedfeaturesStaticGCCplugin, comparedbetterPerforms to MOSS

achievedAccuracy was 86.64%. planAuthors to improve accuracythe

Duric Gasevicand RKR malgorithngWinnowimalgorithGSTand resultsPromising withcomparisoninJPlag

Falls short veryhandlingandvariationlexicalgreaterhavingofrecognitionininputsinsimilar

Methodfails if a submissionamakesstudentonlysingle

etMeuschkeNormanal bliograp(DCT),BimTransforCosineDiscretemeasure,distanceRelative similaritycitationsimilarity,imagesimilarity,mathematicalConsiderssystem,Onlineand

Zhangetal. ritydissimilagdetectinLoPD, techniquesobfuscationtypesagainstResiliencemostof positivesleadsolverConstraintmaytofalse

Table 1: SurveyedPapers

Does not identify submissionsplagiarisedpartiallycaseconsiderdoesfeatures,dynamicanynottheof

5. TABLE

ChenWeijunetal. Six Halstead’ mAlgorith(LCS)enceSubsequCommonLongestmetrics,softwares method0.7)scoressimilarityreceivedassignmentsPlagiarizedhigh(aboveusingthis

The plagiarismsemanticlanguagecrossdetectcannotO(n^2)efficiencytimeisthe

Yet to work featuresbasedcompileron

While many efforts were concentrated towards detecting textualplagiarism,significantstrideshavebeenmadeinthe fieldofsourcecodeplagiarismdetection.Wecanobservethe leapfrommanualplagiarismcheckingtoalgorithm based, automatic plagiarism checking made possible by advancementsintechnology.Wecanalsoobservetheshift from using local client applications to web based applicationsand now tocloud basedapplications, making plagiarism detection systems easy to use and available everywhere. The detection methods started from simple feature based approach, structure based approach, then moved on to hybrid approaches that used similarity measurement and string matching algorithms as students started to evade these systems. Recent approaches using

Author Model description Algorith m/featu res Advantages Disadvanta ges m.algorithEncoplotg,matchinstring(CC),fullgChunkination(GCT),CitTilingCitationreedy(LCCS),GeSequencCitationCommongest(BC),LonCouplinghicthe

Worksontext files

AwaleNisheshetal. (SVM)MachineVectorSupportwithmodelXGBoost

| Page

Does not make use systemsversioningof

QuiboHuangetal Tree,DecisionBoostingGradientmalgorithForestRandomtionCombinaofand

Accuracy rate can reach up to95.9%

Volume: 09 Issue: 03 | Mar 2022 www.irjet.net

Author Model description Algorith m/featu res Advantages Disadvanta ges d(SRT)ThresholySimilaritDegree

DuracikMichael et al. malgorithmeansTree,SyntaxAbstractm,algorithDeckardK

Yet to work on the availablecontent on the programsworksystemextendYetInternet,tothetoon

TataSiddharthetal ysimilaritJaccardm,algorithRabinKarpm,algorithngWinnowi

PAshwinet al. msalgorithgramysimilaritCosineandN

p ISSN: 2395 0072 2022, IRJET Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal 1081

Highaccuracy

Better results than MOSS andJPlag codesignificantofannotationandonnormalizatiregardingIssuesofinputswithnon

Accuracyofat least 80.87% and processcreationsourcepositionseatingConsiders76.88%,andcode

resultsPromising

Yet to work on scrapingweb

International Research Journal of Engineering and Technology (IRJET) e ISSN: 2395 0056

KarnalimandBudiman malgcodingHuffmanorith

textsimilarity

6. CONCLUSIONS

Plagiarismisaubiquitousproblemfacedbypractitionersof differentfieldslikeacademia,journalism,literature,artand soonfordecades.Thefieldhasbeenresearchedintensively sincethe1970’s.Withtheadvancesintechnologyandthe pervasivenessoftheworldwideweb,everyonehasallthe information they need at their fingertips. Especially in academia, this poses a problem to fair evaluation of the studentsandalsoinhibitsthestudent’slearningprocess.

[9] Sven Meyer zu Eissen and Benno Stein, “Intrinsic Plagiarism Detection”, M. Lalmas et al. (Eds.): ECIR 2006, LNCS3936,pp.565 569,2006.

Machine Learning and Deep Learning algorithms and techniqueshaveshownsomepromisingresultstoimprove the accuracy and automating the process of plagiarism detection.

[11] Cynthia Kustanto and Inggriani Liem, “Automatic Source Code Plagiarism Detection”, 2009 10th ACIS InternationalConferenceonSoftwareEngineering,Artificial Intelligences, Networking and Parallel/Distributed

[12] Dejan Sraka and Branko Kauþiþ, “Source Code Plagiarism”,ProceedingsoftheITI200931stInt.Conf.on InformationTechnologyInterfaces,June22 25,2009,Cavtat, [13]Croatia.

[15] UpulBandaraandGaminiWijayrathna,“Detectionof SourceCodePlagiarismUsingMachineLearningApproach”, InternationalJournalofComputerTheoryandEngineering, Vol.4,No.5,October2012.

[8] Lutz Prechelt and Guido Malpohl, “Finding Plagiarisms among a Set of Programs with JPlag”, March 2003,JournalOfUniversalComputerScience8(11).

[18]ProceedingsConferenceonEducation2013,OfficialConference2013.FangfangZhang,DinghaoWu,PengLiuandSencun

[7] Richard M. Karp and Michael O. Rabin, “Efficient randomizedpattern matchingalgorithms”,Publishedin:IBM JournalofResearchandDevelopment(Volume:31,Issue:2, March1987),Page(s):249 260.

[22] Matija Novak, “Review of source code plagiarism detectioninacademia”,201639thInternationalConvention

[16] Prasanth. S,Rajshree. R and Saravana Balaji B, “A Survey on Plagiarism Detection”, International Journal of ComputerApplications(0975 8887),Volume86 No19, January2014.

Zhu,“ProgramLogicBasedSoftwarePlagiarismDetection”, ISSRE'14:Proceedingsofthe2014IEEE25thInternational SymposiumonSoftwareReliabilityEngineering,November 2014,Pages66 77.

Martin Potthast, Benno Stein, Alberto Barrón Cedeño and Paolo Rosso, “An Evaluation Framework for Plagiarism Detection”, Coling 2010: Poster Volume, pages 997 1005,Beijing,August2010.

[3] JohnLDonaldson,AnnMarieLancasterandPaulaH Sposato, “A plagiarism detection system”, SIGCSE '81: ProceedingsofthetwelfthSIGCSEtechnicalsymposiumon Computerscienceeducation,February1981,Pages21 25.

International Research Journal of Engineering and Technology (IRJET) e ISSN: 2395 0056

[14] Zoran Đurić and Dragan Gašević, “A Source Code SimilaritySystemforPlagiarismDetection”,TheComputer Journal, Volume 56, Issue 1, January 2013, Pages 70 86, https://doi.org/10.1093/comjnl/bxs018, Published: 13 March2012.

[17] WeijunChen,ChenlingDuan,LiZhengandYoujian Zhao, “A Hybrid Method for Detecting Source code Plagiarism in Computer Programming Courses”, The European

[20] J. S.R. Jang, “Input selection for ANFIS learning”, Proceedings of IEEE 5th International Fuzzy Systems, 06 August2002,0 7803 3645 3,NewOrleans,LA,USA.

[2] JosephL.F.DeKerf,“APLandHalstead'stheoryof softwaremetrics”,APL'81:Proceedingsoftheinternational conferenceonAPL,October1981,Pages89 93.

[21] A. Chitra and Anupriya Rajkumar, “Plagiarism Detection Using Machine Learning Based Paraphrase Recognizer”,FromthejournalJournalofIntelligentSystems, https://doi.org/10.1515/jisys 2014 0146.

[19] GiovanniAcamporaandGeorginaCosma,“AFuzzy based Approach to Programming Language Independent Source CodePlagiarismDetection”,Publishedin:2015IEEE InternationalConferenceonFuzzySystems(FUZZ IEEE),30 November2015.

Volume: 09 Issue: 03 | Mar 2022 www.irjet.net p ISSN: 2395 0072 2022, IRJET Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal 1082

Computing, Date Added to IEEE Xplore: 13 October 2009, PrintISBN:978 0 7695 3642 2.

[5] Michael J Wise, “YAP3: improved detection of similaritiesincomputerprogramandothertexts”,SIGCSE '96: Proceedings of the twenty seventh SIGCSE technical symposium on Computer science education, March 1996, Pages130 134.

REFERENCES

| Page

[4] Alan Parker and James O. Hamblen, “Computer AlgorithmsforPlagiarismDetection”,IEEETransactionsOn Education,Vol.32,No.2.May1989.

[10] LiangZhang,Yue tingZhuangandZhen mingYuan, “A Program Plagiarism Detection Model Based on Information Distance and Clustering”, Published in: The 2007 International Conference on Intelligent Pervasive Computing (IPC 2007), Date Added to IEEE Xplore: 22 January2008,PrintISBN:978 0 7695 3006 2.

[1] KarlJOttenstein,“Analgorithmicapproachtothe detection and prevention of plagiarism”, ACM SIGCSE Bulletin,Volume8,Issue4,Dec.1976,pp30 41.

[26] JitendraYasaswi,SureshPuriniandC.V.Jawahar, “PlagiarismDetectioninProgrammingAssignmentsUsing DeepFeatures”,20174thIAPRAsianConferenceonPattern Recognition, 17 December 2018, 978 1 5386 3354 0, Nanjing,China,DOI:10.1109/ACPR.2017.146.

IEEE)andStefanToth,“AbstractSyntaxTreeBasedSource CodeAntiPlagiarismSystemforLargeProjectsSet”,October 6, 2020, Volume 8, 2020,Digital Object [38]Identifier:10.1109/ACCESS.2020.3026422.P.Ashwin,M.B.Boominathanand G.Suresh, “Plagiarism Detection Tool for Coding Platform using Machine Learning”, International Research Journal of EngineeringandTechnology(IRJET),Volume:08Issue:05| May2021,TamilNadu,India.

[32] Siddharth Tata, Suguri Charan Kumar and Varampati Reddy Kumar, “Extrinsic Plagiarism Detection Using Fingerprinting”,International Journal of Computer ScienceAndTechnology(IJCST)Vol.10,Issue4, Oct Dec [33]2019.

[35] Thai BaoDo,Huu NghiaH.Nguyen,Bao LinhL.Mai and Vu Nguyen, “Mining and Creating a Software RepositoriesDataset”,20207thNAFOSTEDConferenceon Information and Computer Science (NICS), 02 February 2021,978 0 7381 0553 6,HoChiMinhCity,Vietnam,DOI: [36]10.1109/NICS51282.2020.9335894.NisheshAwale,MiteshPandey,

and Bela Gipp, “HyPlag: A Hybrid Approach to Academic PlagiarismDetection”,SIGIR'18:The41stInternationalACM SIGIR Conference on Research & Development in Information Retrieval, June 2018, Pages 1321 1324, [31]DOI:10.1145/3209978.3210177.ArielElbertBudiman and Oscar Karnalim, “AutomatedHintsGenerationforInvestigatingSourceCode Plagiarism and Identifying The Culprits on In Class Individual Programming Assessment”, Computers 2019,

onInformationandCommunicationTechnology,Electronics andMicroelectronics(MIPRO),28July2016,978 953 233 086 1,Opatija,Croatia,DOI:10.1109/MIPRO.2016.7522248.

[27] JitendraYasaswi,SriKailash,AnilChilupuri,Suresh Purini and C. V. Jawahar, “Unsupervised Learning Based Approach for Plagiarism Detection in Programming Assignments”,ISEC'17:Proceedingsofthe10thInnovations inSoftwareEngineeringConference,February2017,Pages 117 121,DOI:10.1145/3021460.3021473.

HuangQiubo,TangJingdongandFangGuozheng, “Research on Code Plagiarism Detection Model Based on Random Forest and Gradient Boosting Decision Tree”, ICDMML 2019: Proceedings of the 2019 International Conference on Data Mining and Machine Learning, April 2019,Pages97 102,DOI:10.1145/3335656.3335692.

Anish Dulal and Bibek Timsina, “Plagiarism Detection in Programming AssignmentsusingMachineLearning”,JournalofArtificial Intelligence and Capsule Networks (2020),21.07.2020,Vol.02/ No. 03, Pages: 177 184, [37]DOI:10.36548/jaicn.2020.3.005.MichalDuracik,PatrikHrkut,EmilKrsak,(Member,

International Research Journal of Engineering and Technology (IRJET) e ISSN: 2395 0056

[28] ReferencepublicationsaboutcTuning.orglong term vision:GCCSummit'09,ACMTACO'10journalandIJPP'11 [29]https://ctuning.org/wiki/index.php/CTools:MilepostGCC.journal, Narjes Tahaei and David C. Noelle, “Automated PlagiarismDetectionforComputerProgrammingExercises BasedonPatternsofResubmission”,ICER'18:Proceedings of the 2018 ACM Conference on International Computing Education Research,August 2018, Pages 178 186, [30]DOI:10.1145/3230977.3231006.NormanMeuschke,VincentStange,MoritzSchubotz

[23] MayankAgrawalandDilipKumarSharma,“Astate of art on source code plagiarism detection”, 2016 2nd International Conference on Next Generation Computing Technologies(NGCT),16March2017,978 1 5090 3257 0, Dehradun,India,DOI:10.1109/NGCT.2016.7877421.

[24] AhmedHamzaOsmanandOmarM.Barukab,“SVM significantroleselectionmethodforimprovingsemantictext plagiarismdetection”,InternationalJournalofAdvancedand AppliedSciences,4(8)2017,Pages:112 122.

[34] K.K.Chaturvedi,V.B.SingandPrashastSingh,“Tools inMiningSoftwareRepositories”,201313thInternational Conference on Computational Science and Its Applications,12December2013,978 0 7695 5045 9,HoChi MinhCity,Vietnam,DOI:10.1109/ICCSA.2013.22.

[39] HussainAChowdhuryandDhrubaKBhattacharyya, “Plagiarism: Taxonomy, Tools and Detection Techniques”,Paper of the 19th National Convention on Knowledge,LibraryandInformationNetworking(NACLIN 2016)heldatTezpurUniversity,Assam,IndiafromOctober 26 28,2016,ISBN:978 93 82735 08 3.

[25] OlfatM.Mirza,MikeJoyandGeorginaCosma,“Style Analysis for Source Code Plagiarism Detection An Analysis of a Dataset of Student Coursework”, 2017 IEEE 17th International Conference on Advanced Learning Technologies(ICALT),08August2017,978 1 5386 3870 5, Timișoara,Romania,DOI:10.1109/ICALT.2017.117.

8(1), 11, DOI:10.3390/computers8010011,Published: 2 February2019

Volume: 09 Issue: 03 | Mar 2022 www.irjet.net p ISSN: 2395 0072