Fraud Detection and Analysis for Insurance Claims Using Machine Learning by IRJET Journal

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 p-ISSN: 2395-0072

Volume: 11 Issue: 03 | Mar 2024 www.irjet.net

Fraud Detection and Analysis for Insurance Claims Using Machine Learning

1Jaya Vani Vankara, 2V Seshadri Naidu, 3D Govardhan, 4V Vivek , 5P VNikhil

1Assistant Professor, Dept. of CSE, GITAM (Deemed to be University), Visakhapatnam, Andhra Pradesh, India. 2,3,4,5Student, GITAM (Deemed to be University), Visakhapatnam, Andhra Pradesh, India. ***

Abstract - Insurance claim fraud is a serious issue that the insurance business faces. It costs insurance companies money andraises policyholders'rates.Machinelearninghas becomea potential method for insurance claim fraud investigation and detection in recent years. Machine learning algorithms such as data mining and deep learning techniques have been effectively applied to identify trends and abnormalities in insurance claim data that point to fraudulent activity. These algorithms could significantly increase the precision and effectiveness of fraud detection, saving insurance firms money and giving consumers excellent safety. Machine learning algorithms can precisely analyze vast volumes of data and spot trends and abnormalities that can point to insurance claim fraud. These algorithms can look at various factors in claimdata, includingclaim type, policyholder details,and past claim history, to identify anomalies or suspicious trends. With thehelp ofmachine learning, insurancecompanies will be able to build predictive models that will give each claim a Fraud Probability Score (FPS). In this project, we’re focusing on identifying auto insurance fraud using machine learning. An insurance agent should be able to investigate every case and determine if it’s real. But this not only takes time, but it’s also expensive. Hiring and financing the skilled labor needed to review every claim filed daily is impossible. This is where machine learning comes in. In this case, we will use one of the most widely used machine learning algorithms.

Key Words: Fraud Insurance, XGBoost, Artificial Neural Network, Random Forest, Logistic Regression, Decision Tree, SVC.

1.INTRODUCTION

We have discovered a significant issue with insurance fraud inthis project. Claimsfiled to deceiveaninsurance company areknownasfalsecoverageclaims.Sincethebeginningof the insurance sector, there has been a persistent problem withinsurancefraud,withasignificantportionofreceived claims being fake. Insurance firms suffer financial losses from fraudulent claims, and policyholders pay higher premiums. Machine learning algorithms, which use data

mininganddeeplearningtechniques,helpidentifypatterns andanomaliesininsuranceclaimdatathatmaybesignsof fraudulentconduct.Thesealgorithmshavethepotentialto increasetheaccuracysignificantly.Theseadvancedmethods presenttheinsurancesectorwithaviablewaytoimprove frauddetectionandlessentheeffectsoffraudulentclaims. Insurance companies could save money, and consumers wouldfeelsaferifthesealgorithmssignificantlyimproved the accuracy and efficacy of fraud detection. These algorithms analyze variousaspects of claim data,suchas the kind of claim, policyholder information, and prior claim history, to spot abnormalities or questionable patterns. Insurance firms can create predictive models that use machinelearningtoassignaFraudProbabilityScore(FPS) to each claim. This research primarily focuses on using machinelearningtodetectfraudwithautoinsurance.

Wehavechosenoneofthemostpopularmachinelearning methodstodothis.Thesealgorithmsproducedthehighest accuracy in projected results and annual expenses, amountingtobillionsofdollars,provingtheirapplicabilityto our dataset. Insurance fraud can take many forms in differentinsurancerealms,anditcanrangeinseverityfrom small-scale claim embellishment to deliberate acts of destructionorharm.Autoinsurancefraudisoneofinsurers' mostsignificantandwell-knownproblems.Aclaimsagent should look at costs due to fraudulent claims, which highlightsthesignificanceofdifferentiatingbetweengenuine andfraudulentclaims.Althoughaclaimsagentshouldlook into each case separately, this is frequently an expensive, time-consuming, and inefficient procedure. Examining all of the many claims that are filed every day would be very impossible. To detect and mitigate fraudulent claims, machinelearning offers a practical, quick, and economical solution.

2. Literature Survey

[1] T. Badriyah, Lailul Rahmanian I. Syarif, titled 'Nearest NeighbourandStatisticsMethodbasedforDetectingFraudm Auto Insurance" provides an overview of the nearest

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 p-ISSN: 2395-0072

Volume: 11 Issue: 03 | Mar 2024 www.irjet.net

neighbormethodandinterquartilemethodtodetectfraudin car insurance data.Theresultsoftheconductedstudy,thebest result using the paper, concluded that the Feature selection process improves fraud detection accuracy. The distancebased algorithm yields the best fraud detection results.Performancemeasurementissuperiorinsomecases.-The best result of fraud detection is using a distance-based algorithm.

[2]XiLiu,Jian-BoYang,Dong-LingXu,KarimDerrick,Chris Stubbs,andMartinStockdaletitled"AutomobileInsurance FraudDetectionusingEvidentialReasoningApproachand Data-driven Inferential Modeling." Automobile insurance frauddetectionhasbecomecrucialforloweringinsurance companies' prices. Expenence-based knowledge is interpretable and reusable, but the simplistic way this knowledgeisusedinpracticeoftenleadstomisjudgment. This paper proposes tosetupa uniqueEvidential Reasoning (ER) rule that mixes impartial proof from each reveal primarilybasedontotalsignsandchancesoffraudreceived fromhistoricaldataEachpieceofevidenceisweightedand thencombinedconjunctivelywiththetightsoptimizedusing a maximum likelihood evidential reasoning (MAKER)frameworkfordata-driveninferentialmodeling.

[3]K.SuprajaandS.J.Saritha,titled'RobustFuzzyrule-based techniques to detect frauds in insurance," provide an overviewofTheFuzzyRule-basedtechniqueappliedtothe trainingdatasetandbasedontheinstancesofthedegreeof fraudorlegallypredicted-Concludedthatthistechniqueis usedforhighdimensionallargedatasetswithaccuracy.The paperdiscussesfuzzyrule-basedtechniquestoimprovefraud detectioninvehicleinsurance.Thelimitationofthe paper mentioned is that Bayesian visualization is unsuitable for abundantdata-FuzzyLogicusedtoimproveFraudDetection.

[4] Richard A. Bauder and Taghi M. Khoshgoftaar, titled "Medicare Fraud Detection Using Machine Learning Methods," provides insights into the paper comparing different machine learning methods for detecting Medicare fraud and finding that supervised methods perform better thansupervised or hybrid methods. This paper mentions supervised, unsupervised, and hybrid machine learning methods, as well as class imbalance reduction via oversampling and 80-20 undersampling methods

The paper by [5] Machinya Tongesai, Godfrey Mbizo, and KudakwasheZvarevashetitled"InsuranceFraudDetection using Machine Learning proposes an automated fraud detectionapplicationframeworkusingmachinelearningand the XGBoost method to accurately identify fraudulent

insurance claims in a shorter amount of time. The paper concludes that machine learning and data analysis is an automated fraud detection framework that can help insurancecompaniesaccuratelyidentifyfraudulentclaims quickly,improvingthetraditionalclaiminvestigationprocess andresultinginfinancialsavingsanda positiveimpacton society.

3. Project Goal

To successfully create a model that uses ML Algorithms that ultimately aid in detecting fraud insurance claims effectively andefficientlyandhelptheinsuranceindustries.

Problem Identification

Globally, there are over a thousand companies in the insurance market, and they gather premiums of over one trillion dollars annually. The most common insurance fraud is fabricating accident claims, which may be done with a vehicle. Utilizing machine learning techniques, this researchaimstodetectcarinsurancefraud.

Dataset Description

Thedatasetselectedforthis specifictaskwasextractedfrom anonlinesourcenamedKaggle,whichcontainsnearly1000 rowsofhistoricaldataand39columnscontainingvarious criteriafortheproject.

11 Issue: 03 | Mar 2024 www.irjet.net

6. Proposed Method

6.1 Data collection:

Gatheringdatafortrainingamachinelearningmodelisthe initialstageinamachinelearningpipeline.Datacollection involves gathering information from diverse sources to addresspertinentquestions.Theaccuracyofthepredictions generatedbyamodelisinherentlytiedtothequalityofthe datausedforitstraining.Challengesduringdatacollection includeunreliabledata,missingdata,anddataimbalance.To mitigatetheseproblems,datapreprocessingisconductedon thecollecteddata.

6.2 Data Preprocessing:

Becausetheyarefragmentary,inconsistent,anddevoid of specific behavioral patterns, rawdata fromthe real world areprobablynotdependable.Thus,pre-processingisdone afterdatacollectiontocleanupthedataandprepareitfor machine-learningmodelconstruction.

6.3 Data cleaning:

Erroneously added or classified data can be removed manually or automatically. Imputation of data: Standard deviation, mean, and median are three common methods most machine-learning systems use to balance or fill in missingvalues.

6.4 Oversampling:

Biased or unbalanced datasets can be added to the underrepresented classes after being corrected using strategieslikeoversamplingandrepetition.Exploratory

6.5 Data Analysis:

Exploratory Data Analysis, abbreviated as EDA, fundamentallyservesasaformofstorytellingthatenables the revelation of concealed insights and patterns, the identificationofoutliersandanomalies,andthevalidationof underlyingstructures.

6.6 Clustering:

Toachievethegoalofhavingdatapointsinthesamegroup be more similar to each other and less different from the data points in other groups, clustering is the process of dividingthepopulationordatapointsintomanygroups.In this case, the anticipated value is subtracted from the measured value concerning the optimal line to find the residue.

7. Methodology

e-ISSN: 2395-0056 p-ISSN: 2395-0072

1.Intheinitialstep,weimportthepackages

2.UploadtheCSVfile(datasets)

3.Wecanusefunctionslikedescribeandinfofunctionstothe dataset

4.Datapre-processing

i.Missingvalues(visualizingthemissingvalues)

ii.Handlingmissingvalues

5.Displayingcorrelationbetweenfeatures

6. Dropping the columns which are not necessary for prediction

7.MaskingMulticollinearity

8.Separatingthefeatureandtargetcolumns

9.EncodingCategoricalcolumns

➢ Extractingcategoricalcolumns

➢ DropDummies

➢ Extractingthenumericalcolumns

➢ CombiningtheNumericalandCategoricaldataframes togetthefinaldataset

10.OutliersDetection

11.Splittingdataintotrainingsetsandtestsets.

12.Scalingthenumericvaluesinthedataset

13.Modelscreationandtraining

14.ModelPerformanceComparison

Fig - 2:EvaluationMetrics

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 p-ISSN: 2395-0072

Volume: 11 Issue: 03 | Mar 2024 www.irjet.net

Fig-3: Comparisonofmetrics

8. CONCLUSIONS

Compared and contrasted various deep learning and machinelearning modelsforfraudinsurance detectionusing SVM, Random Forest, XGBoost, Decision Tree, Logistics Regression,andNeuralNetworks.

Out of all the models, Random Forest showed the highest accuracy,i.e.,91%,withanerrorrateof0.08.

9. Future scope

Future research directions are indicated, emphasizing enhancing the system's functionality, accuracy, and efficiencyandaddingmoreadvancedfeaturestotheproject so that consumers or policyholders can access it more convenientlyanddependably.

10. REFERENCES

[1] X.Liu,J.-B.Yang,D.-L.Xu,K.Derrick,C.Stubbs,andM. Stockdale, “Automobile Insurance Fraud Detection using the Evidential Reasoning Approach and DataDrivenInferentialModelling,”2020IEEEInternational ConferenceonFuzzySystems(FUZZ-IEEE),Jul.2020.

insurance fraud, K.Supraja, S.J. Saritha,01 Aug 2017.

[3] I.Sadgalian,Saelaf&Benabboua(2019).Performance

[2] Robustfuzzyrule-basedtechniquetodetectvehicle of machine learning techniques in the detection of financialfrauds.

[4] NearestNeighbourandStatisticsMethodBasedfor DetectingFraudinAutoInsuranceTessyBadriyah, LailulRahmaniah,IwanSyar,01Oct2018.

[5] Medicare Fraud Detection Using Machine Learning MethodsRichardA.Bauder,TaghiM.Khoshgofta,01 Dec2017.

[6] InsuranceFraudDetectionUsingMachineLearning Machinya Tongesai, Godfrey Mbizo, Kudakwashe Zvarevashe,09Nov2022.

[7] S.SubudhiandS.Panigrahi,“DetectionofAutomobile Insurance Fraud Using Feature Selection and Data Mining Techniques,” International Journal of Rough Sets and Data Analysis, vol. 5, no. 3, pp. 1–20, Jul. 2018.

[8] V. Khadse, P. N. Mahalle, and S. V. Biraris, “An Empirical Comparison of Supervised Machine Learning Algorithms for Internet of Things Data,” Proc. - 2018 4th Int. Conf. Comput. Commun. Control Autom. ICCUBEA 2018, pp. 1–6, 2018, doi: 10.1109/ICCUBEA.2018.8697476.

[9] BartBaesens,S.H.(2021).Dataengineeringforfraud detection,DecisionSupportSystems.

[10] S. Ray, “A Quick Review of Machine Learning Algorithms,” Proc. Int. Conf. Mach. Learn. Big Data, CloudParallelComput.Trends,PerspectivesProspect. Com. 2019, pp. 35–39, 2019, doi: 10.1109/COMITCon.2019.8862451.

Fig -4: GraphicalAnalysis