Automated payment fraud detection using logistic regression and support vector machines

dc.contributor.advisorNel, J. H.en_ZA
dc.contributor.authorThetard, Heinrich Mathiasen_ZA
dc.contributor.otherStellenbosch University. Faculty of Economic and Management Science. Dept. of Logistics.en_ZA
dc.date.accessioned2021-03-06T16:40:04Z
dc.date.accessioned2021-04-21T14:36:56Z
dc.date.available2021-03-06T16:40:04Z
dc.date.available2021-04-21T14:36:56Z
dc.date.issued2021-03
dc.descriptionThesis (MComm)--Stellenbosch University, 2021.en_ZA
dc.description.abstractENGLISH ABSTRACT: The financial technology sector is a fast moving environment. There are many innovations I nthe automation and efficiency spheres where human intervention is required less and processing speed is rapidly increasing. In the payments space this is evident as payments are processed faster each year with the vast majority of these transactions driven automatically. This has opened up a platform for fraudsters to operate on. The use of Machine Learning (ML) in fraud detection has grown in popularity. Two methods, logistic regression (LR) and support vector machines (SVMs), are used to identify fraud and are investigated in this thesis. LR is less complex as compared to SVMs, but SVMs have unique situations where it will outperform any other ML model [31]. Either method is assessed based on application conditions and measured based on a certain set of confusion matrix based metrics. The two methods are applied to a data set from a bank which participates in the automated payment environment. It was evident that the sample proportions selected had a major impact on the model performance especially with regards to sensitivity and specificity. This was an exercise of fraud identification where sensitivity is the most important. This may not be the case for all data sets and environments as the cost to investigate false positives may be higher than the actual cost of fraud prevented. Condition testing and post model application diagnostics were applied in this research. It was evident principle component analysis (PCA) feature selection was inferior to stepwise feature selection. The relatively poor performance of the PCA feature selection models is due to a loss of information when variables are removed when choosing the components. When considering the odds ratios for LR, there were several variables that were protective factors and others that were risk factors. These factors either increased or decreased the odds of a case being fraudulent. It was found that when a debit order (DO) was associated with an older person it was more likely to be fraudulent than when the DO was associated with a younger person. It was also found that if a DO had a value of R99 or R45 then the odds of the case being fraudulent would increase several-fold. LR models produced equivalent results to the more complex SVM models with a much better run time. From a practical point of view, this means that LR is preferred on larger data sets.en_ZA
dc.description.abstractAFRIKAANSE OPSOMMING: Die finansiële tegnologie sektor is ’n vinnig bewegende omgewing. Daar is baie innovasies op die gebied van outomatisering en doeltreffendheid, waar menslike ingryping minder nodig is en die spoed van verwerking vinnig toeneem. In die betalingsruimte blyk dit dat betalings elke jaar vinniger verwerk word, met die oorgrote meerderheid van die betalingstransaksies wat outomaties verwerk word. Dit het ’n platform vir bedrieërs geskep. Gevolglik neem die gewildheid van die gebruik van masjienleer (ML) in die opsporing van bedrog steeds toe.Twee metodes, logistieke regressie (LR) en ondersteunings vektormasjiene (SVMs), word gebruik om bedrog te identifiseer en word in hierdie tesis ondersoek. LR is minder kompleks in vergelyking met SVMs, maar SVMs het unieke situasies waar dit beter sal presteer as enige ander ML-model. Elk van hierdie metodes word beoordeel op grond van toepassingsvoorwaardes en die prestasie word gemeet aan die hand van ’n sekere stel maatstawwe wat op die verwarringsmatriks gebaseer is. Die twee metodes word op ’n datastel van ’n bank wat aan die outomatiese betalingsomgewing deelneem, toegepas.Dit was duidelik dat die geselekteerde steekproefverhoudings ’n groot invloed op die modelprestasie, sensitiwiteit en spesifisiteit gehad het. In hierdie studie is die identifikasie van bedrog die oogmerk, en daarom is die meting van sensitiwiteit die belangrikste. Dit is miskien nie die geval vir alle datastelle en omgewings nie, aangesien die koste om vals positiewe gevalle te ondersoek, hoër kan wees as wat die werklike koste van die voorkoming van bedrog is. Die toetsing van voorwaardes en ontleding van postmodel diagnostieke is in hierdie navorsing toegepas. Dit was duidelik dat hoofkomponentanalise (PCA) ondergeskik presteer het in vergelyking met stapsgewyse seleksiemetodes. Die relatief swak prestasie van die PCA seleksiemodelle is te wyte aan die verlies van inligting wanneer veranderlikes geelimineer word in die keuse van die komponente. By die oorweging van die kansverhoudings vir LR was daar verskillende veranderlikes wat beskermende faktore was en ander wat risikofaktore was. Hierdie faktore het die kans op gevalle van bedrog verhoog of verminder. Daar is gevind dat wanneer ’n debietorder (DO) met ’n ouer persoon geassosieer word, dit meer waarskynlik as bedrog geklassifiseer word as wanneer die DO met ’n jonger persoon geassosieer word. Dit is ook gevind dat as ’n DO ’n waarde van R99 en R45 het, die kans dat dit ‘n bedrogsaak sal wees, meer sal vergroot. LR-modelle lewer gelykstaande resultate aan die meer ingewikkelde SVM-modelle met ’n baie beter tydsduur. Uit ’n praktiese oogpunt beteken dit dat LR modelle verkies sal word vir groter datastelle.af_ZA
dc.description.versionMasters
dc.format.extent125 pagesen_ZA
dc.identifier.urihttp://hdl.handle.net/10019.1/110024
dc.language.isoen_ZAen_ZA
dc.publisherStellenbosch : Stellenbosch Universityen_ZA
dc.rights.holderStellenbosch Universityen_ZA
dc.subjectLogistic regression analysisen_ZA
dc.subjectMachine learningen_ZA
dc.subjectSupport vector machinesen_ZA
dc.subjectSVMs (Algorithms)en_ZA
dc.subjectAutomated tellersen_ZA
dc.subjectATMs (Banking)en_ZA
dc.subjectRemote sensingen_ZA
dc.subjectContingency tables -- Computer programsen_ZA
dc.subjectCommercial crimesen_ZA
dc.subjectBanks and banking -- Security measuresen_ZA
dc.subjectBank frauden_ZA
dc.subjectUCTD
dc.titleAutomated payment fraud detection using logistic regression and support vector machinesen_ZA
dc.typeThesisen_ZA
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
thetard_payment_2021.pdf
Size:
2.17 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Plain Text
Description: