Mobile Money Transaction Fraud Detection
The purpose of this study is to identify fraudulent transactions in an extremely unbalanced dataset. Using data courtesy of the PaySim synthetic dataset available on Kaggle.
"PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world."
The kaggle dataset is scaled back to 1/4 the size of the original and simulates 30 days of mobile money transactions.
Exploratory Data Analysis:
The primary takeaways we have from our exploratory data analysis are that some of the included features have little to no bearing on whether a transaction is labeled as fraudulent or not. Out of the transaction types of 'cash in,' 'cash out,' 'debit,' 'payment,' and 'transfer,' fraudulent transactions are only included in the cash outs and transfers.
When we examine the accounts and account names, we find that the account types of 'merchant' and 'customer' have little connection to which transactions are fraudulent, and in fact are inconsistently applied. Additionally, only a very few accounts have more than 1 fraudulent transaction either to or from.
When we narrow our focus to only the transaction types that contain fraud, a few factors emerge;
- the overall number of fraudulent transactions per hour (time step) remain relatively constant despite the total number of all transactions per step being variable.
- the strongest correlations fraudulent activity are in the balance columns of the source and target accounts of the transaction.
After exploring this connection further, we start finding that there are errors in both the source accounts and destination accounts. There are examples where the actual balance of the account money is being transferred to doesn't match the amount transferred. For example:
- The old balance is 0
- The transferred amount is 23.00
- The new balance is 0
In this case, we would expect the account balance to be 23.00, when it is instead 0, so our target account variance is 23.00, or $23 of unaccounted for money.
This is our best link yet. The errors in the transaction's target account show a clear difference between fraudulent and legitimate transactions. The target account's variance rarely if ever goes negative for fraudulent transactions. In a nutshell, the fraudulent accounts are more likely to have a positive amount of money that is unaccounted for by the recorded transactions.
Because we're dealing with a highly skewed and imbalanced dataset, my different modeling notebooks explored different methods of handling imbalanced data, including oversampling, undersampling, and the SMOTE technique. Interestingly, the best results were with the core unaltered dataset but choosing a model that excels in handling imbalanced data.
After several trials, the model that performs the best is the Random Forest Classifier, a model type that utilizes an ensemble of decision trees, bootstrapped data, and therefore excels with handling imbalanced datasets. Interestingly enough, when oversampling the minority class, the misclassification rate more than doubled to 20 samples, when undersampling the majority class it rose to 207 samples, and when using a combination of under and oversampling (SMOTE), the misclassification rate hit 40 samples.
Gridsearching and tuning the random forest resulted in a model that is 99.9987% accurate, with a total of 8 misclassifications out of the 692,603 reserved testing samples.
Our baseline for this model is 99.7 %: meaning that this is such an imbalanced or skewed dataset where simply accepting 0.2% misclassifications and assuming all transactions are legitimate allows us to reach what would be an exceptional rate for other models. Unfortunately this misses the point of detecting fraud, so we'll just have to beat this standard.
As mentioned above, the model that performs the best is the Random Forest Classifier, a model type that excels with handling imbalanced datasets. Interestingly enough, when using the oversampling the minority class, the misclassification rate more than doubled to 20 samples, when undersampling the majority class it rose to 207 samples, and when using a combination of under and oversampling (SMOTE), the misclassification rate hit 40 samples.
Gridsearching (programmatically testing different permutations hyperparameters) and tuning the random forest resulted in a model that is 99.9987% accurate, with a total of 7 misclassifications in the 692,603 reserved testing samples.
The original purpose of this project is to develop a model that identifies fraudulent transactions in an extremely unbalanced dataset and serve as proof of concept for handling similar datasets in the future.
Ultimately, this model does exactly that- It gets a mere 7 samples incorrect out of a reserved testing set of 692,603. Unfortunately it misses 7 cases of fraudulent activity, however this is a drastic improvement over missing 2053, which would have been lost in assuming all transactions were legit.
Recommendations for Further Development:
Attempts to tune the model to prefer false positives (labeling a legitimate transaction as fraudulent) rather than false negatives (missing a case of fraud) drastically decreased the overall accuracy as well as both the total number of false negatives and false positives. I've been working with additional models that are reputed handle imbalanced classes well (XGBoost, for example) and I'm excited to see what results I can get with another model in the mix.