Fraudulent Insurance Claim Detection Using Machine Learning

Healthcare and medical insurance is a rich area for fraud schemes due to a complex and bureaucratic process, which requires many approvals, verifications, and other paperwork. The most common scams are fake claims that use false or invalid social security numbers, claims duplication, billing for medically unnecessary tests, fake diagnosis, etc. Both hospitals and insurance companies are suffering from these issues. Insurance carriers lose money and hospitals take risks being involved in serious crimes, like drug turnover. Multiple data analytics approaches can mitigate such fraud risks.

The machine learning approach to fraud detection has received a lot of publicity in recent years and shifted industry interest from rule-based fraud detection systems to ML-based solutions.

The Rule-Based Approach

Detecting fraudulent activities is nuanced, but can be achieved by algorithmically observable signals. Unusually large claims or claims occurring in atypical locations warrant additional verification. Purely rule-based systems entail using algorithms that perform several fraud detection scenarios, manually written by fraud analysts. Today, legacy systems apply about 300 different rules on average to approve a claim. That is why rule-based systems remain too straightforward. They require adding/adjusting scenarios manually and can hardly detect implicit correlations. On top of that, rule-based systems often use legacy software that can hardly process the real-time data streams that are critical for the digital space.

ML-Based Fraud Detection

However, there are also subtle and hidden events in user behavior that may not be evident, but still signal possible fraud. Machine learning allows for creating algorithms that process large datasets with many variables and help end these hidden correlations between user behavior and the likelihood of fraudulent actions. Another strength of machine learning systems, as compared to rule-based systems, is faster data processing and less manual work. ML based methods also do not require a high level of domain knowledge to define rules.

Data and Methodology

Anomaly detection is a common data science approach for fraud detection. It is based on classifying all objects in the available data into two groups: normal distribution and outliers. Outliers, in this case, are the objects (e.g. claims) that deviate from normal ones and are considered potentially fraudulent.

The variables in data that can be used for fraud detection are numerous. By analyzing these parameters, anomaly detection algorithms can answer the following questions:

  1. Do clients access services in an expected way?
  2. Are user actions normal?
  3. Are claims typical?
  4. Are there any inconsistencies in the information provided by users?

We pulled in data from four different datasets:

  1. Beneficiary data (information about the beneficiary and their personal and insurance details)
  2. Inpatient data (beneficiaries that were admitted to a hospital or clinic for treatment)
  3. Outpatient data (beneficiaries that were not admitted to a hospital/clinic and received treatment/advice during visitation)
  4. Flag data (whether or not the claims filed by providers were detected or fraudulent or not)

Supervised Fraud Detection Methods

Logistic Regression

A Logistic Regression (LR) model is used to model the probability of a certain class or event existing. In the context of the issue at hand, it is the probability of claims being fraudulent or non-fraudulent. LR is the utilization of this statistical function to model a binary dependent variable.

According to Figure 1, our LR model seems to perform virtually similarly on the train and validation set across the dataset. By our exploratory data analysis, we determined that the dataset is an imbalanced one. In such a case, we shall use evaluation metrics such as AUROC (Area Under Receiver Operating Characteristic) curve and F-1 scores to get a good overview of how the model is doing.

Graph showing Predictions of Train and Validation
Figure 1: Predictions of Train and Validation

The area under the curve for our LR model gives us a classification accuracy of 94%. Getting into the specifics of how the model is classifying and misclassifying, we scrutinize the True Positive rate (TPR) vs. False Positive rate (FPR).

Random Forest

Random Forests (RF) are an ensemble learning method for classification. They operate by constructing a multitude of decision trees at training time and outputting the class that has been “voted” by the decision trees most frequently.

The area under the curve for our RF model gives us a classification accuracy of 93%. Getting into the specifics of how the model is distinguishing between the classes, we scrutinize the TPR vs. FPR. Looking at Figure 2, we conclude that our LR model is performing well to classify TR and TN. However, there are more instances where TRs and TNs are falsely classified than in the case of our LR model.

The Confusion Matrix confirms our readings from the TP vs. FP. The F-1 score achieved by the Random Forest model was 0.58 on the validation set.

Graph showing TPR vs. FPR, Frauds and Non-Frauds
Figure 2: TPR vs. FPR, Frauds and Non-Frauds

Unsupervised Fraud Detection Methods


An autoencoder is a type of artificial neural network used to learn efficient data coding in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction. Along with the reduction side, a reconstructing side is learned, where the autoencoder tries to generate from the reduced encoding a representation as close as possible to its original input (hence its name). In order to increase the efficiency of the network, we minimize reconstruction error between the input and the output. This helps the autoencoders learn the important features present in the data.


We train the network on a large chunk of “Non-fraud” data and reserve another chunk for the test set. The validation set for the Autoencoder network consists of all the “fraud” data and a small portion of “non-fraud.” Once the autoencoder has finished training, the autoencoder knows how to reproduce feature vectors representing legitimate claims onto the output layer.

The Autoencoder network achieved an accuracy of only 74%, which is lesser than both of our supervised methods. Although the accuracy is poor, the Autoencoder model does well to detect non-fraud claims and achieves an F-1 score of 0.55 with only 2 added layers and 100 epochs. This indicates that there is a high potential to improve this model to possibly outperform our supervised methods.

For a more in-depth analysis of our methods and findings, be sure to download our Fraud Detection paper below:

Leave a Reply

Your email address will not be published.