Overview
Your credit score is a big deal. If you have a bad one, you're less likely to get good rates on credit cards and loans. Many different aspects of your financial past and present are taken into consideration to predict how risky it will be to offer you credit. Machine learning makes it easier for credit card and loan companies to process all of your data and see if they should lend you money.
This project assessed several machine learning algorithms to see which one may be best to use to predict credit risk. For something this important, it is recommended to find the most true positives possible which means the recall scores of the model should be high. The EasyEnsembleClassifier model had recall scores of 91% for high risk and 94% for low risk predictions and had the best performance overall. To learn more about this model and others that were tested, please continue reading.
The Data
The data used for this analysis was provided by the University of Minnesota Data Analysis Boot Camp and comes from LendingClub, a peer-to-peer lending services company. This data is from the first quarter of 2019 and includes credit card data. The data was previously cleaned of personally identifiable information. The following columns were utilized in the analysis:
"loan_amnt", "int_rate", "installment", "home_ownership","annual_inc", "verification_status", "issue_d", "loan_status","pymnt_plan", "dti", "delinq_2yrs", "inq_last_6mths","open_acc", "pub_rec", "revol_bal", "total_acc","initial_list_status", "out_prncp", "out_prncp_inv", "total_pymnt","total_pymnt_inv", "total_rec_prncp", "total_rec_int", "total_rec_late_fee","recoveries", "collection_recovery_fee", "last_pymnt_amnt", "next_pymnt_d","collections_12_mths_ex_med", "policy_code", "application_type", "acc_now_delinq","tot_coll_amt", "tot_cur_bal", "open_acc_6m", "open_act_il","open_il_12m", "open_il_24m", "mths_since_rcnt_il", "total_bal_il","il_util", "open_rv_12m", "open_rv_24m", "max_bal_bc","all_util", "total_rev_hi_lim", "inq_fi", "total_cu_tl","inq_last_12m", "acc_open_past_24mths", "avg_cur_bal", "bc_open_to_buy","bc_util", "chargeoff_within_12_mths", "delinq_amnt", "mo_sin_old_il_acct","mo_sin_old_rev_tl_op", "mo_sin_rcnt_rev_tl_op", "mo_sin_rcnt_tl", "mort_acc","mths_since_recent_bc", "mths_since_recent_inq", "num_accts_ever_120_pd", "num_actv_bc_tl","num_actv_rev_tl", "num_bc_sats", "num_bc_tl", "num_il_tl","num_op_rev_tl", "num_rev_accts", "num_rev_tl_bal_gt_0","num_sats", "num_tl_120dpd_2m", "num_tl_30dpd", "num_tl_90g_dpd_24m","num_tl_op_past_12m", "pct_tl_nvr_dlq", "percent_bc_gt_75", "pub_rec_bankruptcies","tax_liens", "tot_hi_cred_lim", "total_bal_ex_mort", "total_bc_limit","total_il_high_credit_limit", "hardship_flag", "debt_settlement_flag".
The csv file used is provided below.
The Analysis
Background
Predicting credit risk is, at the basic level, a classification problem (high risk vs low risk). It is unbalanced, however, because the amount of good loans out-number the risky ones. Therefore, for this project it was important to try different techniques to train and evaluate models with unbalanced classes. Several algorithms that use resampling were evaluated including: RandomOverSampler, Synthetic Minority Oversampling Technique (SMOTE), ClusterCentroids, and Synthetic Minority Oversampling Technique/Edited Nearest Neighbors (SMOTEENN). Two more that use ensemble methods were tried: BalancedRandomForestClassifier and EasyEnsembleClassifier.
The resampling algorithms use over-sampling, under-sampling, and combination approaches to make predictions. Over-sampling duplicates samples from the minority class and under-sampling removes samples from the majority class; this is done because of an imbalance that may be present already or one that develops after random sampling (Pykes, Oversampling and Undersampling. A technique for Imbalanced… | by Kurtis Pykes | Towards Data Science). The ensemble methods differ in the way they attempt to balance the unbalanced samples and reduce bias. According to Thomas Wood in an article on deepai.org, "ensemble method, meaning that a random forest model is made up of a large number of small decision trees, called estimators, which each produce their own predictions."
The over-sampling algorithms are the RandomOverSampler and SMOTE. The under-sampling algorithm is ClusterCentroids. The combinatorial algorithm is SMOTEENN; and the two newer algorithms that work to reduce bias are BalancedRandomForestClassifier and EasyEnsembleClassifier. Balanced accuracy score (averages true positive rate and true negative rate together) and classification reports (with a focus on recall score) were used to evaluate model effectiveness. Recall is important because it is the percentage of true positive predictions. Models with scores greater-than-or-equal-to 70% are generally accepted as "good."
Process
This analysis was completed using Python and Jupyter Notebooks. The GitHub link for the code is: two-suns/Credit_Risk_Analysis (github.com). The code files are credit_risk_resampling.ipynb and credit_risk_ensemble.ipynb.
To start, some further cleaning was done to only use columns needed for the algorithms (provided above) and to change string columns to numeric for them to function properly. Null and "n/a" values were removed for this reason as well.
Results
Next, brief descriptions of the algorithms will be provided along with the results obtained and a brief analysis. Please note that a logistic regression model classifier was used with all algorithms except BalancedRandomForest and EasyEnsemble (they have their own classifiers).
RandomOverSampler
This model over-samples by creating duplicates of data items from the minority class in the training dataset.
- Balanced Accuracy Score
- Classification Report
This model had a perfect precision (pre) for low-risk customers, but only 1% for high risk. For recall (rec) and balanced accuracy score, it fell below the good threshold of 70%.
SMOTE
The SMOTE algorithm also over-samples. However, it creates synthetic data points instead of duplicating data. These synthetic data points end up being slightly different from the originals.
- Balanced Accuracy Score
- Classification Report
The balanced accuracy score for the SMOTE algorithm improved slightly, but it did not reach the 70% threshold. Precision was exactly the same. Recall improved by 10% for identifying low-risk customers.
ClusterCentroids
This algorithm takes an under-sampling approach. According to imblanced-learn.org, the algorithm works in this way, "Method that under-samples the majority class by replacing a cluster of majority samples by the cluster centroid of a KMeans algorithm. This algorithm keeps N majority samples by fitting the KMeans algorithm with N cluster to the majority class and using the coordinates of the N cluster centroids as the new majority samples." Basically, the KMeans algorithm clumps clusters around their nearest mean which become the centroids of the clusters.
- Balanced Accuracy Score
- Classification Report
This model did much better correctly identifying high-risk loans, but did much worse with the low-risk loans. Also, the balanced accuracy score dropped drastically when under-sampling.
SMOTEENN
The SMOTEENN algorithm combines over- and under-sampling. It uses SMOTE to over-sample and EditedNearestNeighbours to under-sample. See above for an explanation of SMOTE. Edited Nearest Neighbours is requires a more in-depth explanation. See this article on towardsdatascience.com from Raden Aurelius Andhika Viadinugroho: Imbalanced Classification in Python: SMOTE-ENN Method | by Raden Aurelius Andhika Viadinugroho | Towards Data Science.
- Balanced Accuracy Score
- Classification Report
This is the first model to get to 70% recall for either category. However, with both categories still not making it, this model still is not the best.
BalancedRandomForestClassifier
According to BalancedRandomForestClassifier — Version 0.11.0.dev0 (imbalanced-learn.org), "A balanced random forest randomly under-samples each boostrap (Bootstrap Sample: Definition, Example - Statistics How To) sample to balance it."
- Balanced Accuracy Score
- Classification Report
This algorithm achieved the best accuracy score yet with a 79%. Recall for low risk was all the way to 91% and it came close to 70% on high risk. One more model was tried to see if these scores could be improved even more.
EasyEnsembleClassifier
This is another under-sampling technique, but it also tries to further reduce bias. According to Wik Hung Pun in Comparison of Classifiers and EasyEnsemble | Kaggle, "The undersampling technique, EasyEnsemble proposed by Liu, Wu, and Zhou (2008), samples a subset of the negative cases to create a balanced dataset. A classifier is then trained on this reduced dataset and generate predictions for the test set. This procedure is repeated multiple times and the test predicitions are aggregated."
- Balanced Accuracy Score
- Classification Report
The EasyEnsembleClassifier clearly outperformed all the other algorithms. Balanced accuracy score was 93% and the high- and low-risk recall scores were over 90% as well.
Final Thoughts
After trying all the algorithms, it is clear that, in this case, the best model to recommend is the EasyEnsembleClassifier. It achieved the best balanced accuracy score with a 92.5%. It also, achieved excellent recall scores with 91% for high-risk and 94% for low risk. This means that this model was the best at successfully finding the true positives for each type of risk. This would be important to determine someone's credit risk especially for the high-risk clients because you would not want to misidentify someone who could be a viable candidate.
Thank you for reading! If you enjoyed this please follow and connect with me at Bryan Eckard | LinkedIn. Also, feel free to message me if you would like to find out more about me or my work.
Comments