5. Exploratory Data Analysis#
Warning
I decided not to actively pursue my ideas for exploration due to already having a strong set of hypotheses to test and the time constraint. Consider this page just a brain dump, please skip to next section
5.1. Retailers who missed the first collection attempt#
In the Loans dataset, there are a few cases when a retailer is not able to pay back the money they borrowed in the first collection attempt. That does not necessarily mean they will default on their payment, perhaps it’s just a delay. Let’s take a look at the retailer with the most delayed-but-paid loans:
import pandas as pd
try:
_ = loans_df
except NameError:
loans_df = pd.read_excel("../../data/Loans_Data.xlsx")
(
loans_df
# .query("FIRST_TRIAL_BALANCE < 0 and PAYMENT_STATUS == 'Paid'")
.groupby("MAIN_SYSTEM_ID")
.agg(
nunique_loan_id=("LOAN_ID", "nunique"),
nunique_loan_id_delayed_payment=(
"LOAN_ID",
lambda df: df[(loans_df["FIRST_TRIAL_BALANCE"] < 0) & (loans_df["PAYMENT_STATUS"] == "Paid")].nunique()
# the above is very inefficient compute but I'm a bit rusty with Pandas
),
)
.sort_values("nunique_loan_id_delayed_payment", ascending=False)
.head(5)
)
| nunique_loan_id | nunique_loan_id_delayed_payment | |
|---|---|---|
| MAIN_SYSTEM_ID | ||
| 42714 | 14 | 10 |
| 58316 | 19 | 10 |
| 102778 | 15 | 9 |
| 104839 | 11 | 8 |
| 24063 | 20 | 7 |
Let’s investigate retailer MAIN_SYSTEM_ID == 58316, first looking at all 10 loans
loans_df.query("MAIN_SYSTEM_ID == 58316").sort_values("LOAN_ISSUANCE_DATE")
| INDEX | MAIN_SYSTEM_ID | RETAILER_ID | LOAN_ID | LOAN_ISSUANCE_DATE | LOAN_AMOUNT | INITIAL_COST | TOTAL_INITIAL_AMOUNT | INITIAL_DATE | LOAN_PAYMENT_DATE | ... | FIRST_TRAIL_DELAYS | FIRST_TRIAL_BALANCE | SPENT | FINAL_COST | TOTAL_FINAL_AMOUNT | REPAYMENT_ID | REPAYMENT_AMOUNT | REPAYMENT_UPDATED | CUMMULATIVE_OUTSTANDING | PAYMENT_STATUS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 47308 | 298520 | 58316 | 66776 | 697510 | 2022-09-02 00:38:59.309 | 1000.0 | 0 | 1000.0 | 2022-09-02 | 2022-09-03 19:27:33.733 | ... | 1 | 211.36 | 88.64 | 0 | 88.64 | 697510 | 0.0 | 2022-09-03 19:27:33.733 | 211.36 | Paid |
| 62663 | 432451 | 58316 | 66776 | 723339 | 2022-09-08 12:32:13.904 | 1000.0 | 0 | 1000.0 | 2022-09-08 | 2022-09-09 19:42:24.713 | ... | 1 | 191.46 | 108.54 | 0 | 108.54 | 723339 | 0.0 | 2022-09-09 19:42:24.713 | 191.46 | Paid |
| 7866 | 594403 | 58316 | 66776 | 750715 | 2022-09-11 23:38:50.204 | 1000.0 | 0 | 1000.0 | 2022-09-12 | 2022-09-12 19:01:41.513 | ... | 0 | 230.56 | 69.44 | 0 | 69.44 | 750715 | 0.0 | 2022-09-12 19:01:41.513 | 230.56 | Paid |
| 41078 | 242794 | 58316 | 66776 | 763590 | 2022-09-15 14:52:35.497 | 1000.0 | 0 | 1000.0 | 2022-09-15 | 2022-09-16 20:07:58.496 | ... | 1 | -84.87 | 84.87 | 0 | 84.87 | 771485 | 250.0 | 2022-09-17 18:28:30.091 | 165.13 | Paid |
| 49219 | 313904 | 58316 | 66776 | 776182 | 2022-09-18 20:55:42.708 | 1000.0 | 0 | 1000.0 | 2022-09-19 | 2022-09-20 18:41:03.105 | ... | 1 | -730.07 | 730.07 | 0 | 730.07 | 782955 | 800.0 | 2022-09-20 20:25:46.865 | 69.93 | Paid |
| 51662 | 334727 | 58316 | 66776 | 783557 | 2022-09-21 02:37:30.731 | 1000.0 | 0 | 1000.0 | 2022-09-21 | 2022-09-22 14:58:35.910 | ... | 1 | 46.95 | 103.05 | 0 | 103.05 | 783557 | 0.0 | 2022-09-22 14:58:35.910 | 46.95 | Paid |
| 59844 | 405915 | 58316 | 66776 | 791325 | 2022-09-23 14:20:37.542 | 1000.0 | 0 | 1000.0 | 2022-09-23 | 2022-09-24 16:21:40.566 | ... | 1 | -108.56 | 108.56 | 0 | 108.56 | 807424 | 150.0 | 2022-09-28 17:50:15.576 | 41.44 | Paid |
| 68914 | 487963 | 58316 | 66776 | 814724 | 2022-09-29 11:34:20.555 | 1000.0 | 0 | 1000.0 | 2022-09-29 | 2022-09-29 19:17:11.349 | ... | 1 | -91.36 | 91.36 | 0 | 91.36 | 816729 | 100.0 | 2022-09-29 20:05:02.673 | 8.64 | Paid |
| 12992 | 639440 | 58316 | 66776 | 816704 | 2022-09-29 19:41:44.898 | 1000.0 | 0 | 1000.0 | 2022-09-29 | 2022-10-01 17:59:32.282 | ... | 1 | -100.96 | 100.96 | 0 | 100.96 | 831639 | 101.0 | 2022-10-02 18:17:29.442 | 0.04 | Paid |
| 23374 | 86011 | 58316 | 66776 | 832285 | 2022-10-02 18:19:08.181 | 1000.0 | 0 | 1000.0 | 2022-10-02 | 2022-10-03 19:37:34.605 | ... | 1 | -100.74 | 100.74 | 0 | 100.74 | 838376 | 130.0 | 2022-10-04 11:33:34.686 | 29.26 | Paid |
| 2856 | 549976 | 58316 | 66776 | 838411 | 2022-10-04 11:35:42.055 | 1000.0 | 0 | 1000.0 | 2022-10-05 | 2022-10-07 18:53:27.332 | ... | 1 | -139.25 | 139.25 | 0 | 139.25 | 860599 | 200.0 | 2022-10-09 15:41:16.183 | 60.89 | Paid |
| 17867 | 38315 | 58316 | 66776 | 866013 | 2022-10-09 22:50:35.375 | 1000.0 | 0 | 1000.0 | 2022-10-10 | 2022-10-11 17:59:21.048 | ... | 1 | 41.78 | 58.22 | 0 | 58.22 | 866013 | 0.0 | 2022-10-11 17:59:21.048 | 41.78 | Paid |
| 14394 | 8532 | 58316 | 66776 | 873912 | 2022-10-11 23:30:19.495 | 1000.0 | 0 | 1000.0 | 2022-10-12 | 2022-10-13 17:11:45.320 | ... | 1 | -87.47 | 87.47 | 0 | 87.47 | 891721 | 100.0 | 2022-10-16 12:33:27.146 | 12.53 | Paid |
| 61535 | 422479 | 58316 | 66776 | 894023 | 2022-10-16 22:35:14.941 | 1000.0 | 0 | 1000.0 | 2022-10-17 | 2022-10-18 17:18:31.248 | ... | 1 | -55.42 | 55.42 | 0 | 55.42 | 901669 | 100.0 | 2022-10-18 18:18:48.244 | 44.58 | Paid |
| 45487 | 282051 | 58316 | 66776 | 902395 | 2022-10-18 23:01:58.502 | 1000.0 | 0 | 1000.0 | 2022-10-19 | 2022-10-20 17:14:49.627 | ... | 1 | -97.98 | 97.98 | 0 | 97.98 | 915284 | 200.0 | 2022-10-21 15:05:55.259 | 102.02 | Paid |
| 20509 | 60688 | 58316 | 66776 | 916292 | 2022-10-21 19:43:51.875 | 1000.0 | 0 | 1000.0 | 2022-10-22 | 2022-10-23 20:09:05.284 | ... | 1 | 866.01 | 133.99 | 0 | 133.99 | 917815 | 0.0 | 2022-10-23 20:09:05.284 | 866.01 | Paid |
| 22819 | 80971 | 58316 | 66776 | 917815 | 2022-10-22 10:41:49.732 | 1000.0 | 0 | 1000.0 | 2022-10-22 | 2022-10-23 20:09:05.284 | ... | 1 | 866.01 | 0.00 | 0 | 0.00 | 917815 | 0.0 | 2022-10-23 20:09:05.284 | 866.01 | Paid |
| 63143 | 435219 | 58316 | 66776 | 953683 | 2022-10-29 23:46:11.320 | 1000.0 | 0 | 1000.0 | 2022-10-30 | 2022-10-31 18:43:48.993 | ... | 1 | 352.39 | 0.00 | 0 | 0.00 | 953683 | 0.0 | 2022-10-31 18:43:48.993 | 352.39 | Paid |
| 32303 | 163978 | 58316 | 66776 | 964219 | 2022-10-31 20:58:53.714 | 1000.0 | 0 | 1000.0 | 2022-11-01 | 2022-11-02 17:18:39.520 | ... | 1 | 25.00 | 375.00 | 0 | 375.00 | 964219 | 0.0 | 2022-11-02 17:18:39.520 | 25.00 | Paid |
19 rows × 21 columns
import pandas as pd
try:
_ = loans_df
except NameError:
loans_df = pd.read_excel("../../data/Loans_Data.xlsx")
# Are there any LOAN_IDs that have multiple observations, i.e. first collection attempt, second, etc. ?
(
loans_df.query("FIRST_TRIAL_BALANCE < 0")
.groupby(["MAIN_SYSTEM_ID", "LOAN_ID"])
[["REPAYMENT_UPDATED"]]
.nunique()
.query("REPAYMENT_UPDATED > 1")
)
| REPAYMENT_UPDATED | ||
|---|---|---|
| MAIN_SYSTEM_ID | LOAN_ID |
?? If FIRST_TRIAL_BALANCE < 0, the retailer missed first collection attempt. Can we assume that retailers are given only a second collection attempt before being marked as PAYMENT_STATUS in ('Unpaid', 'Partially paid')?
I’m starting to believe I was given a snapshot of the Loans dataset instead of time series as for the other 2 datasets. We can check that by looking at the following case: retailer misses the first payment but is able to fully pay on following collection attempt. We can represent that as FIRST_TRIAL_BALANCE < 0 and PAYMENT_STATUS = 'Paid', then check if any LOAN_ID has more than one observation.
import pandas as pd
try:
_ = loans_df
except NameError:
loans_df = pd.read_excel("../../data/Loans_Data.xlsx")
(
loans_df.query("FIRST_TRIAL_BALANCE < 0 and PAYMENT_STATUS == 'Paid'")
.sort_values(by=["MAIN_SYSTEM_ID", "LOAN_ID"])
[["MAIN_SYSTEM_ID", "LOAN_ID", "REPAYMENT_ID", "REPAYMENT_AMOUNT", "TOTAL_FINAL_AMOUNT"]]
)
| MAIN_SYSTEM_ID | LOAN_ID | REPAYMENT_ID | REPAYMENT_AMOUNT | TOTAL_FINAL_AMOUNT | |
|---|---|---|---|---|---|
| 21130 | 388 | 697040 | 710745 | 1000.0 | 463.35 |
| 30622 | 484 | 723923 | 748868 | 2500.0 | 2492.69 |
| 9603 | 1559 | 817070 | 819766 | 5020.0 | 325.34 |
| 68669 | 1653 | 779240 | 787659 | 1000.0 | 855.50 |
| 19165 | 2023 | 722893 | 745754 | 1500.0 | 1487.88 |
| ... | ... | ... | ... | ... | ... |
| 717 | 155535 | 850258 | 863752 | 700.0 | 496.52 |
| 12557 | 155535 | 921816 | 933511 | 650.0 | 900.54 |
| 42368 | 155535 | 946771 | 961470 | 600.0 | 575.94 |
| 3043 | 155803 | 866687 | 879437 | 3450.0 | 3449.71 |
| 12433 | 155952 | 901395 | 915464 | 550.0 | 549.18 |
686 rows × 5 columns